Archive for the ‘Service Status’ Category

zuhaib

Details on the recent outages

By zuhaib | 3 days ago | 0 Comments

HipChat has been growing like crazy. A year ago we blogged about how we passed 1 Billion messages after 4 years. Today, we are over 3 Billion messages sent–and growing faster every month! However, over the past few weeks we have had some growing pains – we’ve suffered a few different types of service outage with HipChat, and felt like a blog post would be helpful to explain the issues and how are moving forward.

We understand that HipChat is critical for your team – we live in HipChat all day as well – so we know how crippling it feels to lose that connection to your team. We take this responsibility very seriously, and we’re ensuring that even as we continue to scale, your HipChat experience continues to be amazing.

Web App Troubles

Starting Oct 1st, we started to see higher than normal load on our web tier, which caused some load issues at that tier, which in turn triggered all web clients to suddenly reconnect and cause a spike across our whole system.  We added more capacity throughout Oct 2nd in the web tier to handle the extra load and made several code & configuration changes to optimize how we used one of our databases, Redis, to support the higher load, and declared the issue resolved on Oct 2nd.

The optimization addressed the load issue (yey).  However, along the way we inadvertently introduced two additional issues into our system. On Oct 7th, we discovered and resolved an issue that resulted in our Android clients making extra requests and DDoSing ourselves. On Oct 9th, we discovered and resolved a second issue, which resulted in bad cache data that locked out a small percentage of users from logging into HipChat.

SSLv3 “POODLE” bites HipChat

Many of you likely have seen the report going around of a security vulnerability with SSLv3.  As soon as we had confirmation of the vulnerability we rolled out a patched version of our server code to the whole system, including new front-end XMPP servers.  In the process of adding new front-end XMPP servers, our automated tool which manages our DNS records failed to update our domain with the new servers’ addresses.  When we removed the old servers from rotation this resulted in users being unable to login for 15 minutes while we manually updated our DNS records.

Going Forward

We take our commitment to you, our end users as seriously at 3B+ messages as we did at the first message. It’s unacceptable for us to have this much downtime, and while the system cause of each issue is different, the net result to you our users is often the same. We continue to invest heavily in platform, scale, performance and reliability projects–we have several big ones in the works and we hope to blog about them soon. We have an entire team dedicated to scaling HipChat further, and we are looking to scale our team as well – come join us!

Sincerely

Your HipChat Team

 

 

zuhaib

AWS Rolling Restarts

By zuhaib | 3 weeks ago | 4 Comments

We know many in the Devops community use HipChat as an incident management tool. As many of you know, we use Amazon Web Services for hosting HipChat. If you’re affected by the AWS rolling restarts and see anything weird happening with HipChat, just remember that HipChat could be experiencing some side-effects from our own rolling restarts.

Amazon notified us that maintenance that will be occurring for instances hosting HipChat during the following windows over the coming days:

Start Time End Time
September 26, 2014 11:00:00 PM UTC-7 September 27, 2014 5:00:00 AM UTC-7
September 27, 2014 11:00:00 PM UTC-7 September 28, 2014 5:00:00 AM UTC-7
September 28, 2014 11:00:00 PM UTC-7 September 29, 2014 5:00:00 AM UTC-7
September 29, 2014 11:00:00 PM UTC-7 September 30, 2014 5:00:00 AM UTC-7

Our own Devops team is getting ready for this maintenance and we expect no major problems for HipChat. Minor things to be aware of might include client reconnections and slower search responses.
Here’s an article we found helpful whilst planning.

Although we don’t expect any major issues, we just wanted to provide a heads-up.

The HipChat Devops Team.

Garret Heaton

Details on last week’s outages

By Garret Heaton | 1 year ago | 1 Comment

Last week we suffered two outages affecting a large number of users. The first was from 10-11pm PST on Thursday and the second was from 12-1:30pm on Friday. Like you, we live in HipChat all day as well, so we know how crippling it feels to lose that connection to your team. We wanted to share the cause of the trouble since it’s something that likely affects some of you and was not trivially debugged, though the fix is simple.

What happened

At the beginning of both outages, we were alerted that the backend services which provide the XMPP BOSH endpoint for our web app dropped all their connections which disconnected all their users. This led to a rush of reconnections which caused load to increase. Despite having the capacity to handle the load, many web requests became very slow and caused our database (we use MySQL on Amazon RDS) to run out of available connections. Shortly after this, our XMPP services also began to handle requests very slowly despite using persistent MySQL connections. At this point all users were unable to sign in or chat, and we had to disable our API and chat signins. It was not clear to us what was causing the slowdown since our servers and database had plenty of resources available.

Root cause

Log jam!

It turns out that some new processing on our rsyslog logging server had caused its load to spike, which caused all our servers to begin queuing their logs locally and slowing down our services. As load increased, the problem only got worse. We stopped our rsyslog clients and everything started coming back to life immediately. Once we were back online we started searching and found that the Bitbucket team, which sits within shouting distance of us, had run into the same issue last year. Check out their blog post for technical details.

We have since found a way to reproduce this failure locally and implemented a fix by switching from TCP to UDP logging. If you’re using rsyslog we’d recommend you test your setup against this sort of failure. It’s quite embarrassing to have your logging cause so much trouble.

Thank you

Thanks to everyone for your patience during these troubles. We know they happen at the worst possible time for some of you. In the end, our service improves with each one of these failures as we continue to learn and resolve issues.

Finally, we’d like to remind people to keep an eye on @HipChat and our status site if you believe we may be having trouble. We received many individual tweets and emails during the outage but did not have the time to immediately reply to everyone.

Garret Heaton

Things that go bump in the… middle of the day

By Garret Heaton | 2 years ago | 1 Comment

Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.

What happened

HipChat is built using the XMPP protocol and in order to allow our web app (JavaScript) to talk to our XMPP server we use a BOSH proxy service. Lately, we’ve been tracking down an issue that causes these services to drop large numbers of active connections without warning. Usually users don’t notice when this happens since they automatically reconnect to another server, but this time the problem struck during the middle of the day when we have our highest usage. The increased load caused by all these users reconnecting led to some cascading failures we had not seen before and were not well prepared to deal with.

Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.

We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.

Final thoughts

We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.

Thanks to everyone who was patient and humorously supportive.

Garret Heaton

Saint Patrick’s Day downtime

By Garret Heaton | 4 years ago | 2 Comments

Yesterday afternoon while everyone was enjoying the new (shamrock) and (greenbeer) emoticons one of our Amazon EC2 instances suddenly stopped responding. This particular instance helped serve our website and chat service and had a decent percentage of chat clients connected to it. We’re not sure why it went down, but it was EBS-based so we think it may be related to the issues Reddit and others were experiencing.

Whatever the cause, we should be able to recover from server failures. Normally the users connected to the failed server would be briefly disconnected before automatically reconnecting to another server. Unfortunately the other services in our chat cluster weren’t properly detecting the failed instance and continued to send it requests which would then fail. Instead of trying to manually correct this unfamiliar failure state and cause more harm we decided to restart all the chat services. This is when most users noticed that we were having trouble.

Having everyone reconnect shouldn’t be too big a problem but we’ve been growing an awful lot lately and had more users connected than ever before (although today is looking even bigger!) The load caused by everyone reconnecting was too much for our existing servers to bear and everything slowed to a crawl. We had essentially triggered a denial-of-service attack against ourselves – not good. After about 15 minutes we were able to process all the requests and service returned to normal. Total service interruption: about 1 hour.

Here is what we’ll be fixing.

  1. Increasing our capacity so that server failures affect a smaller percentage of users and we can recover more quickly.
  2. Making sure our clustered chat services properly handle the failure case we experienced.
  3. Optimizing the client reconnection flow so it generates less load.

Thanks to all our users for your patience and understanding yesterday. We know HipChat is an integral part of your workflow and that reliability is the most important feature we can provide.