Archive for the ‘Service Status’ Category

Garret Heaton

Details on last week’s outages

By Garret Heaton | 9 months ago | 1 Comment

Last week we suffered two outages affecting a large number of users. The first was from 10-11pm PST on Thursday and the second was from 12-1:30pm on Friday. Like you, we live in HipChat all day as well, so we know how crippling it feels to lose that connection to your team. We wanted to share the cause of the trouble since it’s something that likely affects some of you and was not trivially debugged, though the fix is simple.

What happened

At the beginning of both outages, we were alerted that the backend services which provide the XMPP BOSH endpoint for our web app dropped all their connections which disconnected all their users. This led to a rush of reconnections which caused load to increase. Despite having the capacity to handle the load, many web requests became very slow and caused our database (we use MySQL on Amazon RDS) to run out of available connections. Shortly after this, our XMPP services also began to handle requests very slowly despite using persistent MySQL connections. At this point all users were unable to sign in or chat, and we had to disable our API and chat signins. It was not clear to us what was causing the slowdown since our servers and database had plenty of resources available.

Root cause

Log jam!

It turns out that some new processing on our rsyslog logging server had caused its load to spike, which caused all our servers to begin queuing their logs locally and slowing down our services. As load increased, the problem only got worse. We stopped our rsyslog clients and everything started coming back to life immediately. Once we were back online we started searching and found that the Bitbucket team, which sits within shouting distance of us, had run into the same issue last year. Check out their blog post for technical details.

We have since found a way to reproduce this failure locally and implemented a fix by switching from TCP to UDP logging. If you’re using rsyslog we’d recommend you test your setup against this sort of failure. It’s quite embarrassing to have your logging cause so much trouble.

Thank you

Thanks to everyone for your patience during these troubles. We know they happen at the worst possible time for some of you. In the end, our service improves with each one of these failures as we continue to learn and resolve issues.

Finally, we’d like to remind people to keep an eye on @HipChat and our status site if you believe we may be having trouble. We received many individual tweets and emails during the outage but did not have the time to immediately reply to everyone.

Garret Heaton

Things that go bump in the… middle of the day

By Garret Heaton | 2 years ago | 1 Comment

Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.

What happened

HipChat is built using the XMPP protocol and in order to allow our web app (JavaScript) to talk to our XMPP server we use a BOSH proxy service. Lately, we’ve been tracking down an issue that causes these services to drop large numbers of active connections without warning. Usually users don’t notice when this happens since they automatically reconnect to another server, but this time the problem struck during the middle of the day when we have our highest usage. The increased load caused by all these users reconnecting led to some cascading failures we had not seen before and were not well prepared to deal with.

Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.

We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.

Final thoughts

We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.

Thanks to everyone who was patient and humorously supportive.

Garret Heaton

Saint Patrick’s Day downtime

By Garret Heaton | 3 years ago | 2 Comments

Yesterday afternoon while everyone was enjoying the new (shamrock) and (greenbeer) emoticons one of our Amazon EC2 instances suddenly stopped responding. This particular instance helped serve our website and chat service and had a decent percentage of chat clients connected to it. We’re not sure why it went down, but it was EBS-based so we think it may be related to the issues Reddit and others were experiencing.

Whatever the cause, we should be able to recover from server failures. Normally the users connected to the failed server would be briefly disconnected before automatically reconnecting to another server. Unfortunately the other services in our chat cluster weren’t properly detecting the failed instance and continued to send it requests which would then fail. Instead of trying to manually correct this unfamiliar failure state and cause more harm we decided to restart all the chat services. This is when most users noticed that we were having trouble.

Having everyone reconnect shouldn’t be too big a problem but we’ve been growing an awful lot lately and had more users connected than ever before (although today is looking even bigger!) The load caused by everyone reconnecting was too much for our existing servers to bear and everything slowed to a crawl. We had essentially triggered a denial-of-service attack against ourselves – not good. After about 15 minutes we were able to process all the requests and service returned to normal. Total service interruption: about 1 hour.

Here is what we’ll be fixing.

  1. Increasing our capacity so that server failures affect a smaller percentage of users and we can recover more quickly.
  2. Making sure our clustered chat services properly handle the failure case we experienced.
  3. Optimizing the client reconnection flow so it generates less load.

Thanks to all our users for your patience and understanding yesterday. We know HipChat is an integral part of your workflow and that reliability is the most important feature we can provide.