Last week we suffered two outages affecting a large number of users. The first was from 10-11pm PST on Thursday and the second was from 12-1:30pm on Friday. Like you, we live in HipChat all day as well, so we know how crippling it feels to lose that connection to your team. We wanted to share the cause of the trouble since it’s something that likely affects some of you and was not trivially debugged, though the fix is simple.
At the beginning of both outages, we were alerted that the backend services which provide the XMPP BOSH endpoint for our web app dropped all their connections which disconnected all their users. This led to a rush of reconnections which caused load to increase. Despite having the capacity to handle the load, many web requests became very slow and caused our database (we use MySQL on Amazon RDS) to run out of available connections. Shortly after this, our XMPP services also began to handle requests very slowly despite using persistent MySQL connections. At this point all users were unable to sign in or chat, and we had to disable our API and chat signins. It was not clear to us what was causing the slowdown since our servers and database had plenty of resources available.
It turns out that some new processing on our rsyslog logging server had caused its load to spike, which caused all our servers to begin queuing their logs locally and slowing down our services. As load increased, the problem only got worse. We stopped our rsyslog clients and everything started coming back to life immediately. Once we were back online we started searching and found that the Bitbucket team, which sits within shouting distance of us, had run into the same issue last year. Check out their blog post for technical details.
We have since found a way to reproduce this failure locally and implemented a fix by switching from TCP to UDP logging. If you’re using rsyslog we’d recommend you test your setup against this sort of failure. It’s quite embarrassing to have your logging cause so much trouble.
Thanks to everyone for your patience during these troubles. We know they happen at the worst possible time for some of you. In the end, our service improves with each one of these failures as we continue to learn and resolve issues.
Finally, we’d like to remind people to keep an eye on @HipChat and our status site if you believe we may be having trouble. We received many individual tweets and emails during the outage but did not have the time to immediately reply to everyone.