Garret Heaton

Things that go bump in the… middle of the day

By Garret Heaton | 8 months ago | 1 Comment

Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.

What happened

HipChat is built using the XMPP protocol and in order to allow our web app (JavaScript) to talk to our XMPP server we use a BOSH proxy service. Lately, we’ve been tracking down an issue that causes these services to drop large numbers of active connections without warning. Usually users don’t notice when this happens since they automatically reconnect to another server, but this time the problem struck during the middle of the day when we have our highest usage. The increased load caused by all these users reconnecting led to some cascading failures we had not seen before and were not well prepared to deal with.

Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.

We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.

Final thoughts

We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.

Thanks to everyone who was patient and humorously supportive.

HipChat is group chat and IM built for teams. Learn more
  • Pavel

    Great job being proactive and making it better for the future. Just joined HipChat and loving it so far :)