Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.
Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.
We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.
We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.