HipChat has been growing like crazy. A year ago we blogged about how we passed 1 Billion messages after 4 years. Today, we are over 3 Billion messages sent–and growing faster every month! However, over the past few weeks we have had some growing pains – we’ve suffered a few different types of service outage with HipChat, and felt like a blog post would be helpful to explain the issues and how are moving forward.
We understand that HipChat is critical for your team – we live in HipChat all day as well – so we know how crippling it feels to lose that connection to your team. We take this responsibility very seriously, and we’re ensuring that even as we continue to scale, your HipChat experience continues to be amazing.
Web App Troubles
Starting Oct 1st, we started to see higher than normal load on our web tier, which caused some load issues at that tier, which in turn triggered all web clients to suddenly reconnect and cause a spike across our whole system. We added more capacity throughout Oct 2nd in the web tier to handle the extra load and made several code & configuration changes to optimize how we used one of our databases, Redis, to support the higher load, and declared the issue resolved on Oct 2nd.
The optimization addressed the load issue (yey). However, along the way we inadvertently introduced two additional issues into our system. On Oct 7th, we discovered and resolved an issue that resulted in our Android clients making extra requests and DDoSing ourselves. On Oct 9th, we discovered and resolved a second issue, which resulted in bad cache data that locked out a small percentage of users from logging into HipChat.
Many of you likely have seen the report going around of a security vulnerability with SSLv3. As soon as we had confirmation of the vulnerability we rolled out a patched version of our server code to the whole system, including new front-end XMPP servers. In the process of adding new front-end XMPP servers, our automated tool which manages our DNS records failed to update our domain with the new servers’ addresses. When we removed the old servers from rotation this resulted in users being unable to login for 15 minutes while we manually updated our DNS records.
We take our commitment to you, our end users as seriously at 3B+ messages as we did at the first message. It’s unacceptable for us to have this much downtime, and while the system cause of each issue is different, the net result to you our users is often the same. We continue to invest heavily in platform, scale, performance and reliability projects–we have several big ones in the works and we hope to blog about them soon. We have an entire team dedicated to scaling HipChat further, and we are looking to scale our team as well – come join us!
Your HipChat Team