This has been a difficult month for all of our customers as well as the HipChat team. I was hoping my next blog post would be about some of the successes and plans we have to improve stability, but in the spirit of our company values I want to be transparent about the outage on March 25th.
Archive for the ‘Service Status’ Category
March 23, 2016 — We have added an update below with more details about what happened and what we’re doing about it. Read it now.
Many of you have struggled to connect, send messages, or use chat history at different times this week. Our team has been working around the clock to find a solution to these issues, but we know that you, our users, have borne the brunt of the impact.
As many of you know, HipChat experienced performance issues this week. First we want to apologize. We know how your teams rely on HipChat everyday to get work done, and downtime is a huge inconvenience.
We also want to provide you with the technical detail on what happened and share the lessons we’ve learned. You deserve answers, and we’re here to give them.
Two days ago, HipChat experienced a short disruption in service that prevented users from signing in and sending messages. To all our users, we’re sincerely sorry. We know you rely on HipChat to keep work flowing, and any interruption can be a setback.
We want to give you all the details you deserve, from what happened to what we plan to do to prevent this in the future.
What happened and what we’re doing about it
We sincerely apologize for our recent outage: you trust us with your chats, your important documents, your cat gifs, your personal conversations, your system notifications, your internet memes – and we let you down.
Our team takes pride in making an important part of your (work) life, and we’re sorry.
Short version: the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load .
When a large network provider in the SF Bay Area had an issue Monday morning, it caused all of those clients to start reconnecting at once. This saturated our systems and prevented normal usage.
On Monday, we released an update to our backend systems, and Tuesday morning we released a new Mac app (v 3.3.1), both of which increased protection against this type of issue in the future.
We also fixed various other bugs related to reconnection in the Mac app that will prevent another connection overload like this one. And we continue to have teams building new, amazing technology that improves our system isolation, enhances our ability to do sophisticated load testing, supports even higher scale, and increases our server-side capability management (to disable misbehaving client functionality more directly, for example).
We never want you to be without HipChat. We fell far short of that Monday and are very sorry we let you and your teams down.
We just passed 6B messages delivered via HipChat, more than 2B of which have been delivered in 2015 – our platform is scaling and growing faster than ever thanks to teams like yours. We’re moving quickly to build a stronger HipChat as a result of the experience – thanks for your patience while we do.
TL;DR: A brief network outage exposed a timing bug in the HipChat servers, which locked out a few thousand end users for up to four hours. We are continuing to work every day to make your HipChat experience awesome.
On January 14, starting at approx 20:00 UTC, HipChat began logging a large spike of network errors across our chat frontend layer. Transient network errors aren’t something we enjoy but they happen occasionally with high scale internet services.
During this time clients would attempt connect to HipChat’s servers, wait for a while trying to authenticate, and then lose their connections, forcing them to retry. The network errors cleared up around 20:25 UTC, but unfortunately this wasn’t the end of our problem.
The HipChat frontend servers contain a session limit “circuit breaker” that trips when a particular user has too many simultaneous logins. This feature generally protects the service against badly behaving clients. During the network problem, a bug in our frontend servers caused many user logins to hang around in an orphaned state, even though the clients had disconnected. Eventually these orphaned simultaneous logins built up and started tripping the circuit breaker, disallowing more login attempts from those clients.
HipChat Ops quickly kicked off a backend process that began cleaning up the orphaned login sessions, but the sheer number of users affected meant the process took hours to complete. During this window, we pushed a hot patch to the frontend servers temporarily raising the session limit to give some relief until the cleanup process could complete.
Simultaneously our backend engineers found the timing bug in the network code that allowed the bogus logins to pile up. The permanent fix has been rolled out to production. We are working on performance improvements for the emergency cleanup process that just took too long to finish.
Our beloved users noticed this downtime We opened more support cases in this few hour window than we normally do in a busy month. We take our commitment to you – our HipChat end users – very seriously and have ongoing investment in our platform and operations technologies and teams to support our enormous user growth. Our sincere apologies for any interruption in your team communication.
HipChat has been growing like crazy. A year ago we blogged about how we passed 1 Billion messages after 4 years. Today, we are over 3 Billion messages sent–and growing faster every month! However, over the past few weeks we have had some growing pains – we’ve suffered a few different types of service outage with HipChat, and felt like a blog post would be helpful to explain the issues and how are moving forward.
We understand that HipChat is critical for your team – we live in HipChat all day as well – so we know how crippling it feels to lose that connection to your team. We take this responsibility very seriously, and we’re ensuring that even as we continue to scale, your HipChat experience continues to be amazing.
Web App Troubles
Starting Oct 1st, we started to see higher than normal load on our web tier, which caused some load issues at that tier, which in turn triggered all web clients to suddenly reconnect and cause a spike across our whole system. We added more capacity throughout Oct 2nd in the web tier to handle the extra load and made several code & configuration changes to optimize how we used one of our databases, Redis, to support the higher load, and declared the issue resolved on Oct 2nd.
The optimization addressed the load issue (yey). However, along the way we inadvertently introduced two additional issues into our system. On Oct 7th, we discovered and resolved an issue that resulted in our Android clients making extra requests and DDoSing ourselves. On Oct 9th, we discovered and resolved a second issue, which resulted in bad cache data that locked out a small percentage of users from logging into HipChat.
Many of you likely have seen the report going around of a security vulnerability with SSLv3. As soon as we had confirmation of the vulnerability we rolled out a patched version of our server code to the whole system, including new front-end XMPP servers. In the process of adding new front-end XMPP servers, our automated tool which manages our DNS records failed to update our domain with the new servers’ addresses. When we removed the old servers from rotation this resulted in users being unable to login for 15 minutes while we manually updated our DNS records.
We take our commitment to you, our end users as seriously at 3B+ messages as we did at the first message. It’s unacceptable for us to have this much downtime, and while the system cause of each issue is different, the net result to you our users is often the same. We continue to invest heavily in platform, scale, performance and reliability projects–we have several big ones in the works and we hope to blog about them soon. We have an entire team dedicated to scaling HipChat further, and we are looking to scale our team as well – come join us!
Your HipChat Team
We know many in the Devops community use HipChat as an incident management tool. As many of you know, we use Amazon Web Services for hosting HipChat. If you’re affected by the AWS rolling restarts and see anything weird happening with HipChat, just remember that HipChat could be experiencing some side-effects from our own rolling restarts.
Amazon notified us that maintenance that will be occurring for instances hosting HipChat during the following windows over the coming days:
|Start Time||End Time|
|September 26, 2014 11:00:00 PM UTC-7||September 27, 2014 5:00:00 AM UTC-7|
|September 27, 2014 11:00:00 PM UTC-7||September 28, 2014 5:00:00 AM UTC-7||September 28, 2014 11:00:00 PM UTC-7||September 29, 2014 5:00:00 AM UTC-7|
|September 29, 2014 11:00:00 PM UTC-7||September 30, 2014 5:00:00 AM UTC-7|
Our own Devops team is getting ready for this maintenance and we expect no major problems for HipChat. Minor things to be aware of might include client reconnections and slower search responses.
Here’s an article we found helpful whilst planning.
Although we don’t expect any major issues, we just wanted to provide a heads-up.
The HipChat Devops Team.
Last week we suffered two outages affecting a large number of users. The first was from 10-11pm PST on Thursday and the second was from 12-1:30pm on Friday. Like you, we live in HipChat all day as well, so we know how crippling it feels to lose that connection to your team. We wanted to share the cause of the trouble since it’s something that likely affects some of you and was not trivially debugged, though the fix is simple.
At the beginning of both outages, we were alerted that the backend services which provide the XMPP BOSH endpoint for our web app dropped all their connections which disconnected all their users. This led to a rush of reconnections which caused load to increase. Despite having the capacity to handle the load, many web requests became very slow and caused our database (we use MySQL on Amazon RDS) to run out of available connections. Shortly after this, our XMPP services also began to handle requests very slowly despite using persistent MySQL connections. At this point all users were unable to sign in or chat, and we had to disable our API and chat signins. It was not clear to us what was causing the slowdown since our servers and database had plenty of resources available.
It turns out that some new processing on our rsyslog logging server had caused its load to spike, which caused all our servers to begin queuing their logs locally and slowing down our services. As load increased, the problem only got worse. We stopped our rsyslog clients and everything started coming back to life immediately. Once we were back online we started searching and found that the Bitbucket team, which sits within shouting distance of us, had run into the same issue last year. Check out their blog post for technical details.
We have since found a way to reproduce this failure locally and implemented a fix by switching from TCP to UDP logging. If you’re using rsyslog we’d recommend you test your setup against this sort of failure. It’s quite embarrassing to have your logging cause so much trouble.
Thanks to everyone for your patience during these troubles. We know they happen at the worst possible time for some of you. In the end, our service improves with each one of these failures as we continue to learn and resolve issues.
Finally, we’d like to remind people to keep an eye on @HipChat and our status site if you believe we may be having trouble. We received many individual tweets and emails during the outage but did not have the time to immediately reply to everyone.
Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.
Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.
We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.
We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.