The air is getting chilly here at HipChat headquarters…okay, we’re in San Francisco, so it’s getting chillier. But we’ve been keeping ourselves warm with pumpkin spice lattes, long sleeves, and a lot of hard work creating new and improved features like group video chat and screen sharing, as well as new integrations with your favorite tools.
Check out all our ‘Maintenance’ blogs
Late Friday night, right after wrapping up a maintenance window, HipChat experienced downtime that impacted a number of users. Some users found themselves experiencing slow messages while others had difficultly accessing HipChat altogether.
We take downtime very seriously and apologize for any inconvenience this caused. We understand you rely on HipChat to work, and in the spirit of our company values, we want to be open with you about the situation and our next steps.
Two days ago, HipChat experienced a short disruption in service that prevented users from signing in and sending messages. To all our users, we’re sincerely sorry. We know you rely on HipChat to keep work flowing, and any interruption can be a setback.
We want to give you all the details you deserve, from what happened to what we plan to do to prevent this in the future.
Three weeks ago, we introduced HipChat’s brand new, badass web client. It’s fast, beautiful and built to change how people connect. Needless to say, we’re incredibly proud of it. But, as much as we wanted a perfect launch, we weren’t so lucky: if you tried to use the client in the first week or two, you might have noticed a few hiccups.
Sorry about that.
In the spirit of Open Company – No Bullshit we wanted keep our users informed about the recent outages and what we did to fix the issue. Some of those outages degraded other areas of HipChat, like slowing our main website and message delivery. We’ve made moves to strengthen our web client’s stability so these issues never happen again.
How connecting to the HipChat web client works, at 10,000 feet
- You log into www.hipchat.com, creating a session with HipChat’s web layer.
- After logging in, you click Launch the web app which, in the web layer, creates a session with our BOSH Server.
- Once connected, our BOSH server in turn creates a session with our XMPP server.
In this chain, our BOSH Server is the weakest link. It wasn’t standing up to the popularity of the new client. And unfortunately, it’s coupled to our main web tier in a really bad way.
As our BOSH server came under pressure, it triggered a large number of sessions to reconnect. This, coupled with other issues, would cause hipchat.com to degrade. This is what happened the last week of March.
The little connection that could
With the new web client, the goal was to improve client reconnection, allowing HipChat to maintain resiliency toward network changes, roaming, outages, etc.
Previously, HipChat’s web client attempted reconnection every 10 – 30 seconds following a disconnection. This time around, we wanted a better experience: reconnecting as “automatically” as possible, hoping users never noticed a thing.
To do this, we decreased the connection retry from 10-30 seconds, down to 2 seconds. This drastically shortened time, combined with a surge of new users, strained our system. When we re-wrote the hipchat-js-client, we tried to ensure our users we had reasonable polling rates with exponential back-off and eventual timeout.
Here’s what the new reconnect model looked like:
The initial reconnection attempts were too aggressive for the amount of traffic we saw. So, our first action was to quickly update the back-off rate and initial poll time to be more reasonable.
The problem with exponential back-off
As always, things get complicated when we consider this at scale (webscale). Let’s say a large number of clients become disconnected at once due to a BOSH node failure. With our current reconnection model, we saw the following traffic pattern:
(Above example from AWS Blog, not actually pulled from HipChat, but you get the idea.)
Well, that’s not that much more awesome.
We’ve effectively just bunched all the reconnection requests into a series of incredibly high-load windows where all of the clients compete with each other. What we really want is more randomness. We implemented a heavily jittered algorithm design. This gives us the benefit of having the least number of competing clients, and encourages the clients to back off over time.
waitTime = min(MAX_WAIT, random_integer_between(MIN_WAIT, lastComputedWaitTime * BACKOFF_RATE))
(Again, this example from AWS Blog. They have prettier graphs.)
This model has had a huge impact, and made the service much more resilient.
Untangling the Gordian knot
As mentioned, our BOSH server and our web tier are unfortunately coupled. Currently, it’s the web tier’s job to attach a pre-authed BOSH session to new clients. We do a lot of nginx hackery to ensure that your web session and your BOSH session are live, and are routed to the same box. This means anytime a web client reconnects, it hammers on its corresponding web box making both unstable. This also makes scaling our BOSH server really tricky. And worse, it prevents service isolation since we shared a lot of resources between our web site and HipChat’s web client.
As of March 26th, we’ve deployed changes that allow our web sessions and BOSH sessions to be uncoupled. In fact, all of our new web client users are already using this new auth method. This means we can scale our main website and our web client independently. We’ve already set up isolated worker pools for each. Together, these changes should ensure a misbehaving web client doesn’t cause a dead hipchat.com.
Double the trouble, double the fun
Since we knew session acquisition was our biggest pain point, we combed through our connection code, looking for ways to make it less expensive. We noticed that it was double-hitting Redis in some cases. A fix was quickly deployed, and the results?
They speak for themselves.
How’s it looking?
Since we made these changes, distribution of load on our system has been much improved. In the graphs below, the white lines show the start of Friday 3/27.
Four days of traffic prior to change (Tue – Fri)
Preceding two weeks of traffic (Mon – Fri, Mon – Fri), notice/compare Fridays (end user platform use level is approximately the same).
Last fall we discussed our journey to 1 billion chat messages stored and how we used Elasticsearch to get there. By April we’d already surpassed 2 billion messages and our growth rate only continues to increase. Unfortunately all this growth has highlighted flaws in our initial Elasticsearch setup.
There’s nothing more annoying that having HipChat crash in the middle of a call or when you’re chatting with a coworker, right? It pisses us off, too. Well, you may have noticed that our Mac, iOS, and Android clients are working a lot better lately. Now that we’ve launched video (phew!) we’ve been able to focus more time addressing issues causing random bugs and crashes. We’ve reduced the number of crashes/day by almost 95% on our Mac app, 75% on our Android app, and 25% on our iOS app. In addition we’ve resolved dozens of bugs reported by you (thanks again!).
What We’ve Been Doing
We’ve made a lot of changes in code and process over the past few months. Here are some highlights:
- Aggressively monitoring crash reports in HockeyApp and quickly triaging the worst offenders
- Use Kanban to optimize our development workflow
- Focus on a 2-3 week release cadence, so fixes get to you sooner than later
- Major changes to thread handling and concurrency:
- Switch from using Key Value Observers to GCDMulticastDelegate (part of the XMPPFramework) – KVO didn’t work well in our multithreaded environment
- Encapsulate classes to use a single queue to manage all their behaviors
- Fix every place we were accidentally trying to modify an immutable collection
- Better error handling – errors happen and the client should be able to absorb them gracefully without a complete meltdown
We’re not done yet and there are many more improvements on our list. Here are a few:
- Give our Windows and Linux apps some well deserved love
- Performance and usability improvements!
- Continued bug fixing on Mac, iOS, and Android
What You Can Do
If you see something funky or want to request a new feature, just let us know at http://help.hipchat.com. Also, make sure you’re using our latest clients to take advantage of all the improvements we’ve been making. You can download our Mac, Windows, iOS, Android, or Linux clients from https://www.hipchat.com/downloads.
Oh, by the way, we’re hiring! One of our core values at Atlassian is “Be the change you seek”. Want to be part of making our clients even better? Great, come join us! Apply at http://hipchat.com/jobs/.
We’ll be upgrading one of our main databases this Saturday, March 24th at 10PM PST for about 10 minutes. During this time our website will be unavailable and users will not be able to sign in to chat. However, if you’re already signed in when we begin the maintenance you’ll be able to keep chatting.
We have grown tremendously in the past few months and need to perform this upgrade to keep everything running smoothly. Thanks for your patience!
Update 10:15pm: All set. Everything went very smoothly.
We’ll be taking HipChat offline for about 30 minutes this Friday, December 30th at 11PM PST (23:00) in order to roll out some major updates to our system. The biggest change we’re making is to add the all-time most requested feature; being able to sign in from multiple locations at once. More details on that soon. 🙂
Note: If you have not installed the the latest desktop app update already (instructions here) you will be required to do so before signing in again after the maintenance is finished.
Have a fun and safe New Year’s weekend!
Update (23:20) – Maintenance was completed successfully. Thank you for your patience!
We’ll be shutting things down for about 30 minutes this evening beginning at 11PM PST (23:00) in order to upgrade some of our core systems. All users will be disconnected and unable to chat during this time. The website will remain available.
Additionally, all desktop client users will be prompted with a required update before signing in after the maintenance. We haven’t had a required update since mid-2010 so depending on when you signed up you may see a large number of performance improvements and bug fixes. This new update also contains support for some new features we’ll be releasing in the future.
For the tech curious: we’re increasing our memcached cache sizes, upgrading to the latest Redis, and moving a few systems to more powerful machines in order to increase our capacity. We’ve seen massive growth lately and March is looking to be another very busy month.
Thanks to all our users for your patience and support. We’ll update this post and post on Twitter when the maintenance is complete.