Three weeks ago, we introduced HipChat’s brand new, badass web client. It’s fast, beautiful and built to change how people connect. Needless to say, we’re incredibly proud of it. But, as much as we wanted a perfect launch, we weren’t so lucky: if you tried to use the client in the first week or two, you might have noticed a few hiccups.
Sorry about that.
In the spirit of Open Company – No Bullshit we wanted keep our users informed about the recent outages and what we did to fix the issue. Some of those outages degraded other areas of HipChat, like slowing our main website and message delivery. We’ve made moves to strengthen our web client’s stability so these issues never happen again.
How connecting to the HipChat web client works, at 10,000 feet
- You log into www.hipchat.com, creating a session with HipChat’s web layer.
- After logging in, you click Launch the web app which, in the web a session with our BOSH Server.
- Once connected, our BOSH server in turn creates a session with our XMPP server.
In this chain, BOSH Server is the weakest link. It wasn’t standing up to the popularity of the new client. And unfortunately, it’s coupled to our main web tier in a really bad way.
As our BOSH server came under pressure, it triggered a large number of sessions to reconnect. This, coupled with other issues, would cause hipchat.com to degrade. This is what happened the last week of March.
The little connection that could
With the new web client, the goal was to improve client reconnection, allowing HipChat to maintain resiliency toward network changes, roaming, outages, etc.
Previously, HipChat’s web client attempted reconnection every 10 – 30 seconds following a disconnection. This time around, we wanted a better experience: reconnecting as “automatically” as possible, hoping users never noticed a thing.
To do this, we decreased the connection retry from 10-30 seconds, down to 2 seconds. This drastically shortened time, combined with a surge of new users, strained our system. When we re-wrote the hipchat-js-client, we tried to ensure our users we had reasonable polling rates with exponential back-off and eventual timeout.
Here’s what the new reconnect model looked like:
The initial reconnection attempts were too aggressive for the amount of traffic we saw. So, our first action was to quickly update the back-off rate and initial poll time to be more reasonable.
The problem with exponential back-off
As always, things get complicated when we consider this at scale (webscale). Let’s say a large number of clients become disconnected at once due to a BOSH node failure. With our current reconnection model, we saw the following traffic pattern:
(Above example from AWS Blog, not actually pulled from HipChat, but you get the idea.)
Well, that’s not that much more awesome.
We’ve effectively just bunched all the reconnection requests into a series of incredibly high-load windows where all of the clients compete with each other. What we really want is more randomness. We implemented a having the least number of competing clients, and encourages the clients to back off over time.
waitTime = min(MAX_WAIT, random_integer_between(MIN_WAIT, lastComputedWaitTime * BACKOFF_RATE))
(Again, this example from AWS Blog. They have prettier graphs.)
This model has had a huge impact, and made the service much more
Untangling the Gordian knot
As mentioned, our BOSH server and our web tier are unfortunately coupled. Currently, it’s the web tier’s job to attach a pre-authed BOSH session to new clients. We do a lot of nginx hackery to ensure that your web session and your BOSH session are live, and are routed to the same box. This means anytime a web client reconnects, it hammers on its corresponding web box making both unstable. This also makes scaling our BOSH server really tricky. And worse, it prevents service isolation since we shared a lot of resources between our web site and HipChat’s web client.
As of March 26th, we’ve deployed changes that allow our web sessions and BOSH sessions to be uncoupled. In fact, all of our new web client users are already using this new auth method. This means we can scale our main website and our web client independently. We’ve already set up isolated worker pools for each. Together, these changes should ensure a misbehaving web client doesn’t cause a dead hipchat.com.
Double the trouble, double the fun
Since we knew session acquisition was our biggest pain point, we combed through our connection code, looking for ways to make it less expensive. We noticed that it was double-hitting Redis in some cases. A fix was quickly deployed, and the results?
They speak for themselves.
How’s it looking?
Since we made these changes, distribution of load on our system has been much improved. In the graphs below, the white lines show the start of Friday 3/27.
Four days of traffic prior to change (Tue – Fri)
Preceding two weeks of traffic (Mon – Fri, Mon – Fri), notice/compare Fridays (end user platform use level is approximately the same).