Archive for the ‘Maintenance’ Category

HipChat and the little connection that could

Three weeks ago, we introduced HipChat’s brand new, badass web client. It’s fast, beautiful and built to change how people connect. Needless to say, we’re incredibly proud of it. But, as much as we wanted a perfect launch, we weren’t so lucky: if you tried to use the client in the first week or two, you might have noticed a few hiccups.

Sorry about that.

In the spirit of Open Company – No Bullshit we wanted keep our users informed about the recent outages and what we did to fix the issue. Some of those outages degraded other areas of HipChat, like slowing our main website and message delivery. We’ve made moves to strengthen our web client’s stability so these issues never happen again.

How connecting to the HipChat web client works, at 10,000 feet

  1. You log into www.hipchat.com, creating a session with HipChat’s web layer.
  2. After logging in, you click Launch the web app which, in the web layer, creates a session with our BOSH Server.
  3. Once connected, our BOSH server in turn creates a session with our XMPP server.

In this chain, our BOSH Server is the weakest link. It wasn’t standing up to the popularity of the new client. And unfortunately, it’s coupled to our main web tier in a really bad way.

As our BOSH server came under pressure, it triggered a large number of sessions to reconnect. This, coupled with other issues, would cause hipchat.com to degrade. This is what happened the last week of March.

The little connection that could

With the new web client, the goal was to improve client reconnection, allowing HipChat to maintain resiliency toward network changes, roaming, outages, etc.

Previously, HipChat’s web client attempted reconnection every 10 – 30 seconds following a disconnection. This time around, we wanted a better experience: reconnecting as “automatically” as possible, hoping users never noticed a thing.

To do this, we decreased the connection retry from 10-30 seconds, down to 2 seconds. This drastically shortened time, combined with a surge of new users, strained our system. When we re-wrote the hipchat-js-client, we tried to ensure our users we had reasonable polling rates with exponential back-off and eventual timeout.

Here’s what the new reconnect model looked like:

webclient reconnect

The initial reconnection attempts were too aggressive for the amount of traffic we saw. So, our first action was to quickly update the back-off rate and initial poll time to be more reasonable.

The problem with exponential back-off

As always, things get complicated when we consider this at scale (webscale). Let’s say a large number of clients become disconnected at once due to a BOSH node failure. With our current reconnection model, we saw the following traffic pattern:

backoff_expo_ts

(Above example from AWS Blog, not actually pulled from HipChat, but you get the idea.)

Well, that’s not that much more awesome.

We’ve effectively just bunched all the reconnection requests into a series of incredibly high-load windows where all of the clients compete with each other. What we really want is more randomness. We implemented a heavily jittered algorithm design.  This gives us the benefit of having the least number of competing clients, and encourages the clients to back off over time.

waitTime = min(MAX_WAIT, random_integer_between(MIN_WAIT, lastComputedWaitTime * BACKOFF_RATE))

backoff_fj_ts

(Again, this example from AWS Blog. They have prettier graphs.)

This model has had a huge impact, and made the service much more resilient.

Untangling the Gordian knot

As mentioned, our BOSH server and our web tier are unfortunately coupled. Currently, it’s the web tier’s job to attach a pre-authed BOSH session to new clients. We do a lot of nginx hackery to ensure that your web session and your BOSH session are live, and are routed to the same box. This means anytime a web client reconnects, it hammers on its corresponding web box making both unstable.  This also makes scaling our BOSH server really tricky. And worse, it prevents service isolation since we shared a lot of resources between our web site and HipChat’s web client.

As of March 26th, we’ve deployed changes that allow our web sessions and BOSH sessions to be uncoupled. In fact, all of our new web client users are already using this new auth method. This means we can scale our main website and our web client independently. We’ve already set up isolated worker pools for each. Together, these changes should ensure a misbehaving web client doesn’t cause a dead hipchat.com.

Double the trouble, double the fun

Since we knew session acquisition was our biggest pain point, we combed through our connection code, looking for ways to make it less expensive. We noticed that it was double-hitting Redis in some cases. A fix was quickly deployed, and the results?

double-query-redis

They speak for themselves.

How’s it looking?

Since we made these changes, distribution of load on our system has been much improved. In the graphs below, the white lines show the start of Friday 3/27.

last 4

Four days of traffic prior to change (Tue – Fri)


last 14

Preceding two weeks of traffic (Mon – Fri, Mon – Fri), notice/compare Fridays (end user platform use level is approximately the same).

Many thanks

We’ve got a long list of stability and performance fixes in the pipeline to keep up with amazing growth in demand for HipChat. Thanks for your patience and support. (heart) (hipchat).

Elasticsearch at HipChat: 10x faster queries

Last fall we discussed our journey to 1 billion chat messages stored and how we used Elasticsearch to get there. By April we’d already surpassed 2 billion messages and our growth rate only continues to increase. Unfortunately all this growth has highlighted flaws in our initial Elasticsearch setup.

When we first migrated to Elasticsearch we were under time pressure from a dying CouchDB architecture and did not have the time to evaluate as many design options as we would have liked. In the end we chose a model that was easy to roll out but did not have great performance. In the graph below you can see that requests to load uncached history could take many seconds:

Average response times between 500ms-1000ms with spikes as high as 6000ms!

Identifying our problem

Obviously taking this long to fetch data is not acceptable, so we started investigating.

Hipchat Y U SO SLOW

What we found was a simple problem that had been compounded by the sheer data size we were now working with. With CouchDB we had stored our datetime field as a string and built views around it to do efficient range queries; something it did very well and with little memory usage.

So why did this cause such a performance problem for Elasticsearch?

Well, an old and incorrect design decision resulted in us storing datetime values in a way that was close to ISO 8601, but not entirely the same. This custom format posed no problem for CouchDB as it treated it as any other sortable string.

On the other hand, Elasticsearch keeps as much of your data in memory as possible, including the field you sort by. Since we were using these long datetime strings it needed much memory to store them: up to 18GB across our 16 nodes.

In addition, all of our in app history queries use a range filter so we can request history between two datetimes. For Elasticsearch to answer this query it had to load all the datetime fields from disk to memory for the query, compute the range, and then throw away the data it didn’t need.

As you can imagine, this resulted in high disk usage and cpu wait i/o;

But as we mentioned earlier, Elasticsearch stores this datetime field in memory, so why can’t it use that data (known as field data) instead of going to disk? It turns out that it can, but only if you are using a numeric range for your index, and we were using these custom datetime strings.

Kick off the reindexing!

Once we identified this problem we tweaked our index mapping so it would store our datetime field as a datetime type (with our custom format) so all new data would get stored correctly. We leveraged Elasticsearch’s ability to store a multi-field which meant we were able to keep our old string datetimes around for backwards compatibility. But what about the old data? Since Elasticsearch does not support mapping a change onto an old index, we’d need to reindex all of our old data to a new set of indices and create aliases for them. And since our cluster was under so much IO load during normal usage we needed to do this reindexing on nights and weekends when resources were available. There were around 100 indices to rebuild and the larger ones took up 12+ hours.

Elasticsearch helped this process by providing helper methods in their client library to assist in our reindexing. We also built a custom script around their Python client to automate the process and ensure we caused no downtime or lost data. We hope to share this script in the future.

The fruits of our labor

Once we finished reindexing we switched our query to use numeric_ranges and the results were well worth the work:

Going from 1-5s to sub-200ms queries (and data transfer)

So the big takeaway from this experience for us was that while Elasticsearch dynamic mapping is great for getting you started quickly, it can handcuff you as you as you scale.  All of our new projects with Elasticsearch use explicit mapping templates so we know our data structure and can write queries that take advantage of them. We expect to see far more consistent and predictable performance as we race towards 10 billion messages stored.

We’d love be able to make another order of magnitude performance improvement to our Elasticsearch setup and ditch our intermediate Redis cache entirely. Sound fun to you too? We’re hiring! https://www.hipchat.com/jobs

Our Mac, iOS, and Android clients are better than ever!

There’s nothing more annoying that having HipChat crash in the middle of a call or when you’re chatting with a coworker, right? It pisses us off, too. Well, you may have noticed that our Mac, iOS, and Android clients are working a lot better lately. Now that we’ve launched video (phew!) we’ve been able to focus more time addressing issues causing random bugs and crashes. We’ve reduced the number of crashes/day by almost 95% on our Mac app, 75% on our Android app, and 25% on our iOS app. In addition we’ve resolved dozens of bugs reported by you (thanks again!).

 

What We’ve Been Doing

We’ve made a lot of changes in code and process over the past few months. Here are some highlights:

  • Aggressively monitoring crash reports in HockeyApp and quickly triaging the worst offenders
  • Use Kanban to optimize our development workflow
  • Focus on a 2-3 week release cadence, so fixes get to you sooner than later
  • Major changes to thread handling and concurrency:
    • Switch from using Key Value Observers to GCDMulticastDelegate (part of the XMPPFramework) – KVO didn’t work well in our multithreaded environment
    • Encapsulate classes to use a single queue to manage all their behaviors
    • Fix every place we were accidentally trying to modify an immutable collection 
    • Better error handling – errors happen and the client should be able to absorb them gracefully without a complete meltdown

What’s Next?

We’re not done yet and there are many more improvements on our list. Here are a few:

  • Give our Windows and Linux apps some well deserved love
  • Performance and usability improvements!
  • Continued bug fixing on Mac, iOS, and Android

What You Can Do

If you see something funky or want to request a new feature, just let us know at http://help.hipchat.com. Also, make sure you’re using our latest clients to take advantage of all the improvements we’ve been making. You can download our Mac, Windows, iOS, Android, or Linux clients from https://www.hipchat.com/downloads.

We’re Hiring

Oh, by the way, we’re hiring! One of our core values at Atlassian is “Be the change you seek”. Want to be part of making our clients even better? Great, come join us! Apply at http://hipchat.com/jobs/.

Scheduled maintenance this Saturday at 10PM PST

We’ll be upgrading one of our main databases this Saturday, March 24th at 10PM PST for about 10 minutes. During this time our website will be unavailable and users will not be able to sign in to chat. However, if you’re already signed in when we begin the maintenance you’ll be able to keep chatting.

We have grown tremendously in the past few months and need to perform this upgrade to keep everything running smoothly. Thanks for your patience!

Update 10:15pm: All set. Everything went very smoothly.

Scheduled maintenance this Friday at 11PM PST

We’ll be taking HipChat offline for about 30 minutes this Friday, December 30th at 11PM PST (23:00) in order to roll out some major updates to our system. The biggest change we’re making is to add the all-time most requested feature; being able to sign in from multiple locations at once. More details on that soon. :)

Note: If you have not installed the the latest desktop app update already (instructions here) you will be required to do so before signing in again after the maintenance is finished.

Have a fun and safe New Year’s weekend!

Scheduled Maintenance Tonight at 11PM PST

Update (23:20) – Maintenance was completed successfully. Thank you for your patience!

We’ll be shutting things down for about 30 minutes this evening beginning at 11PM PST (23:00) in order to upgrade some of our core systems. All users will be disconnected and unable to chat during this time. The website will remain available.

Additionally, all desktop client users will be prompted with a required update before signing in after the maintenance. We haven’t had a required update since mid-2010 so depending on when you signed up you may see a large number of performance improvements and bug fixes. This new update also contains support for some new features we’ll be releasing in the future.

For the tech curious: we’re increasing our memcached cache sizes, upgrading to the latest Redis, and moving a few systems to more powerful machines in order to increase our capacity. We’ve seen massive growth lately and March is looking to be another very busy month.

Thanks to all our users for your patience and support. We’ll update this post and post on Twitter when the maintenance is complete.