Archive for the ‘Maintenance’ Category

zuhaib

Elasticsearch at HipChat: 10x faster queries

By zuhaib | 3 months ago | 5 Comments

Last fall we discussed our journey to 1 billion chat messages stored and how we used Elasticsearch to get there. By April we’d already surpassed 2 billion messages and our growth rate only continues to increase. Unfortunately all this growth has highlighted flaws in our initial Elasticsearch setup.

When we first migrated to Elasticsearch we were under time pressure from a dying CouchDB architecture and did not have the time to evaluate as many design options as we would have liked. In the end we chose a model that was easy to roll out but did not have great performance. In the graph below you can see that requests to load uncached history could take many seconds:

Average response times between 500ms-1000ms with spikes as high as 6000ms!

Identifying our problem

Obviously taking this long to fetch data is not acceptable, so we started investigating.

Hipchat Y U SO SLOW

What we found was a simple problem that had been compounded by the sheer data size we were now working with. With CouchDB we had stored our datetime field as a string and built views around it to do efficient range queries; something it did very well and with little memory usage.

So why did this cause such a performance problem for Elasticsearch?

Well, an old and incorrect design decision resulted in us storing datetime values in a way that was close to ISO 8601, but not entirely the same. This custom format posed no problem for CouchDB as it treated it as any other sortable string.

On the other hand, Elasticsearch keeps as much of your data in memory as possible, including the field you sort by. Since we were using these long datetime strings it needed much memory to store them: up to 18GB across our 16 nodes.

In addition, all of our in app history queries use a range filter so we can request history between two datetimes. For Elasticsearch to answer this query it had to load all the datetime fields from disk to memory for the query, compute the range, and then throw away the data it didn’t need.

As you can imagine, this resulted in high disk usage and cpu wait i/o;

But as we mentioned earlier, Elasticsearch stores this datetime field in memory, so why can’t it use that data (known as field data) instead of going to disk? It turns out that it can, but only if you are using a numeric range for your index, and we were using these custom datetime strings.

Kick off the reindexing!

Once we identified this problem we tweaked our index mapping so it would store our datetime field as a datetime type (with our custom format) so all new data would get stored correctly. We leveraged Elasticsearch’s ability to store a multi-field which meant we were able to keep our old string datetimes around for backwards compatibility. But what about the old data? Since Elasticsearch does not support mapping a change onto an old index, we’d need to reindex all of our old data to a new set of indices and create aliases for them. And since our cluster was under so much IO load during normal usage we needed to do this reindexing on nights and weekends when resources were available. There were around 100 indices to rebuild and the larger ones took up 12+ hours.

Elasticsearch helped this process by providing helper methods in their client library to assist in our reindexing. We also built a custom script around their Python client to automate the process and ensure we caused no downtime or lost data. We hope to share this script in the future.

The fruits of our labor

Once we finished reindexing we switched our query to use numeric_ranges and the results were well worth the work:

Going from 1-5s to sub-200ms queries (and data transfer)

So the big takeaway from this experience for us was that while Elasticsearch dynamic mapping is great for getting you started quickly, it can handcuff you as you as you scale.  All of our new projects with Elasticsearch use explicit mapping templates so we know our data structure and can write queries that take advantage of them. We expect to see far more consistent and predictable performance as we race towards 10 billion messages stored.

We’d love be able to make another order of magnitude performance improvement to our Elasticsearch setup and ditch our intermediate Redis cache entirely. Sound fun to you too? We’re hiring! https://www.hipchat.com/jobs

Michael Benner

Our Mac, iOS, and Android clients are better than ever!

By Michael Benner | 4 months ago | 1 Comment

There’s nothing more annoying that having HipChat crash in the middle of a call or when you’re chatting with a coworker, right? It pisses us off, too. Well, you may have noticed that our Mac, iOS, and Android clients are working a lot better lately. Now that we’ve launched video (phew!) we’ve been able to focus more time addressing issues causing random bugs and crashes. We’ve reduced the number of crashes/day by almost 95% on our Mac app, 75% on our Android app, and 25% on our iOS app. In addition we’ve resolved dozens of bugs reported by you (thanks again!).

 

What We’ve Been Doing

We’ve made a lot of changes in code and process over the past few months. Here are some highlights:

  • Aggressively monitoring crash reports in HockeyApp and quickly triaging the worst offenders
  • Use Kanban to optimize our development workflow
  • Focus on a 2-3 week release cadence, so fixes get to you sooner than later
  • Major changes to thread handling and concurrency:
    • Switch from using Key Value Observers to GCDMulticastDelegate (part of the XMPPFramework) – KVO didn’t work well in our multithreaded environment
    • Encapsulate classes to use a single queue to manage all their behaviors
    • Fix every place we were accidentally trying to modify an immutable collection 
    • Better error handling – errors happen and the client should be able to absorb them gracefully without a complete meltdown

What’s Next?

We’re not done yet and there are many more improvements on our list. Here are a few:

  • Give our Windows and Linux apps some well deserved love
  • Performance and usability improvements!
  • Continued bug fixing on Mac, iOS, and Android

What You Can Do

If you see something funky or want to request a new feature, just let us know at http://help.hipchat.com. Also, make sure you’re using our latest clients to take advantage of all the improvements we’ve been making. You can download our Mac, Windows, iOS, Android, or Linux clients from https://www.hipchat.com/downloads.

We’re Hiring

Oh, by the way, we’re hiring! One of our core values at Atlassian is “Be the change you seek”. Want to be part of making our clients even better? Great, come join us! Apply at http://hipchat.com/jobs/.

Garret Heaton

Scheduled maintenance this Saturday at 10PM PST

By Garret Heaton | 2 years ago | 3 Comments

We’ll be upgrading one of our main databases this Saturday, March 24th at 10PM PST for about 10 minutes. During this time our website will be unavailable and users will not be able to sign in to chat. However, if you’re already signed in when we begin the maintenance you’ll be able to keep chatting.

We have grown tremendously in the past few months and need to perform this upgrade to keep everything running smoothly. Thanks for your patience!

Update 10:15pm: All set. Everything went very smoothly.

Garret Heaton

Scheduled maintenance this Friday at 11PM PST

By Garret Heaton | 3 years ago | 0 Comments

We’ll be taking HipChat offline for about 30 minutes this Friday, December 30th at 11PM PST (23:00) in order to roll out some major updates to our system. The biggest change we’re making is to add the all-time most requested feature; being able to sign in from multiple locations at once. More details on that soon. :)

Note: If you have not installed the the latest desktop app update already (instructions here) you will be required to do so before signing in again after the maintenance is finished.

Have a fun and safe New Year’s weekend!

Garret Heaton

Scheduled Maintenance Tonight at 11PM PST

By Garret Heaton | 4 years ago | 0 Comments

Update (23:20) - Maintenance was completed successfully. Thank you for your patience!

We’ll be shutting things down for about 30 minutes this evening beginning at 11PM PST (23:00) in order to upgrade some of our core systems. All users will be disconnected and unable to chat during this time. The website will remain available.

Additionally, all desktop client users will be prompted with a required update before signing in after the maintenance. We haven’t had a required update since mid-2010 so depending on when you signed up you may see a large number of performance improvements and bug fixes. This new update also contains support for some new features we’ll be releasing in the future.

For the tech curious: we’re increasing our memcached cache sizes, upgrading to the latest Redis, and moving a few systems to more powerful machines in order to increase our capacity. We’ve seen massive growth lately and March is looking to be another very busy month.

Thanks to all our users for your patience and support. We’ll update this post and post on Twitter when the maintenance is complete.