zuhaib

How HipChat scales to 1 Billion Messages

By zuhaib | 12 months ago | 11 Comments

When Atlassian acquired HipChat, we had sent about 110 million messages. Today that number has grown tenfold, and it’s still growing at a record pace. Scaling to meet these demands has not been easy but the HipChat Ops team is up to the task. We thought it’d be cool to shine some light on what it took, infrastructure wise, for those who are curious about this kind of stuff. In this post, we’ll highlight how we use CouchDB, ElasticSearch, and Redis to handle our load and make sure we provide as reliable a service for our users as possible.

Road to 1 billion messages

Getting off the Couch to scale chat history and search

Originally HipChat had a single m2.4xlarge EC2 Instance running CouchDB as datastore for chat history and Couch-lucene for search, a fine set up for a small application. However, once we started to grow, we began to hit the limits of CouchDB and AWS instance size, and we’d be out of memory daily. We kicked off a project to look at other data stores and indexers to solve this problem, and we concluded that the first step involved upgrading our search indexer. So we kicked Lucene to the curb in favor of Elasticsearch.

Heeding the advice of the Loggly team, we set up 7 Elasticsearch index servers and 3 dedicated master nodes to help prevent split brain. Elasticsearch lets us add more nodes to our cluster when we need more capacity, so we can handle extra load while concurrently serving requests. Moreover, the ability to have our shards replicated across the cluster means if we ever lose an instance, we can still continue serving requests, reducing the amount of time HipChat Search is offline.

For chat history, we still use CouchDB as our datastore, but we are beginning to hit limits with AWS trying to fit everything into a single instance. Just prior to hitting a billion messages, we noticed that during compaction, our EBS volume storing our CouchDB files was running out of disk space. AWS limits EBS volumes to 1TB, so as a stop gap solution, we decided to try out EBS Raid. We at HipChat don’t believe in one-off solutions, so we used a slightly hacked version of AWS using the Opscode Chef cookbook to automate the process of creating, mounting, and formatting our RAID arrays. Our hack can even rebuild the RAID using EBS Snapshots. True webscale stuff.

Currently, we pull data from couchDB using a custom ruby import script, but since Elasticsearch has treated us so well, we are looking to replace CouchDB with just Elasticsearch. If you want to hear more about this, we plan on giving a talk about Elasticsearch at a meetup here at Atlassian.

Caching in on Redis

We at HipChat use Redis a lot, caching everything from XMPP session info to up to 2 weeks of chat history. Originally we started with two Redis servers, one caching stats and the other caching everything else, but we soon realized that we’d need more help. Today, we shard our data over 3 Redis servers, with each server having its own slave. We continue to dedicate one of these servers to hosting our stats, while leaving the other two to cache everything else.

However, even with these changes, we found that we had to upgrade our Redis history instance size as we were running out of memory close to our billion message milestone. We will continue to improve the scalability in this area of the HipChat architecture, so we can handle load and ween off our dependence on Redis clustering to mitigate single points of failure.

Future

This is just a highlight of some parts of the HipChat infrastructure we needed to tweak to help us reach 1 billion messages. We still have a long ways to go to scale HipChat for our growing enterprise needs – improving our Redis architecture for example. A more robust system, increasing performance of our code, and mitigating or removing Single Points of Failure are large objectives that our Ops team look forward to tackling in the coming months.

If you want to learn more or think you can help us scale HipChat better, I suggest you come by our meetup. If you can’t make it to that, feel free to submit your resume here. Our team is growing fast, and we would love to have you on our team.

HipChat is group chat and IM built for teams. Learn more
  • hltbra

    Is the meetup link in the end correct? (It’s pointing to elasticsearch.org)

  • antirez

    Are you interested in Redis Cluster? I’m willing to work with companies that want to introduce Redis Cluster in their architecture. Cheers.

  • Jeff

    Good catch! It should be fixed now :)

  • erlend_sh

    I love the fact that HipChat works with XMPP, but I’m still curious: Would development, especially scaling challenges like this one, have been any easier if you had chosen not to go with the XMPP protocol?

    I noticed a lot of the new kids on the block don’t seem to give a damn about it and just roll their own, so I’m wondering if they’re doing it for any particular reason besides just getting to do things exactly their way (and as a result creating more lock-in… boo!)

  • Mark

    An interesting read and it seems elasticsearch is really gaining traction.In terms o your AWS setup, are you using stock AMIs (Amazon Linux, Ubuntu, RHEL etc..) or something custom? Also, do you develop in AWS as well or do this in house and push to AWS for staging/prod using Chef?

  • Benoit

    Great article ! It’s always good to have an overview of how it built behind the scene :)

    One question about CouchDB, are dealing the 1b messages into a single DB or do you have a split at organizations level for example (or something else) ?

  • http://www.zuhaiblog.com zuhaib

    @antirez:disqus would love to chat with you, Matt (my co-worker at HipChat) sent you an email earlier and if you like you can reach me at zuhaib@hipchat.com

  • http://www.zuhaiblog.com zuhaib

    @182d98fe5997c11ba507d035553f9c63:disqus Currently we are running on a single AWS instance, we have a slave setup as a hot failover. Its one of the reason we are looking at moving to Elasticsearch cluster to prevent this single point of failure.

  • http://www.zuhaiblog.com zuhaib

    @disqus_YMJ4jrIfhN:disqus We run on Ubuntu LTS using stock AMI currently but have played with custom AMI for the future. For Dev we use Vagrant with Chef to configure local VMs using the same configurations as prod and once we feel something is well baked we roll it out to prod. In the future we might talk more about our workflow.

  • John Reeder

    The statement that you ‘kicked Lucene to the curb in favor of Elasticsearch’ is interesting considering the fact that Elasticsearch is a front end for Lucene. Perhaps you meant ‘kicked Couch-Lucene to the curb’?

  • AJW

    Your last few updates make us want to stop using HipChat