When Atlassian acquired HipChat, we had sent about 110 million messages. Today that number has grown tenfold, and it’s still growing at a record pace. Scaling to meet these demands has not been easy but the HipChat Ops team is up to the task. We thought it’d be cool to shine some light on what it took, infrastructure wise, for those who are curious about this kind of stuff. In this post, we’ll highlight how we use CouchDB, ElasticSearch, and Redis to handle our load and make sure we provide as reliable a service for our users as possible.
Getting off the Couch to scale chat history and search
Originally HipChat had a single m2.4xlarge EC2 Instance running CouchDB as datastore for chat history and Couch-lucene for search, a fine set up for a small application. However, once we started to grow, we began to hit the limits of CouchDB and AWS instance size, and we’d be out of memory daily. We kicked off a project to look at other data stores and indexers to solve this problem, and we concluded that the first step involved upgrading our search indexer. So we kicked Lucene to the curb in favor of Elasticsearch.
Heeding the advice of the Loggly team, we set up 7 Elasticsearch index servers and 3 dedicated master nodes to help prevent split brain. Elasticsearch lets us add more nodes to our cluster when we need more capacity, so we can handle extra load while concurrently serving requests. Moreover, the ability to have our shards replicated across the cluster means if we ever lose an instance, we can still continue serving requests, reducing the amount of time HipChat Search is offline.
For chat history, we still use CouchDB as our datastore, but we are beginning to hit limits with AWS trying to fit everything into a single instance. Just prior to hitting a billion messages, we noticed that during compaction, our EBS volume storing our CouchDB files was running out of disk space. AWS limits EBS volumes to 1TB, so as a stop gap solution, we decided to try out EBS Raid. We at HipChat don’t believe in one-off solutions, so we used a slightly hacked version of AWS using the Opscode Chef cookbook to automate the process of creating, mounting, and formatting our RAID arrays. Our hack can even rebuild the RAID using EBS Snapshots. True webscale stuff.
Currently, we pull data from couchDB using a custom ruby import script, but since Elasticsearch has treated us so well, we are looking to replace CouchDB with just Elasticsearch. If you want to hear more about this, we plan on giving a talk about Elasticsearch at a meetup here at Atlassian.
Caching in on Redis
We at HipChat use Redis a lot, caching everything from XMPP session info to up to 2 weeks of chat history. Originally we started with two Redis servers, one caching stats and the other caching everything else, but we soon realized that we’d need more help. Today, we shard our data over 3 Redis servers, with each server having its own slave. We continue to dedicate one of these servers to hosting our stats, while leaving the other two to cache everything else.
However, even with these changes, we found that we had to upgrade our Redis history instance size as we were running out of memory close to our billion message milestone. We will continue to improve the scalability in this area of the HipChat architecture, so we can handle load and ween off our dependence on Redis clustering to mitigate single points of failure.
This is just a highlight of some parts of the HipChat infrastructure we needed to tweak to help us reach 1 billion messages. We still have a long ways to go to scale HipChat for our growing enterprise needs – improving our Redis architecture for example. A more robust system, increasing performance of our code, and mitigating or removing Single Points of Failure are large objectives that our Ops team look forward to tackling in the coming months.
If you want to learn more or think you can help us scale HipChat better, I suggest you come by our meetup. If you can’t make it to that, feel free to submit your resume here. Our team is growing fast, and we would love to have you on our team.