We recently announced new search improvements in HipChat that support advanced searching of chat histories. At the same time, to support the scale at which HipChat is growing we needed to rethink our search architecture. The result? We switched to Elasticsearch.
Originally, HipChat search was powered by a single AWS instance running couchdb-lucene. This was acceptable in the early days. But we had the issue of a single point of failure for our search system.
As HipChat grew, we needed a bigger and bigger AWS instance – to the point that we were using the 2nd largest memory instance AWS had. Even then we experienced periods of search outages preventing our users from searching all because we have a single instance with no redundancy.
Say Hello to Elasticsearch
We determined our previous setup was not sustainable so we kicked off a project to find a new search engine. After kicking the tires on a few solutions we landed on Elasticsearch.
- It is built on top of apache lucene so it is familiar to us
- It supports distributed nodes allowing us to run multiple nodes on different AWS availability zones
- It supports robust plugins, including one allowing AWS node discovery
- We could roll it out with as little impact to users as possible
You can read more about all the great Elasticsearch features here.
Deploying Elasticsearch at HipChat
Search error percentage before and after Elasticsearch
Rolling out Elasticsearch at HipChat with as little impact to users was a key goals of ours. During the evaluation phase we wrote a script that duplicated our search queries in production and ran them against the search engines we were testing, logging any differences in the response and logging response times to statd/graphite. This allowed us to figure our which service could handle the load we generated.
Working with the Elasticserch consulting team we determined that the standard couchdb-river would not work for us so we built a custom ruby importer to support the type of performance we needed. We hope someday to open source this script but currently its very tailored to our needs.
At HipChat, we leverage feature toggling for many of our features so when we roll out something new we can enable it for only certain groups. This allows us to test at scale without causing disruption to all of our customers. We used this feature to roll out Elasticsearch slowly to a small sub-set of customers (Thanks to all the customers who helped us Beta test Elasticsearch!). Once we got comfortable with Elasticsearch and saw it was beating out couchdb-lucene we decided to roll it out to all of our customers with minimal impact on end users.
The Ops team at HipChat is still working on other ways to scale HipChat to make it as reliable as possible!
Happy Searching all!