Archive for the ‘How HipChat Works’ Category

zuhaib

Elasticsearch at HipChat: 10x faster queries

By zuhaib | 4 months ago | 6 Comments

Last fall we discussed our journey to 1 billion chat messages stored and how we used Elasticsearch to get there. By April we’d already surpassed 2 billion messages and our growth rate only continues to increase. Unfortunately all this growth has highlighted flaws in our initial Elasticsearch setup.

When we first migrated to Elasticsearch we were under time pressure from a dying CouchDB architecture and did not have the time to evaluate as many design options as we would have liked. In the end we chose a model that was easy to roll out but did not have great performance. In the graph below you can see that requests to load uncached history could take many seconds:

Average response times between 500ms-1000ms with spikes as high as 6000ms!

Identifying our problem

Obviously taking this long to fetch data is not acceptable, so we started investigating.

Hipchat Y U SO SLOW

What we found was a simple problem that had been compounded by the sheer data size we were now working with. With CouchDB we had stored our datetime field as a string and built views around it to do efficient range queries; something it did very well and with little memory usage.

So why did this cause such a performance problem for Elasticsearch?

Well, an old and incorrect design decision resulted in us storing datetime values in a way that was close to ISO 8601, but not entirely the same. This custom format posed no problem for CouchDB as it treated it as any other sortable string.

On the other hand, Elasticsearch keeps as much of your data in memory as possible, including the field you sort by. Since we were using these long datetime strings it needed much memory to store them: up to 18GB across our 16 nodes.

In addition, all of our in app history queries use a range filter so we can request history between two datetimes. For Elasticsearch to answer this query it had to load all the datetime fields from disk to memory for the query, compute the range, and then throw away the data it didn’t need.

As you can imagine, this resulted in high disk usage and cpu wait i/o;

But as we mentioned earlier, Elasticsearch stores this datetime field in memory, so why can’t it use that data (known as field data) instead of going to disk? It turns out that it can, but only if you are using a numeric range for your index, and we were using these custom datetime strings.

Kick off the reindexing!

Once we identified this problem we tweaked our index mapping so it would store our datetime field as a datetime type (with our custom format) so all new data would get stored correctly. We leveraged Elasticsearch’s ability to store a multi-field which meant we were able to keep our old string datetimes around for backwards compatibility. But what about the old data? Since Elasticsearch does not support mapping a change onto an old index, we’d need to reindex all of our old data to a new set of indices and create aliases for them. And since our cluster was under so much IO load during normal usage we needed to do this reindexing on nights and weekends when resources were available. There were around 100 indices to rebuild and the larger ones took up 12+ hours.

Elasticsearch helped this process by providing helper methods in their client library to assist in our reindexing. We also built a custom script around their Python client to automate the process and ensure we caused no downtime or lost data. We hope to share this script in the future.

The fruits of our labor

Once we finished reindexing we switched our query to use numeric_ranges and the results were well worth the work:

Going from 1-5s to sub-200ms queries (and data transfer)

So the big takeaway from this experience for us was that while Elasticsearch dynamic mapping is great for getting you started quickly, it can handcuff you as you as you scale.  All of our new projects with Elasticsearch use explicit mapping templates so we know our data structure and can write queries that take advantage of them. We expect to see far more consistent and predictable performance as we race towards 10 billion messages stored.

We’d love be able to make another order of magnitude performance improvement to our Elasticsearch setup and ditch our intermediate Redis cache entirely. Sound fun to you too? We’re hiring! https://www.hipchat.com/jobs

zuhaib

HipChat + Elasticsearch guest list expanded

By zuhaib | 11 months ago | 1 Comment

You want it – You got it

75 more slots to attend SF Elasticsearch’s Meetup on November 18th

Capacity reached! But we will be recording the talk and sharing it with the Elasticsearch community.

At HipChat, we’re big fans of Elasticsearch. It’s helped us scale our infrastructure. The title of this talk will be “Heavy Lifting: How HipChat Scaled to 1 Billion Messages.”

Originally, we thought 75 people might register for the talk. But, the Bay Area Elasticsearch community is bigger and more passionate than we anticipated. So we’re doubling our guest count to 150 people. RSVP now to save your spot!

Note: We also changed the time of the Meetup. The main talk will begin at 6:30pm. Not 7pm.

What you’ll learn

One of the keys to our success has been building a scalable backend. Elasticsearch has played a big part in this.

We plan to talk about how we scaled to sending over 1 Billion messages and how Elasicsearch allows us to index and make in near-realtime search possible for all 1 billion messages. We will also discuss our future with Elasticsearch — using it for more than just search and logs. We’ll share some tips and things we learned (and are still learning) about our transition to Elasticsearch.

Why you should attend

  • Free pizza, beer and sodas
  • A chance to talk with our engineering team, including HipChat Founders
  • You’ll get some of HipChat’s popular meme stickers
  • Chance to win limited-edition HipChat t-shirts
  • Learn something cool

Did we mention we’re hiring?

Full disclosure: we know the Elasticsearch community is packed with incredible engineering talent. We’d love to talk with you about current and future opportunities to build the best damn group chat application for teams. We’ll have one of our talent coordinators on-site in case you have questions about the company, our values or the hiring process.

Can’t make it? You can always submit a resume to jobs@hipchat.com. We (heart) smart people.

zuhaib

How HipChat scales to 1 Billion Messages

By zuhaib | 1 year ago | 11 Comments

When Atlassian acquired HipChat, we had sent about 110 million messages. Today that number has grown tenfold, and it’s still growing at a record pace. Scaling to meet these demands has not been easy but the HipChat Ops team is up to the task. We thought it’d be cool to shine some light on what it took, infrastructure wise, for those who are curious about this kind of stuff. In this post, we’ll highlight how we use CouchDB, ElasticSearch, and Redis to handle our load and make sure we provide as reliable a service for our users as possible.

Road to 1 billion messages

Getting off the Couch to scale chat history and search

Originally HipChat had a single m2.4xlarge EC2 Instance running CouchDB as datastore for chat history and Couch-lucene for search, a fine set up for a small application. However, once we started to grow, we began to hit the limits of CouchDB and AWS instance size, and we’d be out of memory daily. We kicked off a project to look at other data stores and indexers to solve this problem, and we concluded that the first step involved upgrading our search indexer. So we kicked Lucene to the curb in favor of Elasticsearch.

Heeding the advice of the Loggly team, we set up 7 Elasticsearch index servers and 3 dedicated master nodes to help prevent split brain. Elasticsearch lets us add more nodes to our cluster when we need more capacity, so we can handle extra load while concurrently serving requests. Moreover, the ability to have our shards replicated across the cluster means if we ever lose an instance, we can still continue serving requests, reducing the amount of time HipChat Search is offline.

For chat history, we still use CouchDB as our datastore, but we are beginning to hit limits with AWS trying to fit everything into a single instance. Just prior to hitting a billion messages, we noticed that during compaction, our EBS volume storing our CouchDB files was running out of disk space. AWS limits EBS volumes to 1TB, so as a stop gap solution, we decided to try out EBS Raid. We at HipChat don’t believe in one-off solutions, so we used a slightly hacked version of AWS using the Opscode Chef cookbook to automate the process of creating, mounting, and formatting our RAID arrays. Our hack can even rebuild the RAID using EBS Snapshots. True webscale stuff.

Currently, we pull data from couchDB using a custom ruby import script, but since Elasticsearch has treated us so well, we are looking to replace CouchDB with just Elasticsearch. If you want to hear more about this, we plan on giving a talk about Elasticsearch at a meetup here at Atlassian.

Caching in on Redis

We at HipChat use Redis a lot, caching everything from XMPP session info to up to 2 weeks of chat history. Originally we started with two Redis servers, one caching stats and the other caching everything else, but we soon realized that we’d need more help. Today, we shard our data over 3 Redis servers, with each server having its own slave. We continue to dedicate one of these servers to hosting our stats, while leaving the other two to cache everything else.

However, even with these changes, we found that we had to upgrade our Redis history instance size as we were running out of memory close to our billion message milestone. We will continue to improve the scalability in this area of the HipChat architecture, so we can handle load and ween off our dependence on Redis clustering to mitigate single points of failure.

Future

This is just a highlight of some parts of the HipChat infrastructure we needed to tweak to help us reach 1 billion messages. We still have a long ways to go to scale HipChat for our growing enterprise needs – improving our Redis architecture for example. A more robust system, increasing performance of our code, and mitigating or removing Single Points of Failure are large objectives that our Ops team look forward to tackling in the coming months.

If you want to learn more or think you can help us scale HipChat better, I suggest you come by our meetup. If you can’t make it to that, feel free to submit your resume here. Our team is growing fast, and we would love to have you on our team.

zuhaib

HipChat search now powered by Elasticsearch

By zuhaib | 1 year ago | 8 Comments

We recently announced new search improvements in HipChat that support advanced searching of chat histories. At the same time, to support the scale at which HipChat is growing we needed to rethink our search architecture. The result? We switched to Elasticsearch.

Previous Setup

Originally, HipChat search was powered by a single AWS instance running couchdb-lucene. This was acceptable in the early days. But we had the issue of a single point of failure for our search system.

As HipChat grew, we needed a bigger and bigger AWS instance – to the point that we were using the 2nd largest memory instance AWS had. Even then we  experienced periods of search outages preventing our users from searching all because we have a single instance with no redundancy.

Say Hello to Elasticsearch

We determined our previous setup was not sustainable so we kicked off a project to find a new search engine. After kicking the tires on a few solutions we landed on Elasticsearch.

Why Elasticsearch?

  • It is built on top of apache lucene so it is familiar to us
  • It supports distributed nodes allowing us to run multiple nodes on different AWS availability zones
  • It supports robust plugins, including one allowing AWS node discovery
  • We could roll it out with as little impact to users as possible

You can read more about all the great Elasticsearch features here.

Deploying Elasticsearch at HipChat

Search error percentage before and after Elasticsearch

Rolling out Elasticsearch at HipChat with as little impact to users was a key goals of ours. During the evaluation phase we wrote a script that duplicated our search queries in production and ran them against the search engines we were testing, logging any differences in the response and logging response times to statd/graphite. This allowed us to figure our which service could handle the load we generated.

Working with the Elasticserch consulting team we determined that the standard couchdb-river would not work for us so we built a custom ruby importer to support the type of performance we needed. We hope someday to open source this script but currently its very tailored to our needs.

At HipChat, we leverage feature toggling for many of our features so when we roll out something new we can enable it for only certain groups. This allows us to test at scale without causing disruption to all of our customers. We used this feature to roll out Elasticsearch slowly to a small sub-set of customers (Thanks to all the customers who helped us Beta test Elasticsearch!). Once we got comfortable with Elasticsearch and saw it was beating out couchdb-lucene we decided to roll it out to all of our customers with minimal impact on end users.

The Ops team at HipChat is still working on other ways to scale HipChat to make it as reliable as possible!

Happy Searching all!

Jeff Park

Make HipChat your Team’s Command Center

By Jeff Park | 2 years ago | 4 Comments

Our customers love HipChat because it’s so easy to extend. HipChat connects to over 45 tools that your company uses every day. Here are 5 ways to make HipChat your team’s command center and stay on top of everything your team needs to know about.

1. Connect to JIRA and Pivotal Tracker

Track issues with JIRAEvery company has projects they need to manage. Keeping up with the issues your team needs to address for these projects is a breeze with HipChat. Integrate with project management tools like Atlassian JIRA or Pivotal Tracker, and receive updates whenever an issue is opened, commented on, or resolved.

2. Collaborate on code with Bitbucket or GitHub

Collaborate on Code with BitbucketWith software eating the world, your team most likely has some code to work with. Tie in your repositories from Github and Bitbucket to receive a notification whenever a teammate pushes code, creates a branch, opens a pull request and more.

3. Builds and deploys with Bamboo or Jenkins

Build and deploy with BambooDeploying clean code is critical to your team’s success. Integrate HipChat with a continuous integration tool like Jenkins or Atlassian Bamboo and be the first to know whenever your code passes or fails a build. If your team deploys with Heroku, you can have HipChat send you a message to let you know a team member deployed your app.

4. Tackle customer service with UserVoice and Zendesk

Provide kick-ass customer service with UserVoiceNo matter what your company does, the customer is critical to your success. Stay connected and provide immediate kick-ass service to your customers by bringing UserVoice and Zendesk into HipChat.



5. Missing something? Zapier has you covered

Using Zapier, you can integrate HipChat with any other tool your team uses. Zapier supports 200+ services and has a simple interface, so your team doesn’t have to spend any time writing code to get these integrations set up. Start getting notifications in HipChat in just a couple steps. Check out the list of services and instructions to set up your Zaps, and when you’re ready, sign up through this link to get an extra 100 tasks per month!

Zapier

There’s no need to fumble around to stay updated of what’s going on. Keep a pulse on your team’s activity by integrating the tools you use with HipChat. With one service sending notifications, your team spends less time distracted and more time shipping awesome products.