Archive for the ‘Service Status’ Category

June 15th Outage

What happened and what we’re doing about it

We sincerely apologize for our recent outage: you trust us with your chats, your important documents, your cat gifs, your personal conversations, your system notifications, your internet memes – and we let you down.

Our team takes pride in making   an important part of your (work) life, and we’re sorry. sad face hipchat

What happened?

Short version: the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load  facepalm-hipchat.

When a large network provider in the SF Bay Area had an issue Monday morning, it caused all of those clients to start reconnecting at once. This saturated our systems and prevented normal usage.

On Monday, we released an update to our backend systems, and Tuesday morning we released a new Mac app (v 3.3.1), both of which increased protection against this type of issue in the future.

We also fixed various other bugs related to reconnection in the Mac app that will prevent another connection overload like this one. And we continue to have teams building new, amazing technology that improves our system isolation, enhances our ability to do sophisticated load testing, supports even higher scale, and increases our server-side capability management (to disable misbehaving client functionality more directly, for example).

We never want you to be without HipChat. We fell far short of that Monday and are very sorry we let you and your teams down. 

We just passed 6B messages delivered via HipChat, more than 2B of which have been delivered in 2015 – our platform is scaling and growing faster than ever thanks to teams like yours.  We’re moving quickly to build a stronger HipChat as a result of the experience – thanks for your patience while we do.

 

HipChat login outage (Jan 14, 2015)

7 months ago Matt Gleeson 0 Comments

TL;DR: A brief network outage exposed a timing bug in the HipChat servers, which locked out a few thousand end users for up to four hours. We are continuing to work every day to make your HipChat experience awesome. 

On January 14, starting at approx 20:00 UTC, HipChat began logging a large spike of network errors across our chat frontend layer.  Transient network errors aren’t something we enjoy but they happen occasionally with high scale internet services.

During this time clients would attempt connect to HipChat’s servers, wait for a while trying to authenticate, and then lose their connections, forcing them to retry. The network errors cleared up around 20:25 UTC, but unfortunately this wasn’t the end of our problem.

The HipChat frontend servers contain a session limit “circuit breaker” that trips when a particular user has too many simultaneous logins. This feature generally protects the service against badly behaving clients. During the network problem, a bug in our frontend servers caused many user logins to hang around in an orphaned state, even though the clients had disconnected. Eventually these orphaned simultaneous logins built up and started tripping the circuit breaker, disallowing more login attempts from those clients.

HipChat Ops quickly kicked off a backend process that began cleaning up the orphaned login sessions, but the sheer number of users affected meant the process took hours to complete. During this window, we pushed a hot patch to the frontend servers temporarily raising the session limit to give some relief until the cleanup process could complete.

Simultaneously our backend engineers found the timing bug in the network code that allowed the bogus logins to pile up. The permanent fix has been rolled out to production. We are working on performance improvements for the emergency cleanup process that just took too long to finish.

Our beloved users noticed this downtime :) We opened more support cases in this few hour window than we normally do in a busy month.   We take our commitment to you – our HipChat end users – very seriously and have ongoing investment in our platform and operations technologies and teams to support our enormous user growth. Our sincere apologies for any interruption in your team communication.

Details on the recent outages

10 months ago Zuhaib Siddique 0 Comments

HipChat has been growing like crazy. A year ago we blogged about how we passed 1 Billion messages after 4 years. Today, we are over 3 Billion messages sent–and growing faster every month! However, over the past few weeks we have had some growing pains – we’ve suffered a few different types of service outage with HipChat, and felt like a blog post would be helpful to explain the issues and how are moving forward.

We understand that HipChat is critical for your team – we live in HipChat all day as well – so we know how crippling it feels to lose that connection to your team. We take this responsibility very seriously, and we’re ensuring that even as we continue to scale, your HipChat experience continues to be amazing.

Web App Troubles

Starting Oct 1st, we started to see higher than normal load on our web tier, which caused some load issues at that tier, which in turn triggered all web clients to suddenly reconnect and cause a spike across our whole system.  We added more capacity throughout Oct 2nd in the web tier to handle the extra load and made several code & configuration changes to optimize how we used one of our databases, Redis, to support the higher load, and declared the issue resolved on Oct 2nd.

The optimization addressed the load issue (yey).  However, along the way we inadvertently introduced two additional issues into our system. On Oct 7th, we discovered and resolved an issue that resulted in our Android clients making extra requests and DDoSing ourselves. On Oct 9th, we discovered and resolved a second issue, which resulted in bad cache data that locked out a small percentage of users from logging into HipChat.

SSLv3 “POODLE” bites HipChat

Many of you likely have seen the report going around of a security vulnerability with SSLv3.  As soon as we had confirmation of the vulnerability we rolled out a patched version of our server code to the whole system, including new front-end XMPP servers.  In the process of adding new front-end XMPP servers, our automated tool which manages our DNS records failed to update our domain with the new servers’ addresses.  When we removed the old servers from rotation this resulted in users being unable to login for 15 minutes while we manually updated our DNS records.

Going Forward

We take our commitment to you, our end users as seriously at 3B+ messages as we did at the first message. It’s unacceptable for us to have this much downtime, and while the system cause of each issue is different, the net result to you our users is often the same. We continue to invest heavily in platform, scale, performance and reliability projects–we have several big ones in the works and we hope to blog about them soon. We have an entire team dedicated to scaling HipChat further, and we are looking to scale our team as well – come join us!

Sincerely

Your HipChat Team

 

 

AWS Rolling Restarts

11 months ago Zuhaib Siddique 4 Comments

We know many in the Devops community use HipChat as an incident management tool. As many of you know, we use Amazon Web Services for hosting HipChat. If you’re affected by the AWS rolling restarts and see anything weird happening with HipChat, just remember that HipChat could be experiencing some side-effects from our own rolling restarts.

Amazon notified us that maintenance that will be occurring for instances hosting HipChat during the following windows over the coming days:

Start Time End Time
September 26, 2014 11:00:00 PM UTC-7 September 27, 2014 5:00:00 AM UTC-7
September 27, 2014 11:00:00 PM UTC-7 September 28, 2014 5:00:00 AM UTC-7
September 28, 2014 11:00:00 PM UTC-7 September 29, 2014 5:00:00 AM UTC-7
September 29, 2014 11:00:00 PM UTC-7 September 30, 2014 5:00:00 AM UTC-7

Our own Devops team is getting ready for this maintenance and we expect no major problems for HipChat. Minor things to be aware of might include client reconnections and slower search responses.
Here’s an article we found helpful whilst planning.

Although we don’t expect any major issues, we just wanted to provide a heads-up.

The HipChat Devops Team.

Details on last week’s outages

2 years ago Garret Heaton 1 Comment

Last week we suffered two outages affecting a large number of users. The first was from 10-11pm PST on Thursday and the second was from 12-1:30pm on Friday. Like you, we live in HipChat all day as well, so we know how crippling it feels to lose that connection to your team. We wanted to share the cause of the trouble since it’s something that likely affects some of you and was not trivially debugged, though the fix is simple.

What happened

At the beginning of both outages, we were alerted that the backend services which provide the XMPP BOSH endpoint for our web app dropped all their connections which disconnected all their users. This led to a rush of reconnections which caused load to increase. Despite having the capacity to handle the load, many web requests became very slow and caused our database (we use MySQL on Amazon RDS) to run out of available connections. Shortly after this, our XMPP services also began to handle requests very slowly despite using persistent MySQL connections. At this point all users were unable to sign in or chat, and we had to disable our API and chat signins. It was not clear to us what was causing the slowdown since our servers and database had plenty of resources available.

Root cause

Log jam!

It turns out that some new processing on our rsyslog logging server had caused its load to spike, which caused all our servers to begin queuing their logs locally and slowing down our services. As load increased, the problem only got worse. We stopped our rsyslog clients and everything started coming back to life immediately. Once we were back online we started searching and found that the Bitbucket team, which sits within shouting distance of us, had run into the same issue last year. Check out their blog post for technical details.

We have since found a way to reproduce this failure locally and implemented a fix by switching from TCP to UDP logging. If you’re using rsyslog we’d recommend you test your setup against this sort of failure. It’s quite embarrassing to have your logging cause so much trouble.

Thank you

Thanks to everyone for your patience during these troubles. We know they happen at the worst possible time for some of you. In the end, our service improves with each one of these failures as we continue to learn and resolve issues.

Finally, we’d like to remind people to keep an eye on @HipChat and our status site if you believe we may be having trouble. We received many individual tweets and emails during the outage but did not have the time to immediately reply to everyone.

Things that go bump in the… middle of the day

3 years ago Garret Heaton 1 Comment

Many of you probably noticed that we had some service issues around 11:00PST yesterday (October 2nd). The timing was horrible and we weren’t able to fix things as quickly as we’d like to. Here’s a little detail on what happened and what we’ve learned.

What happened

HipChat is built using the XMPP protocol and in order to allow our web app (JavaScript) to talk to our XMPP server we use a BOSH proxy service. Lately, we’ve been tracking down an issue that causes these services to drop large numbers of active connections without warning. Usually users don’t notice when this happens since they automatically reconnect to another server, but this time the problem struck during the middle of the day when we have our highest usage. The increased load caused by all these users reconnecting led to some cascading failures we had not seen before and were not well prepared to deal with.

Here’s a bit more detail for the techies out there. One of our main Redis servers had stopped accepting new connections due to the high load and a configuration issue (low maxclients setting), and as a result most users were unable to sign in (though people who were already connected were able to chat normally). At this point we began Tweeting, updating our status site, and working to redirect the load from the affected servers. It would have been handy to temporarily block users from signing in so that we could get things under control, but we hadn’t built a way to prevent users from using HipChat. Perhaps this is a ‘feature’ we’ll have to add moving forward so we don’t DDoS ourselves.

We considered performing a manual failover to a slave server but realized that it had the same configuration issue as the master and would end up suffering the same fate. In the end we realized that extra Apache child processes had been spawned which were holding connections to our Redis server open and refusing to terminate. After removing them we were able to start churning through the backlog of work and users were able to sign in again.

Final thoughts

We realize you all rely on HipChat to be available 24/7 so you can stay connected to your team and be productive. The length of this outage is unacceptable to us and we’ll be prioritizing some changes to improve our systems and prevent this from happening in the future. On the plus side, every issue we run into helps us learn more about our systems and protect them from future failures. Growing pains are a necessary thing.

Thanks to everyone who was patient and humorously supportive.

Saint Patrick’s Day downtime

Yesterday afternoon while everyone was enjoying the new (shamrock) and (greenbeer) emoticons one of our Amazon EC2 instances suddenly stopped responding. This particular instance helped serve our website and chat service and had a decent percentage of chat clients connected to it. We’re not sure why it went down, but it was EBS-based so we think it may be related to the issues Reddit and others were experiencing.

Whatever the cause, we should be able to recover from server failures. Normally the users connected to the failed server would be briefly disconnected before automatically reconnecting to another server. Unfortunately the other services in our chat cluster weren’t properly detecting the failed instance and continued to send it requests which would then fail. Instead of trying to manually correct this unfamiliar failure state and cause more harm we decided to restart all the chat services. This is when most users noticed that we were having trouble.

Having everyone reconnect shouldn’t be too big a problem but we’ve been growing an awful lot lately and had more users connected than ever before (although today is looking even bigger!) The load caused by everyone reconnecting was too much for our existing servers to bear and everything slowed to a crawl. We had essentially triggered a denial-of-service attack against ourselves – not good. After about 15 minutes we were able to process all the requests and service returned to normal. Total service interruption: about 1 hour.

Here is what we’ll be fixing.

  1. Increasing our capacity so that server failures affect a smaller percentage of users and we can recover more quickly.
  2. Making sure our clustered chat services properly handle the failure case we experienced.
  3. Optimizing the client reconnection flow so it generates less load.

Thanks to all our users for your patience and understanding yesterday. We know HipChat is an integral part of your workflow and that reliability is the most important feature we can provide.