Yesterday afternoon while everyone was enjoying the new (shamrock) and (greenbeer) emoticons one of our Amazon EC2 instances suddenly stopped responding. This particular instance helped serve our website and chat service and had a decent percentage of chat clients connected to it. We’re not sure why it went down, but it was EBS-based so we think it may be related to the issues Reddit and others were experiencing.
Whatever the cause, we should be able to recover from server failures. Normally the users connected to the failed server would be briefly disconnected before automatically reconnecting to another server. Unfortunately the other services in our chat cluster weren’t properly detecting the failed instance and continued to send it requests which would then fail. Instead of trying to manually correct this unfamiliar failure state and cause more harm we decided to restart all the chat services. This is when most users noticed that we were having trouble.
Having everyone reconnect shouldn’t be too big a problem but we’ve been growing an awful lot lately and had more users connected than ever before (although today is looking even bigger!) The load caused by everyone reconnecting was too much for our existing servers to bear and everything slowed to a crawl. We had essentially triggered a denial-of-service attack against ourselves – not good. After about 15 minutes we were able to process all the requests and service returned to normal. Total service interruption: about 1 hour.
Here is what we’ll be fixing.
- Increasing our capacity so that server failures affect a smaller percentage of users and we can recover more quickly.
- Making sure our clustered chat services properly handle the failure case we experienced.
- Optimizing the client reconnection flow so it generates less load.
Thanks to all our users for your patience and understanding yesterday. We know HipChat is an integral part of your workflow and that reliability is the most important feature we can provide.