TL;DR: A brief network outage exposed a timing bug in the HipChat servers, which locked out a few thousand end users for up to four hours. We are continuing to work every day to make your HipChat experience awesome.
On January 14, starting at approx 20:00 UTC, HipChat began logging a large spike of network errors across our chat frontend layer. Transient network errors aren’t something we enjoy but they happen occasionally with high scale internet services.
During this time clients would attempt connect to HipChat’s servers, wait for a while trying to authenticate, and then lose their connections, forcing them to retry. The network errors cleared up around 20:25 UTC, but unfortunately this wasn’t the end of our problem.
The HipChat frontend servers contain a session limit “circuit breaker” that trips when a particular user has too many simultaneous logins. This feature generally protects the service against badly behaving clients. During the network problem, a bug in our frontend servers caused many user logins to hang around in an orphaned state, even though the clients had disconnected. Eventually these orphaned simultaneous logins built up and started tripping the circuit breaker, disallowing more login attempts from those clients.
HipChat Ops quickly kicked off a backend process that began cleaning up the orphaned login sessions, but the sheer number of users affected meant the process took hours to complete. During this window, we pushed a hot patch to the frontend servers temporarily raising the session limit to give some relief until the cleanup process could complete.
Simultaneously our backend engineers found the timing bug in the network code that allowed the bogus logins to pile up. The permanent fix has been rolled out to production. We are working on performance improvements for the emergency cleanup process that just took too long to finish.
Our beloved users noticed this downtime We opened more support cases in this few hour window than we normally do in a busy month. We take our commitment to you – our HipChat end users – very seriously and have ongoing investment in our platform and operations technologies and teams to support our enormous user growth. Our sincere apologies for any interruption in your team communication.