This is a guest post by David Mytton, founder of server and website monitoring product, Server Density. He has been programming for over 10 years and has grown Server Density to now processing over 30TB of incoming data per month. You an email him on email@example.com or follow @davidmytton
Whether your company works entirely from an office, are all remote or a combination of the two, using HipChat to help run infrastructure operations is a common way to organize and coordinate teams.
For us, HipChat acts as a news feed with events being piped in from all the services we use - GitHub, ZenDesk, JIRA, deploys, new signups and ops alerts. It’s this final one that I want to focus on in this post.
The ops war room
There is a lot of activity going on during each day and so our main room gets quite noisy with commits, builds, customer upgrades, etc.
This works well for staying up to date with what is going on but when there’s an infrastructure incident e.g. an outage, the response team immediately switch to sterile cockpit rules - only essential communication is allowed.
To achieve this, we have a dedicated HipChat room which is used only to discuss ongoing incidents. Only critical information from our alerting system gets piped into this room, which is combination of Server Density’s own HipChat integration and alerts handled by PagerDuty. This has a number of advantages:
- We have an easy way to see chronological timeline of exactly what has been happening for the first responder to triage and additional responders to review to get up to speed.
- We have a single place to communicate for the responders.
- We have a permanent record of what happened for the followup post-mortem.
This seems to be a similar pattern amongst other companies. The privacy oriented search engine DuckDuckGo, users of both Server Density and HipChat, also have alerts piped into a single Ops room which is used to help increase sysops transparency across the whole team.
Communicating during outages
Every incident starts with a first responder doing some initial investigation to diagnose the issue. This involves them joining the Ops War Room, triaging the alerts and then generating a new incident tracking ticket in JIRA.
The most important thing is to keep a close-to-real-time record of what is being done and who is doing it. This helps any additional people who might join later, it helps to review what has already been done and really helps with the post-mortem analysis so you can review how to improve responses in the future.
We record events in several ways:
- Quick communication and initial investigation is done through HipChat text chat in the Ops War Room.
- Actions performed e.g. commands run, failover scripts executed, etc are logged in JIRA with the command line and output. It’s important to know what commands were run so there’s no duplication, they can be considered for automation next time, and we have a history for review.
- If there is a long running incident or there is some complexity, text based chat can become time consuming. Instead, we often switch to video conferencing so we can talk through what’s happening and coordinate individual responders. Even when there’s not much to say and it’s mostly silence when people are working, video is a good way to work with people remotely. We have been using Google Hangouts for this but are now testing the HipChat Video features.
A gathering place
The linking factor between many users of HipChat is how it acts as a gathering place for all teams. At idio, they are also using HipChat to help fight fires with the use of Airbrake to make them aware of code level exceptions, but it also brings together the dev teams with build events from Janky and Hubot.
The combination of developer and operations teams, with the ability to pipe events in real time and access crucial tools like video chat in a single location helps to improve ops response times, which all leads to better uptime – something which customers really notice!