Last Wednesday we were painfully reminded how much it sucks when your service goes down. We lost a server that was a single point of failure and had to move all of its services to a new machine. It kept us offline for about an hour. The next day GitHub and Facebook both suffered surprising outages and confused many programmers. No matter how big you are or how much money you put into your architecture, things are going to fail.
Obviously your just-launched startup isn’t going to have the availability of Google, and it shouldn’t. Maybe you’re bootstrapped and running on a single server — that’s fine (and a great way to save cash). But as you grow you’ll need to remain aware of your service’s weak spots and incrementally improve its ability to handle failures. ‘Incrementally’ is a key word here. Your service doesn’t need to be capable of five nines of availability before you’ve seen if the idea is even going to work out. On the flip side, it can’t have weekly outages once you have traction and paying users. In the middle you’ll go through various stages of improvement which probably look something like this:
- The “epic fail” stage – Everything’s running on one server. If it goes down your entire service and homepage are unavailable. Users don’t even see an error message. This is usually where you start.
- The “oh crap” stage – If a critical service is lost the service is mostly unusable but can at least let users know there’s a problem going on.
- The “uh oh” stage – Critical services are highly available and it’d take a major infrastructure failure for the core service to become unusable. Things may be a little slower during a failure but many users won’t notice.
- The “smooth sailing” stage – Your system can self-heal and recover from all expected failures automatically and users are unaffected. You only have to get involved when sh*t really hits the fan.
At any startup infrastructure upgrades must fight for time alongside new features, bug fixes, performance improvements, PR, and a million other things. There’s always a risk of downtime, just like there’s always a risk of losing customers, an employee, being the target of an attack, etc. It’s all about balance.
As for us, we obviously still have a service that’s a single point of failure and we’re working to fix that. Thanks for your patience during the outage last week.


When we started HipChat a year ago, our goal was to bring HipChat from concept to profitable product while controlling both the direction and ownership of the company. HipChat has been bootstrapped for the last year, we have awesome customers that we love, and we’re quickly approaching profitability… so why are we raising money?