On the unplanned downtime Friday night

Friday night, September 4th 2009, around 6:30 pm UTC, Bitbucket went down for about half an hour. It’s back up now. I’ll explain the outage and some of the measures we’re going to take to make sure this doesn’t happen again.

First suspect was high load, which it turned out not to be–on the contrary. Load was lower than it had been for weeks, if not months. This was a pretty strong indicator that no one was accessing the site. People were reporting “504 Gateway Timeout” errors to us, meaning that they were in fact reaching our first tier of load balancing, run by nginx.

So what’s wrong? A quick “ps aux” revealed that all services were running, but still something wasn’t quite right. Restart nginx. Nope. Is apache2 running? Yep. Restart apache2. Everything’s back to normal.

At least it was an easy fix. I have 74 mb of error logs to trail through, trying to figure out why apache2 decided to drop its cookies and not tell anyone.

Now, this has never happened before. We’ve had a few unexpected outages, but they’ve always been caused by flukes or other oddities, and a single time, a hardcore kernel crash.

We’re of course interested in this not happening again, especially since was such an easy fix to automate. We’ll be deploying periodic probes to check for apache2 responsiveness, and if it doesn’t answer, it will restart it. I don’t know which software yet, but monit is a good candidate.

I’m very sorry for any inconvenience this caused you (according to Twitter, quite a few), and know that we’re doing what we can to prevent this from happening again.

EDIT:

Our apache is set up to recycle the workers after N requests, and it seems what happened is that a loaded egg had been updated, and apache2 refused to pick it up. I found several “bad local file header” errors in the log, which made Django throw a ImproperlyConfigured exception which is basically a death sentence. This explains why the workers were plucked one by one, which finally resulted in the sites demise.

Again, the probe-check would’ve fixed this, and I think we’ve settled on monit.