On the unplanned downtime Friday night

By on September 4, 2009

Friday night, September 4th 2009, around 6:30 pm UTC, Bitbucket went down for about half an hour. It’s back up now. I’ll explain the outage and some of the measures we’re going to take to make sure this doesn’t happen again.

First suspect was high load, which it turned out not to be–on the contrary. Load was lower than it had been for weeks, if not months. This was a pretty strong indicator that no one was accessing the site. People were reporting “504 Gateway Timeout” errors to us, meaning that they were in fact reaching our first tier of load balancing, run by nginx.

So what’s wrong? A quick “ps aux” revealed that all services were running, but still something wasn’t quite right. Restart nginx. Nope. Is apache2 running? Yep. Restart apache2. Everything’s back to normal.

At least it was an easy fix. I have 74 mb of error logs to trail through, trying to figure out why apache2 decided to drop its cookies and not tell anyone.

Now, this has never happened before. We’ve had a few unexpected outages, but they’ve always been caused by flukes or other oddities, and a single time, a hardcore kernel crash.

We’re of course interested in this not happening again, especially since was such an easy fix to automate. We’ll be deploying periodic probes to check for apache2 responsiveness, and if it doesn’t answer, it will restart it. I don’t know which software yet, but monit is a good candidate.

I’m very sorry for any inconvenience this caused you (according to Twitter, quite a few), and know that we’re doing what we can to prevent this from happening again.

EDIT:

Our apache is set up to recycle the workers after N requests, and it seems what happened is that a loaded egg had been updated, and apache2 refused to pick it up. I found several “bad local file header” errors in the log, which made Django throw a ImproperlyConfigured exception which is basically a death sentence. This explains why the workers were plucked one by one, which finally resulted in the sites demise.

Again, the probe-check would’ve fixed this, and I think we’ve settled on monit.

  • Carl Meyer

    Thanks Jesper! Bitbucket’s a great service.

    I use monit all over the place; very nice, simple, easy to setup, and reliable.

  • http://zacharyvoase.com Zachary Voase

    Thanks for the update!

    With regards to process management, I recommend Supervisor (http://supervisord.org/) and Superlance (http://pypi.python.org/pypi/superlance). I’ve never looked at monit before though.

  • http://bitbucket.org/jespern/ Jesper

    @Zachary,

    We’re already using supervisord since day 1. I haven’t seen superlance before, but I’ll check it out :-)

  • rorio

    Thanks guys! Good to see you back!

  • Pingback: Ольга

  • http://www.bootoutletstore.co.uk cheap uggs

    Mark S. is definitely on the right track. If you want to get a professional looking email address, Id recommend buying your name domain name, like or
    ajf 2
    If its common it might be difficult to get, however, be creative and you can usually find something.

  • Shrawan Patel

    This post excellently highlights what the author is trying to communicate. Nonetheless, the article has been framed excellently well and all credits to the author. For more information on how to load balance your web servers, please visit ..nhttp://serverloadbalancing.biz/wordpressbiz/, nhttp://serverloadbalancing.info/wordpressinfo/