Post mortem on our availability earlier today

By on September 19, 2012

Earlier today at 2am San Francisco time Bitbucket experienced about three hours of 500 error page responses for users attempting to access the user newsfeed and repository overview pages. The outage was caused by a kernel panic on our Redis server, which is responsible for pages that display recent events related to a user. We are very sorry for the inconvenience this outage has caused.

After rebooting the Redis server, the index that Redis uses to serve the newsfeed content was found to be corrupt, which caused certain pages on Bitbucket to fail. For users accessing pages deeper into the site, such as pull requests, commit views, wikis and issues the site continued to work as expected. During this time Git and Mercurial access continued to work over both HTTP and SSH. After identifying the cause of the problem, we turned off the newsfeed for all of Bitbucket bringing an end to the 500 errors.

With the newsfeed temporarily disabled, we began investigating the corruption problem and discovered a forum post with instructions and a repair tool to fix the corrupted index. We then used the instructions to repair the index and restore full service to Bitbucket.

During this outage we have identified areas for improvement and are implementing changes to the way we manage the operations of Bitbucket:

  1. Improve our escalation procedures so that the response times are faster during non-office hours
  2. Update the Bitbucket codebase so we do not have the dashboard and repo overview fail when Redis becomes unavailable
  3. Increase the number of tests that performs triggering our automatic phone alert system
We are very sorry for the inconvenience this outage has caused.


  • Anonymous
    Posted September 19, 2012 at 1:13 pm | Permalink

    Wow you had outage issues just a few days after Github. What a coincidence! =)

  • Ola Martins
    Posted September 20, 2012 at 12:23 am | Permalink

    The “status”-page claimed all system “green and go”. Maybe this should reflect the user experience rather than a server’s availability?

    • Posted September 20, 2012 at 12:22 pm | Permalink

      Ola, totally agree.

      As part of our area for improvements we have “Increase the number of tests that performs triggering our automatic phone alert system”

      Cheers, Justen.

  • Freddy Potargent
    Posted September 21, 2012 at 2:00 am | Permalink

    Nicely handled IMHO! When I was hit by the problem I uttered some words not suitable for repetition in public. Quickly discovered however that pages inside the repos were still available so not that big of a problem. I jumped onto the IRC and got a short answer. Well, short is understandable in these circumstances but that short answer did one important thing: it ensured me you were on top of it. 🙂 So, thanks for the great job done and keep it up!