What caused our service outage on Wednesday?

By on September 4, 2015

We owe you an explanation for the recent service interruption. On Wednesday, our users were unable to push or pull over SSH or HTTPS, access the GUI or API, or complete important tasks on time. We violated the cardinal rule of SaaS – “don’t have downtime” – and we’re very sorry for all the trouble we’ve caused. We know that you rely on Bitbucket for your daily work and when our service isn’t working correctly it affects your productivity.

I’d like to take a little time to explain what happened, how we handled it, and what are we doing to avoid a repeat of the situation in the future.

What actually happened?

At 11:50 UTC, Wednesday, September 2nd, we noticed a drop in the webhook processing rate. Our Site Reliability Engineers (SREs) escalated the issue to the Bitbucket team. We noticed at the same time the load balancer queues for every other service – SSH, Git over HTTPS, Mercurial over HTTPS, the API, the GUI – began growing very, very quickly. By 12:10 UTC, all services began failing for users.

Response times soon skyrocketed for Git over HTTPS. Response times soon skyrocketed for Git over HTTPS.

 

Initially, the issues seemed unrelated. Application nodes all showed a reasonable load for that time of the day, and our RabbitMQ processes had not thrown any alerts. As our investigation later revealed, we had experienced severe resource exhaustion on the cluster serving both Redis and RabbitMQ, and the kernel’s out-of-memory killer began terminating both sets of services. This led to a split brain in RabbitMQ, as individual RabbitMQ instances were unable to communicate with each other and select a consistent master. Additionally, the application nodes’ Celery workers were unable to publish their background tasks consistently to RabbitMQ, thus blocking normal processing. With application processes blocked, the load balancers’ queues backed up, and soon the HAProxy daemon handling SSH traffic crashed outright.

By 12:30 UTC, we had focused our attention on the load balancer queues. Restarting haproxy-ssh did not work at first, as the kernel’s SYN backlog was still full, but once enough broken connection requests had timed out we’re able to restore haproxy-ssh at 12:47 UTC. By that point, RabbitMQ had stabilized to the point that Celery workers were once again able to publish. Our backlog of connection requests started to come down between the load balancer fix and the resumption of Celery workflow.

With traffic moving again, we started looking for a root cause. There were a few false leads: network changes, software deployments, back-end storage, and new RabbitMQ/Redis hardware. By 14:00 UTC, we came to a consensus that RabbitMQ’s failure was the reason we had seen so many problems. Unfortunately, Celery consumers were still unable to connect reliably to the queues, so the queues began growing very, very quickly, adding roughly 250,000 new tasks from 13:31 UTC to 13:59 UTC.

We spent the next several hours attempting to get Celery consuming again. However, the sheer size of the task queues, combined with the backlog of other services and our normal level of traffic for a Wednesday, meant slow going. At several points, consumer workers lost their ability to communicate effectively with their master processes. This kept the queues from being consumed consistently – workers would take a few tasks, run them, then die, all within the space of a few minutes.

Once Celery was fixed, the message backlog began to diminish. Once Celery was fixed, it began handling a sizable message backlog.

 

By 19:27 UTC, we had managed to troubleshoot Celery enough to get it to consuming tasks consistently, though slowly. We still had two million backlogged tasks to process, and even with a relatively quiet back end that takes some time. By 21:00 UTC, we had managed to reduce the backlog to about 3500 tasks, and we resolved the incident as soon as we were confident the backlog wouldn’t go back up.

What are we doing to keep this from happening again?

There are a few actions we’re taking immediately:

  1. We’ve migrated Redis to new hardware, separate from RabbitMQ, so that we can avoid resource contentions like this one in the future.
  2. We’ve corrected a flaw in a script to truncate old keys in Redis, which could have helped us notice the Redis cluster problems sooner. This script has already purged about 100,000 dead keys, and it’s still going.
  3. Our developers are re-evaluating their usage of Redis and RabbitMQ, and they are preparing adjustments that should reduce unnecessary traffic to either cluster.
  4. Last but not least, the outage exposed some holes in our monitoring, especially centered around RabbitMQ and Celery. We’re upgrading our monitors to improve problem detection, and we’re rewriting SRE runbooks so that there’s a much clearer plan of action.

Thank you for using our service. We know that many of you rely on it. We truly strive every hour of every day to keep Bitbucket running reliably, so you can get stuff done and create amazing software.

6 Comments

  • Cheng-Wei Chien
    Posted September 5, 2015 at 4:16 am | Permalink

    Thanks for the explanation.
    That’s why I like Bitbucket so much. 🙂

    • Robclancy
      Posted September 21, 2015 at 9:10 am | Permalink

      Because when they go down over and over again they give a reason? where are the reasons for bugs that have existed for years or feature requests that have been very important for people but then sit there for years while all the paid products get their updates. They don’t care about people using Bitbucket. Time to move on to something stable.

  • Posted September 8, 2015 at 4:31 am | Permalink

    Thank you for the explanation. Down time is fine, communication is key 🙂

    • Robclancy
      Posted September 21, 2015 at 9:11 am | Permalink

      It’s really not though. Especially when it happens so often.

  • Mxx
    Posted September 8, 2015 at 8:48 am | Permalink

    Jim, but you still have not communicated what was the root cause of the outage. What happened that caused the service to go down? Yes, RMQ queues started to grow, but why? If it was celery, why did it suddenly stop working?

  • Richard
    Posted September 14, 2015 at 2:51 am | Permalink

    My biggest complaint is the fact that information was nigh on impossible to get a hold of. At around 1400 UTC we started seeing these issues. At that point your status site was showing all green and no updates were given by your Twitter account. Once the updates came in there was no mention of a general outage until much later than what you describe in the article here. It took you several hours to even acknowledge that there was a problem which of course wasted my time trying to find the problem internally.

    As mentioned by Daryl, downtime is not a problem as long as you communicate 🙂 Kudos of course for releasing this info.