What caused our service outage on Wednesday?

We owe you an explanation for the recent service interruption. On Wednesday, our users were unable to push or pull over SSH or HTTPS, access the GUI or API, or complete important tasks on time. We violated the cardinal rule of SaaS – “don’t have downtime” – and we’re very sorry for all the trouble we’ve caused. We know that you rely on Bitbucket for your daily work and when our service isn’t working correctly it affects your productivity.

I’d like to take a little time to explain what happened, how we handled it, and what are we doing to avoid a repeat of the situation in the future.

What actually happened?

At 11:50 UTC, Wednesday, September 2nd, we noticed a drop in the webhook processing rate. Our Site Reliability Engineers (SREs) escalated the issue to the Bitbucket team. We noticed at the same time the load balancer queues for every other service – SSH, Git over HTTPS, Mercurial over HTTPS, the API, the GUI – began growing very, very quickly. By 12:10 UTC, all services began failing for users.

Response times soon skyrocketed for Git over HTTPS.
Response times soon skyrocketed for Git over HTTPS.

 

Initially, the issues seemed unrelated. Application nodes all showed a reasonable load for that time of the day, and our RabbitMQ processes had not thrown any alerts. As our investigation later revealed, we had experienced severe resource exhaustion on the cluster serving both Redis and RabbitMQ, and the kernel’s out-of-memory killer began terminating both sets of services. This led to a split brain in RabbitMQ, as individual RabbitMQ instances were unable to communicate with each other and select a consistent master. Additionally, the application nodes’ Celery workers were unable to publish their background tasks consistently to RabbitMQ, thus blocking normal processing. With application processes blocked, the load balancers’ queues backed up, and soon the HAProxy daemon handling SSH traffic crashed outright.

By 12:30 UTC, we had focused our attention on the load balancer queues. Restarting haproxy-ssh did not work at first, as the kernel’s SYN backlog was still full, but once enough broken connection requests had timed out we’re able to restore haproxy-ssh at 12:47 UTC. By that point, RabbitMQ had stabilized to the point that Celery workers were once again able to publish. Our backlog of connection requests started to come down between the load balancer fix and the resumption of Celery workflow.

With traffic moving again, we started looking for a root cause. There were a few false leads: network changes, software deployments, back-end storage, and new RabbitMQ/Redis hardware. By 14:00 UTC, we came to a consensus that RabbitMQ’s failure was the reason we had seen so many problems. Unfortunately, Celery consumers were still unable to connect reliably to the queues, so the queues began growing very, very quickly, adding roughly 250,000 new tasks from 13:31 UTC to 13:59 UTC.

We spent the next several hours attempting to get Celery consuming again. However, the sheer size of the task queues, combined with the backlog of other services and our normal level of traffic for a Wednesday, meant slow going. At several points, consumer workers lost their ability to communicate effectively with their master processes. This kept the queues from being consumed consistently – workers would take a few tasks, run them, then die, all within the space of a few minutes.

Once Celery was fixed, the message backlog began to diminish.
Once Celery was fixed, it began handling a sizable message backlog.

 

By 19:27 UTC, we had managed to troubleshoot Celery enough to get it to consuming tasks consistently, though slowly. We still had two million backlogged tasks to process, and even with a relatively quiet back end that takes some time. By 21:00 UTC, we had managed to reduce the backlog to about 3500 tasks, and we resolved the incident as soon as we were confident the backlog wouldn’t go back up.

What are we doing to keep this from happening again?

There are a few actions we’re taking immediately:

  1. We’ve migrated Redis to new hardware, separate from RabbitMQ, so that we can avoid resource contentions like this one in the future.
  2. We’ve corrected a flaw in a script to truncate old keys in Redis, which could have helped us notice the Redis cluster problems sooner. This script has already purged about 100,000 dead keys, and it’s still going.
  3. Our developers are re-evaluating their usage of Redis and RabbitMQ, and they are preparing adjustments that should reduce unnecessary traffic to either cluster.
  4. Last but not least, the outage exposed some holes in our monitoring, especially centered around RabbitMQ and Celery. We’re upgrading our monitors to improve problem detection, and we’re rewriting SRE runbooks so that there’s a much clearer plan of action.

Thank you for using our service. We know that many of you rely on it. We truly strive every hour of every day to keep Bitbucket running reliably, so you can get stuff done and create amazing software.