Report on Sunday’s Outage

After last Sunday morning’s downtime, we thought it’d be nice to share exactly what happened and what steps we took to resolve things.

Timeline

At roughly 3am Pacific time on Sunday morning April 7th we (Bitbucket’s SF-based developers) were alerted about reduced availability of the site. Initially responded to by our support engineers, the problem required the help of the Bitbucket developers, which at this particular time of day complicated the investigation a bit.

When we went in we noticed extremely high load on all of our webservers, making them incapable of keeping up with incoming traffic. As load on our fileservers had also gone up significantly, we initially focused our attention on some of the recent changes we had made to our storage infrastructure and configuration.

When this did not reveal any regressions, we saw that our Dogslow reports were reporting an excessive number of page timeouts on a very specific, popular public repository. We saw that this repo, as well as its forks, were being flooded with requests, many hitting pages that are relatively expensive for us to render. At its peak as much as 10% of all traffic went to these repositories and since its access pattern differed dramatically from normal patterns (mostly hammering expensive pages), it overwhelmed us, filling up our worker pools and causing pages to time out.

As the traffic appeared to target just this one repo and its forks, we proceeded to temporarily make these repositories unavailable. This resulted in requests for them serving a 503 Service Unavailable. This immediately brought the site back, confirming that the load was indeed caused by this targeted traffic.

Next we looked for patterns in the now blocked traffic and noticed that while it seemed to come from unique IP addresses from all over the world, they shared a distinct User-Agent string identifying it as coming from a webcrawler.

We contacted the people at this company about their excessive traffic and preemptively went on to block them. We added them to our robots.txt, but since we couldn’t afford to wait for their crawlers to re-fetch that file, we also blacklisted their User-Agent string on our HAProxy load balancers.

By now the crawler people had gotten back to us saying they had reduced the aggressiveness of their crawling. However, at no point did their traffic show any sign of reduction and it was clear that we needed to keep the blacklist in place.

Immediately after deploying this, the site appeared healthy again and we were very keen to go back to sleep when we noticed a different problem. After a little while the site became unresponsive once again. Requests started to time out and a very ugly 504 page was being served by our Nginx-based SSL terminators. All the while, at about 60%, the actual load on our servers was a lot lower than normal. Something was preventing traffic from reaching the backend.

It turned out that when we blocked the crawler in HAProxy using its reqtarpit directive, we failed to realize that reqtarpit keeps the connection open for several seconds before closing it and with the crawler still opening a ton of concurrent connections per second, this starved our HAProxy connection pools, triggering 503s from Nginx. Instead of reqtarpit, we should have used reqideny to immediately reject the connection. We quickly corrected our mistake and brought the site back to life.

Conclusion

In hindsight we’re not too impressed by our performance in addressing this issue and would have liked a more speedy resolution. It’s worth noting that in addition to automated monitoring, we have staff in different time zones covering all 24 hours. In this instance staff in Asia were first to respond, but sometimes investigation of site-wide calamities requires the help of core developers, who in this case had to be woken up, delaying our response a bit.

It was about 6am by the time we flipped our status site back to green, ending a significant period of limited availability, for which we apologize.

We’d love to say that this will never happen again and of course we’ll be much better prepared to handle similar incidents in the future. Calamities are often unique, making it hard to predict and anticipate the unknown, but we are committed to making our infrastructure more flexible to make dealing with issues like this one less complicated and we’ll work as hard as we can to ensure uptime remains what you have come to expect from Bitbucket.