We know how important the reliability of Bitbucket is, so we thought we should take some time to explain the downtime we’ve had during the last two weeks.
We release in one week iterations, deploying upgrades every Thursday afternoon from our San Francisco office. Most of the time our database schema migrations don’t require any downtime, and we’re able to upgrade the site live. However, our last couple upgrades have required downtime as we changed the way we store repositories on disk in preparation for an upcoming 50TB storage deployment. The database migration took 20 minutes. We try to keep these sorts of disruptive upgrades to a minimum, but sometimes there’s no way to avoid taking the site offline for a few moments in order to make fundamental changes to our infrastructure.
On May 6th our load balancers reached the maximum number of connections they’re configured to accept. We’ve seen this problem happen twice now, but this time it happened when everyone on our team was sleeping. We weren’t as quick to react to SMS alerts and phone calls as we normally are since it happened in the middle of the night. We use HAProxy for load balancing, which is a lovely piece of software. It’s well documented and the author Willy Tarreau graciously answers questions on mailing lists and forums. When we originally setup HAProxy we read through the excellent documentation and scoured the web for examples. We probably should have stuck to the official docs, but instead we ended up copying a very high timeout setting from a published HAProxy config from a popular social media site. In our configuration the high timeout resulted in a large number of connections in CLOSE_WAIT state, which built up until we maxed out HAProxy. We’ve lowered that high timeout setting on our load balancers and we setup additional monitoring to prevent it from happening again.
We’ve been having ongoing issues with our database connection pooler, pgbouncer, closing connections to our database instead of pooling them. We’ve known about the problem for a while, but it became more pronounced with recent spikes in activity. We’ve made improvements to our pgbouncer configuration to mitigate the issue, and we’re working on changes to the way the site manages database connections that should fix the issue. We’ll follow up with another blog post providing more technical details later in the week.
For the past two weeks we’ve received occasional alerts indicating requests to our application servers periodically timing out. We’ve been really concerned as it indicated overall degradation in our site’s performance. This particular problem was very difficult to troubleshoot. Unlike an internal server error which would result in log entries and Django error emails, our Gunicorn workers were simply timing out. It was pretty clear something was blocking Gunicorn, but since Gunicorn SIGKILLs it’s workers after they timeout there was little chance to log the problem. This sparked the curiosity of one of our developers, Erik van Zijst . As a 20% project he wrote a Django middleware we call Dogslow that logs slow requests with full tracebacks. Once we deployed Dogslow to production we quickly realized one of our NFS servers was being heavily loaded by backup software and another was slowed down by our Redis instance. Our data center staff resolved the backup issue by downgrading to an older version of the software and we migrated Redis to another host to free up resources. We’ll follow up with another post detailing the work we did to develop Dogslow today.