Segregating services

I wanted to take a moment to talk about some infrastructure changes we’ve made on Bitbucket lately, and apologize for some flakiness this week.

Over the years, the architecture behind Bitbucket has changed significantly. On day 1, we ran Apache with mod_wsgi on EC2, but today our stack looks completely different. And this week, we made yet another major change.

So what did we do?

Bitbucket is already segregated into smaller parts; we have a pool of Django workers, mercurial workers, and git workers. However, up until today, everything has run on every machine. This is really just a leftover from the early days where we had a handful of machines on EC2. It makes a lot of sense to split up your service, designating a set of machines to handle specific things. This makes it easier to measure, profile & improve.

Using clever routing and inspection, we’ve shortened requests paths all over the place. This is good news for everyone: You have less hops to get your data, and we have less moving parts when things act funny. Simpler is always better. This also means we can re-route traffic when necessary, and easily provision new workers and stick them in the pool.

Our offering over the past few years, has grown beyond a handful of virtualized machines to many racks of expensive hardware. We have automated a lot, most importantly deployments. It simply became unmanageable with the sheer amount of machines we would need to SSH into and develop carpal tunnel. So it made a lot of sense to simplify where we could, increasing reliability, measurability and transparency.

I’ll get a picture for you guys soon. The rest of this post is rather technical, and unless you’re interested in that sort of thing, you can skip right to the end.

Technical details (nerd porn)

Previously, when you cloned a git or mercurial repository, the request went something like this:

Let me explain: First you hit one of our load balancers, running HAProxy. HAProxy proxies you through to Django. Why Django? Because we need to authorize/authenticate your request. That’s all done there. We then make use of a feature in nginx called “X-Accel-Redirect”. It’s a special header you can return in your response, and it tells nginx “go look here”. So if you did an X-Accel-Redirect to a local file, nginx would serve that file. However, if you do an X-Accel-Redirect to a location, nginx will replay the entire request as it came in, at a new location. This is very handy, as we let Django authenticate the request, and pass it on to our mercurial workers. That way, they don’t need to even have knowledge of Bitbucket, and can just be mercurial work horses.

And so they have been. But this introduces a dependency, namely Django. It’d be a lot better to get rid of that, and get to your destination as early as you can.

What we’ve done, is develop a small WSGI library, called singlemalt (get it?). It’s a thin middleware that authenticates, and provides hooks for authorization for each individual thing. We plugged this under the hood of hgweb, gitweb, etc. That gives us transparent authentication, and enough flexibility to reuse the library throughout different services. It’s nothing special, but for us, it was worth the investment. This helps keep things simple and consistent. We also took the liberty of improving health checks across the services, too.

The new request chain looks like this:

Yay! No more Django dependency, and you get to talk straight to hgweb. We did this by using the ACL feature of HAProxy–they’re akin to rewrite rules. We look at incoming requests, and based on various headers (like User-Agent), we determine who you should talk to. This let’s us entirely by pass parts that aren’t strictly necessary, freeing them up to serve normal web page requests.

This is the result

This graph shows sessions per second, per backend. Backends are pools of remote workers HAProxy will forward requests to.

The red line, servers-ssl, was the backend that served all requests, including Django, hgweb and gitweb. After we deployed the new routing, look at how the traffic first dropped significantly, and a new light green line appears. That one shows sustained sessions to the new servers-hg backend.

As a side note, look at the difference in mercurial traffic vs. web page traffic! Mercurial sure is chatty.

Shortly thereafter, we began re-routing gitweb traffic as well, causing a further drop of the red line, and the introduction of a new, purple line.

Mind you, we have several load balancers, so this only represents what a single one puts through.

Having made these changes, we have increased our fault tolerance across the board. Eliminating dependencies such as Django in the request chain, now means that if all the Django workers are busy (or down), you can still interact with your repositories. Or if someone decides to bombard our sshd, Django will still serve requests.

Conclusion

This was a huge rollout! Along the way, we tripped a few times, but in retrospect, it helped us identify problem areas immediately, and we appreciate the patience & understanding of those affected at the time.

One thing that helped us immensely, especially with retaining our sanity, was to break this up into smaller bits, that we could roll out individually. We used branches and pull requests for this. (smile)

PS: HAProxy is a seriously neat piece of software, and Willy Tarreau is a boss for creating it.

So what did we do?

Technical details (nerd porn)

This is the result

Conclusion

Outage incident and our new monitoring setup

Tracking bitbucket.org uptime

Bitbucket downtime for a hardware upgrade