Segregating services

By on August 24, 2012

I wanted to take a moment to talk about some infrastructure changes we’ve made on Bitbucket lately, and apologize for some flakiness this week.

Over the years, the architecture behind Bitbucket has changed significantly. On day 1, we ran Apache with mod_wsgi on EC2, but today our stack looks completely different. And this week, we made yet another major change.

So what did we do?

Bitbucket is already segregated into smaller parts; we have a pool of Django workers, mercurial workers, and git workers. However, up until today, everything has run on every machine. This is really just a leftover from the early days where we had a handful of machines on EC2. It makes a lot of sense to split up your service, designating a set of machines to handle specific things. This makes it easier to measure, profile & improve.

Using clever routing and inspection, we’ve shortened requests paths all over the place. This is good news for everyone: You have less hops to get your data, and we have less moving parts when things act funny. Simpler is always better. This also means we can re-route traffic when necessary, and easily provision new workers and stick them in the pool.

Our offering over the past few years, has grown beyond a handful of virtualized machines to many racks of expensive hardware. We have automated a lot, most importantly deployments. It simply became unmanageable with the sheer amount of machines we would need to SSH into and develop carpal tunnel. So it made a lot of sense to simplify where we could, increasing reliability, measurability and transparency.

I’ll get a picture for you guys soon. The rest of this post is rather technical, and unless you’re interested in that sort of thing, you can skip right to the end.

Technical details (nerd porn)

Previously, when you cloned a git or mercurial repository, the request went something like this:

Let me explain: First you hit one of our load balancers, running HAProxy. HAProxy proxies you through to Django. Why Django? Because we need to authorize/authenticate your request. That’s all done there. We then make use of a feature in nginx called “X-Accel-Redirect”. It’s a special header you can return in your response, and it tells nginx “go look here”. So if you did an X-Accel-Redirect to a local file, nginx would serve that file. However, if you do an X-Accel-Redirect to a location, nginx will replay the entire request as it came in, at a new location. This is very handy, as we let Django authenticate the request, and pass it on to our mercurial workers. That way, they don’t need to even have knowledge of Bitbucket, and can just be mercurial work horses.

And so they have been. But this introduces a dependency, namely Django. It’d be a lot better to get rid of that, and get to your destination as early as you can.

What we’ve done, is develop a small WSGI library, called singlemalt (get it?). It’s a thin middleware that authenticates, and provides hooks for authorization for each individual thing. We plugged this under the hood of hgweb, gitweb, etc. That gives us transparent authentication, and enough flexibility to reuse the library throughout different services. It’s nothing special, but for us, it was worth the investment. This helps keep things simple and consistent. We also took the liberty of improving health checks across the services, too.

The new request chain looks like this:

Yay! No more Django dependency, and you get to talk straight to hgweb. We did this by using the ACL feature of HAProxy–they’re akin to rewrite rules. We look at incoming requests, and based on various headers (like User-Agent), we determine who you should talk to. This let’s us entirely by pass parts that aren’t strictly necessary, freeing them up to serve normal web page requests.

This is the result

This graph shows sessions per second, per backend. Backends are pools of remote workers HAProxy will forward requests to.

The red line, servers-ssl, was the backend that served all requests, including Django, hgweb and gitweb. After we deployed the new routing, look at how the traffic first dropped significantly, and a new light green line appears. That one shows sustained sessions to the new servers-hg backend.

As a side note, look at the difference in mercurial traffic vs. web page traffic! Mercurial sure is chatty.

Shortly thereafter, we began re-routing gitweb traffic as well, causing a further drop of the red line, and the introduction of a new, purple line.

Mind you, we have several load balancers, so this only represents what a single one puts through.

Having made these changes, we have increased our fault tolerance across the board. Eliminating dependencies such as Django in the request chain, now means that if all the Django workers are busy (or down), you can still interact with your repositories. Or if someone decides to bombard our sshd, Django will still serve requests.

Conclusion

This was a huge rollout! Along the way, we tripped a few times, but in retrospect, it helped us identify problem areas immediately, and we appreciate the patience & understanding of those affected at the time.

One thing that helped us immensely, especially with retaining our sanity, was to break this up into smaller bits, that we could roll out individually. We used branches and pull requests for this. (smile)

PS: HAProxy is a seriously neat piece of software, and Willy Tarreau is a boss for creating it.

  • http://michaelgrace.org/ MikeGrace

    Awesome!!! Very cool to see that the team was able to make all those changes. Great job of team! : )

  • http://twitter.com/marcopinheiro Marco Pinheiro

    tks for sharing this!

  • ArneBab

    looks great!

    One thing, though: “mercurial sure is chatty” — don’t you mean “most people who access our sites want to get info on hg repos”?

    Or rather: “we really have lots of hg users interacting with our service via Mercurial”?

    • http://twitter.com/jespern Jesper Noehr

      No, I meant that Mercurial’s protocol does a lot of roundtrips. For example, a push requires something like 5 or 6 HTTPS requests. This just means it’s very “talkative”, requiring a lot more requests than for example Git.

  • Alir3z4

    Love to know more about how bitbucket works under the hood.
    Specially database and how it handling the data[store], and also server architect.
    Since at the footer says that bitbucket uses django 1.3.1 version i’n guessing that it uses mongodb for database backend and etc.

    Anyway, thanks for the sharing and amazing services ;)

  • Arian Fornaris

    What I see is that right now it is the slower thing I’m running in the web. My connection is not good, but it takes minutes to post a comment. I hope you guys fix it soon because right now I just see one solution: migrate to github or something, but I like bitbucket, please fix it.

  • den0833.cx

    Dear Jesper Noehr
    I have skill is very small. Politeness follow up.
    I hope bring up my character.
    Do one’s best works!!
    Thank you**
    I’m very tyred ♪( ´θ`)ノ
    But just going on bitbucke’s
    ありがとうございます。

    den0833.cx

  • TicketGoose

    I am very happy to know about the services.
    http://www.ticketgoose.com

  • Nicolas Grilly

    Thank you for sharing this! One question though: how do you restart haproxy, for example after a configuration change, without refusing new connections?