Skip to content


Migrating to new harddrives

In case you haven’t noticed, Bitbucket’s been suffering from slowdowns. This is mainly due to poor I/O performance we get from our Amazon EBS mounts.

After much planning and benchmarking, we’ve decided to move our data over to an 8-disk RAID 0 setup instead. The past few days, we’ve been synchronizing data to the new mountpoint, and earlier this evening, we made the switch. This resulted in about 2 minutes downtime.

Now, seeing as causing more I/O load on the live drives would result in Bitbucket being completely unreachable, we did our best to get the new RAID disks into action with as-fresh-as-possible data before we made the switch. This allows us to have minimal downtime, but it also means that the live data will now be a few hours old.

If you’re seeing old data in your repositories (or in the worst case, missing repositories that were created within the last few hours before the switch), that’s to be expected.

We’re currently synchronizing everything back from the now-legacy drives to the live drives. Synchronizing data can be a somewhat lengthy operation, but it’s going pretty fast.

In the end, this should result in a much faster Bitbucket. We’ve been under I/O strains for some time now, and this is the first (large) step in resolving these issues for good.

Posted in bitbucket.

On our extended downtime, Amazon and what’s coming

As many of you are well aware, we’ve been experiencing some serious downtime the past couple of days. Starting Friday evening, our network storage became virtually unavailable to us, and the site crawled to a halt.

We’re hosting everything on Amazon EC2, aka. “the cloud”, and we’re also using their EBS service for storage of everything from our database, logfiles, and user data (repositories.)

Amazon EBS is a persistent storage solution for EC2, where you get high-speed (and free) connectivity from your instances, while it’s also replicated. That gives you a lot for free, since you don’t have to worry about hardware failure, and you can create periodic “snapshots” of your volumes easily.

While we were down, it was unknown to us what exactly the problem was, but it was almost certainly a problem with the EBS store. We’ve been working closely with Amazon the past 24 hours resolving the issue, and this post will outline what exactly went wrong, and what was done to remedy the problem.

Symptoms

What we were seeing on the server was high load, even after turning off anything that took up CPU. Load is a result of stuff “waiting to happen”, and after reviewing iostat, it became apparent that the “iowait” was very high, while the “tps” (transactions per second) was very low for our EBS volume. We tried several things at this point:

  • Un-mounting and re-mounting the volume.
  • Runing xfs_check on the volume, which reported no errors (we use XFS.)
  • Moving our instances and volumes from us-east-1b to both us-east-1a and us-east-1c.

None of these resolved the problem, and it was at this point we decided to upgrade to the “Gold plan” of support to gain access to the 1-hour turnaround technical support with Amazon.

The Support (0 hours after reporting it)

We filed an “urgent” ticket with Amazons support system, and within 5 minutes we had them on the phone. I spoke to the person there, describing our issue, continuously claiming that everything pointed to a network problem between the instance and the store.

What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput.

At this point, we were getting less throughput than you can pull off of a 1.44MB floppy, so we didn’t accept this for an answer. We did some more tests, trying to measure the bandwidth of the machine by fetching their “100mb.bin” files, which we couldn’t do. We again emphasized that this was in fact, in all likelihood, a network problem.

At this point, our outage was well known, especially in the Twittosphere. We have some rather large customers relying on service with us, and some of these customers have some hefty support contracts with Amazon. Emails were sent.

Shortly after this, I requested an additional phone-call from Amazon, this time to our system administrator. He had been compiling some rather worrying numbers over the past hours, since up until now, the support had refused to acknowledge a problem with the service. They claimed that everything was working fine, when clearly, it was not.

This time, a different support rep. called, and this time, they were ready to acknowledge our problem as “very serious.” We sent them our aggregated logs, and shortly thereafter, they reported that “we have found something out of the ordinary with your volume.”

We had been extremely frustrated up until this point, because 1) we couldn’t actually *do* anything about it, and 2) we were being told that everything should be fine. It felt like there was an elephant right in front of us, and a person next to us was insisting that there wasn’t.

Anyway (8 hours after reporting it)

From here on, we had been graced with the acknowledgement we had been waiting for: There was a problem, and it wasn’t us. We had been thinking that, you know, *maybe* we had screwed up somewhere and this was our fault. We didn’t find anything.

So, back to waiting.

What exactly triggered what happened after this, I’m not sure.

The Big Dogs (11 hours after reporting it)

I received an unrequested phone-call from some higher-up at Amazon. He wanted to tell me what was  going on, which was much appreciated.

He wanted to re-assure me that we were now their top priority, and he had brought in a whole team of specialized engineers to look at our case. That’s nice.

I received periodic updates, and frequent things for us to try. We sent them the logs they asked for, and complied with their wishes.

From this point on, we were treated like they owed us money, which is quite the difference from basically being called a liar earlier on.

Closing in (15 hours after reporting it)

OK, so we are finally getting somewhere. We all agreed that there was a serious networking problem between our EC2 instances and our EBS. This is around the time Amazon called me and asked me to try and put the application back online. So I did. And all was well.

I kindly asked the manager I had on the phone to please explain to me what the problem had been. He said he wasn’t really sure, and that he would set up a telephone conference with his team of engineers.

I dial in, and they start explaining what the problem is.

Now, I have been specifically advised not to say what the problem was, but I believe we owe it to our customers to explain what went wrong. Also, we owe it to Amazon to clear it up, since they were looking pretty bad due to this. I’ve already mentioned the cause shortly on our earlier status page, as well as on IRC, but let me re-iterate.

We were attacked. Bigtime. We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn’t read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That’s nice.

This is 16-17 hours after we reported the problem, which frankly, is a bit disheartening. Why did it take so long to discover? Oh well.

Amazon blocked the UDP traffic a couple of levels above us, and everything went back to normal. We surveyed the services for a while longer, and after deciding that everything was holding up fine, we went to bed (it was 4am in the morning.)

This morning

So, when we got up again this morning, things weren’t looking good, again. We were having the exact same symptoms as previously, and before our morning coffee, we re-opened our urgent ticket with Amazon. 2 minutes later I had them on the phone.

I explained that the problem was back, and they assured me the team of engineers working on this yesterday would be re-gathered and have a look. Cool.

About… 2 hours later, the problem was again resolved. Seems that the DDOS-ees figured that we were now invulnerable to UDP flood, so they instead initiated something like a TCP SYNFLOOD. Amazon employed new techniques for filtering our traffic and everything is fine again now.

What’s next

Amazon contacted us again after this was over, and told us they wanted to work with us in the coming days to make sure this doesn’t happen again. They have some ideas on how both they and we can improve things in general.

Are we going to do that? Maybe. We’re seriously considering moving to a different setup now. Not because Amazon isn’t providing us with decent service, which they are, most of the time. While we were down, several large hosting companies took direct contact with us, pitching their solutions. I won’t mention names, but some of the offerings are quite tempting, and have several advantages over what we get with Amazon.

One thing’s for sure, we’re investing a lot of man-hours into making sure this won’t happen again. If this means moving to a different host, so be it. We haven’t decided yet.

In conclusion

Let me round this post off by saying that Amazon doesn’t entirely deserve the criticism it has received over this outage. I do think they could’ve taken precautions to at least be warned if one of their routers started pumping through millions of bogus UDP packets to one IP, and I also think that 16+ hours is too long to discover the root of the problem.

After a bit of stalling with their first rep., our case received absolutely stellar attention. They were very professional in their correspondence, and in communicating things to us along the way.

And to re-iterate, the problem wasn’t really Amazon EC2 or EBS, it was isolated to our case, due to the nature of the attack. All the UDP traffic was conveniently spoofed, so we can’t tell where it originated.

Posted in bitbucket, status.

On the unplanned downtime Friday night

Friday night, September 4th 2009, around 6:30 pm UTC, Bitbucket went down for about half an hour. It’s back up now. I’ll explain the outage and some of the measures we’re going to take to make sure this doesn’t happen again.

First suspect was high load, which it turned out not to be–on the contrary. Load was lower than it had been for weeks, if not months. This was a pretty strong indicator that no one was accessing the site. People were reporting “504 Gateway Timeout” errors to us, meaning that they were in fact reaching our first tier of load balancing, run by nginx.

So what’s wrong? A quick “ps aux” revealed that all services were running, but still something wasn’t quite right. Restart nginx. Nope. Is apache2 running? Yep. Restart apache2. Everything’s back to normal.

At least it was an easy fix. I have 74 mb of error logs to trail through, trying to figure out why apache2 decided to drop its cookies and not tell anyone.

Now, this has never happened before. We’ve had a few unexpected outages, but they’ve always been caused by flukes or other oddities, and a single time, a hardcore kernel crash.

We’re of course interested in this not happening again, especially since was such an easy fix to automate. We’ll be deploying periodic probes to check for apache2 responsiveness, and if it doesn’t answer, it will restart it. I don’t know which software yet, but monit is a good candidate.

I’m very sorry for any inconvenience this caused you (according to Twitter, quite a few), and know that we’re doing what we can to prevent this from happening again.

EDIT:

Our apache is set up to recycle the workers after N requests, and it seems what happened is that a loaded egg had been updated, and apache2 refused to pick it up. I found several “bad local file header” errors in the log, which made Django throw a ImproperlyConfigured exception which is basically a death sentence. This explains why the workers were plucked one by one, which finally resulted in the sites demise.

Again, the probe-check would’ve fixed this, and I think we’ve settled on monit.

Posted in bitbucket.

About last night

Last night, May 26th, the site was down for about 3 hours. This happened after midnight, European time, which means most of us were sleeping.

Our administrator, Mads Jørgensen, was attentive though, and had to endure long hours with coffee and disk resizing.

What happened: Late last night, we started getting errors saying “no space left on device.” True enough, the space had almost run out (well, the system reported 6gb free, but OK), and what we did to remedy this was to truncate a large logfile, and restart the services. Everything then went humming along, and we figured no one was going to push 8gb over night, and we’d deal with it early this morning.

Not so.

Just after I had gone to bed, allegedly things began acting up once more. Mads jumped to action, and replaced the front page with status updates. If you want to read just exactly what went wrong, and how he remedied it, you can read it on his blog:

http://swag.dk/bitbucket/downtime_27052009.html

Again, Mads, you saved the day, and we wouldn’t want to continue without you!

Posted in bitbucket.

Collapsed-mode and a new header

We just deployed a minor design-change, specifically a new header. The main goal with the new header is to give a cleaner look, and save a few pixels vertically.

We also added a new “collapsed-mode” toggle at the bottom of the repository infobox:

picture-3

If you click the arrow, you’ll hide a lot of the content in the infobox, stuff you’re probably not interested in seeing all the time anyway. Click the same arrow again to toggle back on.

Feedback is as always appreciated!

Posted in bitbucket, new stuff.

New feature: Downloads

Ask, and you shall receive.

We’ve just unrolled a new feature on the site, Downloads. This lets you upload any file you want to your repository, be it public or private. The files are uploaded directly to Amazon S3, so our servers won’t even break a sweat.

After the upload is complete, your file will be accessible via the Cloudfront content delivery network (CDN), which allows for extremely fast downloads, all over the globe. If a repository is private, the URL for downloading a file will look a bit different (it will include an authentication token that is good for 24 hours), but that also means that your files are not accessible to anyone else, even if they have the full URL. For public files, they will be served from cdn.bitbucket.org/username/reponame/downloads/.

Files uploaded will count against your quota, but Mercurial repositories being notoriously small, you should have plenty of space, even on the free plan.

Some of our users have already begun upload, as can be seen here and here.

On another note, tags are now ordered by date instead of alphabetically.

A big thanks to Vetle Roeim for spending some time helping us out with this feature.

Posted in bitbucket, new stuff.

SPF record

Some of you may have experienced mail showing up late, especially to @gmail.com addresses and Google groups.

This was caused by a missing SPF record for our mail server. One has been set up now, and those delays should be gone.

Read more about SPF (Sender Policy Framework) here.

Posted in bitbucket.

Announcing Git support

Well, it was inevitable. With the ever-growing popularity of Git, we figured we had to support it.

For several months now, we’ve been scratching our scalps, spending endless nights trying to wrap our heads around the inner workings of git. But in the end, it paid off, and we’re proud to announce:

We now support Git!

It took a while, but we feel confident we can provide an outstanding service to the community, and the barrier between hg and git is no more.

Some of our first git users are already emerging, go check them out.

Enjoy!

EDIT: This was an April fools joke, and we don’t now, nor will we any time soon support Git.

Posted in new stuff. Tagged with , .

Wiki additions

Just this weekend we rolled out some exciting new changes, wiki-wize. 3 brand new things, actually!

1. /help/ is now a wiki

Yep. And it’s editable by anyone, so feel free to contribute! Anything you think might benefit others will do.

2. Wikis are globally editable

That means you no longer have to be a writer of a repository to be able to edit its wiki. All that is required is that you’re logged into Bitbucket.

Sounds kinda scary, but we’re trying it and we’ll see how it goes. Since you need an account, if anyone messes up your wiki, you can always see who it was.

3. The new “Table Of Contents” macro

This is something a lot of people have asked for, and we think we’ve delivered! The new macro allows you to generate a TOC for your page, but it goes way beyond that.

It can also generate TOCs for other files, or even a directory of files. On top of that, it takes an optional argument indicating how many header-levels it should include.

So <<toc>> works, but so does <<toc FAQ/>>, <<toc OtherPage/MaybeInSubDir>> and <<toc FAQ/ 2>>!

This is how we construct the FAQ section on http://bitbucket.org/help/.

The wiki there is itself a repository, so in case you want an offline copy, or your own fork, go for it.

Posted in new stuff, tips & tricks, wiki.

Issue editing

This has been requested by many users, and it’s finally live: issue editing.

Edit the title of an issue

Title description editable

As you can see underlined in the screenshot above, when you hover the title, a pencil-icon appears to indicate that the title is editable. Clicking the title allows you to make your changes:

Title editable

Hit enter to save, and you’re done.

Edit the description of an issue

Also in the first screenshot, you can see an Edit-link underlined in the bottom right. Clicking this link lets you edit the description of the issue.

If you don’t see any of these Edit-links, that’s probably ok, you just don’t have access to editing all issues.

Edit an issue comment

Comment editable

After submitting a comment to an issue, you will have the possibility to edit your comment for 15 minutes. As you can see from the screenshot above, your time left with access to edit the comment will be shown and continuously updated.

That’s it, simple as that. Any questions or comments are more than welcome!

Feel free to join us on IRC, write a message to bitbucket-users on Google Groups, or report a bug or suggestion in our issue tracker.

Posted in bitbucket, new stuff.