Skip to content


On our extended downtime, Amazon and what’s coming

As many of you are well aware, we’ve been experiencing some serious downtime the past couple of days. Starting Friday evening, our network storage became virtually unavailable to us, and the site crawled to a halt.

We’re hosting everything on Amazon EC2, aka. “the cloud”, and we’re also using their EBS service for storage of everything from our database, logfiles, and user data (repositories.)

Amazon EBS is a persistent storage solution for EC2, where you get high-speed (and free) connectivity from your instances, while it’s also replicated. That gives you a lot for free, since you don’t have to worry about hardware failure, and you can create periodic “snapshots” of your volumes easily.

While we were down, it was unknown to us what exactly the problem was, but it was almost certainly a problem with the EBS store. We’ve been working closely with Amazon the past 24 hours resolving the issue, and this post will outline what exactly went wrong, and what was done to remedy the problem.

Symptoms

What we were seeing on the server was high load, even after turning off anything that took up CPU. Load is a result of stuff “waiting to happen”, and after reviewing iostat, it became apparent that the “iowait” was very high, while the “tps” (transactions per second) was very low for our EBS volume. We tried several things at this point:

  • Un-mounting and re-mounting the volume.
  • Runing xfs_check on the volume, which reported no errors (we use XFS.)
  • Moving our instances and volumes from us-east-1b to both us-east-1a and us-east-1c.

None of these resolved the problem, and it was at this point we decided to upgrade to the “Gold plan” of support to gain access to the 1-hour turnaround technical support with Amazon.

The Support (0 hours after reporting it)

We filed an “urgent” ticket with Amazons support system, and within 5 minutes we had them on the phone. I spoke to the person there, describing our issue, continuously claiming that everything pointed to a network problem between the instance and the store.

What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput.

At this point, we were getting less throughput than you can pull off of a 1.44MB floppy, so we didn’t accept this for an answer. We did some more tests, trying to measure the bandwidth of the machine by fetching their “100mb.bin” files, which we couldn’t do. We again emphasized that this was in fact, in all likelihood, a network problem.

At this point, our outage was well known, especially in the Twittosphere. We have some rather large customers relying on service with us, and some of these customers have some hefty support contracts with Amazon. Emails were sent.

Shortly after this, I requested an additional phone-call from Amazon, this time to our system administrator. He had been compiling some rather worrying numbers over the past hours, since up until now, the support had refused to acknowledge a problem with the service. They claimed that everything was working fine, when clearly, it was not.

This time, a different support rep. called, and this time, they were ready to acknowledge our problem as “very serious.” We sent them our aggregated logs, and shortly thereafter, they reported that “we have found something out of the ordinary with your volume.”

We had been extremely frustrated up until this point, because 1) we couldn’t actually *do* anything about it, and 2) we were being told that everything should be fine. It felt like there was an elephant right in front of us, and a person next to us was insisting that there wasn’t.

Anyway (8 hours after reporting it)

From here on, we had been graced with the acknowledgement we had been waiting for: There was a problem, and it wasn’t us. We had been thinking that, you know, *maybe* we had screwed up somewhere and this was our fault. We didn’t find anything.

So, back to waiting.

What exactly triggered what happened after this, I’m not sure.

The Big Dogs (11 hours after reporting it)

I received an unrequested phone-call from some higher-up at Amazon. He wanted to tell me what was  going on, which was much appreciated.

He wanted to re-assure me that we were now their top priority, and he had brought in a whole team of specialized engineers to look at our case. That’s nice.

I received periodic updates, and frequent things for us to try. We sent them the logs they asked for, and complied with their wishes.

From this point on, we were treated like they owed us money, which is quite the difference from basically being called a liar earlier on.

Closing in (15 hours after reporting it)

OK, so we are finally getting somewhere. We all agreed that there was a serious networking problem between our EC2 instances and our EBS. This is around the time Amazon called me and asked me to try and put the application back online. So I did. And all was well.

I kindly asked the manager I had on the phone to please explain to me what the problem had been. He said he wasn’t really sure, and that he would set up a telephone conference with his team of engineers.

I dial in, and they start explaining what the problem is.

Now, I have been specifically advised not to say what the problem was, but I believe we owe it to our customers to explain what went wrong. Also, we owe it to Amazon to clear it up, since they were looking pretty bad due to this. I’ve already mentioned the cause shortly on our earlier status page, as well as on IRC, but let me re-iterate.

We were attacked. Bigtime. We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn’t read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That’s nice.

This is 16-17 hours after we reported the problem, which frankly, is a bit disheartening. Why did it take so long to discover? Oh well.

Amazon blocked the UDP traffic a couple of levels above us, and everything went back to normal. We surveyed the services for a while longer, and after deciding that everything was holding up fine, we went to bed (it was 4am in the morning.)

This morning

So, when we got up again this morning, things weren’t looking good, again. We were having the exact same symptoms as previously, and before our morning coffee, we re-opened our urgent ticket with Amazon. 2 minutes later I had them on the phone.

I explained that the problem was back, and they assured me the team of engineers working on this yesterday would be re-gathered and have a look. Cool.

About… 2 hours later, the problem was again resolved. Seems that the DDOS-ees figured that we were now invulnerable to UDP flood, so they instead initiated something like a TCP SYNFLOOD. Amazon employed new techniques for filtering our traffic and everything is fine again now.

What’s next

Amazon contacted us again after this was over, and told us they wanted to work with us in the coming days to make sure this doesn’t happen again. They have some ideas on how both they and we can improve things in general.

Are we going to do that? Maybe. We’re seriously considering moving to a different setup now. Not because Amazon isn’t providing us with decent service, which they are, most of the time. While we were down, several large hosting companies took direct contact with us, pitching their solutions. I won’t mention names, but some of the offerings are quite tempting, and have several advantages over what we get with Amazon.

One thing’s for sure, we’re investing a lot of man-hours into making sure this won’t happen again. If this means moving to a different host, so be it. We haven’t decided yet.

In conclusion

Let me round this post off by saying that Amazon doesn’t entirely deserve the criticism it has received over this outage. I do think they could’ve taken precautions to at least be warned if one of their routers started pumping through millions of bogus UDP packets to one IP, and I also think that 16+ hours is too long to discover the root of the problem.

After a bit of stalling with their first rep., our case received absolutely stellar attention. They were very professional in their correspondence, and in communicating things to us along the way.

And to re-iterate, the problem wasn’t really Amazon EC2 or EBS, it was isolated to our case, due to the nature of the attack. All the UDP traffic was conveniently spoofed, so we can’t tell where it originated.

Posted in bitbucket, status.

57 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Your issue got me involved in a discussion on Twitter last night and inspired this blog post: http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/

    I’m sorry to hear it was a spoofed DDoS, all ISPs should be running uRPF ( http://en.wikipedia.org/wiki/Reverse_path_forwarding ) at the edge so people can’t spoof. Once the traffic is at the target provider, in this case AWS, or in the middle of the Internet backbone you can’t tell if it is spoofed or not anymore. It won’t stop DDoS but it’ll allow for identification of all of the zombie machines participating which can help clean up the “disease” in the future each time an attack happens.

  2. kevun said

    MRTG?

  3. Me said

    Sounds more like a problem on your site.

    Maybe you should have checked your in/out traffic before complaining that the volumes weren’t working…

  4. Interesting that you had this experience – I had a small version today.

    I’m a much smaller customer of a standard hosting service for my small personal website. Uh, teeny, really. I recently moved my personal domain to HostGator. At around 11am PDT today, my site went down, and the DNS servers pointing to my site went down. I fooled with a few things, then called HostGator (where I spend about $5/month) and wait 6 minutes in their call queue. Right when they picked up, I tried again – I was up, I said “thanks” and hung up.

    I had already sent an email support request though – they responded about 10 minutes later saying an IP near me (but not me) was under attack (likely the same virtual host, but the mail wasn’t crystal clear), that there were about 30-odd hosts attacking, which was within their standard mitigation potential, but he’d keep an eye on it during his shift.

    I wasn’t terribly impressed by the fact that I had any downtime, or 10 whole minutes, but I was cheered by the transparency and effectiveness of the HG support staff. But hearing that your attack resulted in 15 hours of downtime, I guess I can feel pretty happy.

    Disclaimer: I 100% understand that your situation is *far* more complex, with more moving parts, which can lead to finger-pointing… but still, attacks are a way of life now, and anything with an external address needs monitoring for quick resolution under attack load.

  5. saf said

    Why in the hell didn’t YOU discover this?

  6. So bascially this all boils down to the fact that neither your or amazon was doing decent reporting on network (and maybe other critical operating system) metrics?

    Or am I missing something obvious?

  7. Thank you Jesper and the team for working so hard on this and for being so open about the issue.

  8. So the UDP/TCP traffic came from the internet? It’s disconcerting that an attack coming from an external source could affect your access to internal network resources.

  9. Did you not look at your network traffic graphs and notice a huge influx of inbound traffic? It boggles my mind that this wasn’t one of the first troubleshooting steps taken.

    You guys are running something like Munin, right?

  10. Roger said

    It isn’t clear to me if you have separate network interfaces for “internal” (other instances, EBS) vs “external” (the great unwashed internet) traffic. That traffic should never be mingled together, and if spoofing could happen what is to stop bad guys on the outside spoofing traffic to make it look like stuff on the inside?

  11. Ken Baker said

    There is no need to employ any engineers to try to work out what is happening. The simple and cost effective answer to your problem is to employ an IntelliGuard DPS DDoS defence which will block any attack immediately, maintain your business on line for all our customers, and provide extensive reporting on the traffic received and filtered.

  12. alex said

    Ever consider that maybe one of the hosting companies that called you was the source of the attack? It would be pretty outrageous marketing behaviour but it certainly proves a point.

  13. SysNetEngineer said

    I think it’s time for you to fire your sysadmin and replace him with someone can dual-function as a network engineer/admin. This could have been resolved in an hour or less as problems like this are extremely easy to detect and fix. The “fix” part is the longest part of the ordeal because if the DDoS is truly large-scale, you will need the assistance of the upstream ISP.

  14. Sounds like you only got good service from Amazon after they realized you were an “important” customer (vs. everyone else). I saw that because of the timeline: You got crap service in the beginning. Then some emails were sent from “important” people outside of Amazon to “important” people inside of Amazon. Then you got the royal treatment. That is disturbing (but, sadly, not surprising).

  15. Just a little question that comes to mind about this.

    How can traffic coming from the public network interfaces (the attack) interfere with the private network interfaces which, I assume, handle the EBS volume?

  16. Jonathan Wight said

    All the commenters replying “why didn’t you just ______?” amuse me. Being an armchair network expert is so much easier in hindsight.

  17. Sorry to hear about your DDoS woes. What is frustrating I guess that cloud service providers essentially have no security mechanisms in place to immediately address massive DDoS attack.

  18. @teo
    > How can traffic coming from
    > the public network interfaces…

    There is no such thing as “external” interface on Amazon. Internal is the only interface instances have, then internal IPs are DNAT’ed into “external” ones somewhere up the stream.

  19. Something from my personal experience: an unrelated, but never the less severe network problem was resolved by the linode.com team in FIFTEEN minutes. I had my support ticked responded with “we’re on it, hold on” within.. I bet that was FIVE minutes, no more than that.

  20. Just a quick follow-up: Our sysadmins are writing a follow-up post outlining some more meaty details with numbers and graphs.

    There are some questions posted in comments here that I won’t answer now, but they will be answered in the follow-up.

    In short: We couldn’t see anything on the servers as the traffic never reached it. It was somehow caught in the black box in front of us, and bogged down the resources we needed to speak to EBS.

    Amazon eventually asked us to turn on their (paid) CloudWatch service, and in there, we did observe high peaks, which only made us re-insist that this was, in fact, a network problem.

  21. Jon Watte said

    “How can traffic coming from the public network interfaces (the attack) interfere with the private network interfaces which, I assume, handle the EBS volume?”

    It’s virtual images, with virtual network interfaces. Even if you have two virtual network interfaces on your image, you have no control over what physical interface/es are used on the physical host. You don’t even know whether the physical host has more than one interface.

    In fact, depending on exactly how the UDP traffic was shaped, it may have been pretty hard to pick up on the flood at the virtual interface level, despite what all the snide remarks seem to suggest.

  22. Damian Menscher said

    How large were the attacks? And how big are your pipes?

  23. Right, echoing Jeremy’s comment, so the probem here was that you weren’t monitoring your systems? If you are looking for a hosting provider that will manage your servers for you, Amazon isn’t the solution, but you should’ve know that going in.

  24. JAB said

    we also host with AWS and frankly this is kind of service is becoming all too familiar… here’s what happens when something happens with your instance in AWS:

    1) Instance Failure. Your instance becomes unresponsive and cant ssh into your instance.
    2) Go to AWS Forum. You find that there’s an increase in the number of complaints and a lot of “please??!!!” and moaning and nail biting and people post their instance ID’s like the mothers holding up their babies so they can get on the last train out of warsaw.
    3) You Wait. For several hours, as other complaints dribble in.. no response from anybody in AWS.. service dashboard remains “normal”. Until…
    3) AWS Trolls are unleashed. This is when you get something like: thats because some users are “new” and “dont understand”.. anything but accepting that there really might be something wrong with AWS service… and then they suggest you should “upgrade” to AWS premium service (which bitbucket did).

    So think about this.. AWS will actually stand to earn more if they dont address problems quickly!

  25. Taking the truthful, and maybe harsh, path, rather than the polite one, I must say that from the description of the case in the post it sounds like you have no one at your company that has experience with hosting servers, on the cloud or off it. I’ll be more than happy, exhilarated even, to be corrected on this, by some more, hard core, technical data regarding the case, as promised.
    The ongoing hype talk about yes cloud, no cloud, trust, maybe not, maybe yes, is not really promoting anything anymore. Detailed study cases are what the cloud industry needs now to grow and blossom. Just a bunch of geeky sysadmins talking shop, that’s where the cloud future is at.

  26. It is very long time to find the root of the problem.It is really a misery for you guys.

Continuing the Discussion

  1. Availability is a fundamental design concept | Bret Piatt linked to this post on October 4, 2009

    [...] page is back to normal now, no longer the explanation since the problem is fixed). [UPDATE: Adding BitBucket blog post on the [...]

  2. Tweets that mention On our extended downtime, Amazon and what’s coming – Bitbucket -- Topsy.com linked to this post on October 4, 2009

    [...] This post was mentioned on Twitter by Steve Losh. Steve Losh said: RT @jespern: Blog post detailing the #bitbucket outage: http://bit.ly/80Rha [...]

  3. Twitted by gianluca_r linked to this post on October 4, 2009

    [...] This post was Twitted by gianluca_r [...]

  4. On our extended downtime, Amazon and what’s coming – Bitbucket « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit linked to this post on October 5, 2009

    [...] On our extended downtime, Amazon and what’s coming – Bitbucketblog.bitbucket.org [...]

  5. Closer To The Ideal » Blog Archive » BitBucket gets hacked and, since they rely on Amazon Web Services, they were dependent on Amazon to fix the problem linked to this post on October 5, 2009

    [...] BitBucket got hit with a denial of service attack. But they host everything on Amazon Web services (EC2, EBS, etc). So they were dependent on Amazon to figure out what the problem was. And Amazon took 16 long hours to figure out what the problem was. And in the mean time, Amazon kept telling BitBucket that everything was fine: What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput. [...]

  6. BitBucket.org hit by massive DDoS « DoS Attacks linked to this post on October 5, 2009

    [...] Read a detailed explanation offered by BitBucket on their blog. [...]

  7. DDoS attack rains down on Amazon cloud - Computer Forums linked to this post on October 5, 2009

    [...] [...]

  8. The perils of a single external provider « Shift Research linked to this post on October 6, 2009

    [...] extended reading, here is the blog post by Jesper with a blow-by-blow of the attacks, service level by Amazon, initial resolution, second wave, and [...]

  9. Amazons cloud-dienst EC2 getroffen door ddos-aanval | ISPam.nl linked to this post on October 6, 2009

    [...] een blogbericht van Jesper Nøhr van Bitbucket.org werd Amazons netwerk aangevallen waardoor de dienst vanaf [...]

  10. eungju's me2DAY linked to this post on October 6, 2009

    EP의 생각…

    Bitbucket의 On our extended downtime, Amazon and what’s coming같은 걸 accountability라고 하는 건가….

  11. Amazon Web Services Gets DDoS Attack and the Client Waits linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  12. Tech News World » Amazon Web Services Gets DDoS Attack and the Client Waits linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  13. Amazon Web Services Gets DDoS Attack and the Client Waits | Stoth linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  14. TumbleTech » Amazon Web Services Gets DDoS Attack and the Client Waits linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  15. Amazon Web Services Gets DDoS Attack and the Client Waits | UpOff.com linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  16. Amazon Web Services Gets DDoS Attack and the Client Waits | GeekStream linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  17. Amazon Web Services Gets DDoS Attack and the Client Waits | GroupHelp.NET - Easy everything! linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  18. the hive » Amazon Web Services Gets DDoS Attack and the Client Waits linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  19. Amazon Web Services Gets DDoS Attack and the Client Waits | Samachar Express linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  20. RSSguru.com | ReadWriteWeb | Amazon Web Services Gets DDoS Attack and the Client Waits linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  21. Amazon Web Services Gets DDoS Attack and the Client Waits | Techdare linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  22. Amazon Web Services Gets DDoS Attack and the Client Waits | Techno Portal linked to this post on October 6, 2009

    [...] when their network storage became virtually unavailable. According to the detailed account on their blog, the site crawled to a [...]

  23. BitBucket attacked by DDoS at comp527 linked to this post on October 6, 2009

    [...] More can be read about the story in a BitBucket blog post found here. [...]

  24. Minuter senare - Amazons EC2 går ner | City|Network linked to this post on October 6, 2009

    [...] kan du läsa om företaget Bitbuckets frustration både via twitter kontot och bloggen där Jesper Nøhr beskriver mer i detalj både om vad som hände och dialogen med Amazon sup… Uppenbarligen tog det tid för Amazon att gå med på att det faktiskt var problem med tjänsten. [...]

  25. DDoS Attack Hits Amazon Cloud! | SecTechno linked to this post on October 6, 2009

    [...] Cloud EC2), we had previously posted on a several cases of DDoS attacks, Jesper posted on the company blog some details about the incident which is not [...]

  26. Amazon EC2 Customer Hammered by DDOS Attack linked to this post on October 7, 2009

    [...] To get the fine detail on this story, the Twitter feed of Bitbucket’s chief, Jesper Noehr makes for interesting reading, as does the more detailed blog posts of the outage. [...]

  27. egrep-cloud-cambrian-watch-2009-10-06 « すでにそこにある雲 linked to this post on October 9, 2009

    [...] On our extended downtime, Amazon and what’s coming – Bitbucket [...]

  28. troeger.eu » Blog Archive » When the cloud is gone linked to this post on October 12, 2009

    [...] Bitbucket Gone (while this was not directly Amazon’s fault) [...]

Some HTML is OK

(required)

(required, but never shared)

or, reply to this post via trackback.