Another EC2+EBS incident: What happened

Today we had another one of our infamous downtime parties on IRC, due to some unforeseen downtime, caused by our ever-improving infrastructure.

Along with our growth, we’ve hit just about every snag and bottleneck known to man^Hsysadmins, and we’ve done our best to keep up. We’ve recently introduced sharding to our architecture, which is working very well. More importantly, we’ve moved all of our drives over on RAID0 EBS, to gain some throughput. This has also given us quite a nice improvement.

That is until one of your 8 drives decides to have the hiccups and stop putting any data through.

That’s what happened today. One of our application servers load was driven through the roof (200+ in less than 2 minutes), IO was queueing up, and nothing was responding. We quickly ran ‘iostat’ and saw that a device (specifically /dev/sdi) was util’d 249% (didn’t know that was possible) and the queue was growing.

So from previous experience, this seems to indicate either a) underlying hardware failure on your virtual block device (EBS), or b) network trouble. Neither of which you can do anything about.

We immediately opened a case with Amazon (after shelling out for the “1 hour support” premium-gold-amazing support package they offer), and got them on the phone pretty quickly. They couldn’t really tell us what was up, and the best they could do was forward the case to the EBS team. They couldn’t tell me when we could expect to hear back, let alone have the issue fixed, nor could they tell me how long these things usually take.

Oh well. Drinks aren’t serving themselves at the downtime party.

~30 minutes later, I requested an update from Amazon by phone, and asked what they’d recommend we do. Our best bet would apparently be a reboot of the faulty instance. I don’t know what kind of policy they have for the support team, but my “wish me luck then, I guess” was met by awkward silence.

The reboot didn’t help at first. In fact, the entire instance became completely unreachable. In CloudWatch (their paid monitoring) we could see 0% cpu utilization, 0% network, but curiously, high disk writes. We decided this was a swapfile being 0’d out in lack of a better explanation. Alas, the instance remained unreachable for another 20 minutes(!) or so. I was writing our findings in the open case report with Amazon, as one of my spurious SSH connections finally opened, and I was in.

After a quick check, it seemed that everything was fine. I reassembled the RAID array, started our services, and opened the floodgates. And things are looking fine now.

We will be actively looking into moving elsewhere, but such a migration is no small undertaking. But something needs to happen.

If anyone has had similar experiences with EC2/EBS, please feel free to share your knowledge.