Another EC2+EBS incident: What happened

By on April 20, 2010

Today we had another one of our infamous downtime parties on IRC, due to some unforeseen downtime, caused by our ever-improving infrastructure.

Along with our growth, we’ve hit just about every snag and bottleneck known to man^Hsysadmins, and we’ve done our best to keep up. We’ve recently introduced sharding to our architecture, which is working very well. More importantly, we’ve moved all of our drives over on RAID0 EBS, to gain some throughput. This has also given us quite a nice improvement.

That is until one of your 8 drives decides to have the hiccups and stop putting any data through.

That’s what happened today. One of our application servers load was driven through the roof (200+ in less than 2 minutes), IO was queueing up, and nothing was responding. We quickly ran ‘iostat’ and saw that a device (specifically /dev/sdi) was util’d 249% (didn’t know that was possible) and the queue was growing.

So from previous experience, this seems to indicate either a) underlying hardware failure on your virtual block device (EBS), or b) network trouble. Neither of which you can do anything about.

We immediately opened a case with Amazon (after shelling out for the “1 hour support” premium-gold-amazing support package they offer), and got them on the phone pretty quickly. They couldn’t really tell us what was up, and the best they could do was forward the case to the EBS team. They couldn’t tell me when we could expect to hear back, let alone have the issue fixed, nor could they tell me how long these things usually take.

Oh well. Drinks aren’t serving themselves at the downtime party.

~30 minutes later, I requested an update from Amazon by phone, and asked what they’d recommend we do. Our best bet would apparently be a reboot of the faulty instance. I don’t know what kind of policy they have for the support team, but my “wish me luck then, I guess” was met by awkward silence.

The reboot didn’t help at first. In fact, the entire instance became completely unreachable. In CloudWatch (their paid monitoring) we could see 0% cpu utilization, 0% network, but curiously, high disk writes. We decided this was a swapfile being 0’d out in lack of a better explanation. Alas, the instance remained unreachable for another 20 minutes(!) or so. I was writing our findings in the open case report with Amazon, as one of my spurious SSH connections finally opened, and I was in.

After a quick check, it seemed that everything was fine. I reassembled the RAID array, started our services, and opened the floodgates. And things are looking fine now.

We will be actively looking into moving elsewhere, but such a migration is no small undertaking. But something needs to happen.

If anyone has had similar experiences with EC2/EBS, please feel free to share your knowledge.

  • http://blog.coredumped.org/ Chris Moyer

    I think you meant to say “CloudWatch”, not “CloudFront”. CloudFront is a Content Distribution Network, not a monitoring service.

    I've never been a fan of EBS, but it sounds like your resolution took you about an hour. Can you name any other company that responds and resolves that quickly?

  • http://blog.coredumped.org/ Chris Moyer

    I think you meant to say “CloudWatch”, not “CloudFront”. CloudFront is a Content Distribution Network, not a monitoring service.

    I've never been a fan of EBS, but it sounds like your resolution took you about an hour. Can you name any other company that responds and resolves that quickly?

  • Pingback: Tweets that mention What happened to #bitbucket now: -- Topsy.com

  • http://blog.coredumped.org/ Chris Moyer

    I think you meant to say “CloudWatch”, not “CloudFront”. CloudFront is a Content Distribution Network, not a monitoring service.

    I’ve never been a fan of EBS, but it sounds like your resolution took you about an hour. Can you name any other company that responds and resolves that quickly?

    • http://conspyre.com/ Andy (zenom)

      I think the biggest problem is that:

      1. This isn’t the first time BB has been down due to amazon’s service.
      2. Most hosting companies could reboot a machine and be back up in less than an hour.
      3. Rebooting isn’t the best result. They should be helping diagnose the problem. It’s what happens after the downtime has ended that makes the difference.

      I don’t blame BB for wanting to switch services at all.

    • Anonymous

      Updated the typo, thanks.

      The thing is, Amazon didn’t actually *resolve* the issue within an hour. They were dumbfounded and recommended we simply reboot. That’s not resolving the issue. They’ve called me today (24+ hours later) to explain what happened, and it was again, a network issue between our instance and the EBS mounts.

  • http://blog.coredumped.org/ Chris Moyer

    I think you meant to say “CloudWatch”, not “CloudFront”. CloudFront is a Content Distribution Network, not a monitoring service.

    I’ve never been a fan of EBS, but it sounds like your resolution took you about an hour. Can you name any other company that responds and resolves that quickly?

  • http://conspyre.com/ Andy (zenom)

    I think the biggest problem is that:

    1. This isn't the first time BB has been down due to amazon's service.
    2. Most hosting companies could reboot a machine and be back up in less than an hour.
    3. Rebooting isn't the best result. They should be helping diagnose the problem. It's what happens after the downtime has ended that makes the difference.

    I don't blame BB for wanting to switch services at all.

  • Pingback: Another EC2+EBS incident: What happened – Bitbucket Mobile

  • schickb

    I don't know the details of your setup, but EBS volumes are only supposed to be somewhat more reliable than standard hard-drives. They are not guaranteed to be fault free. Creating a single 8 disk RAID0 volume for system critical data is nuts. You're almost certain to have problems since you are multiplying your failure rate by 8. This would be the same or worse with physical disks.

    I'd only consider using such a disk configuration on stateless nodes (like a web-front-end) or if you have near active backup/replication. In other words, if you have a traditional database on a volume like this you may want to consider replication to a failover system. Or change your design to go through a caching layer that does lazy writes and keep the DB on a more robust storage volume.

    If your DB update rate isn't too fast for replication to keep up, that is an easy solution. Of course you'll still have to live with (hopefully) small windows of potential data loss.

  • schickb

    I don't know the details of your setup, but EBS volumes are only supposed to be somewhat more reliable than standard hard-drives. They are not guaranteed to be fault free. Creating a single 8 disk RAID0 volume for system critical data is nuts. You're almost certain to have problems since you are multiplying your failure rate by 8. This would be the same or worse with physical disks.

    I'd only consider using such a disk configuration on stateless nodes (like a web-front-end) or if you have near active backup/replication. In other words, if you have a traditional database on a volume like this you may want to consider replication to a failover system. Or change your design to go through a caching layer that does lazy writes and keep the DB on a more robust storage volume.

    If your DB update rate isn't too fast for replication to keep up, that is an easy solution. Of course you'll still have to live with (hopefully) small windows of potential data loss.

  • vanbas

    Jesper, why you just use RAID 0? There is no protection to your data. You will notice the annual failure rate of EBS is 0.1 % ~ 0.4%(http://aws.amazon.com/ebs/) . Even if it's safer than the normal disk, it's not safe enough.

    If the work load of BB is write dominated, you can consider sharding it heavily to reduce the write request per seconds on each shard, so that the single replication stream can catch up.

  • vanbas

    Jesper, why you just use RAID 0? There is no protection to your data. You will notice the annual failure rate of EBS is 0.1 % ~ 0.4%(http://aws.amazon.com/ebs/) . Even if it's safer than the normal disk, it's not safe enough.

    If the work load of BB is write dominated, you can consider sharding it heavily to reduce the write request per seconds on each shard, so that the single replication stream can catch up.

  • schickb

    I don’t know the details of your setup, but EBS volumes are only supposed to be somewhat more reliable than standard hard-drives. They are not guaranteed to be fault free. Creating a single 8 disk RAID0 volume for system critical data is nuts. You’re almost certain to have problems since you are multiplying your failure rate by 8. This would be the same or worse with physical disks.

    I’d only consider using such a disk configuration on stateless nodes (like a web-front-end) or if you have near active backup/replication. In other words, if you have a traditional database on a volume like this you may want to consider replication to a failover system. Or change your design to go through a caching layer that does lazy writes and keep the DB on a more robust storage volume.

    If your DB update rate isn’t too fast for replication to keep up, that is an easy solution. Of course you’ll still have to live with (hopefully) small windows of potential data loss.

  • schickb

    I don’t know the details of your setup, but EBS volumes are only supposed to be somewhat more reliable than standard hard-drives. They are not guaranteed to be fault free. Creating a single 8 disk RAID0 volume for system critical data is nuts. You’re almost certain to have problems since you are multiplying your failure rate by 8. This would be the same or worse with physical disks.

    I’d only consider using such a disk configuration on stateless nodes (like a web-front-end) or if you have near active backup/replication. In other words, if you have a traditional database on a volume like this you may want to consider replication to a failover system. Or change your design to go through a caching layer that does lazy writes and keep the DB on a more robust storage volume.

    If your DB update rate isn’t too fast for replication to keep up, that is an easy solution. Of course you’ll still have to live with (hopefully) small windows of potential data loss.

  • vanbas

    Jesper, why you just use RAID 0? There is no protection to your data. You will notice the annual failure rate of EBS is 0.1 % ~ 0.4%(http://aws.amazon.com/ebs/) . Even if it’s safer than the normal disk, it’s not safe enough.

    If the work load of BB is write dominated, you can consider sharding it heavily to reduce the write request per seconds on each shard, so that the single replication stream can catch up.

  • vanbas

    Jesper, why you just use RAID 0? There is no protection to your data. You will notice the annual failure rate of EBS is 0.1 % ~ 0.4%(http://aws.amazon.com/ebs/) . Even if it’s safer than the normal disk, it’s not safe enough.

    If the work load of BB is write dominated, you can consider sharding it heavily to reduce the write request per seconds on each shard, so that the single replication stream can catch up.

  • Pingback: uberVU - social comments

  • http://buffered.io/ OJ

    I agree Jesper. It looks like it's time to move on. Amazon have been the cause of a bit too much heartache for you of late.

    What other options are you currently looking into?

  • http://buffered.io/ OJ

    I agree Jesper. It looks like it's time to move on. Amazon have been the cause of a bit too much heartache for you of late.

    What other options are you currently looking into?

  • http://buffered.io/ OJ

    I agree Jesper. It looks like it’s time to move on. Amazon have been the cause of a bit too much heartache for you of late.

    What other options are you currently looking into?

    • schickb

      If the design is to have an 8 disk RAID0 volume a single point of failure, it doesn’t matter where they go. That is going to blow up not matter who provides the underlying disks.

      • Anonymous

        That’s the design we have to have with Amazon. Squeezing decent I/O out of EBS isn’t possible until you RAID up this way. Google it :)

        • schickb

          Oh I understand that. The key point is that this can’t be a single point of failure in your system. With this design you absolutely need a “close-to-live” replicated/synced failover volume.

          • Anonymous

            Absolutely. We have very recent snapshots of all data, at all time. If disaster strikes, we can recover pretty well. But yeah, it doesn’t sit well with me either. Maybe RAID10?

          • schickb

            I meant more like a hot spare, rather than a snapshot sitting on S3 that needs restoring. Basically you’d have 2 or more instances, each with their own RAID0 volumes. One is the master and all operations that happen on the master get sent to the slave. How that happens depends on your app and what the data is. DB replication, DRBD, or hg push, are all options. If the replication is async you’ll need some tolerance for data loss during failures.

            You then need some sort of failover solution and protection again “split-brain” problems. Linux tools like heartbeat, keepalived, or pacemaker can help with those. Keepalived is my current favorite due to its fairly simple configuration. Although to be honest, I have not attempted to use it within EC2.

            Perhaps RAID10 is a simpler solution :) Although that still won’t help when network problems prevent a single instance from talking to any of its EBS devices. Having distinct servers would probably solve that if they were in different zones.

          • Anonymous

            Thanks for the detailed answer :)

            The issue is that we’re maxing out our pipe (GBit interface) as it is now, and if we add DRBD/hot spare/mirroring, we’re cutting our write throughput in half.

            In our case, it was a single EBS device that was unreachable, not the whole array.

          • schickb

            I see. Sounds like RAID10 might be a stop gap measure with your current setup.

            Beyond that perhaps you’d need to consider a design change that split you load to many servers (and may RAID volumes) so that the throughput to any single server was much smaller. Or something even more fancy like an in memory caching layer that allows you to do lazy disk writes.

          • schickb

            One thing I can say for sure is that if you grow wildly, you will eventually hit these problems no matter where the application is hosted. Eventually even directly attached Ultra-640 SCSI devices aren’t going to be able to keep up on a single machine. At some point you’ll probably need a true scale-out design… which can be difficult.

            With EC2 I’d say you are just going to hit that point sooner since you have limited ability to scale-up the hardware. :)

          • Anonymous

            We do have sharding on the application level (we can redirect any repository/user to any appserver), so we can scale this out. But I’d like to squeeze as much juice out of these boxes as I can before I go provision more.

          • schickb

            Well if you are squeezing so much that you don’t have enough bandwidth for replication, you might be over squeezing :)

            In any case… I’ll stop blathering. BB is a cool site, I hope you get it all figured out!

      • http://buffered.io/ OJ

        Thanks for the info mate. I did understand that from your previous comment. I guess what I wasn’t totally clear on with my comment was that it’s not the first time that Amazon has been the issue. This time it might have been the RAID setup, but other times it hasn’t been. Plus their turnaround time has been poor.

        Cheers :)

        • Anonymous

          Oh, sorry, I thought I was replying to @shickb, not you :)

  • http://buffered.io/ OJ

    I agree Jesper. It looks like it’s time to move on. Amazon have been the cause of a bit too much heartache for you of late.

    What other options are you currently looking into?

  • schickb

    If the design is to have an 8 disk RAID0 volume a single point of failure, it doesn't matter where they go. That is going to blow up not matter who provides the underlying disks.

  • jespern

    That's the design we have to have with Amazon. Squeezing decent I/O out of EBS isn't possible until you RAID up this way. Google it :)

  • jespern

    Updated the typo, thanks.

    The thing is, Amazon didn't actually *resolve* the issue within an hour. They were dumbfounded and recommended we simply reboot. That's not resolving the issue. They've called me today (24+ hours later) to explain what happened, and it was again, a network issue between our instance and the EBS mounts.

  • schickb

    Oh I understand that. The key point is that this can't be a single point of failure in your system. With this design you absolutely need a “close-to-live” replicated/synced failover volume.

  • jespern

    Absolutely. We have very recent snapshots of all data, at all time. If disaster strikes, we can recover pretty well. But yeah, it doesn't sit well with me either. Maybe RAID10?

  • schickb

    I meant more like a hot spare, rather than a snapshot sitting on S3 that needs restoring. Basically you'd have 2 or more instances, each with their own RAID0 volumes. One is the master and all operations that happen on the master get sent to the slave. How that happens depends on your app and what the data is. DB replication, DRBD, or hg push, are all options. If the replication is async you'll need some tolerance for data loss during failures.

    You then need some sort of failover solution and protection again “split-brain” problems. Linux tools like heartbeat, keepalived, or pacemaker can help with those. Keepalived is my current favorite due to its fairly simple configuration. Although to be honest, I have not attempted to use it within EC2.

    Perhaps RAID10 is a simpler solution :) Although that still won't help when network problems prevent a single instance from talking to any of its EBS devices. Having distinct servers would probably solve that if they were in different zones.

  • jespern

    Thanks for the detailed answer :)

    The issue is that we're maxing out our pipe (GBit interface) as it is now, and if we add DRBD/hot spare/mirroring, we're cutting our write throughput in half.

    In our case, it was a single EBS device that was unreachable, not the whole array.

  • http://buffered.io/ OJ

    Thanks for the info mate. I did understand that from your previous comment. I guess what I wasn't totally clear on with my comment was that it's not the first time that Amazon has been the issue. This time it might have been the RAID setup, but other times it hasn't been. Plus their turnaround time has been poor.

    Cheers :)

  • schickb

    I see. Sounds like RAID10 might be a stop gap measure with your current setup.

    Beyond that perhaps you'd need to consider a design change that split you load to many servers (and may RAID volumes) so that the throughput to any single server was much smaller. Or something even more fancy like an in memory caching layer that allows you to do lazy disk writes.

  • jespern

    Oh, sorry, I thought I was replying to @shickb, not you :)

  • schickb

    One thing I can say for sure is that if you grow wildly, you will eventually hit these problems no matter where the application is hosted. Eventually even directly attached Ultra-640 SCSI devices aren't going to be able to keep up on a single machine. At some point you'll probably need a true scale-out design… which can be difficult.

    With EC2 I'd say you are just going to hit that point sooner since you have limited ability to scale-up the hardware. :)

  • jespern

    We do have sharding on the application level (we can redirect any repository/user to any appserver), so we can scale this out. But I'd like to squeeze as much juice out of these boxes as I can before I go provision more.

  • schickb

    Well if you are squeezing so much that you don't have enough bandwidth for replication, you might be over squeezing :)

    In any case… I'll stop blathering. BB is a cool site, I hope you get it all figured out!

  • http://www.stevemilner.org Steve

    Ever think about having a secondary RO instance running on another provider (slicehost or something) which syncs data over in the background? At the very least people would still be able to check out and download code while 'da clowd' stops evaporating.

  • http://www.stevemilner.org Steve

    Ever think about having a secondary RO instance running on another provider (slicehost or something) which syncs data over in the background? At the very least people would still be able to check out and download code while ‘da clowd’ stops evaporating.

  • http://www.stevemilner.org Steve

    Ever think about having a secondary RO instance running on another provider (slicehost or something) which syncs data over in the background? At the very least people would still be able to check out and download code while ‘da clowd’ stops evaporating.

  • http://www.prada-outlet-store.com prada

    Here elaborates the matter not only extensively but also detailly .I support the
    write's unique point.It is useful and benefit to your daily life.You can go those
    sits to know more relate things.They are strongly recommended by friends.Personally
    I feel quite well.. http://www.prada-outlet-store.com

  • http://www.vibramshoesonline.com vibram

    Mark S. is definitely on the right track. If you want to get a professional looking email address, Id recommend buying your name domain name, like or
    discount ugg boots
    If its common it might be difficult to get, however, be creative and you can usually find something.

  • http://www.discountbootsonsale.co.uk discount uggs

    Mark S. is definitely on the right track. If you want to get a professional looking email address, Id recommend buying your name domain name, like or
    ajf 10
    If its common it might be difficult to get, however, be creative and you can usually find something.

  • Pingback: ROBERT

  • Pingback: HARVEY

  • Pingback: PAUL

  • Pingback: Bitbucket downtime for a hardware upgrade – Bitbucket

  • Pingback: ALEJANDRO

  • Pingback: DAVE

  • Pingback: CECIL

  • Pingback: GLEN

  • http://www.marblepolishing.net/ Marble Repair West Palm Beach

    Thanks for the info mate. I did understand that from your previous comment.

  • http://www.marblepolishing.net/ Marble Repair West Palm Beach

    Thanks for the info mate. I did understand that from your previous comment.

  • http://www.marblepolishing.net/ Marble Repair West Palm Beach

    Thanks for the info mate. I did understand that from your previous comment.

  • Shrawan Patel

    This post excellently highlights what the author is trying to communicate. Nonetheless, the article has been framed excellently well and all credits to the author. For more information on how to load balance your web servers, please visit ..nhttp://serverloadbalancing.biz/wordpressbiz/, nhttp://serverloadbalancing.info/wordpressinfo/

  • http://conspyre.com/ Andy (zenom)

    I think the biggest problem is that:

    1. This isn't the first time BB has been down due to amazon's service.
    2. Most hosting companies could reboot a machine and be back up in less than an hour.
    3. Rebooting isn't the best result. They should be helping diagnose the problem. It's what happens after the downtime has ended that makes the difference.

    I don't blame BB for wanting to switch services at all.

  • jespern

    Updated the typo, thanks.

    The thing is, Amazon didn't actually *resolve* the issue within an hour. They were dumbfounded and recommended we simply reboot. That's not resolving the issue. They've called me today (24+ hours later) to explain what happened, and it was again, a network issue between our instance and the EBS mounts.

  • http://conspyre.com/ Andy (zenom)

    I think the biggest problem is that:

    1. This isn’t the first time BB has been down due to amazon’s service.
    2. Most hosting companies could reboot a machine and be back up in less than an hour.
    3. Rebooting isn’t the best result. They should be helping diagnose the problem. It’s what happens after the downtime has ended that makes the difference.

    I don’t blame BB for wanting to switch services at all.

  • Anonymous

    Updated the typo, thanks.

    The thing is, Amazon didn’t actually *resolve* the issue within an hour. They were dumbfounded and recommended we simply reboot. That’s not resolving the issue. They’ve called me today (24+ hours later) to explain what happened, and it was again, a network issue between our instance and the EBS mounts.

  • schickb

    If the design is to have an 8 disk RAID0 volume a single point of failure, it doesn't matter where they go. That is going to blow up not matter who provides the underlying disks.

  • jespern

    That's the design we have to have with Amazon. Squeezing decent I/O out of EBS isn't possible until you RAID up this way. Google it :)

  • http://buffered.io/ OJ

    Thanks for the info mate. I did understand that from your previous comment. I guess what I wasn't totally clear on with my comment was that it's not the first time that Amazon has been the issue. This time it might have been the RAID setup, but other times it hasn't been. Plus their turnaround time has been poor.

    Cheers :)

  • schickb

    Oh I understand that. The key point is that this can't be a single point of failure in your system. With this design you absolutely need a “close-to-live” replicated/synced failover volume.

  • jespern

    Absolutely. We have very recent snapshots of all data, at all time. If disaster strikes, we can recover pretty well. But yeah, it doesn't sit well with me either. Maybe RAID10?

  • schickb

    I meant more like a hot spare, rather than a snapshot sitting on S3 that needs restoring. Basically you'd have 2 or more instances, each with their own RAID0 volumes. One is the master and all operations that happen on the master get sent to the slave. How that happens depends on your app and what the data is. DB replication, DRBD, or hg push, are all options. If the replication is async you'll need some tolerance for data loss during failures.

    You then need some sort of failover solution and protection again “split-brain” problems. Linux tools like heartbeat, keepalived, or pacemaker can help with those. Keepalived is my current favorite due to its fairly simple configuration. Although to be honest, I have not attempted to use it within EC2.

    Perhaps RAID10 is a simpler solution :) Although that still won't help when network problems prevent a single instance from talking to any of its EBS devices. Having distinct servers would probably solve that if they were in different zones.

  • jespern

    Thanks for the detailed answer :)

    The issue is that we're maxing out our pipe (GBit interface) as it is now, and if we add DRBD/hot spare/mirroring, we're cutting our write throughput in half.

    In our case, it was a single EBS device that was unreachable, not the whole array.

  • schickb

    I see. Sounds like RAID10 might be a stop gap measure with your current setup.

    Beyond that perhaps you'd need to consider a design change that split you load to many servers (and may RAID volumes) so that the throughput to any single server was much smaller. Or something even more fancy like an in memory caching layer that allows you to do lazy disk writes.

  • schickb

    One thing I can say for sure is that if you grow wildly, you will eventually hit these problems no matter where the application is hosted. Eventually even directly attached Ultra-640 SCSI devices aren't going to be able to keep up on a single machine. At some point you'll probably need a true scale-out design… which can be difficult.

    With EC2 I'd say you are just going to hit that point sooner since you have limited ability to scale-up the hardware. :)

  • jespern

    Oh, sorry, I thought I was replying to @shickb, not you :)

  • jespern

    We do have sharding on the application level (we can redirect any repository/user to any appserver), so we can scale this out. But I'd like to squeeze as much juice out of these boxes as I can before I go provision more.

  • schickb

    Well if you are squeezing so much that you don't have enough bandwidth for replication, you might be over squeezing :)

    In any case… I'll stop blathering. BB is a cool site, I hope you get it all figured out!

  • schickb

    If the design is to have an 8 disk RAID0 volume a single point of failure, it doesn’t matter where they go. That is going to blow up not matter who provides the underlying disks.

  • Anonymous

    That’s the design we have to have with Amazon. Squeezing decent I/O out of EBS isn’t possible until you RAID up this way. Google it :)

  • http://buffered.io/ OJ

    Thanks for the info mate. I did understand that from your previous comment. I guess what I wasn’t totally clear on with my comment was that it’s not the first time that Amazon has been the issue. This time it might have been the RAID setup, but other times it hasn’t been. Plus their turnaround time has been poor.

    Cheers :)

  • schickb

    Oh I understand that. The key point is that this can’t be a single point of failure in your system. With this design you absolutely need a “close-to-live” replicated/synced failover volume.

  • Anonymous

    Absolutely. We have very recent snapshots of all data, at all time. If disaster strikes, we can recover pretty well. But yeah, it doesn’t sit well with me either. Maybe RAID10?

  • schickb

    I meant more like a hot spare, rather than a snapshot sitting on S3 that needs restoring. Basically you’d have 2 or more instances, each with their own RAID0 volumes. One is the master and all operations that happen on the master get sent to the slave. How that happens depends on your app and what the data is. DB replication, DRBD, or hg push, are all options. If the replication is async you’ll need some tolerance for data loss during failures.

    You then need some sort of failover solution and protection again “split-brain” problems. Linux tools like heartbeat, keepalived, or pacemaker can help with those. Keepalived is my current favorite due to its fairly simple configuration. Although to be honest, I have not attempted to use it within EC2.

    Perhaps RAID10 is a simpler solution :) Although that still won’t help when network problems prevent a single instance from talking to any of its EBS devices. Having distinct servers would probably solve that if they were in different zones.

  • Anonymous

    Thanks for the detailed answer :)

    The issue is that we’re maxing out our pipe (GBit interface) as it is now, and if we add DRBD/hot spare/mirroring, we’re cutting our write throughput in half.

    In our case, it was a single EBS device that was unreachable, not the whole array.

  • schickb

    I see. Sounds like RAID10 might be a stop gap measure with your current setup.

    Beyond that perhaps you’d need to consider a design change that split you load to many servers (and may RAID volumes) so that the throughput to any single server was much smaller. Or something even more fancy like an in memory caching layer that allows you to do lazy disk writes.

  • schickb

    One thing I can say for sure is that if you grow wildly, you will eventually hit these problems no matter where the application is hosted. Eventually even directly attached Ultra-640 SCSI devices aren’t going to be able to keep up on a single machine. At some point you’ll probably need a true scale-out design… which can be difficult.

    With EC2 I’d say you are just going to hit that point sooner since you have limited ability to scale-up the hardware. :)

  • Anonymous

    Oh, sorry, I thought I was replying to @shickb, not you :)

  • Anonymous

    We do have sharding on the application level (we can redirect any repository/user to any appserver), so we can scale this out. But I’d like to squeeze as much juice out of these boxes as I can before I go provision more.

  • schickb

    Well if you are squeezing so much that you don’t have enough bandwidth for replication, you might be over squeezing :)

    In any case… I’ll stop blathering. BB is a cool site, I hope you get it all figured out!