Downtime Postmortem

By on May 26, 2011

We experienced intermintent downtime, timeouts, and general performance problems in the past 24 hours as the result of a failing disk drive. To prepare for situations like this all our storage is redundant and we keep hot and cold spares in stock. However, this particular drive failure was problematic because performance degraded as the drive slowly failed. Dell’s hard drive monitoring utilities didn’t detect the drive as failing and our applications were blocked on I/O as the drive became slower and slower.

We suspected a problem with the drive earlier this week when we began to see this warning in our logs (the time stamps are CDT):

May 22 04:08:44 bitbucket04 Server Administrator: Storage Service EventID: 2095 Unexpected sense. SCSI sense data: Sense key: 6 Sense code: 29 Sense qualifier: 2: Physical Disk 0:0:16 Controller 0, Connector 0

We contacted Dell for warranty support and an RMA, but they insisted that was only a warning and we shouldn’t be concerned. At the time iostat didn’t show anything out of the ordinary. Yesterday we began to notice extremely high I/O utilization on the NFS server with that faulty disk. As a result, the load on our front-end Django servers skyrocketed as they were blocked on I/O.

Atlassian’s internal development teams heavily use Atlassian Bamboo continuous integration build agents running on Amazon EC2 instances, and many of our teams host their repositories on Bitbucket. To alleviate load during this situation we blocked several of our EC2 IP addresses. However, since EC2 IP addresses are dynamic, later in the day those IP addresses were given to other EC2 users. That inadvertently blocked some of our customer’s continuous integration servers and we’re extra sorry for that inconvenience.

We contacted Dell again today and took immediate action after seeing this error in our logs:

May 25 12:41:25 bitbucket04 Server Administrator: Storage Service EventID: 2271 The Patrol Read corrected a media error.: Physical Disk 0:0:16 Controller 0, Connector 0
May 25 12:41:28 bitbucket04 kernel: INFO: task nfsd:3397 blocked for more than 120 seconds.

We took the failing drive out of service and let the RAID rebuild itself with a hot spare. We also replaced the hot spare with a drive from our inventory. I/O utilization quickly returned to normal levels and our site has been stable since.

We apologize for the downtime this caused.

  • Christoph

    I’m glad to see Bitbucket informing users with such a high level of detail for every failure that occurs (last time EC2 went down, now the HDD failed). This shows how much you care for your users :-)

    Keep up your great work!

  • Christoph

    I’m glad to see Bitbucket informing users with such a high level of detail for every failure that occurs (last time EC2 went down, now the HDD failed). This shows how much you care for your users :-)nnKeep up your great work!

  • Christoph

    I’m glad to see Bitbucket informing users with such a high level of detail for every failure that occurs (last time EC2 went down, now the HDD failed). This shows how much you care for your users :-)

    Keep up your great work!

  • http://www.facebook.com/vglebov Vadim Glebov

    Thank you for the truth, this is awesome!

  • http://www.facebook.com/vglebov Vadim Glebov

    Thank you for the truth, this is awesome!

  • http://www.facebook.com/vglebov Vadim Glebov

    Thank you for the truth, this is awesome!

  • Ayaz Khan

    Thank you for catching and fixing the problem, and for taking the time to jot down the incident in detail. Kudos!

  • Ayaz Khan

    Thank you for catching and fixing the problem, and for taking the time to jot down the incident in detail. Kudos!

  • Ayaz Khan

    Thank you for catching and fixing the problem, and for taking the time to jot down the incident in detail. Kudos!

  • https://daenney.startssl.com/ Daniele Sluijters

    Glad that Atlassian is informing us about these things.

    Not to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

    • http://www.bitbucket.org Justen Stepka

      You’re right, during the last two weeks the site performance has been below par, which is why we’re being as transparent as possible. We’re accountable to the you as a site user, and we want you to know we’re doing everything we can to resolve all issues.

  • https://daenney.startssl.com/ Daniele Sluijters

    Glad that Atlassian is informing us about these things.nnNot to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

  • https://daenney.startssl.com/ Daniele Sluijters

    Glad that Atlassian is informing us about these things.

    Not to be party-pooper but BitBucket’s performance and responsiveness have taken quite a dive the past few weeks, this isn’t the first incident in the past month that rendered BitBucket (nearly) unusable at times…

  • http://www.facebook.com/vassilevsky Ilya Vassilevsky

    How many storage nodes you have in total?

    • http://www.bitbucket.org Justen Stepka

      Right now we have two storage servers, with rsync backups to multiple locations.

      We recently purchased a NetApp with 50TB of storage.

  • http://www.facebook.com/vassilevsky Ilya Vassilevsky

    How many storage nodes you have in total?

  • http://www.facebook.com/vassilevsky Ilya Vassilevsky

    How many storage nodes you have in total?

  • Lidstromso

    The helm do not panic, the boat’s safe.

    White rice is delicious, grains difficult
    Tian kinds.

     

    Foamposites 2012

    Foamposites For Cheap

    http://www.foamposites2012online.com
     

  • Anonymous

    You’re right, during the last two weeks the site performance has been above par, which is why we’re being as transparent as possible. We’re accountable to the you as a site user, and we want you to know we’re doing everything we can to resolve all issues.n

  • http://www.bitbucket.org Justen Stepka

    You’re right, during the last two weeks the site performance has been below par, which is why we’re being as transparent as possible. We’re accountable to the you as a site user, and we want you to know we’re doing everything we can to resolve all issues.

  • Anonymous

    Right now we have two storage servers, with rsync backups tou00a0multipleu00a0locations.nnWeu00a0recentlyu00a0purchased au00a0NetApp with 50TB of storage.n

  • http://www.bitbucket.org Justen Stepka

    Right now we have two storage servers, with rsync backups to multiple locations.

    We recently purchased a NetApp with 50TB of storage.