We experienced intermintent downtime, timeouts, and general performance problems in the past 24 hours as the result of a failing disk drive. To prepare for situations like this all our storage is redundant and we keep hot and cold spares in stock. However, this particular drive failure was problematic because performance degraded as the drive slowly failed. Dell’s hard drive monitoring utilities didn’t detect the drive as failing and our applications were blocked on I/O as the drive became slower and slower.
We suspected a problem with the drive earlier this week when we began to see this warning in our logs (the time stamps are CDT):
We contacted Dell for warranty support and an RMA, but they insisted that was only a warning and we shouldn’t be concerned. At the time iostat didn’t show anything out of the ordinary. Yesterday we began to notice extremely high I/O utilization on the NFS server with that faulty disk. As a result, the load on our front-end Django servers skyrocketed as they were blocked on I/O.
Atlassian’s internal development teams heavily use Atlassian Bamboo continuous integration build agents running on Amazon EC2 instances, and many of our teams host their repositories on Bitbucket. To alleviate load during this situation we blocked several of our EC2 IP addresses. However, since EC2 IP addresses are dynamic, later in the day those IP addresses were given to other EC2 users. That inadvertently blocked some of our customer’s continuous integration servers and we’re extra sorry for that inconvenience.
We contacted Dell again today and took immediate action after seeing this error in our logs:
May 25 12:41:28 bitbucket04 kernel: INFO: task nfsd:3397 blocked for more than 120 seconds.
We took the failing drive out of service and let the RAID rebuild itself with a hot spare. We also replaced the hot spare with a drive from our inventory. I/O utilization quickly returned to normal levels and our site has been stable since.
We apologize for the downtime this caused.