Outage incident and our new monitoring setup

Today around 18:00 GMT two of our front end servers ran into the limit of our Apache’s MaxClient configuration. After receiving Pingdom alerts it took us 10 minutes to find the problem, change the setting, and reload Apache. During that time you may have noticed poor performance and timeouts, for that we apologize.

Our analysis points to some legacy HTTP load-balancing code left over from when we ran Bitbucket on EC2. We’re implementing a fix and will deploy the fix production soon.

Since switching off EC2, we’ve been working hard to improve our monitoring, which will help in times like this. For instance, we’re testing Monit, which could have automatically detected this problem and bounced Apache. We’re also working to expand the live functional, including Mercurial operations such as checkouts, using Twill and Kong.

For those of you interested in our hardware setup, the new front end machines each have 16 cores with 32GB of RAM. Since migrating from EC2 to Contegix we’ve rarely seen the load over 2 (~12% load) whereas we were at or beyond 100% load on EC2.