Follow Up On Our Downtime Last Week

Last week we experienced unexpected downtime.  Anytime something like this happens we perform a full post-incident investigation to ensure it never happens again.  Here are our findings.

The Crash

We determined the problem started with our syslog server crashing.  We found log rotation occurred at 04:05 as the cron.daily logrotate script ran:

-rw------- 1 root root 101714 Jan 5 04:05 messages-20120105.gz
-rw------- 1 root root 5012515 Jan 5 15:11 messages

Instead of responding to HUP with a graceful reload, rsyslogd completely stopped:

Jan 5 04:05:04 bitbucket04 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="1905" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'restart'.
Jan 5 04:05:04 bitbucket04 kernel: Kernel logging (proc) stopped.

Based on previous log entries this is normal, but it typically restarts properly.  In this case, rsyslogd didn’t start back up at all.

We’re currently running rsyslog 4.6.2 that ships with RHEL 6.2.  We suspect we hit a known bug – the fix isn’t yet packaged for RHEL.

Blocking on syslog()

So we determined rsyslog crashed.  We wouldn’t have expected that to be much of a problem.  However, closer inspection shows we were syslogging over TCP.  Our syslog server was configured for both TCP and UDP:

# Provides UDP syslog reception

$ModLoad imudp.so
$UDPServerRun 514

# Provides TCP syslog reception
$ModLoad imtcp.so
$InputTCPServerRun 514

But our clients were specifically configured to log over TCP – two @ signs prior to the hostname means use TCP and one would mean use UDP:

*.*    @@syslog.private.bitbucket.org

When our central rsyslog crashed all applications logging to it over TCP were blocked and our systems became unresponsive.  This included our custom applications and almost every other piece of our technology stack.  Unknown to us, we were  vulnerable to this problem since switching data centers over a year ago.  Moreover, recently we began syslogging much more information, such as Nginx logs.  In this instance, the crash together with the increased logging combined to make the situation much worse.

We’ve made these changes to ensure this doesn’t happen again:

  • Disabled rsyslog’s HUPIsRestart option
  • Added a Monit check to ensure rsyslog is restarted even if it crashes
  • Switched from syslog over TCP to UDP

We’re sorry for the inconvenience this caused.  We’re always working to make Bitbucket more reliable and we won’t see this particular issue occur again.