Follow Up On Our Downtime Last Week

By on January 12, 2012

Last week we experienced unexpected downtime.  Anytime something like this happens we perform a full post-incident investigation to ensure it never happens again.  Here are our findings.

The Crash

We determined the problem started with our syslog server crashing.  We found log rotation occurred at 04:05 as the cron.daily logrotate script ran:

-rw------- 1 root root 101714 Jan 5 04:05 messages-20120105.gz
-rw------- 1 root root 5012515 Jan 5 15:11 messages

Instead of responding to HUP with a graceful reload, rsyslogd completely stopped:

Jan 5 04:05:04 bitbucket04 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="1905" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'restart'.
Jan 5 04:05:04 bitbucket04 kernel: Kernel logging (proc) stopped.

Based on previous log entries this is normal, but it typically restarts properly.  In this case, rsyslogd didn’t start back up at all.

We’re currently running rsyslog 4.6.2 that ships with RHEL 6.2.  We suspect we hit a known bug – the fix isn’t yet packaged for RHEL.

Blocking on syslog()

So we determined rsyslog crashed.  We wouldn’t have expected that to be much of a problem.  However, closer inspection shows we were syslogging over TCP.  Our syslog server was configured for both TCP and UDP:

# Provides UDP syslog reception

$ModLoad imudp.so
$UDPServerRun 514

# Provides TCP syslog reception
$ModLoad imtcp.so
$InputTCPServerRun 514

But our clients were specifically configured to log over TCP – two @ signs prior to the hostname means use TCP and one would mean use UDP:

*.*    @@syslog.private.bitbucket.org

When our central rsyslog crashed all applications logging to it over TCP were blocked and our systems became unresponsive.  This included our custom applications and almost every other piece of our technology stack.  Unknown to us, we were  vulnerable to this problem since switching data centers over a year ago.  Moreover, recently we began syslogging much more information, such as Nginx logs.  In this instance, the crash together with the increased logging combined to make the situation much worse.

We’ve made these changes to ensure this doesn’t happen again:

We’re sorry for the inconvenience this caused.  We’re always working to make Bitbucket more reliable and we won’t see this particular issue occur again.

  • Chris Moyer

    I think you mean “two at signs”, not “ampersands”, ampersands are &, not @.

    • Charles McLaughlin

      Thanks for pointing that out.  I corrected it. 

      • http://www.miketheman.net/ Mike Fiedler

        I can still see the term “two ampersands”.

  • Chris Moyer

    I think you mean “two at signs”, not “ampersands”, ampersands are &, not @.

    • Charles McLaughlin

      Thanks for pointing that out.  I corrected it. 

      • http://www.miketheman.net/ Mike Fiedler

        I can still see the term “two ampersands”.

  • http://twitter.com/_liamb Liam Bateman

    Fair play guys. It was highly frustrating; but things don’t always go right and we’ve all been there. The more information you can share with us the better, the only criticism I have for you is that I feel your live status updates could have been better to keep us all posted on what was going on. Either way it’s done now and what’s important is as you say, it doesn’t happen again.

    Cheers for the report.

  • http://twitter.com/_liamb Liam Bateman

    Fair play guys. It was highly frustrating; but things don’t always go right and we’ve all been there. The more information you can share with us the better, the only criticism I have for you is that I feel your live status updates could have been better to keep us all posted on what was going on. Either way it’s done now and what’s important is as you say, it doesn’t happen again.

    Cheers for the report.

  • http://kev.inburke.com/ Kevin Burke

    Wouldn’t it be better to log without blocking in the application? So that if the log server is unavailable, the apps can continue running?

    • Anonymous

      that’s exactly what they’ve done by switching from TCP to UDP

  • http://kev.inburke.com/ Kevin Burke

    Wouldn’t it be better to log without blocking in the application? So that if the log server is unavailable, the apps can continue running?

    • Anonymous

      that’s exactly what they’ve done by switching from TCP to UDP

  • Anonymous

    Why not instead use a local “store and forward” approach?  Now, if your syslog server goes down, you lose all data during that time.  With store and forward, it would be buffered locally.

    • http://twitter.com/brodie Brodie Rao

      That’s a good point. That’s something we’re looking to move to in the near future.

  • Anonymous

    Why not instead use a local “store and forward” approach?  Now, if your syslog server goes down, you lose all data during that time.  With store and forward, it would be buffered locally.

    • http://twitter.com/brodie Brodie Rao

      That’s a good point. That’s something we’re looking to move to in the near future.

  • andrei lisnic

    It’s a good tradition to go public with the cause of downtime, I like it.

  • andrei lisnic

    It’s a good tradition to go public with the cause of downtime, I like it.

  • http://www.miketheman.net/ Mike Fiedler

    This might be because you are running both TCP and UDP on the same incoming port.
    From the documentation here: http://rsyslog.com/config-snippets/the-recipies/basic-configuration/

    “If you need both syslog server types configured, please make sure they run on proper ports. By default UDP syslog is received on port 514. TCP syslog needs a different port because often the RPC service is using this port as well.”

  • http://www.miketheman.net/ Mike Fiedler

    This might be because you are running both TCP and UDP on the same incoming port.
    From the documentation here: http://rsyslog.com/config-snippets/the-recipies/basic-configuration/

    “If you need both syslog server types configured, please make sure they run on proper ports. By default UDP syslog is received on port 514. TCP syslog needs a different port because often the RPC service is using this port as well.”

  • David Frey

    I don’t know anything about the higher level protocol syslog is using.  Does syslog over UDP do any checks to ensure that logging messages aren’t lost or arriving out of order?

    • http://www.red-dove.com/ Vinay Sajip

      I doubt it, since UDP does not guarantee delivery, nor in-order delivery (i.e. stuff that’s sent may all be delivered, but not in the order it was sent). For reliable delivery, TCP is the ticket. But then you need to avoid the senders blocking when no receiver is available.

      • David Frey

        It seems to me like the logging call should be asynchronous to avoid locking up an application for an unknown amount of time.

  • David Frey

    I don’t know anything about the higher level protocol syslog is using.  Does syslog over UDP do any checks to ensure that logging messages aren’t lost or arriving out of order?

    • http://www.red-dove.com/ Vinay Sajip

      I doubt it, since UDP does not guarantee delivery, nor in-order delivery (i.e. stuff that’s sent may all be delivered, but not in the order it was sent). For reliable delivery, TCP is the ticket. But then you need to avoid the senders blocking when no receiver is available.

      • David Frey

        It seems to me like the logging call should be asynchronous to avoid locking up an application for an unknown amount of time.

  • Matt Doar

    Good explanation!

  • Matt Doar

    Good explanation!

  • Lidstromso

    Whipping the
    horse, things looking busy.

    Learn to ask, have
    learned.

     

    Jordan 4 Fire
    Red

    Jordan 4 Fire Red
    2012

    Jordan IV


    http://www.jordan4firered2012.net
     

  • Lidstromso
  • demofly

    just use syslog-ng and don’t give a fuck!