Follow Up On Our Downtime Last Week

By on January 12, 2012

Last week we experienced unexpected downtime.  Anytime something like this happens we perform a full post-incident investigation to ensure it never happens again.  Here are our findings.

The Crash

We determined the problem started with our syslog server crashing.  We found log rotation occurred at 04:05 as the cron.daily logrotate script ran:

-rw------- 1 root root 101714 Jan 5 04:05 messages-20120105.gz
-rw------- 1 root root 5012515 Jan 5 15:11 messages

Instead of responding to HUP with a graceful reload, rsyslogd completely stopped:

Jan 5 04:05:04 bitbucket04 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="1905" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'restart'.
Jan 5 04:05:04 bitbucket04 kernel: Kernel logging (proc) stopped.

Based on previous log entries this is normal, but it typically restarts properly.  In this case, rsyslogd didn’t start back up at all.

We’re currently running rsyslog 4.6.2 that ships with RHEL 6.2.  We suspect we hit a known bug – the fix isn’t yet packaged for RHEL.

Blocking on syslog()

So we determined rsyslog crashed.  We wouldn’t have expected that to be much of a problem.  However, closer inspection shows we were syslogging over TCP.  Our syslog server was configured for both TCP and UDP:

# Provides UDP syslog reception

$ModLoad imudp.so
$UDPServerRun 514

# Provides TCP syslog reception
$ModLoad imtcp.so
$InputTCPServerRun 514

But our clients were specifically configured to log over TCP – two @ signs prior to the hostname means use TCP and one would mean use UDP:

*.*    @@syslog.private.bitbucket.org

When our central rsyslog crashed all applications logging to it over TCP were blocked and our systems became unresponsive.  This included our custom applications and almost every other piece of our technology stack.  Unknown to us, we were  vulnerable to this problem since switching data centers over a year ago.  Moreover, recently we began syslogging much more information, such as Nginx logs.  In this instance, the crash together with the increased logging combined to make the situation much worse.

We’ve made these changes to ensure this doesn’t happen again:

We’re sorry for the inconvenience this caused.  We’re always working to make Bitbucket more reliable and we won’t see this particular issue occur again.

31 Comments

  • Chris Moyer
    Posted January 12, 2012 at 8:11 pm | Permalink

    I think you mean “two at signs”, not “ampersands”, ampersands are &, not @.

    • Charles McLaughlin
      Posted January 13, 2012 at 12:16 am | Permalink

      Thanks for pointing that out.  I corrected it. 

      • Posted January 13, 2012 at 6:33 am | Permalink

        I can still see the term “two ampersands”.

  • Chris Moyer
    Posted January 12, 2012 at 8:11 pm | Permalink

    I think you mean “two at signs”, not “ampersands”, ampersands are &, not @.

    • Charles McLaughlin
      Posted January 13, 2012 at 12:16 am | Permalink

      Thanks for pointing that out.  I corrected it. 

      • Posted January 13, 2012 at 6:33 am | Permalink

        I can still see the term “two ampersands”.

  • Posted January 12, 2012 at 8:51 pm | Permalink

    Fair play guys. It was highly frustrating; but things don’t always go right and we’ve all been there. The more information you can share with us the better, the only criticism I have for you is that I feel your live status updates could have been better to keep us all posted on what was going on. Either way it’s done now and what’s important is as you say, it doesn’t happen again.

    Cheers for the report.

  • Posted January 12, 2012 at 8:51 pm | Permalink

    Fair play guys. It was highly frustrating; but things don’t always go right and we’ve all been there. The more information you can share with us the better, the only criticism I have for you is that I feel your live status updates could have been better to keep us all posted on what was going on. Either way it’s done now and what’s important is as you say, it doesn’t happen again.

    Cheers for the report.

  • Posted January 12, 2012 at 10:15 pm | Permalink

    Wouldn’t it be better to log without blocking in the application? So that if the log server is unavailable, the apps can continue running?

    • Anonymous
      Posted January 12, 2012 at 10:49 pm | Permalink

      that’s exactly what they’ve done by switching from TCP to UDP

  • Posted January 12, 2012 at 10:15 pm | Permalink

    Wouldn’t it be better to log without blocking in the application? So that if the log server is unavailable, the apps can continue running?

    • Anonymous
      Posted January 12, 2012 at 10:49 pm | Permalink

      that’s exactly what they’ve done by switching from TCP to UDP

  • Anonymous
    Posted January 12, 2012 at 11:32 pm | Permalink

    Why not instead use a local “store and forward” approach?  Now, if your syslog server goes down, you lose all data during that time.  With store and forward, it would be buffered locally.

    • Posted January 13, 2012 at 12:36 am | Permalink

      That’s a good point. That’s something we’re looking to move to in the near future.

  • Anonymous
    Posted January 12, 2012 at 11:32 pm | Permalink

    Why not instead use a local “store and forward” approach?  Now, if your syslog server goes down, you lose all data during that time.  With store and forward, it would be buffered locally.

    • Posted January 13, 2012 at 12:36 am | Permalink

      That’s a good point. That’s something we’re looking to move to in the near future.

  • andrei lisnic
    Posted January 12, 2012 at 11:35 pm | Permalink

    It’s a good tradition to go public with the cause of downtime, I like it.

  • andrei lisnic
    Posted January 12, 2012 at 11:35 pm | Permalink

    It’s a good tradition to go public with the cause of downtime, I like it.

  • Posted January 13, 2012 at 4:50 am | Permalink

    This might be because you are running both TCP and UDP on the same incoming port.
    From the documentation here: http://rsyslog.com/config-snippets/the-recipies/basic-configuration/

    “If you need both syslog server types configured, please make sure they run on proper ports. By default UDP syslog is received on port 514. TCP syslog needs a different port because often the RPC service is using this port as well.”

  • Posted January 13, 2012 at 4:50 am | Permalink

    This might be because you are running both TCP and UDP on the same incoming port.
    From the documentation here: http://rsyslog.com/config-snippets/the-recipies/basic-configuration/

    “If you need both syslog server types configured, please make sure they run on proper ports. By default UDP syslog is received on port 514. TCP syslog needs a different port because often the RPC service is using this port as well.”

  • David Frey
    Posted January 24, 2012 at 12:41 am | Permalink

    I don’t know anything about the higher level protocol syslog is using.  Does syslog over UDP do any checks to ensure that logging messages aren’t lost or arriving out of order?

    • Posted January 24, 2012 at 2:16 pm | Permalink

      I doubt it, since UDP does not guarantee delivery, nor in-order delivery (i.e. stuff that’s sent may all be delivered, but not in the order it was sent). For reliable delivery, TCP is the ticket. But then you need to avoid the senders blocking when no receiver is available.

      • David Frey
        Posted January 24, 2012 at 5:34 pm | Permalink

        It seems to me like the logging call should be asynchronous to avoid locking up an application for an unknown amount of time.

  • David Frey
    Posted January 24, 2012 at 12:41 am | Permalink

    I don’t know anything about the higher level protocol syslog is using.  Does syslog over UDP do any checks to ensure that logging messages aren’t lost or arriving out of order?

    • Posted January 24, 2012 at 2:16 pm | Permalink

      I doubt it, since UDP does not guarantee delivery, nor in-order delivery (i.e. stuff that’s sent may all be delivered, but not in the order it was sent). For reliable delivery, TCP is the ticket. But then you need to avoid the senders blocking when no receiver is available.

      • David Frey
        Posted January 24, 2012 at 5:34 pm | Permalink

        It seems to me like the logging call should be asynchronous to avoid locking up an application for an unknown amount of time.

  • Matt Doar
    Posted January 24, 2012 at 6:47 pm | Permalink

    Good explanation!

  • Matt Doar
    Posted January 24, 2012 at 6:47 pm | Permalink

    Good explanation!

  • Lidstromso
    Posted July 23, 2012 at 7:22 pm | Permalink

    Whipping the
    horse, things looking busy.

    Learn to ask, have
    learned.

     

    Jordan 4 Fire
    Red

    Jordan 4 Fire Red
    2012

    Jordan IV


    http://www.jordan4firered2012.net
     

  • Lidstromso
    Posted July 27, 2012 at 6:32 pm | Permalink
  • demofly
    Posted January 4, 2013 at 3:54 am | Permalink

    just use syslog-ng and don’t give a fuck!