Bitbucket downtime for a hardware upgrade

By on August 25, 2010

To badly paraphrase everyone’s favorite Wall Crawler series, with great success comes great responsibility. Bitbucket has grown fast – faster than we were ready for.

We’re aware that there have been on-going stability and performance issues. That is why we’re happy to announce that on Monday, August 30, 01:00 GMT we’ll be moving off the Amazon EC2 system to a dedicated server deployment, professionally managed at Contegix.

The current Amazon EC2 setup looks like this:

Many of the problems that we have are related to disk I/O and memory which is why we’ve chosen to move to a physical machine setup.

When we switch to Contegix, we’ll be switching to:

Expected Downtime

Over the last month we’ve been putting together a plan that limits downtime, which should be limited to 1 hour of downtime.

The main part of that is moving the database, and everyone’s repositories will be moved over gradually. Your repositories will still be available during the transition, however while each individual repository is being migrated, they will be in read-only mode. This should only be for 10-60 seconds, even for the largest ones. Chances are you may not even notice it.

We’d like to thank everyone for their patience in helping us get this far.

Be sure to check back soon for some very exciting updates, and look forward to a more stable, faster Bitbucket!

73 Comments

  • Devine
    Posted August 24, 2010 at 9:00 pm | Permalink

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine
    Posted August 24, 2010 at 9:00 pm | Permalink

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine
    Posted August 25, 2010 at 4:00 am | Permalink

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine
    Posted August 25, 2010 at 4:00 am | Permalink

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • azrul
    Posted August 25, 2010 at 8:14 am | Permalink

    I think this should be good..

  • azrul
    Posted August 25, 2010 at 8:14 am | Permalink

    I think this should be good..

  • azrul
    Posted August 25, 2010 at 3:14 pm | Permalink

    I think this should be good..

  • azrul
    Posted August 25, 2010 at 3:14 pm | Permalink

    I think this should be good..

  • Adam N
    Posted August 25, 2010 at 3:03 pm | Permalink

    Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Adam N
    Posted August 25, 2010 at 3:03 pm | Permalink

    Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Mazahakacimla
    Posted August 25, 2010 at 3:23 pm | Permalink

    Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂

  • Mazahakacimla
    Posted August 25, 2010 at 3:23 pm | Permalink

    Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂

  • axolx
    Posted August 25, 2010 at 3:35 pm | Permalink

    Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.

  • axolx
    Posted August 25, 2010 at 3:35 pm | Permalink

    Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.

  • Adam N
    Posted August 25, 2010 at 10:03 pm | Permalink

    Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Adam N
    Posted August 25, 2010 at 10:03 pm | Permalink

    Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • jespern
    Posted August 25, 2010 at 5:11 pm | Permalink

    @devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • jespern
    Posted August 25, 2010 at 5:11 pm | Permalink

    @devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • Mazahakacimla
    Posted August 25, 2010 at 10:23 pm | Permalink

    Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂

  • Mazahakacimla
    Posted August 25, 2010 at 10:23 pm | Permalink

    Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂

  • axolx
    Posted August 25, 2010 at 10:35 pm | Permalink

    Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.

  • axolx
    Posted August 25, 2010 at 10:35 pm | Permalink

    Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.

  • Anonymous
    Posted August 26, 2010 at 12:11 am | Permalink

    @devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • Anonymous
    Posted August 26, 2010 at 12:11 am | Permalink

    @devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

    • Posted August 28, 2010 at 6:57 am | Permalink

      Well, you know the law : “If anything can go wrong, it will.”

  • Posted August 27, 2010 at 11:57 pm | Permalink

    Well, you know the law : “If anything can go wrong, it will.”

  • Posted August 29, 2010 at 7:21 pm | Permalink

    So, right a this moment, I cant do push.

  • Posted August 29, 2010 at 7:21 pm | Permalink

    So, right a this moment, I cant do push.

  • jespern
    Posted August 29, 2010 at 7:41 pm | Permalink

    You can now. We're back!

  • Posted August 30, 2010 at 2:21 am | Permalink

    So, right a this moment, I cant do push.

  • Posted August 30, 2010 at 2:21 am | Permalink

    So, right a this moment, I cant do push.

    • Anonymous
      Posted August 30, 2010 at 2:41 am | Permalink

      You can now. We’re back!

  • Sayane
    Posted August 30, 2010 at 8:17 am | Permalink

    I'm not able to push over SSH. When it will be fixed?

  • Sayane
    Posted August 30, 2010 at 8:17 am | Permalink

    I'm not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

  • Posted August 30, 2010 at 9:33 am | Permalink

    Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)

  • Sayane
    Posted August 30, 2010 at 3:17 pm | Permalink

    I’m not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

  • Sayane
    Posted August 30, 2010 at 3:17 pm | Permalink

    I’m not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

    • Posted August 30, 2010 at 4:33 pm | Permalink

      Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)

    • Posted August 30, 2010 at 6:03 pm | Permalink

      Confirmation for 2 more users at different network locations

    • Posted August 30, 2010 at 6:11 pm | Permalink

      We’re seeing this as well . . .

      user: tonybuckingham
      repo: nextscreenlabs/forefront

    • Posted August 30, 2010 at 7:19 pm | Permalink

      Yet another similar experience. ssh authentication appears foobared.

  • Posted August 30, 2010 at 11:03 am | Permalink

    Confirmation for 2 more users at different network locations

  • Posted August 30, 2010 at 11:11 am | Permalink

    We're seeing this as well . . .

    user: tonybuckingham
    repo: nextscreenlabs/forefront

  • Posted August 30, 2010 at 12:19 pm | Permalink

    Yet another similar experience. ssh authentication appears foobared.

  • Posted August 30, 2010 at 3:19 pm | Permalink

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • Posted August 30, 2010 at 3:19 pm | Permalink

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • Posted August 30, 2010 at 3:30 pm | Permalink

    same here

  • apotheon
    Posted August 30, 2010 at 3:33 pm | Permalink

    I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • apotheon
    Posted August 30, 2010 at 3:33 pm | Permalink

    I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • Posted August 30, 2010 at 4:17 pm | Permalink

    Can't clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • Posted August 30, 2010 at 4:17 pm | Permalink

    Can't clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • apotheon
    Posted August 30, 2010 at 4:47 pm | Permalink

    It seems to be working for me again.

  • Posted August 30, 2010 at 10:19 pm | Permalink

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • Posted August 30, 2010 at 10:19 pm | Permalink

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

    • Luciano Longo
      Posted August 30, 2010 at 10:30 pm | Permalink

      same here

  • apotheon
    Posted August 30, 2010 at 10:33 pm | Permalink

    I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • apotheon
    Posted August 30, 2010 at 10:33 pm | Permalink

    I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

    • apotheon
      Posted August 30, 2010 at 11:47 pm | Permalink

      It seems to be working for me again.

  • Posted August 30, 2010 at 11:17 pm | Permalink

    Can’t clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • Posted August 30, 2010 at 11:17 pm | Permalink

    Can’t clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • jespern
    Posted August 30, 2010 at 8:49 pm | Permalink

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • jespern
    Posted August 30, 2010 at 8:49 pm | Permalink

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • Anonymous
    Posted August 31, 2010 at 3:49 am | Permalink

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • Anonymous
    Posted August 31, 2010 at 3:49 am | Permalink

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • Posted August 31, 2010 at 12:19 am | Permalink

    good to you.
    thanks for this post.

    G.J.

  • Posted August 31, 2010 at 12:19 am | Permalink

    good to you.
    thanks for this post.

    G.J.

  • Posted August 31, 2010 at 7:19 am | Permalink

    good to you.
    thanks for this post.

    G.J.

  • Posted August 31, 2010 at 7:19 am | Permalink

    good to you.
    thanks for this post.

    G.J.

  • Posted September 17, 2010 at 9:48 am | Permalink

    Thanks!
    Interesting.

  • Posted September 17, 2010 at 9:48 am | Permalink

    Thanks!
    Interesting.

  • Posted September 17, 2010 at 4:48 pm | Permalink

    Thanks!
    Interesting.

  • Posted September 17, 2010 at 4:48 pm | Permalink

    Thanks!
    Interesting.

  • Shrawan Patel
    Posted June 8, 2011 at 11:14 am | Permalink

    This post excellently highlights what the author is trying to communicate. Nonetheless, the article has been framed excellently well and all credits to the author. For more information on how to load balance your web servers, please visit ..nhttp://serverloadbalancing.biz/wordpressbiz/, nhttp://serverloadbalancing.info/wordpressinfo/