Bitbucket downtime for a hardware upgrade

By on August 25, 2010

To badly paraphrase everyone’s favorite Wall Crawler series, with great success comes great responsibility. Bitbucket has grown fast – faster than we were ready for.

We’re aware that there have been on-going stability and performance issues. That is why we’re happy to announce that on Monday, August 30, 01:00 GMT we’ll be moving off the Amazon EC2 system to a dedicated server deployment, professionally managed at Contegix.

The current Amazon EC2 setup looks like this:

Many of the problems that we have are related to disk I/O and memory which is why we’ve chosen to move to a physical machine setup.

When we switch to Contegix, we’ll be switching to:

Expected Downtime

Over the last month we’ve been putting together a plan that limits downtime, which should be limited to 1 hour of downtime.

The main part of that is moving the database, and everyone’s repositories will be moved over gradually. Your repositories will still be available during the transition, however while each individual repository is being migrated, they will be in read-only mode. This should only be for 10-60 seconds, even for the largest ones. Chances are you may not even notice it.

We’d like to thank everyone for their patience in helping us get this far.

Be sure to check back soon for some very exciting updates, and look forward to a more stable, faster Bitbucket!

  • Devine

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • Devine

    I get the feeling that the downtime estimate is a little optimistic.
    And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.

    Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”

  • azrul

    I think this should be good..

  • azrul

    I think this should be good..

  • azrul

    I think this should be good..

  • azrul

    I think this should be good..

  • Adam N

    Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Adam N

    Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Mazahakacimla

    Надеюсь жуткие тормоза после обновления прекратятся :) Удачи вам :)

  • Mazahakacimla

    Надеюсь жуткие тормоза после обновления прекратятся :) Удачи вам :)

  • axolx

    Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.

  • axolx

    Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.

  • Adam N

    Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • Adam N

    Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.

  • jespern

    @devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • jespern

    @devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • Mazahakacimla

    Надеюсь жуткие тормоза после обновления прекратятся :) Удачи вам :)

  • Mazahakacimla

    Надеюсь жуткие тормоза после обновления прекратятся :) Удачи вам :)

  • axolx

    Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.

  • axolx

    Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.

  • Anonymous

    @devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

    • http://twitter.com/xrogaan xrogaan

      Well, you know the law : “If anything can go wrong, it will.”

  • Anonymous

    @devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment

    @axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.

  • http://twitter.com/xrogaan xrogaan

    Well, you know the law : “If anything can go wrong, it will.”

  • http://twitter.com/duduzerah duduzerah

    So, right a this moment, I cant do push.

  • http://twitter.com/duduzerah duduzerah

    So, right a this moment, I cant do push.

  • jespern

    You can now. We're back!

  • http://twitter.com/duduzerah duduzerah

    So, right a this moment, I cant do push.

    • Anonymous

      You can now. We’re back!

  • http://twitter.com/duduzerah duduzerah

    So, right a this moment, I cant do push.

  • Sayane

    I'm not able to push over SSH. When it will be fixed?

  • Sayane

    I'm not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)

  • Sayane

    I’m not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

    • http://www.blogistan.co.uk/blog/ Matthew Smith

      Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)

    • http://dotnetchris.wordpress.com/ Chris Marisic

      Confirmation for 2 more users at different network locations

    • http://tonybuckingham.net/ Tony

      We’re seeing this as well . . .

      user: tonybuckingham
      repo: nextscreenlabs/forefront

    • http://petter-haggholm.livejournal.com/ Petter Häggholm

      Yet another similar experience. ssh authentication appears foobared.

  • Sayane

    I’m not able to push over SSH. When it will be fixed?

    Error:
    remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
    abort: no suitable response from remote hg!

  • http://dotnetchris.wordpress.com/ Chris Marisic

    Confirmation for 2 more users at different network locations

  • http://tonybuckingham.net/ Tony

    We're seeing this as well . . .

    user: tonybuckingham
    repo: nextscreenlabs/forefront

  • http://petter-haggholm.livejournal.com/ Petter Häggholm

    Yet another similar experience. ssh authentication appears foobared.

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • http://twitter.com/xNephilimx Luciano Longo

    same here

  • apotheon

    I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • apotheon

    I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • http://falsefalse.tumblr.com/ falsefalse

    Can't clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • http://falsefalse.tumblr.com/ falsefalse

    Can't clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • apotheon

    It seems to be working for me again.

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

    • Luciano Longo

      same here

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    The error message has changed now:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • apotheon

    I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

    • apotheon

      It seems to be working for me again.

  • apotheon

    I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:

    remote: hg serve: invalid arguments
    abort: no suitable response from remote hg!

  • http://falsefalse.tumblr.com/ falsefalse

    Can’t clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • http://falsefalse.tumblr.com/ falsefalse

    Can’t clone/pull over ssh

    remote: Access granted
    remote: Opened channel for session
    remote: Started a shell/command

    remote: hg serve: invalid arguments
    remote: Server sent command exit status 0
    remote: Disconnected: All channels closed
    no suitable response from remote hg

  • jespern

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • jespern

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • Anonymous

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • Anonymous

    Over night we ran into a hiccup with SSH authentications. For this we apologize.
    When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.

    A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.

  • http://www.tapety.cjb.net tapety

    good to you.
    thanks for this post.

    G.J.

  • http://www.tapety.cjb.net tapety

    good to you.
    thanks for this post.

    G.J.

  • http://www.tapety.cjb.net tapety

    good to you.
    thanks for this post.

    G.J.

  • http://www.tapety.cjb.net tapety

    good to you.
    thanks for this post.

    G.J.

  • http://www.tapety.nub.pl tapety na telefon

    Thanks!
    Interesting.

  • http://www.tapety.nub.pl tapety na telefon

    Thanks!
    Interesting.

  • http://www.tapety.nub.pl tapety na telefon

    Thanks!
    Interesting.

  • http://www.tapety.nub.pl tapety na telefon

    Thanks!
    Interesting.

  • Shrawan Patel

    This post excellently highlights what the author is trying to communicate. Nonetheless, the article has been framed excellently well and all credits to the author. For more information on how to load balance your web servers, please visit ..nhttp://serverloadbalancing.biz/wordpressbiz/, nhttp://serverloadbalancing.info/wordpressinfo/

  • http://twitter.com/xrogaan xrogaan

    Well, you know the law : “If anything can go wrong, it will.”

  • http://twitter.com/xrogaan xrogaan

    Well, you know the law : “If anything can go wrong, it will.”

  • jespern

    You can now. We're back!

  • Anonymous

    You can now. We’re back!

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    Ditto. (user IndigoJo)

  • http://dotnetchris.wordpress.com/ Chris Marisic

    Confirmation for 2 more users at different network locations

  • http://tonybuckingham.net/ Tony

    We're seeing this as well . . .

    user: tonybuckingham
    repo: nextscreenlabs/forefront

  • http://petter-haggholm.livejournal.com/ Petter Häggholm

    Yet another similar experience. ssh authentication appears foobared.

  • http://www.blogistan.co.uk/blog/ Matthew Smith

    Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)

  • http://dotnetchris.wordpress.com/ Chris Marisic

    Confirmation for 2 more users at different network locations

  • http://tonybuckingham.net/ Tony

    We’re seeing this as well . . .

    user: tonybuckingham
    repo: nextscreenlabs/forefront

  • http://petter-haggholm.livejournal.com/ Petter Häggholm

    Yet another similar experience. ssh authentication appears foobared.

  • http://twitter.com/xNephilimx Luciano Longo

    same here

  • apotheon

    It seems to be working for me again.

  • Luciano Longo

    same here

  • apotheon

    It seems to be working for me again.