Bitbucket downtime for a hardware upgrade
By Jesper Noehr on August 25, 2010To badly paraphrase everyone’s favorite Wall Crawler series, with great success comes great responsibility. Bitbucket has grown fast – faster than we were ready for.
We’re aware that there have been on-going stability and performance issues. That is why we’re happy to announce that on Monday, August 30, 01:00 GMT we’ll be moving off the Amazon EC2 system to a dedicated server deployment, professionally managed at Contegix.
The current Amazon EC2 setup looks like this:
- 2 x m1.small
- 2 x c1.xlarge
- 2 x m2.4xlarge
Many of the problems that we have are related to disk I/O and memory which is why we’ve chosen to move to a physical machine setup.
When we switch to Contegix, we’ll be switching to:
- 5 x Dell R610, 32Gb ram, 16 core
- Storage, Dell MD1120 DAS array
- 22 600Gb 10krpm disks RAID10 (~2.4tb final storage)
- Redundant backbone providers
Expected Downtime
Over the last month we’ve been putting together a plan that limits downtime, which should be limited to 1 hour of downtime.
The main part of that is moving the database, and everyone’s repositories will be moved over gradually. Your repositories will still be available during the transition, however while each individual repository is being migrated, they will be in read-only mode. This should only be for 10-60 seconds, even for the largest ones. Chances are you may not even notice it.
We’d like to thank everyone for their patience in helping us get this far.
Be sure to check back soon for some very exciting updates, and look forward to a more stable, faster Bitbucket!
73 Comments
I get the feeling that the downtime estimate is a little optimistic.
And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.
Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”
I get the feeling that the downtime estimate is a little optimistic.
And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.
Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”
I get the feeling that the downtime estimate is a little optimistic.
And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.
Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”
I get the feeling that the downtime estimate is a little optimistic.
And then we have to consider the “what if it all goes wrong” time which could be 24 or 48 hours.
Should just say that the “…expected downtime is 24 hours, but you should be able to access your repositories except for a 10-60 sec unavailability that might occur when we are moving your repo…”
I think this should be good..
I think this should be good..
I think this should be good..
I think this should be good..
Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.
Can you comment a bit more on your architecture and where the bottlenecks are? We're using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it's sailing. We do use Celery on a large instance for offline tasks which helps alot too.
Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂
ÐадеюÑÑŒ жуткие тормоза поÑле Ð¾Ð±Ð½Ð¾Ð²Ð»ÐµÐ½Ð¸Ñ Ð¿Ñ€ÐµÐºÑ€Ð°Ñ‚ÑÑ‚ÑÑ 🙂 Удачи вам 🙂
Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.
Like Adam N, I'm curious what are the bottlenecks you ran into with EC2.
Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.
Can you comment a bit more on your architecture and where the bottlenecks are? We’re using a small for a web server and a large RDS for a database and S3, CloudFront for static content and it’s sailing. We do use Celery on a large instance for offline tasks which helps alot too.
@devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment
@axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.
@devine: We've been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment
@axolx: We'll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.
Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂
Надеюсь жуткие тормоза после обновления прекратятся 🙂 Удачи вам 🙂
Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.
Like Adam N, I’m curious what are the bottlenecks you ran into with EC2.
@devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment
@axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.
@devine: We’ve been performing synchronizations of the database and repositories daily into our staging environment as dry runs for the change on Monday. During the 1 hour downtime, we will perform the final synchronization and promote the staging environment to be the live environment. If anything goes wrong with the synchronization, we will roll back to our existing environment
@axolx: We’ll be posting a follow-up post-migration about the migration, what we did to prepare, and more details on what our problems with Amazon were.
Well, you know the law : “If anything can go wrong, it will.”
Well, you know the law : “If anything can go wrong, it will.”
So, right a this moment, I cant do push.
So, right a this moment, I cant do push.
You can now. We're back!
So, right a this moment, I cant do push.
So, right a this moment, I cant do push.
You can now. We’re back!
I'm not able to push over SSH. When it will be fixed?
I'm not able to push over SSH. When it will be fixed?
Error:
remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
abort: no suitable response from remote hg!
Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)
I’m not able to push over SSH. When it will be fixed?
Error:
remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
abort: no suitable response from remote hg!
I’m not able to push over SSH. When it will be fixed?
Error:
remote: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
abort: no suitable response from remote hg!
Ditto. (user IndigoJo, repo IndigoJo/qtm-1.3)
Confirmation for 2 more users at different network locations
We’re seeing this as well . . .
user: tonybuckingham
repo: nextscreenlabs/forefront
Yet another similar experience. ssh authentication appears foobared.
Confirmation for 2 more users at different network locations
We're seeing this as well . . .
user: tonybuckingham
repo: nextscreenlabs/forefront
Yet another similar experience. ssh authentication appears foobared.
The error message has changed now:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
The error message has changed now:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
same here
I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
I've been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
Can't clone/pull over ssh
remote: Access granted
remote: Opened channel for session
remote: Started a shell/command
remote: hg serve: invalid arguments
remote: Server sent command exit status 0
remote: Disconnected: All channels closed
no suitable response from remote hg
Can't clone/pull over ssh
remote: Access granted
remote: Opened channel for session
remote: Started a shell/command
remote: hg serve: invalid arguments
remote: Server sent command exit status 0
remote: Disconnected: All channels closed
no suitable response from remote hg
It seems to be working for me again.
The error message has changed now:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
The error message has changed now:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
same here
I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
I’ve been trying to pull for a period measured in hours rather than seconds, starting something like 18 hours after the one hour of projected downtime, with no success. Is there something I need to do on my end to get hg-over-ssh working again? My specific error looks like this:
remote: hg serve: invalid arguments
abort: no suitable response from remote hg!
It seems to be working for me again.
Can’t clone/pull over ssh
remote: Access granted
remote: Opened channel for session
remote: Started a shell/command
remote: hg serve: invalid arguments
remote: Server sent command exit status 0
remote: Disconnected: All channels closed
no suitable response from remote hg
Can’t clone/pull over ssh
remote: Access granted
remote: Opened channel for session
remote: Started a shell/command
remote: hg serve: invalid arguments
remote: Server sent command exit status 0
remote: Disconnected: All channels closed
no suitable response from remote hg
Over night we ran into a hiccup with SSH authentications. For this we apologize.
When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.
A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.
Over night we ran into a hiccup with SSH authentications. For this we apologize.
When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.
A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.
Over night we ran into a hiccup with SSH authentications. For this we apologize.
When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.
A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.
Over night we ran into a hiccup with SSH authentications. For this we apologize.
When anyone uploaded a new SSH key after the roll out of our new setup, the new key would overwrite the existing SSH key store. If you attempted to authenticate to your repositories using SSH during this time, your authentication would have failed. A stray process in charge of handling the store kept us in check and it took us a while to track it down.
A couple of hours ago we rolled out a fix for the problem and all of your SSH authentications should be working as expected.
good to you.
thanks for this post.
G.J.
good to you.
thanks for this post.
G.J.
good to you.
thanks for this post.
G.J.
good to you.
thanks for this post.
G.J.
Thanks!
Interesting.
Thanks!
Interesting.
Thanks!
Interesting.
Thanks!
Interesting.
This post excellently highlights what the author is trying to communicate. Nonetheless, the article has been framed excellently well and all credits to the author. For more information on how to load balance your web servers, please visit ..nhttp://serverloadbalancing.biz/wordpressbiz/, nhttp://serverloadbalancing.info/wordpressinfo/