Does this workflow sound familiar? Commit, trigger a build, switch to your continuous integration tool, check the status, configure your deployment environment, execute complex scripts, switch back to Bitbucket, start again… you get the idea.
What if all this context switching could be a thing of the past? How much time would you gain back from being able to view build information and even deploy without leaving Bitbucket? We think it’s time you find out.
We’re working with industry leaders, including Amazon, Microsoft, DigitalOcean, and more to close the loop on your development workflow with new build status and deployment integrations on Bitbucket.
Are you using a different continuous integration tool in your workflow? Don’t fret – the community has been busy as well. Users of Werckercan not only visualize their build pipelines using their add-on, but can also see build status while they work. Buildkite has you covered as well with automatic build status integration for all builds on Bitbucket repositories. Don’t see your favorite tool listed here? Build an integration using our documentation.
Deploy from Bitbucket
Now that you know your builds are passing, it’s time to deploy your work. In the past, getting code from your team repository to your staging or production environments required executing scripts or configuration of complicated deployment plans – all outside of Bitbucket.
Thanks to Bitbucket Connect, you can now deploy your code from Bitbucket to several leading cloud services including Amazon Web Services (AWS) CodeDeploy, Microsoft Azure App Service, and DigitalOcean. Bitbucket’s Connect architecture takes this a level beyond a simple “click to deploy” button. The ability of add-ons to add features into the user interface means you can configure your deployment environments without leaving Bitbucket. These cloud services have worked with us to build add-ons to make your life easier. Workflow simplification for the win!
Amazon, Microsoft, and DigitalOcean see value from deploying directly from Bitbucket, and we hope you do, too.
Ready to get started?
From commit to ship, your team can now complete their workflow without ever leaving Bitbucket. Get started by heading to your CI tool of choice, enabling build status for Bitbucket, and installing a deployment add-on. Now teams can spend less time switching between tools and more time doing what they love – coding.
Many of you have been asking for better support for continuous integration in Bitbucket Cloud. Every time you trigger a build, whether by pushing commits or creating a pull request, you have to log in to your build server to see if it passed or failed. For many of you, we know it’s been a major hassle that you’ve had no way to see the build status right within the UI – until now.
Starting today, the build status API is available with updates to the UI providing at-a-glance feedback on commits, branches, and pull requests in Bitbucket Cloud. Now, you’ll be able to know when your build is passing and when it’s safe to merge changes saving you precious time to do what you do best: coding.
When viewing the commits in your repository, you can clearly see which of those commits have been checked out and tested by your CI tool of choice. If the tests all pass, you see a green checkmark, or else we display a red warning indicator.
For a more detailed view about the status of a commit, the commit view provides a listing of the passed or failed builds (if you have multiple builds), and passed or failed tests for each build. This saves you precious time as you don’t have to browse through log files of your CI tool trying to find why the build failed.
An arguably even more useful application of the Bitbucket’s build status API is for pull requests. If you use pull requests to do code reviews (like we do), you know that one of the first questions you always ask as a reviewer is “Do the tests still pass?” This question is now easily answered by looking for a successful build status indicator in the pull request view.
You can also see the build status at a branch level which is great for distributed teams. Make sure your builds have passed before you merge your changes to the master branch.
We’re working on integrating with other CI tools using the build status API. In the meantime, if you want to use build status now, the best way is to write a simple script to post the results of your builds to the Bitbucket API. Most importantly, if you want to build an integration for Bitbucket Cloud with the CI tool of your choice, get started by taking a look at our documentation. We’re excited to see all the integrations you build in the next few weeks.
Many users have embraced Git for its flexibility as a distributed version control system. In particular, Git’s branching and merging model provides powerful ways to decentralize development workflows. While this flexibility works for the majority of use cases, some aren’t handled so elegantly. One of these use cases is the use of Git with large, monolithic repositories, or monorepos. This article explores issues when dealing with monorepos using Git and offers tips to mitigate them.
What is a monorepo?
Definitions vary, but we define a monorepo as follows:
The repository contains more than one logical project (e.g. an iOS client and a web-application)
These projects are most likely unrelated, loosely connected, or can be connected by other means (e.g via dependency management tools)
The repository is large in many ways:
Number of commits
Number of branches and/or tags
Number of files tracked
Size of content tracked (as measured by looking at the .git directory of the repository)
With thousands of commits a week across hundreds of thousands of files, Facebook’s main source repository is enormous—many times larger than even the Linux kernel, which checked in at 17 million lines of code and 44,000 files in 2013.
There are many conceptual challenges when managing unrelated projects in a monorepo in Git.
First, Git tracks the state of the whole tree in every single commit made. This is fine for single or related projects but becomes unwieldy for a repository containing many unrelated projects. Simply put, commits in unrelated parts of the tree affect the subtree that is relevant to a developer. This issue is pronounced at scale with large numbers of commits advancing the history of the tree. As the branch tip is changing all the time, frequent merging or rebasing locally is required to push changes.
In Git, a tag is a named alias for a particular commit, referring to the whole tree. But usefulness of tags diminishes in the context of a monorepo. Ask yourself this: if you’re working on a web application that is continuously deployed in a monorepo, what relevance does the release tag for the versioned iOS client have?
Alongside these conceptual challenges are numerous performance issues that can affect a monorepo setup.
Number of commits
Managing unrelated projects in a single repository at scale can prove troublesome at the commit level. Over time this can lead to a large number of commits with a significant rate of growth (Facebook cites “thousands of commits a week”). This becomes especially troublesome as Git uses a directed acyclic graph (DAG) to represent the history of a project. With a large number of commits any command that walks the graph could become slow as the history deepens.
Some examples of this include investigating a repository’s history via git log or annotating changes on a file by using git blame. With git blame if your repository has a large number of commits, Git would have to walk a lot of unrelated commits in order to calculate the blame information. Other examples would be answering any kind of reachability question (e.g. is commit A reachable from commit B). Add together many unrelated modules found in a monorepo and the performance issues compound.
Number of refs
A large number of refs (i.e branches or tags) in your monorepo affect performance in many ways.
Ref advertisements contain every ref in your monorepo. As ref advertisements are the first phase in any remote Git operation, this affects operations like git clone, git fetch, or git push. With a large number of refs, performance takes a hit when performing these operations. You can see the ref advertisement by using git ls-remote with a repository URL. For example:
will list all the references in the Linux Kernel repository.
If refs are loosely stored listing branches would be slow. After a git gc refs are packed in a single file and even listing over 20,000 refs is fast (~0.06 seconds).
Any operation that needs to traverse a repository’s commit history and consider each ref (e.g. git branch –contains SHA1) will be slow in a monorepo. In a repository with 21708 refs, listing the refs that contain an old commit (that is reachable from almost all refs) took:
User time (seconds): 146.44*
*This will vary depending on buffer caches and the underlying storage layer.
Number of files tracked
The index or directory cache (.git/index) tracks every file in your repository. Git uses this index to determine whether a file has changed by executing stat(1) on every single file and comparing file modification information with the information contained in the index.
Thus the number of files tracked impacts the performance* of many operations:
git status could be slow (stats every single file, index file will be large)
git commit could be slow as well (also stats every single file)
*This will vary depending on buffer caches and the underlying storage layer, and is only noticeable when there are a large number of files, in the realm of tens or hundreds of thousands.
Large files in a single subtree/project affects the performance of the entire repository. For example, large media assets added to an iOS client project in a monorepo are cloned despite a developer (or build agent) working on an unrelated project.
Whether it’s the number of files, how often they’re changed, or how large they are, these issues in combination have an increased impact on performance:
Switching between branches/tags, which is most useful in a subtree context (e.g. the subtree I’m working on), still updates the entire tree. This process can be slow due to the number of files affected or requires a workaround. Using git checkout ref-28642-31335 — templates for example updates the ./templates directory to match the given branch but without updating HEAD which has the side effect of marking the updated files as modified in the index.
Cloning and fetching slows and is resource intensive on the server as all information is condensed in a packfile before transfer.
Garbage collection is slow and by default triggered on a push (if garbage collection is necessary).
Resource usage is high for every operation that involves the (re-)creation of a packfile, e.g. git upload-pack, git gc.
What about Bitbucket?
Monolithic repositories are a challenge for any Git repository management tool due to the design goals that Git follows, and Bitbucket is no different. More importantly, monolithic repositories pose challenges that need a solution on both the server and client (user) side.
The following table presents these challenges:
While it would be great if Git would support the special use case that monolithic repositories tend to be, Git’s design goals that made it hugely successful and popular are sometimes at odds with the desire to use it in a way it wasn’t designed for. The good news for the vast majority of teams is that really, truly large monolithic repositories tend to be the exception rather than the rule, so as interesting as this post hopefully is, it most likely won’t apply to a situation that you are facing.
That said, there are a range of mitigation strategies that can help when working with large repositories. For repositories with long histories or large binary assets, my colleague Nicola Paolucci describes a few workarounds.
If your repository has refs in the tens of thousands, you should consider removing refs you don’t need them anymore. The DAG retains the history of how changes evolved, while merge commits point to its parents so work conducted on branches can be traced even if the branch doesn’t exist anymore.
In a branch based workflow the number of long lived branches you want to retain should be small. Don’t be afraid to delete a short lived feature branch after a merge.
Consider removing all branches that have been merged into a main branch like master or production. Tracing the history of how changes have evolved is still possible, as long as a commit is reachable from your main branch and you have merged your branch with a merge commit. The default merge commit message often contains the branch name, allowing you to retain this information if necessary.
Handling large numbers of files
If your repository has a large number of files (in the tens to hundreds of thousands), using fast local storage with plenty of memory that can be used as a buffer cache can help. This is an area that would require more significant changes to the client similar for example to the changes that Facebook implemented for Mercurial.
Their approach used file system notifications to record file changes instead of iterating over all files to check whether any of them changed. A similar approach (also using watchman) has been discussed for Git but has not been eventuated yet.
Use Git LFS
For projects that include large files like videos or graphics, Git LFS is one option of integrating your large binary files and limiting the impact on overall performance. My colleague Steve Streeting is an active contributor to the LFS project and recently wrote about the project.
Identify boundaries and split your repository
The most radical workaround is splitting your monorepo into smaller, more focused Git repositories. Try moving away from tracking every change in a single repository and instead identify component boundaries, perhaps by identifying modules or components that have a similar release cycle. A good litmus test for clear subcomponents are the use of tags in a repository, and whether they make sense for other parts of the source tree.
While it would be great if Git supported monorepos elegantly, the concept of a monorepo is slightly at odds with what makes Git hugely successful and popular in the first place. However that doesn’t mean you should give up on the capabilities of Git because you have a monorepo – in most cases there are workable solutions to any issues that arise.
This guest post is written by Mike Neumegen, co-founder of CloudCannon. Mike’s passionate about startups, discovering new technologies, and web design.
Bitbucket provides developers with great workflows for collaborating on software projects. So why can’t we have these workflows when building websites for non-developers?
What if you could deploy websites straight from Bitbucket? What if you could build static websites and have the power of a full blown CMS? What if non-developers could update content and have changes pushed back to your Bitbucket repository?
The new CloudCannon Bitbucket add-on makes all of this possible.
What is CloudCannon?
CloudCannon helps agencies and enterprises build websites for non-developers. Developers build a static or Jekyll site and push it to a Bitbucket repository. CloudCannon synchronizes the files and deploys the site live. Non-developers log in and update the content inline. All changes are synced between CloudCannon and Bitbucket.
The CloudCannon Atlassian Connect add-on allows you to work on websites without leaving Bitbucket. Deploy your static/Jekyll website directly from your repository and have non-developers update content in seconds.
Once you install the add-on, visit one of your repositories and a CloudCannon option will be available in the sidebar. Selecting this option will allow you to create sites from this repository. If you already have CloudCannon sites attached to this repository they will be visible here.
After adding a site, your files are cloned from your selected branch. A few seconds later, you have a live website. CloudCannon automatically compiles, optimizes, and deploys your site to a CDN. Any changes you make on that branch appear on CloudCannon. As a developer you can update locally through Git or using the CloudCannon code editor. Changes made on CloudCannon push back to Bitbucket.
How it works for non-developers
Non-developers update content inline without the need to understand Git or the underlying files. CloudCannon abstracts all of that away with a clean and easy to use interface. Just add class=”editable” to any element to allow the non-developer to edit it inline. They can also update metadata and create blog posts through a simple interface.
Get started with CloudCannon and Bitbucket
The CloudCannon Atlassian Connect add-on enables new workflows for the entire team. Developers can build sites locally and deploy them directly from Bitbucket. Non-developers push changes seamlessly by updating content visually.
If you need help setting up your first site, have a read of our get started tutorial, get in touch at firstname.lastname@example.org, or post a comment below.
Need to store large media files in Git? We’re making major contributions to the Git LFS open source project to help make this happen! Want to know how this came about? What follows is a true story…
Git’s great. When it comes to keeping a handle on your source code, there’s nothing quite as flexible, and developers are adopting it in droves. But there are a lot of teams whose needs haven’t been particularly well met by Git in the past, whose projects consist of not just code, but media files or other large assets. People like game developers and web design studios are common examples, and in many cases Git’s inability to elegantly handle this issue has meant they’ve had to remain on older source control systems.
I’m well aware of this problem myself, because before I came to work at Atlassian via creating SourceTree, I ran an open source graphics engine called Ogre3d for 10 years. I worked with a lot of teams with large textures, models, and so on in their repositories using pre-DVCS tools. In the years after, when I started making Git & Mercurial tools, I kept in contact with many of these people and witnessed frequent problems adopting Git because of large files clogging up repositories, slowing everything down, and making clone sizes intolerable. A few teams used Mercurial’s large- files extension, but many of them still wished they had the option of using Git as well.
Atlassian has focused on serving the needs of professional software teams for the last 13 years and we’ve heard from many of our customers that they’ve struggled with the transition from legacy source code management systems to Git because of the lack of a good solution for the large files problem. Software teams really like using Git but this one problem is a real spanner in the works for their adoption plans. What could we do?
The quest begins
So, in late 2014, we started seriously looking at this issue. We tried everything that was out there already, and reluctantly concluded that the best way to solve this properly for the long term was to create a new tool. Creating a brand new tool wasn’t our first preference, but we felt that existing tools were either too complicated to configure for a team environment (once all the features you really needed like pruning were factored in) or were not fully developed enough and used technology we didn’t think would scale, so extending them wasn’t attractive.
We chose to write this new tool in Go, a modern language that was both good at producing stand-alone, fast, native binaries for all platforms, but was also fairly easy for most developers to learn; important because we intended to make it open source.
We initially called it Git LOB, and after working on it for a few months we attended Git Merge this May to announce it.
What neither Atlassian nor GitHub realised is that we’d both been working on the same problem! Atlassians and GitHubbers met in the bar the night before our talks, to discover that we’d both:
– made the decision to write a new tool for large files in git
– chosen to write it in Go
– designed the tools in very similar ways
– planned to announce the work on the same day at Git Merge.
Crazy right? Was it really all a coincidence? Turns out that yeah, it absolutely was, we’d both done this completely independently. I guess great minds really do think alike 🙂
We decided it made no sense to fragment the community when Git LOB and Git LFS were clearly so similar in their approach. It wasn’t a complete overlap, there were things that Git LOB did that Git LFS didn’t and vice versa – the best solution would be to have the best of both. So we open sourced our version as a reference, then switched our efforts to contribute to Git LFS instead, starting with porting across features we’d already developed that were useful for Git LFS. We at Atlassian plan to continue collaborating on the Git LFS project for the foreseeable future.
How it’s been going
I’ve really enjoyed working with the community around Git LFS, it’s turning out to be a really productive team effort. I’ve contributed 36 pull requests so far, making me the biggest non-GitHub contributor to the project right now:
If you’re using Git LFS, my name crops up quite a lot in the new features for v0.6 🙂 Many of these features were ported across from Git LOB, but the best thing is, I actually managed to make them better as I did the port, since you can always find improvements when you look at a problem the second time.
While it was a hard decision for me personally to stop working on our own solution, a few months on I’m really happy with the outcome and it was absolutely the right thing to do for the community. I firmly believe we’ll create something even more awesome by working with the open source community, and that I’ll be able to contribute positively to that effort. We at Atlassian feel it makes perfect sense for us to work to a common standard just like we do with Git itself and concentrate on creating great solutions around it, especially those that fit the needs of professional teams.
I’m working on a number of features for future Git LFS versions, including support for pruning, extensions to the SSH support, and more. I’ll also be just generally around the community commenting on stuff and trying to help out. I’ll be at GitHub Universe too, talking about our collaboration as part of the “Using Git LFS” panel on 1st October.
You’ll also be hearing much more about Git LFS support in Atlassian products, which of course includes Bitbucket and SourceTree. Watch this space! In any case, thanks for reading and I hope my little story of random coincidences and community collaboration was interesting. 🙂
We’ve had the chance to talk to many of you about how you’re using Bitbucket Server and Bitbucket Cloud (formerly Stash and Bitbucket.org, respectively). And along the way, we’ve learned a lot about what Bitbucket customers are building.
Simply put, we are in awe.
Take Halogenics, for example. They build database and software solutions using Bitbucket for the Biomedical and Education sectors. Their tools and the research they support are a literal life-saver for the cancer patients of tomorrow, and a ray of hope for the patients of today and their families.
Or how about Fugro Roames? They built the code that drives a drone to look at vegetation and record geographical data that makes it easier to determine risks when building out electrical power lines. We all need power, right? The code that drives Fugro Roames’ drone, built with Bitbucket, is making it possible for all of us to hang out at home and binge on Netflix or cook dinner for our families with the lights on.
We want you
We know the teams using Bitbucket (including one in three Fortune 500 companies!) are building diverse and innovative products that are moving the world forward. But we want to hear about it first-hand. From you.
To celebrate the software that you’ve poured your heart and soul into, we’re launching #BuiltWithBitbucket. This is your chance to strut your stuff in front of the entire Atlassian user base of 50,000 companies around the world.
We’ll be featuring a different product daily on Twitter, Facebook, LinkedIn, and Google+ to show what diverse products are built with Bitbucket – and brag about the coding skills of our customers . Here’s what you need to do:
Watch the #BuiltWithBitbucket video (because who doesn’t want to learn about how an asteroid was coming towards earth and the algorithm that saved it with Bitbucket?)
Tell us what you have #BuiltWithBitbucket on Twitter and you could receive a surprise from the Bitbucket team. Hint hint, wink wink (ok, you get it).
Don’t be shy. Even if you aren’t curing cancer or revolutionizing power line safety, your day-to-day coding has a bigger impact in the world than you give yourself credit for. And we want to give you credit for it.
So tell us: what are some of the cool things your team has #BuiltWithBitbucket?
It’s no secret anymore that Git is gaining traction within professional teams. 33% of respondents from a Forrester Enterprise Study last year indicated that 60% or more of their source code is managed by Git-based systems. Git surpassed Subversion for the first time in the 2014 Eclipse Community Survey. Git is becoming increasingly popular because of its easy branching model, flexible workflows, and distributed architecture.
At Atlassian, we’re committed to supporting professional teams making the switch to Git with Bitbucket. So, we’re announcing new capabilities today that will be available soon to help you use Git at massive scale:
Git mirroring to increase performance for distributed team members
Git large file storage support to allow collaboration on all file types of any size
Projects to keep your repositories organized
Build status for tighter integration between Bitbucket and Bamboo or any other Continuous Integration vendor
Our goal is to make it easier for professional teams to collaborate and deliver software faster. We’ve already added active-active clustering to ensure continuous uptime for source code management systems, a free business-ready product for small teams, and the first marketplace that allows for the discovery and distribution of third-party add-ons.
Organizations of all sizes from large enterprises such as Samsung, Splunk, Netflix, and Barclays Capital to small startups like Pinger, Metromile, and Kaazing are using Bitbucket today. Our JIRA Cloud customers picked Bitbucket as their #1 Git solution. One in three Fortune 500 companies trust Bitbucket and are using it everyday to build and ship software.
We hope you do as well.
Bitbucket is built for professional teams
Git was not originally designed for professional teams that are agile, distributed, and need secure and extensible workflows. Bitbucket makes it easier for a professional team to use Git:
Features like pull requests, inline comments, and permissioned workflows make it easy for teams to collaborate.
It’s the only Git solution that massively scales with active-active clustering, mirroring, and flexible deployment models.
Teams have access to integrations with the tools they most use, such as JIRA, which provides traceability across issues and source. Also, developers can create custom functionality or install a rich set of third-party add-ons available in the Atlassian Marketplace.
For larger organizations who need Premier Support and Strategic Services to get the most out of Bitbucket, we have already added Atlassian Enterprise.
Bitbucket: a unified brand for professional teams
To make it easier for you to find a collaborative code management solution that best meets your needs, we’ve unified our Git products under the Bitbucket name. With Bitbucket, you now have a range of options that can be adopted by teams of all sizes and requirements: Bitbucket Cloud (previously known as Bitbucket), Bitbucket Server (previously known as Stash) and Bitbucket Data Center (previously known as Stash Data Center).
Get started with Bitbucket: Git your way
We have a solution for teams of all sizes and needs – collaborate on code either self-managed or in the cloud, use Git via command-line or SourceTree. If you’re new to Git – head over to “Getting Git right” or if you’ve already made a decision to switch to Git, Try Bitbucket today.
Note: If you’re an existing customer of Stash or Bitbucket and have more questions, please visit the Bitbucket Rebrand FAQ.
Two-step verification secures your account by requiring a second component, in addition to your password, to access your account. That second step means your account stays secure even if your password is compromised. Bitbucket’s two-step verification implementation is built on the Time-based One-time Password Algorithm (TOTP), to ensure compatibility with mobile apps like Authy or Google Authenticator.
You will find the two-step verification (optional for you to use) in your Bitbucket account settings, which will take you through the onboarding process. More information can be found here.
The Bitbucket team has been hard at work adding features to make your day more productive and Bitbucket even more wonderful to use.
Et voilà… Snippets API, wiki edit restrictions, and relative links in READMEs and Snippets.
Snippets becomes even more of a first-class citizen among Bitbucket resources with the addition of APIs to create, rename, update, and delete Snippets — and more. See documentation here. We just made it easier for you to update your snippets from any third party application.
Wiki edit restrictions
Wiki editing is now restricted by default to those users with write permissions to the parent repository to prevent spamming on wikis. There is a new setting in the repository admin to toggle between this behavior and the previous behavior, where all users with any repo access could edit the wiki.
Relative links in READMEs and Snippets
A fix, really. This answers the very popular issue (#6513) on our public issues board.
Now you can use relative links in READMEs for both files and images. This also makes it possible to link to files from issues, changesets, and pull requests. Images can also be embedded in Snippets. Both “.” and “..” path notations are supported, and you can now start a link with “/” to make it relative to the repository root.
We owe you an explanation for the recent service interruption. On Wednesday, our users were unable to push or pull over SSH or HTTPS, access the GUI or API, or complete important tasks on time. We violated the cardinal rule of SaaS – “don’t have downtime” – and we’re very sorry for all the trouble we’ve caused. We know that you rely on Bitbucket for your daily work and when our service isn’t working correctly it affects your productivity.
I’d like to take a little time to explain what happened, how we handled it, and what are we doing to avoid a repeat of the situation in the future.
What actually happened?
At 11:50 UTC, Wednesday, September 2nd, we noticed a drop in the webhook processing rate. Our Site Reliability Engineers (SREs) escalated the issue to the Bitbucket team. We noticed at the same time the load balancer queues for every other service – SSH, Git over HTTPS, Mercurial over HTTPS, the API, the GUI – began growing very, very quickly. By 12:10 UTC, all services began failing for users.
Response times soon skyrocketed for Git over HTTPS.
Initially, the issues seemed unrelated. Application nodes all showed a reasonable load for that time of the day, and our RabbitMQ processes had not thrown any alerts. As our investigation later revealed, we had experienced severe resource exhaustion on the cluster serving both Redis and RabbitMQ, and the kernel’s out-of-memory killer began terminating both sets of services. This led to a split brain in RabbitMQ, as individual RabbitMQ instances were unable to communicate with each other and select a consistent master. Additionally, the application nodes’ Celery workers were unable to publish their background tasks consistently to RabbitMQ, thus blocking normal processing. With application processes blocked, the load balancers’ queues backed up, and soon the HAProxy daemon handling SSH traffic crashed outright.
By 12:30 UTC, we had focused our attention on the load balancer queues. Restarting haproxy-ssh did not work at first, as the kernel’s SYN backlog was still full, but once enough broken connection requests had timed out we’re able to restore haproxy-ssh at 12:47 UTC. By that point, RabbitMQ had stabilized to the point that Celery workers were once again able to publish. Our backlog of connection requests started to come down between the load balancer fix and the resumption of Celery workflow.
With traffic moving again, we started looking for a root cause. There were a few false leads: network changes, software deployments, back-end storage, and new RabbitMQ/Redis hardware. By 14:00 UTC, we came to a consensus that RabbitMQ’s failure was the reason we had seen so many problems. Unfortunately, Celery consumers were still unable to connect reliably to the queues, so the queues began growing very, very quickly, adding roughly 250,000 new tasks from 13:31 UTC to 13:59 UTC.
We spent the next several hours attempting to get Celery consuming again. However, the sheer size of the task queues, combined with the backlog of other services and our normal level of traffic for a Wednesday, meant slow going. At several points, consumer workers lost their ability to communicate effectively with their master processes. This kept the queues from being consumed consistently – workers would take a few tasks, run them, then die, all within the space of a few minutes.
Once Celery was fixed, it began handling a sizable message backlog.
By 19:27 UTC, we had managed to troubleshoot Celery enough to get it to consuming tasks consistently, though slowly. We still had two million backlogged tasks to process, and even with a relatively quiet back end that takes some time. By 21:00 UTC, we had managed to reduce the backlog to about 3500 tasks, and we resolved the incident as soon as we were confident the backlog wouldn’t go back up.
What are we doing to keep this from happening again?
There are a few actions we’re taking immediately:
We’ve migrated Redis to new hardware, separate from RabbitMQ, so that we can avoid resource contentions like this one in the future.
We’ve corrected a flaw in a script to truncate old keys in Redis, which could have helped us notice the Redis cluster problems sooner. This script has already purged about 100,000 dead keys, and it’s still going.
Our developers are re-evaluating their usage of Redis and RabbitMQ, and they are preparing adjustments that should reduce unnecessary traffic to either cluster.
Last but not least, the outage exposed some holes in our monitoring, especially centered around RabbitMQ and Celery. We’re upgrading our monitors to improve problem detection, and we’re rewriting SRE runbooks so that there’s a much clearer plan of action.
Thank you for using our service. We know that many of you rely on it. We truly strive every hour of every day to keep Bitbucket running reliably, so you can get stuff done and create amazing software.