From scripts to software: Scaling beyond Jenkins with large monorepos

At the beginning of every software project, teams face fundamental questions, such as:

How will we organize our code?
How will we build and ship our software?

For many, the answers are monorepos and Jenkins. It makes sense: monorepos bring everything together—projects, dependencies, docs, tooling, coding standards—and can help teams collaborate and enable velocity in meaningful ways. And Jenkins, as an open-source, self-hostable platform with a large community, can seem like a compelling choice for teams looking to grab the reins of CI and control their own destinies.

But while monorepos and Jenkins can work together well for a while, at a certain point, many teams find themselves hitting the limits of what Jenkins can do for them—both as an operational platform and as a tool for managing the complex challenges that large monorepos present.

Fortunately there's a way forward. With the right set of tools, and a slight shift in thinking—from treating pipelines as build scripts to treating them much more like software—teams can break past the operational limits of Jenkins, and unlock new possibilities. In this post, you'll learn how successful teams are making this shift, and how it's helping them not only make monorepos work well at massive scale, but also transforming their approach to the software-delivery process.

How it goes: The journey to complexity

The story typically goes something like this:

Early on, life is good. You’ve got a few projects checked into your new monorepo—a web front-end, service backend, and a shared library, all written in one language—and you’re able to build, test, and ship it all into production in a couple of minutes with a single Jenkins controller and a couple of workers. It all works, you’re running the show from commit to deployment, and getting value into the hands of your customers several times a day. Everyone’s happy.

Over time, though, things start to get complicated:

As the team grows, PR volume goes up. This is of course what you want (growth is good!), but more PRs means running more PR jobs, which in turn means having to wait for available job runners. Queueing ensues, pushing build times up slightly, so the team adds a few more Jenkins workers to handle the load.
Fast forward a few months, and the still-growing team has now added several more projects, now in multiple languages, each with its own build tools, dependencies, and tests—lots of tests. Build times have crept up to almost an hour, so the team adds several more Jenkins workers, refactoring the pipeline’s Groovy scripts to run more steps in parallel to bring build times back down to more reasonable levels.
That works for a while—but fast forward a few dozen more engineers, and the repo’s now so active that the team’s pushing the limits of what the Jenkins controller can support. At peak times, it crashes, taking all running jobs down with it—including the occasional deployment to production. This prompts the team to add a second Jenkins server for failover (along with a load balancer and some shared storage to hold them together), and that helps, but it doesn’t give them any more throughput; it only keeps Jenkins itself from blocking the path to production.

Now fast forward an order of magnitude or two and you begin to see how this looks at enterprise scale: one pair of Jenkins controllers becomes ten, then twenty, all held together with load balancers, shared storage, monitoring, networking, and an increasing number of Jenkins experts to keep it all up and running. More growth leads to more load, which leads to more crashes, more controllers, more humans to manage it all... and on it goes, rising in lockstep with the size of the organization.

What’s driving this narrative: The challenges of building monorepos at scale

Why is this story so common? We’ve spoken to a lot of teams and we find that the challenges with monorepos are universal; Jenkins is just one way they’re exposed and exacerbated. The bottom line is that there’s constant work needed to keep build times under control as contributions increase. Regardless of your infrastructure, you will have to confront two major challenges.

First, the need for massive concurrency. As commit volume rises (to thousands per day in some cases), you need to run more jobs in parallel as a result. That concurrency isn’t just for commits, though; a single commit might be split into hundreds of individual jobs to distribute the work of building dozens of projects, running thousands of tests, uploading packages, running deployments, and more, across as many processes as possible. Concurrency being the main lever for keeping build times under control, teams lean as heavily into it as they can—but that only works when the underlying infrastructure is there (and elastic enough) to support it.

Next, the need for much more control over pipeline dynamics. As monorepos gather more projects, and the relationships between those projects become more complex, teams search for ways to make pipelines efficient. Avoiding unnecessary work is the name of the game here, and one way to do that is by building only the code that’s changed—for example, with selective builds and path-based filters. That works too, but when the change is to a shared library, testing only that library doesn’t make sense; you’ll usually want to test some or all of its consumers as well, to guard against regressions. Which is where things get complicated.

Monorepos also tend to attract merge conflicts, which can bring the release train to a halt and leave the main branch wedged, blocking the path to production. To address this, teams frequently introduce merge queues—but since merge queues also intentionally slow the train down, the moment you add one, you start looking for ways to speed things back up—e.g., by moving higher-priority changes to the front of the line, or combining multiple non-conflicting changes into a single job to save time. All of these scenarios call for weaving more logic into your pipeline definitions, often at runtime—and that’s not easy to do when the languages you're working with are Groovy and Bash.

These are far from the only challenges; caching is another one, across many dimensions (the repo itself being the first—monorepos get big). Visibility is another: it’s tough to know what’s going on with a given change when it’s splintered across a half-dozen Jenkins UIs.

In pursuit of the primary goals, though—keeping build times down and the main branch shippable—it largely comes down to these two. And your ability to succeed at both relies directly on the scalability and flexibility of your delivery platform.

You can make Jenkins work—but should you?

All this said, it’s certainly possible to make Jenkins work with a large monorepo if you’re committed—and we’ve certainly seen some teams put a ton of effort into doing so. With significant investment in custom tooling and infrastructure, and a staff of specialists with lots of Jenkins experience, you can make it happen. A few things we’ve seen work:

Adding yet more Jenkins controllers as described—and with each one, all of the compute, networking, shared storage, load balancing, monitoring, and humans to support it.
Building out a publicly accessible API endpoint to pull all of those individual controllers behind a single URL to handle callbacks from your source-code provider, so you can delegate which controllers should handle a given code change.
Building out the orchestration to gather up all of the job statuses across all of your organization’s Jenkins controllers so you can capture and report a collective status for a given change back up to your source provider—for example, as a GitHub check status.
Building abstractions on top of all of these running Jenkins controllers (e.g., custom-built internal front-ends) to make it possible for your developers to find and debug their builds when something goes wrong—or even just track their build as it moves through the queue.
Wrapping everything up with an infrastructure-as-code tool like Terraform or Pulumi to make it easier to deploy and manage all of these (and future) Jenkins controllers, workers, shared storage, load balancers, abstraction layers, etc.

But here’s the thing: Look closely and you’ll see that every one of these is a workaround—an attempt to fix something most teams would prefer just worked.

Worse, they only address the concurrency half of the problem. The other half—the need for more precise, programmatic control over the definition and behavior of the pipeline—remains, and as monorepos grow, that’s where the majority of the complexity lies (and where most teams would like to spend most of their time).

Unfortunately, that part doesn’t really have a workaround; Jenkins pipelines, in the end, bottom out on Groovy and Bash scripts—and there’s only so far you can get in terms of expressive capability (not to mention maintainability) with Groovy and Bash.

So where does that leave a team that’s found itself in this situation? What’s next?

Breaking out of the loop: from scripts to software-driven pipelines

It’s easy for teams to get bogged down and stay stuck in this operational loop of trying to make Jenkins work for a long time—so bogged down that they can forget to ask whether it makes sense to go on doing so.

Most high-velocity teams, however, eventually realize it’s holding them back, and they need a different way forward. And a key part of that, as we’ve learned from some of our largest customers, is to stop thinking of the delivery pipeline like a statically defined container of shell scripts, and instead to begin treating it like a flexible, dynamic, constantly-evolving software application of its own.

What does that look like? At first, it might just mean pulling some of your gnarlier pipeline logic out of Groovy and Bash and into freestanding programs written in modern programming languages like TypeScript, Python, or Go. Small, incremental changes like these can go a long way toward making even Jenkins-based monorepo pipelines more flexible and maintainable.

But the big wins come when you realize you can drive the whole pipeline in one of those languages, and then start doing that. Being able to define and shape the behavior of the pipeline as it unfolds—based on the content of the change, the depth of the queue, the number of tests to be run, or whether the step that just ran finished successfully—gives you a whole other level of power and flexibility that can you can use to unlock higher levels of efficiency as your team grows.

By shifting to software-driven pipelines, you can:

Define the characteristics of your pipelines and steps programmatically based on the conditions of the environment—for instance, in response to asynchronous calls to other systems
Trigger additional pipeline steps (or even whole other pipelines) based on the outcomes of other steps, and without having to shell out to Bash
Analyze the output of specialized tools like Bazel (e.g., with bazel query) to expand the pipeline in response to a given change—for instance, to run the integration tests of all of a shared library’s consumers to guard against regressions
Calculate the number of agents you’d need to compile all of the build targets of a given change, and then spread those jobs evenly across all of those agents
Extract some or all of your pipeline logic—error handling, notifications, access to secrets stores, and more—into a shared library, and then make that library available to other teams in your organization to use in their own processes

Dynamic, software-driven pipelines are also a big part of the Delivery First mindset, and at the core of how many of our largest customers—Uber, Rippling, Elastic, and others—deliver their large monorepos successfully. Unfortunately, though, this level of flexibility isn’t possible with Jenkins, as Jenkins pipelines must be written in Groovy and defined statically, before the pipeline begins; they can’t be modified or extended at runtime.

But it is possible with Buildkite— specifically with dynamic pipelines—and with the added benefit of unlimited concurrency and a managed, scalable control plane that you never have to think about. Here's a webinar, for example, in which engineers from the Uber team discuss how they use Go to drive the pipeline of their 50-million-plus-line monorepo with Buildkite, after migrating from Jenkins:

Monorepos at scale: Building CI for 1,000 daily commits at Uber

Recorded on: June 26, 2024

Length: 40 minutes

Large monorepos are nothing if not complex—and modern software delivery is nothing if not fundamentally dynamic. With Buildkite, you have all the tools you need—including unbounded scale and a unified control plane developers love—to handle both.

Schedule a demo with one of our experts to learn more, or get started with a free trial on your own today.

Scale-Out Delivery Platform→

Capabilities

Pipelines→

Test Engine→

Package Registries→

Mobile Delivery Cloud→

Bring your own compute

Hosted compute

Replace Jenkins

Workflows for AI/ML

Testing at scale

Monorepo mojo

Bazel orchestration

Webinars

Blog

Case studies

Events

About

Careers

Follow Buildkite

From scripts to software: Scaling beyond Jenkins with large monorepos

How it goes: The journey to complexity

What’s driving this narrative: The challenges of building monorepos at scale

You can make Jenkins work—but should you?

Breaking out of the loop: from scripts to software-driven pipelines

Monorepos at scale: Building CI for 1,000 daily commits at Uber

Related posts

How Uber halved monorepo build times with Buildkite

Optimizing CI/CD for monorepos with Buildkite’s official plugin

Monorepo CI Best Practices

Start turning complexity into an advantage

Platform

Hosting options

Resources

Company

Solutions

Legal

Support