Philipp Wollermann is a Staff Software Engineer at Google. He was on the open-source team of Google's build system Bazel and worked on Bazel for over eight years.
At UnblockConf ‘21 Philipp shared the things the Bazel team learnt while implementing their own Bazel CI on top of Buildkite, and some Bazel tooling tips and best practices that will help speed up your own build pipelines.
Bazel is Google’s Open-source build system, created in 2015 to solve specific issues Google was tackling. Since then, it has been developed with the community and continues to grow in popularity. Bazel can provide considerably faster build times: it has the ability to recompile only the files that need to be recompiled and can skip re-running tests that haven’t changed.
About Bazel:
- It’s an open-source build and test tool created by Google and their community.
- It has a high-level, human-readable build language.
- It’s fast, reliable, hermetic, incremental, parallelized, and extensible.
- It supports multiple languages, platforms, and architectures.
Bazel CI is Bazel’s custom CI/CD system for testing and releasing Bazel and its ecosystem. Bazel CI is built on top of Buildkite with some custom VCS integrations, a configuration DSL, and its own infrastructure, all written and maintained by the Bazel team. Bazel CI and Buildkite are used by the Bazel team for pre-submits, post-submit and downstream testing – basically testing Bazel against projects that use Bazel in order to minimise regressions with automatic culprit finding and to maximise stability when building and deploying Bazel releases.
“The problems you encounter and the way you structure your CI setup highly depends on how your source code is structured.
Perhaps you have:
- a monorepo
- or a collection of project files
- or maybe just one big project file (called a
WORKSPACE
file in Bazel)- or you might be using the classic open-source approach of having a set of completely independent Git repositories, hosted on a remote source control manager such as GitHub or Bitbucket.
Getting started is relatively easy for all of these cases.
Philipp Wollermann
In most cases, starting out with Buildkite and Bazel is relatively easy, according to Philipp. You can:
bazel build //src:bazel
and bazel test --build_tests_only //...
This approach works and performs quite well, but if you have a large monorepo, or large collection of repos it isn't likely to scale. The ...
in //…
is a wildcard which means all the tests in the repository will be run. You can use the build_tests_only
flag to prevent building non-test targets during bazel test
, but for huge repositories, it can still be slow, computationally intensive, or even fail completely.
So, once you've got Buildkite and Bazel working together, what happens when you hit the limits of what the initial implementation can offer? Let's take a look at some common problems and Philipp's suggested solutions.
“Time spent waiting for your build to complete is okay if it’s post-submit, but if you’re waiting for pull requests to build, developers get very uneasy waiting around for test feedback," says Philipp.
Combine Buildkite’s native sharding support with
Philipp Wollermannbazel query
to turn one big job inton
jobs that take roughly1/n
time
Bring down build times with:
...
wildcard in bazel test –build_tests_only // …
into the full list of test targetsbazel query
.Here’s an example Python script that Philipp extracted from the team’s Bazel CI logic that illustrates how to implement target sharding by combining Buildkite’s parallelisation with bazel query
to significantly reduce build times.
It's true that Bazel is an incremental build system that has inbuilt caching and incrementality. Users should be able to rely on running bazel test //…
to limit the scope of tests that are re-run to those relating to changes in a commit. According to Philipp,"this works, but eventually your repository might become so large that provisioning machines with enough RAM and CPU for Bazel to be able to work on the full dependency graph is too expensive, or even impossible".
The answer then, is to calculate affected targets computationally. You can use path-based triggers that:
git commit
triggering the jobThe optimal solution is to calculate affected targets by using bazel query
.
Helpful resources for getting started:
Target skipping means you’re not only optimising your builds, you’re also keeping your infrastructure and cloud-compute costs under control.
There are a number of reasons why you might want to split one big pipeline for your repository into multiple pipelines. You might be:
Philipp suggests applying the 1:1:1 rule:
It’s important to remember your monorepo’s webhook will continue to fire every time and as a result commits will often be irrelevant to the individual team, project and pipeline with this new architecture. Ideally you would implement target skipping to reduce the noise once you have re-architected your CI to have more granular pipelines.
Inevitably as the size of your repository grows the git clone
step in your pipeline will become increasingly slow and painful. Repeatedly downloading Bazel (and other utilities) unnecessarily utilizes network resources that could be better used elsewhere.
Philipp recommends one of these approaches to reduce the time spent waiting for a build to start:
git-mirrors
feature heregit clone --bare
them all into the same directory structure that the agent's git-mirrors
feature expects and update them e.g. daily via git fetch
tar.gz
copy of this directory structure somewhere and extract it into the right location when building your agent imagebuildkite-agent
only has to git fetch
the difference, saving a lot of network bandwidth and timegitbundle.sh
docker.sh
Initially the Bazel team had "a classic CI setup – most things were shell scripts, project owners had particular ideas how these shell scripts should be written, we ended up with 50 different ways to do roughly the same thing. The result meant our CI was very difficult to reason about and modify.”
Philipp Wollermann
The team wanted Bazel CI to have a high level abstraction that worked with Bazel’s primitives. To achieve this they designed a custom high-level DSL that projects on Bazel CI all use.
platforms: # ... windows: build_targets: - //... test_targets: - //... # Windows doesn't have a `python3` executable on PATH. - -//:py3_bazelisk_test test_flags: - --flaky_test_attempts=1 - --test_env=PATH - --test_env=PROCESSOR_ARCHITECTURE - --test_output=streamed
A list of platforms
can be passed in:
Each of the platforms can:
build_targets
test_targets
and exclude particular tests if they don't need to be run on a particular platformsteps: - command: |- curl -fsS "https://raw.githubusercontent.com/bazelbuild/continuous-integration/master/buildkite/bazelci.py?$(date +%s)" -o bazelci.py python3 bazelci.py project_pipeline --file_config=.bazelci/presubmit.yml --monitor_flaky_tests=true | buildkite-agent pipeline upload label: ":pipeline:" agents: - "queue=default"
Pipelines have a single step configured in the Buildkite web UI that:
bazelci.py
script from the repositorybazelci.py
script which generates and uploads the rest of the jobs from the DSL programmatically:
buildkite-agent upload
commandThis DSL helped Bazel CI users a lot, transforming it “into a self service system that is very easy to modify and understand. Users no longer have to deal with the implementation details of how to do things in multiple platforms” instead having just one, high-level, consistent format for everything.
“We made it so easy to create new pipelines (with our new DSL) that over time we accumulated a lot of pipelines. Eventually, we needed to update a plugin in the
Philipp Wollermannbazelci.py
script. It was then I realized I’d probably have to open a hundred Chrome tabs and make this change manually – I didn't see any other way.”
Not a great way to maintain Buildkite at scale.
Fortunately there’s a better way to manage your Buildkite pipelines at scale! The Bazel team migrated their pipelines to the official Terraform Buildkite Provider, meaning:
You can check out the Bazel team’s Github repository - it contains all the Terraform configuration files used to manage their Buildkite pipelines.
With Buildkite, you’re in control of your own agent infrastructure, and this means deciding whether to use stateful or stateless workers.
Stateful workers give you the ability to cache things on disk, such as build system output and git clones, this improves performance and makes auto-scaling simpler. The downside is, it’s risky testing untrusted code on stateful workers, third party pull requests may result in unintended, negative side effects that may impact future builds. For this reason, wherever possible, Bazel CI uses stateless workers.
Stateless workers are configured to:
Physical machines:
Philipp’s recommended approach to managing bare-metal stateless workers safely:
kill -9
(or equivalent)Here’s an example of a clean-up script from Bazel CI for Buildkite Agents on macOS.
If you’re using stateful workers you can make use of Bazel’s local caching features; repository caching and disk caching. Bazel is incremental by default and should only do the minimum work required between two builds.
bazel test --repository_cache=/home/buildkite/repocache
bazel test --disk_cache=/home/buildkite/diskcache
Having access to these caches means Bazel can access the outputs of an action immediately without needing to re-run, even if it’s shut down between jobs or bazel clean
is run.
Unfortunately, stateless workers are unable to utilize local caches (as the cache is lost when the machine gets replaced after each job), in this case a remote cache can be used. Bazel can query and upload cache entries for build outputs and test results to a shared remote cache, using HTTP WebDAV or a gRPC based API.
By default, Bazel executes builds and tests on your local machine. Remote execution of a Bazel build allows you to distribute build and test actions across multiple machines, such as a datacenter. It also gives you an automatic persistent shared cache across all of your CI workers.
“Even if you have a nice 64-core workstation, you can only run 64 actions in parallel for integration tests, probably less because they tend to use a lot of resources. This is nice, but what's even nicer is to run thousands of actions in parallel - and with remote execution, you can make this work.”
Philipp Wollermann
Remote execution can be complex to set up but provides great benefits:
Some options to get you started with Bazel's remote execution feature:
Initially test logs were made available to users after the build was completed - using the Buildkite Agent’s artifact upload feature, but users wanted faster access to test failure logs.
Developers wanted faster feedback, asking “if I already know that a test is failing, why do I have to wait for the CI job to finish before I’m able to read it?”
Philipp Wollermann
In response to this very valid question, the team built the Bazel CI Agent tool.
The bazelci-agent tool:
buildkite-agent artifact upload
commandCheck out the team’s bazelci.py
setup to see an example of the integration in use.
Where there are tests, there are flaky tests – tests that might fail 20% of the time, for a reason that is often unclear.
Bazel has built in support for detecting flaky tests:
FLAKY
in the output and the BEPFLAKY
tests
bazel test –flaky_test_attempts=3 //…
bazel test
invocation to have passedFLAKY
FLAKY
test logs are found it uploads the BEP JSON for further processing to a GCS bucket“Of course, you’ll want to fix flaky tests because they are a pretty bad performance drag on your pipelines, they can cause the runtime to explode, and Bazel has to retry them multiple times.”
Philipp Wollermann
At Buildkite we know flaky tests can be a huge problem. That’s why we built Test Analytics – to help you identify and fix your flaky tests for good. ;)
You can watch Philipp's UnblockConf '21 presentation here:
Buildkite is the fastest, most secure way to test and deploy code at any scale.
Our self-hosted agents work in your environment with any source code tool, platform and language including but not limited to Ruby, Xcode, Go, Node, Python, Java, Haskell, .NET or pre-release tools.