Shay Sofer is a Backend Engineer on Wix’s Infrastructure team. Recently at UnblockConf '21 he shared with us how Wix built a scalable, highly concurrent CI solution using Buildkite, while reducing the time builds spend in queue from 40-60 minutes to just a few seconds. To give you an idea of the scale of Wix's engineering efforts:
Over the past few years, Wix has put significant work into improving its CI system. After making the switch from Maven to Bazel as their build tool in 2019, build times improved by 90 percent. Despite this marked improvement, things were not running smoothly.
The issue was the time that builds were spending in the queue before they began to run. When the build system hit a certain threshold, builds were spending time in the queue before they began to run. This wait time often ranged between 40 to 60 minutes.
The bottleneck stemmed from:
Very simple day-to-day scenarios could lead to what the Wix team refers to as “build storms,” a state where everyone is blocked, and there is no way to prioritize builds.
Wix’s new build system needed to meet the following requirements:
After exploring multiple tools, the Wix team found that Buildkite met most of their requirements right out of the box.
Moving to Buildkite – or any new build system – introduced a few challenges, architecture decisions, and dilemmas that Wix had to address. Here’s what Shay had to say about how Buildkite met all six of Wix’s requirements:
We have around 60 backend repositories. We chose the Buildkite pipeline per repo approach; it works quite well for us. Since we maintain a microservice that is responsible for triggering builds, we now also have a configuration file that holds the mapping of a Git Repository => Buildkite’s pipeline slug. This helps us trigger the correct Buildkite pipeline.
We use Buildkite’s dynamic pipeline mechanism; it means that we do not hard code the Bazel commands we are running in the pipeline. It allows us to identify certain conditions and feature flags. We are offloading logic to a Buildkite plug-in that we created, in order to reduce the chances that pipeline steps will change. Now if there's a change to the code that is dynamically deciding which steps to run, we just need to release a new plug-in.
It was important to perform the migration safely while also preventing downtime. We created a mechanism that allows us to quickly opt-in/out repositories to Buildkite and a way to run builds in parallel, in the old build system and in a ‘dry run’ mode in Buildkite (without side effects). That gave us the ability to run all of the load in Buildkite with minimum risk. When everything worked as expected, we gradually moved all of our repositories to Buildkite.
We chose to use Buildkite’s integration with EventBridge as it provides a reliable way of listening to notifications on the lifecycle of a build. We also defined retries in case our handler fails for any reason, to improve resilience.
We use Buildkite’s queuing mechanism. For each of our different build types, we have dedicated queues. This creates a much needed isolation, and prevents incidents where critical builds, such as production hotfixes, are queued and blocked by other types of builds.
We’re running on Kubernetes and every Buildkite build agent is a K8s pod. For autoscaling, we are leveraging Kubernetes Event Driven Autoscaler (KEDA) and buildkite-agent-metrics. We can optimize each queue for low time-in-queue (by having a large buffer of pre-warmed agents) or optimize for cost (no buffer of pre-warmed agents as they spawn on-demand).
Check out Shay’s talk to learn more about how Wix’s infrastructure team met the challenge of supporting its growing engineering team.
Buildkite is the fastest, most secure way to test and deploy code at any scale.
Our self-hosted agents work in your environment with any source code tool, platform and language including but not limited to Ruby, Xcode, Go, Node, Python, Java, Haskell, .NET or pre-release tools.