Shay Sofer is a Backend Engineer on Wix’s Infrastructure team. Recently at UnblockConf '21 he shared with us how Wix built a scalable, highly concurrent CI solution using Buildkite, while reducing the time builds spend in queue from 40-60 minutes to just a few seconds. To give you an idea of the scale of Wix's engineering efforts:
- 220M users
- ~500 backend engineers
- More than 2,000 microservices in production
- 400 deployments per day
- 9000 backend builds per day
- Watch here for a quick summary
When time in queue is the problem
Over the past few years, Wix has put significant work into improving its CI system. After making the switch from Maven to Bazel as their build tool in 2019, build times improved by 90 percent. Despite this marked improvement, things were not running smoothly.
The issue was the time that builds were spending in the queue before they began to run. When the build system hit a certain threshold, builds were spending time in the queue before they began to run. This wait time often ranged between 40 to 60 minutes.
The bottleneck stemmed from:
- Wix’s system handles different types of builds. Many of them are very “bursty,” firing a large number of builds in a short period of time. Others are very critical, such as production hotfixes. These different types of builds require different SLAs. There was no way to achieve that, and Wix frequently had "less important" builds clogging the system.
- The system could not concurrently run more than 200 builds
Very simple day-to-day scenarios could lead to what the Wix team refers to as “build storms,” a state where everyone is blocked, and there is no way to prioritize builds.
In search of a new build system
Wix’s new build system needed to meet the following requirements:
- Define Wix repos as pipelines
- Dynamically load and run all of Wix’s build types
- Trigger builds in multiple build systems with no side effects
- Resilient notifications
- Prevent “build storms”
After exploring multiple tools, the Wix team found that Buildkite met most of their requirements right out of the box.
The challenge of moving to a new build system
Moving to Buildkite – or any new build system – introduced a few challenges, architecture decisions, and dilemmas that Wix had to address. Here’s what Shay had to say about how Buildkite met all six of Wix’s requirements:
1. Wix Repositories ⇔ Buildkite Pipelines
We have around 60 backend repositories. We chose the Buildkite pipeline per repo approach; it works quite well for us. Since we maintain a microservice that is responsible for triggering builds, we now also have a configuration file that holds the mapping of a Git Repository => Buildkite’s pipeline slug. This helps us trigger the correct Buildkite pipeline.
2. Dynamically support multiple build types
We use Buildkite’s dynamic pipeline mechanism; it means that we do not hard code the Bazel commands we are running in the pipeline. It allows us to identify certain conditions and feature flags. We are offloading logic to a Buildkite plug-in that we created, in order to reduce the chances that pipeline steps will change. Now if there's a change to the code that is dynamically deciding which steps to run, we just need to release a new plug-in.
3. Triggering builds in multiple build systems
It was important to perform the migration safely while also preventing downtime. We created a mechanism that allows us to quickly opt-in/out repositories to Buildkite and a way to run builds in parallel, in the old build system and in a ‘dry run’ mode in Buildkite (without side effects). That gave us the ability to run all of the load in Buildkite with minimum risk. When everything worked as expected, we gradually moved all of our repositories to Buildkite.
4. Resilient notifications
We chose to use Buildkite’s integration with EventBridge as it provides a reliable way of listening to notifications on the lifecycle of a build. We also defined retries in case our handler fails for any reason, to improve resilience.
5. Preventing "build storms"
We use Buildkite’s queuing mechanism. For each of our different build types, we have dedicated queues. This creates a much needed isolation, and prevents incidents where critical builds, such as production hotfixes, are queued and blocked by other types of builds.
We’re running on Kubernetes and every Buildkite build agent is a K8s pod. For autoscaling, we are leveraging Kubernetes Event Driven Autoscaler (KEDA) and buildkite-agent-metrics. We can optimize each queue for low time-in-queue (by having a large buffer of pre-warmed agents) or optimize for cost (no buffer of pre-warmed agents as they spawn on-demand).
Want to learn more?
Check out Shay’s talk to learn more about how Wix’s infrastructure team met the challenge of supporting its growing engineering team.