Lelia Bray-Musso and Gary White Jr. are members of Wayfair’s Open Source Program Office (OSPO) team. At last year’s Unblock conference, they shared how they upleveled their approach to CI with Buildkite, moving from an unstable and difficult to manage set of tools to a resilient and fun to use platform.
Defining success for CI
Gary and Lelia began working on the CI/CD pipeline at Wayfair on the Test Enablement team in late 2018. During that time they were responsible for the success of Wayfair’s CI platform.
In evaluating the efficacy of their CI tooling then and now, they map their goals to three distinct themes that must be achievable:
- Code reusability (the “reusable”)
- Infrastructure resiliency (the “resilient”)
- User experience and developer leverage (the “fun”)
With these themes in mind, Gary and Lelia began searching for a solution for a development organization of over 2000 running builds and deployments at scale.
For many years, Wayfair’s CI infrastructure was in a state that some would refer to as “snowflake” infrastructure.
“We had a lot of machines that were fully unique and bespoke in their setup and maintenance. And as a result, teams would regularly have to maintain their own infrastructure or use a machine they didn’t know much about” Gary shared with us.
Wayfair teams expected there would be an infrastructure platform available to them so they could focus on application development. Instead, there was chaos and confusion, as machines ran scarce and builds took longer and longer. At one point, the entire build infrastructure depended on two virtual machines requiring specialized knowledge.
“A lot of things went wrong,” Gary said. “The solution was not resilient. It certainly wasn’t reusable. And our developers were not having fun trying to use it.”
Buildkite simplified many hard things about Wayfair’s previous CI infrastructure. Gary and Lelia shared three Buildkite features that led to making Wayfair’s CI infrastructure reusable, resilient, and fun: queues, dynamic pipelines, and plugins.
Gary and Lelia had to specialize machines by language, framework, and developer preferences. With Buildkite, they could easily achieve this while also centralizing the keys to build machines. Buildkite made this a much better developer experience through using Buildkite queues.
Containers before Buildkite queues
Gary described the problem with containers before Buildkite to illustrate the power of queues:
Often as a developer you are looking for somewhere for a container job to run. You might see a list like the one above and get reasonably confused. You might select one that you think is best-fit by the name, but it’s possible the container daemon isn’t available to the build environment for security reasons. You might choose another and have a similar problem.
Containers with Buildkite queues
With Buildkite, they’ve been able to use queues to make it much easier for developers to identify what machines to use for any given purpose.
“In Buildkite we specified queues through the Buildkite API to designate which machines did what,” Gary said. “So for example, we could have a fleet of machines that run Java, and not require developers to be intimately familiar with our infrastructure to use them effectively. You just provide some docs and a guide.”
The name of the queue negates the need for engineers to think about the infrastructure, so they don’t try to target specific machines directly at all. This means on the infrastructure side, the team can make changes without interrupting service. Maintenance events, outages, and auto-scaling are much easier with this abstraction than when developers are targeting machines purpose-built for their needs.
The team uses Terraform and Puppet for configuration management to spin up and down the machines used in these queues, giving them an ultimately reusable and resilient approach to providing a better developer experience.
“Switching to Buildkite queues makes the developer experience better, allowing developers to do more with less and have more fun,” Gary said.
Prior to using Buildkite, Wayfair’s pipeline config existed as a single unwieldy YAML file. Because it was also the single orchestration layer responsible for building a critical production asset, engineers risked disrupting dozens of deployments to make a change. In order to make a change, an engineer would need to work with the maintainers of the pipeline to access their tribal knowledge and request access to modify the config.
With Buildkite, they were able to break up the single, large YAML file into smaller digestible chunks with clear separation of domain and ownership.
“Buildkite agents typically expect to receive pipeline instructions in YAML, but you can supercharge your everyday
pipeline.ymlfile by converting it to a
pipeline.shscript, ” Lelia said. “This in turn can concatenate multiple chunks of YAML–based on environment variables, build conditions, or context from commit diffs. This magically appears in the Buildkite UI as a unified pipeline.”
Dynamic pipelines allow them to make decisions about where the pipeline is going during the runtime of the pipeline. From there, a
pipeline.sh script helps them achieve that goal. Using dynamic pipelines, their platform teams were able to define a highly reusable pattern for engineers building moderately complex pipelines. And, because of the flexibility and modularity of dynamic pipelines, teams can easily share pipeline definitions and reuse what’s relevant to them.
“Beyond the ability to programmatically stitch together pipeline steps on the fly, dynamic pipelines unleashed the power of YAML templating and DRY (Don’t Repeat Yourself) principles. By combining with the notion of a common YAML file read in by
pipeline.sh, we can store environment variables and build configurations for multiple steps that need to be extracted by many of the same values over and over again. This makes it easier to do things like propagate a tag or agent specification across numerous pipelines,” Lelia shared.
Prior to Buildkite, in order to address challenges with their prior build pipelines, the team created many ad hoc shell scripts.
“More often than not, these shell scripts would call a secondary script, which in turn called another script, and another script and so on and so forth,” Lelia explained. “Not only were the shell scripts extremely difficult to troubleshoot, they were even harder to discover or share with others, often undocumented, almost never versioned.”
Buildkite plugins proved to be the solution to the problem.
“Since Buildkite plugins all begin with a shell script entry point, there is near-guarantee that your plugin will run on any CI agent it is deployed on to,” Lelia said. “Using plugins, it allows us to centralize and track commonly requested pipeline functionality, version it and create an easy path for anyone to improve and extend the plugin as needed.”
The following are Buildkite plugins that the Wayfair team regularly uses:
The following are Buildkite Plugins that Wayfair builds and forks:
See the full video to learn more about how Wayfair uses Buildkite and builds resilient engineering teams.