Bluecore is a multichannel personalization platform for retail that makes it easy for digital marketers to launch personalized campaigns and experiences that get shoppers to buy again and again.
Shray Kumar is a Staff Software Engineer on the infrastructure team. Since he joined over 2 years ago, the engineering organization has doubled. During that time, their need to customize and optimize their CI system outgrew the limits of their previous CI tool. We sat down with Shray to discuss the move to Buildkite and what their team loves about hosting CI on their infrastructure.
Q: What prompted the move to Buildkite?
A: Our previous provider’s pricing model forced annual billing and upgrades when we hit certain limits. We also used the cloud-hosted offering. This limited our ability to optimize and customize our CI set up.
We started looking at other solutions. Everyone’s mind went to Jenkins first because it’s what they know.
It’s surprising how many companies use a CI provider because it’s well known. If you really go and evaluate that option and what you’re getting out of it as your organization grows, there are lesser-known but better options out there.
Q: What were your requirements when seeking out a new solution?
A: Our general goal as an engineering organization is for any engineer to be able to start a project within 1-2 hours max. Everything, including CI, should be standardized and very easy to onboard. From when you write the code locally, to when you submit it to GitHub, to when it starts in Buildkite, to when it ends up in your environments–we want that process to be super smooth. The goal is for everything to be seamless.
In addition to that, we were looking for:
- Easy to manage internally
- Single sign-on
- Bring everything in house for infrastructure
Q: How long did you spend evaluating alternatives? Who really championed adopting Buildkite and why?
A: During the evaluation phase, people are usually thinking–for something like CI–“it will take a long time to adopt this.” But it was quick. It was maybe one week before we decided Buildkite is what we wanted to go with.
The infrastructure team in particular loved it. We can run our own nodes, cache locally, and pull our own build tools onto nodes. It’s amazing how much time we’re saving, how much standardization we’re getting, and how much we are able to control on our own.
The security team highly approved of moving things in-house and not running our code on third party systems.
The development team enjoyed decreased build times. Deployments in our previous provider would take 45 mins to an hour. With Buildkite we were able to get it down immediately to 15 minutes and to make additional optimizations from there. Also, Buildkite’s YAML-based format is familiar to people, making it an easy lift. If you want to set up your own pipeline, it’s one step in our internal Stack Overflow and it’s automatically generated. 90% of our pipelines are “Upload pipeline from repository.”
Q: What was the migration to Buildkite like?
A: We built a script that looked at our pipelines in our previous tool, took the relevant steps, and ported them to Buildkite pipelines. This worked quickly and painlessly for 90% of our projects. We increasingly downsized our plan and migrated off our previous provider for the remaining 10% of projects that took a little longer to move.
Q: What are some of the ways you’ve customized your CI setup with Buildkite?
A: With Buildkite’s plugin system, we were able to handle a lot of things like secret management and test coverage.
For secrets management, we forked a popular Google Cloud secrets management plugin and modified it to fit our needs, avoiding the need to build it from scratch in house.
For test coverage, we created our own plugin around Coveralls that can dynamically create tokens and associate them with pipelines using the Coveralls API.
We use a plugin for monorepos that identifies what files have changed, and based on what has changed, it can run a specific set of tests (instead of the whole suite.) So if you’re changing frontend assets and you change a frontend file, it will run tests specific to that area of the monolith on pull requests, so we’re not burning expensive tests on things that aren’t changed–only for specific subsets of code that have been changed.
We also built our own autoscaler for Google Cloud, after reviewing the code for Buildkite’s Elastic CI Stack for AWS. Depending on time of day, we might have 50 nodes running for our build queue. We examine the build queue throughout the day and spin up extra nodes as needed. We use the equivalent of AWS spot instances in Google Cloud–preemptible, ephemeral nodes–that allow us to shave down CI costs.
Q: How does the hosting CI on your infrastructure change the way your team works?
A: We’re not spending unneeded time or money on egress for pulling down images for build containers. Our previous provider required pulling down your build container, almost 2 gb each time, and if it didn’t start up on the same node, it would take 30-45 seconds to pull down the image. If you’re doing 10 PRs an hour, this adds up quickly. And, with that, if something goes wrong, you have to SSH into an instance outside of your network to troubleshoot.
With Buildkite, any engineer can access nodes on our infrastructure to troubleshoot. If they want to reproduce a bug in CI that makes it much easier. We don’t want to be the gatekeepers of troubleshooting an external platform. Go direct to the machine and troubleshoot the issue without leaking any code onto third party networks.