Keith Smiley is a Principal Engineer working on infrastructure at Lyft. He's also a maintainer of Bazel's iOS support and Envoy Proxy, LLVM and Swift contributor and creator of the Mobile Native Foundation.
Recently, at UnblockConf '21 he shared with us Lyft’s CI journey to Buildkite, their current setup based on Buildkite and Bazel, and where the team is heading next. Some facts about Lyft's mobile teams:
- ~200 active contributors to the mobile code base
- 1000s of CI builds per week
- Dozens of PRs merged per day
Slow build times in macOS machines
Lyft has two platform-specific monorepos to house their Android and iOS code bases. They perform most of their iOS builds on macOS machines. In 2015, Lyft did a rewrite of their iOS app in Swift, which at the time was very slow to compile. In addition, Apple’s constant changes to Xcode presented challenges to the team. Build times were particularly slow because of these issues.
Flexible CI Abstraction
To address the slow build times, Lyft decided to come up with a more flexible CI abstraction. Here, Keith explains how they did so:
"First, we abstracted our CI configuration out of any vendor-specific file, and moved all of that into our repository, which made it much easier for us to give a CI provider a top-level script, and have everything work out of the box."
Picking Jobs to Run
"We threw a fast Linux CI job in between the environment variable and the CI provider that picked what we actually wanted to run for each commit. This let us do things like commit to a release branch and run some extra jobs."
Triggering jobs via the API
"Finally, we would just hit the CI providers API, passing the right environment for whatever we wanted to do. And then it would trigger the job. But most of the logic was still happening in our repository separate from any CI provider."
"This setup let us do many interesting things. Not only could we run some jobs on Linux, and some jobs on Mac OS, but we could also use different CI providers on the same commit. It was a nice way for us to dip our toes in a little bit to see if the new provider was fast and stable and with just one job while still running most of the other jobs on other providers."
While the new CI set up did help some with Lyft’s slow build times, they still faced the challenge of using hosted macOS CI services. The Lyft team decided to try self hosting macOS machines. This is also the point, Keith said, that the mobile infrastructure team began thinking very seriously about using Buildkite as a CI provider.
“We knew we wanted to host the machines ourselves, but we still weren't excited about the idea of hosting the central kind of CI scheduling piece ourselves. We knew that could be error prone and just take a lot of our time.
“Lyft does have some other internal CI setups, and some folks were using Buildkite already. It gave us a nice ‘in' to test that out, and worked well with us not wanting to maintain that central piece of infrastructure.”
The initial Buildkite trial on self-hosted macOS machines decreased CI times from 20 minutes to 5.
How Lyft Uses Buildkite
Keith considers Buildkite’s dynamic pipelines its most differentiating factor from other CI providers. Keith explains how Lyft uses them.
Here’s a simple Buildkite pipeline:
steps: - command: "./ci/run.sh"
“This works really well to test. You can just change the UI and quickly trigger a new build and see what happens. That’s great, but pretty quickly, I think you'll want to check in your configs so that you don't break old branches, as you change configs over time.”
Here’s a different setup done in the Buildkite UI:
steps: - command: "buildkite-agent pipeline upload"
“Here, you run the Buildkite Agent, and it uploads the pipeline to Buildkite, which adds all the jobs to you're currently running to the Buildkite build. This works great and lets you easily test the changes to the pipeline configuration and PRs as well as versioning it over time. But the thing to understand here is that this is just an arbitrary shell command that runs in the context of your repo.”
Here's what Lyft does:
steps: - command: "./ci/generate_pipeline.py && buildkite-agent pipeline upload"
"We can actually throw some other scripts in here. In ours we run an arbitrary python script that's checked into our repo first. And then we tell the build agent to upload the pipeline. So the difference is, we don't actually check in a static Buildkite pipeline at all, and we just generate it with these scripts."
generate_pipeline.py is responsible for the following tasks:
"First, we query GitHub. The biggest benefit of doing this is that we can get all the files that change in a PR. One practical example of that is that if you have a pull request that changes Swift files, we know that we want to run our Swift formatting linter. Whereas if you don't, we can skip that step which not only saves time and some risk of flakiness, but also saves machine scheduling tasks, so that it can go and grab a different job instead. Once you have enough of these rules, this helps you tune your CI machine utilization."
"The second thing we do is query our build system. We use Bazel to build our apps and it has a built-in feature called Bazel query. Bazel query lets you query information about your build graph, given some specific set of files or other criteria."
"In this case, we take the files that we query from GitHub, and then ask the build system, ‘given these files have changed, what apps do I need to rebuild, and what test targets do I need to rerun?’"
"In our platform this is a massive win for us, because the standard case for our developers is that they're working isolated features. In this case, their small change may only affect one app and a handful of test targets when we actually have dozens of apps and thousands of test targets. So, in the worst case, you could end up triggering 20 CI jobs on some change, whereas in the best case, you may only trigger two."
"So especially for our macOS CI machines, where we can't auto scale them, the difference is one or two pull requests, could take over our whole CI fleet and then start other developers queuing versus maybe 20 or 30. This would all have to be pushed to at the same time to actually exhaust our entire fleet. So this is a huge win for us."
"We use all this information along with other heuristics to produce a valid
buildkite.jsonfile. And this is what buildkite ends up taking and throwing into the current build given all the conditions we've applied to it so far. This is a huge win for CI utilization, and leads to a much better developer experience in general."
Want to learn more about Lyft’s mobile CI setup?
Check out Keith’s full UnblockConf '21 presentation to learn more about how the team also uses Linux and slash commands to interact with CI through pull requests.