Accelerating Intercom’s development: harnessing Buildkite for rapid, reliable CI/CD
Intercom is a live chat system for support, sales, and marketing teams that allows businesses to track and filter customer data; this data can be used to create personalized, automated marketing emails and in-app messages.
Intercom runs on two large applications: A Ruby on Rails monolith (the backend of Intercom, where the majority of code is written) and a large Ember.js application (the frontend that customers use). Both applications are complex in terms of the amount of test cases, the amount of code, and the amount of effort that the Intercom team invests to give developers the best environment to work in so they can ship code quickly. “Shipping is our heartbeat,” says Intercom’s Principal Systems Engineer, Brian Scanlan. “We like having a single application and invest in it accordingly so that our developers can be very productive in it.”
The Intercom team deploys about 150 times per day across multiple applications in the codebase. The continuous shipping of code requires the team to run a large number of tests and they needed a CI solution that could keep up with their demands.
Previously, Intercom was using two other CI tools — almost in redundancy — in order to achieve the stability they needed to build and to enable Intercom developers to do performance work on either of those environments. Unfortunately, even with dual solutions, they were unable to achieve the speed and control levels needed. “We’d still have outages and all sorts of problems which was frustrating,” says Scanlan. In the end, neither platform allowed the team to optimize for the things the team wanted. Plus, it was time-consuming and expensive to keep up both of them.
The team knew what they needed: reliability, control, and speed. “Getting the reliability of tests to run correctly was the first thing we needed to focus on. Just having full insight into what was going on would give us more visibility, and we’d be able to fix things quicker rather than working on a more closed platform,” Scanlan reflects back, “Then, optimizing for speed because the faster we can get feedback to the developers, the better.”
“There was nothing wrong with the tests; it was the environment that was always failing,” Scanlan reports. This seemed like a good use case to try a new approach so, focusing solely on reliability, the Intercom team used Buildkite to orchestrate the build within their own EC2 infrastructure. “We didn’t even have an efficiency goal at that point. We just wanted a reliability improvement, so that was the small thing we were going to try out first,” reports Scanlan. Once they saw that reliability was no longer an issue, the team began optimizing for speed and moving all of their builds over to Buildkite.
For a company whose “heartbeat is shipping”, time is priceless. While it used to take 20 to 25 minutes, the Intercom team can now run tens of thousands of tests in just three minutes. This provides their developers with real-time feedback which translates to a better developer experience and also gives them the ability to be more responsive when doing things like rolling back problems, getting changes out, or dealing with security problems.
“The mix of being able to own the infrastructure, performance-tune everything ourselves, having that full control, and then taking advantage of some Buildkite features (things like retries for failed build jobs and other kinds of automation) allows us to go really, really fast,” says Scanlan.