How PagerDuty leverages Buildkite for 20% faster incident resolution
Pagerduty is the leader in digital operations management with more than 13,000 customers worldwide who rely on its platform to keep their own digital services running. Its platform helps clients identify issues and opportunities in real time and bring together the right people to fix problems faster and prevent them in the future. Headquartered in San Francisco, PagerDuty has a distributed engineering team operating in Toronto, Atlanta, London, Sydney, and Seattle in addition to its home base.
As the world rushes online and to remote work, there has been a 47% increase in the number of daily incidents. However, PagerDuty customers have been able to resolve those incidents 20% faster. PagerDuty’s use of AI and ML to address issues in minutes and seconds, not hours, is what has attracted 60 of the Fortune 100, and companies like GE, Cisco, Genentech, Electronic Arts, Netflix, Shopify, Zoom, DoorDash, lululemon and more as customers.
When CI/CD Becomes A ‘Choose Your Own Adventure’ Game
As one can imagine, the ability to deploy, test, and ship code faster is essential for a company that describes itself as the “central nervous system” of an organization’s IT operations. However, the CI/CD solutions used by PagerDuty weren’t able to keep up with the testing and deployment needs of engineering and other teams.
“CI/CD was a choose-your-own adventure for our feature delivery teams,” recalled Tristan Bates, Senior Site Reliability Engineer at PagerDuty. “Deployments were either a mix of self-hosted GoCD or manual deployments using shell scripts and Makefiles.”
Neither option was ideal for PagerDuty. “GoCD was difficult to maintain and required constant collaboration between feature delivery teams and the SRE team whenever a change to a deploy pipeline was required, a new service was created, or an engineer switched teams,” said Bates.
He added, “Custom scripts for deployment meant there was no standardization and it was difficult to tell when, how, or why something was deployed.”
This inconsistent process prompted the search for a replacement. After reviewing more than 50 vendors, PagerDuty narrowed it down to three finalists: Jenkins with CloudBees support, an updated version of GoCD, and Buildkite.
They arrived at their final decision by allowing service delivery teams to build pipelines for all three tools and test them against one another.
“We set up environments for all of these so that teams could test deploying to staging, deploying to production,” said Bates. “We created a decision matrix of all the different features from all the tools and our requirements, and scored each of them. It was a very analytically-driven selection.”
Buildkite’s Solution for Faster Shipping
In the end, Buildkite was the solution most closely aligned with PagerDuty’s key requirements:
- Hybrid PagerDuty needed a hybrid solution where the control plane was cloud managed, but agents could run locally.
- Secrets management The team needed to keep all of their secrets within their own infrastructure.
- Self-service and ease of use PagerDuty’s multiple engineering teams operate under a full-service ownership model. Common tooling and services helps them reduce their cognitive load.
“The hassle of managing the control plane is totally out of our hands,” Bates said of Buildkite. “We can set up our own secrets management, access to internal schedulers, and AWS and don’t have to have those secrets out on the cloud at large.”
PagerDuty has since moved all deployment pipelines to the platform. This includes the work of any team that touches code within the company, not just engineering.
Ninety-nine percent of everything that makes it into production passes through Buildkite for deployment,” said Bates. “We also use it a lot for running tests and other jobs as well.”
Freeing Up Time for SRE and Feature Teams
PagerDuty has multiple SRE teams, however the one responsible for Buildkite is charged with enabling the rest of the engineering teams to deliver reliable and scalable services efficiently.
“The ability to bootstrap new services quickly is key for us so we can focus on delivering features, bug fixes, and improvements instead of repeating the same common setup and configuration steps for every service,” Bates said. “With Buildkite and a small amount of automation we’ve built ourselves, teams are able to go from a blank repo to a service running in production in a few minutes. This allows us to go from an idea to an MVP much quicker.”
Recalling PagerDuty’s previous CI/CD process, Bates said, “It used to take a day to get a pipeline up and running because you had to learn this archaic XML format, and set up credentials, and perform these rituals and other manual steps. It was just really difficult to get code deployed to any environment.”
But now with Buildkite “it’s trivial,” Bates said. “You click a button and it generates a pipeline, and it does all this dynamic magic. Teams just don’t think about it anymore.”