NewBuildkite hosted agents. Check out the Q1 Release for the latest features, including managed CI/CD offerings for Mac and Linux.

How to lower costs while scaling your CI/CD: Use Spot Instances


Every great engineering team is running CI/CD. That’s because CI/CD pipelines let you automate workflows like building, testing, and deploying your software. These are jobs that would otherwise take hours of manual work and likely involve a few special scripts that need careful handling.

In short: CI/CD is about unblocking your engineers so they can focus on shipping updates to users, planning new changes, and improving your team processes—all wins for your organization.

However, as the team creates more pipelines and runs them more often, costs creep up. CI/CD pipelines need to be run by build agents after all, and infrastructure isn’t free.

There are a few ways to tackle this cost problem:

  • Ship changes slower. Fewer changes mean you need to run pipelines less often, but that also means features and fixes get to your users more slowly.
  • Run fewer agents. Fewer agents mean lower infrastructure costs, but pipelines will take longer to start running, leaving your engineers waiting around.

Neither of these options is compelling. What if there was a way to reduce the cost of your infrastructure while retaining your current speed and efficiency? Spot Instances let you do just that.

Introducing Spot Instances

Your build agents may already run on AWS, GCP, Azure, or another cloud provider. A Spot Instance (just Spot, for short) is a virtual machine offered by cloud providers for a cheaper price, which can be terminated with little notice. Importantly, these Spot Instances run on spare capacity—if only 20% of a cloud provider's compute capacity is being used by regular machines, they want to incentivize teams to take up the extra space.

Here’s how they work on Amazon EC2:

Diagram showing two Amazon AWS Availability Zones. The first Zone's EC2 only has a small amount of regular instances. The second Zone's EC2 has a large amount of regular instances.

AWS Availability Zone 1 has lots of spare capacity for new Spot Instances, but Availability Zone 2 is filled with regular instances. The amount of spare capacity affects how many Spot Instances can be started, how long those machines will live, and their current price.

Let’s talk about the benefits and challenges of Spot Instances.

Benefits of Spot Instances

Spot Instances are significantly discounted—up to 90% off market prices on Amazon EC2. This is because they run on a cloud provider's unused compute capacity. You can either run the same compute capacity for a lower cost, or increase your capacity with the same spend.

Switching to Spot Instances is usually fairly simple. Most Infrastructure-as-Code templates have a parameter to control how much of a fleet of agents uses Spot Instances. For example, the Buildkite Elastic CI Stack for AWS exposes the OnDemandPercentage parameter to control this.

Downsides of Spot Instances

They’re interruptible. If prices increase (and they are volatile), or the provider needs that capacity for regular instances, your machines can be terminated at short notice. On GCP and Azure, this means 30 seconds, and AWS provides two minutes of notice before the instance is terminated. So long-running, uninterruptible processes aren’t appropriate for Spot Instances.

Spot Instances also may not be available at your desired price point. If your whole agent fleet is made up of Spot Instances, you can run into issues if the Spot Instance price spikes or there’s simply no Spot capacity available.

Guidelines for using Spot Instances in CI/CD

Given the challenges, how do you make use of Spot Instances in a way that ensures your pipelines will be cost effective and continue running? Here are some guidelines we recommend:

  • Be ready for interruptions. We have an article outlining CI/CD retry steps for Spot Instances, but the basics are to either put shorter tasks on Spot Instances or make sure those tasks can save their progress and resume on another machine.
  • Play with Spot Instance percentages. If you already have a fleet of agents running well on regular instances, you don’t want to switch it to 100% Spot Instances overnight. After all, your tasks may not handle interruptions well and there may be issues to resolve. Start small and see how using Spot Instances impacts build times and throughput. Over time, increase the percentage of Spot Instances as you gain confidence they can run your pipelines.
  • Ensure you always have machines. Due to capacity and pricing, Spot Instances may not always be available, and you don’t want your pipelines grinding to a halt. Accepting a broader range of instance types in your agent fleet can make it more likely that you’ll always have machines. You can also fall back to regular instances on EC2 when there’s no Spot capacity.

Doing this should ensure your pipelines are ready for Spot Instances and keep executing no matter what.

Cost savings with Spot Instances

Exactly how much can Spot Instances save your organization? Rippling saved 60% on their EC2 compute costs by leveraging Spot Instances—the savings really stack up when you make this change across your whole agent fleet. We’ve seen other customers save similar percentages with Spot Instances.

So, how do you justify moving to Spot Instances? Analyzing your cloud provider bill is a good place to start. It will show how much you currently spend on instances. If you compare your spend to the price of equivalent Spot Instances, you can estimate the savings your organization can make. Both Azure and AWS also expose historical Spot Instance pricing in their portals, which can give your team a more accurate estimate.

Most cloud providers cite a 50-90% reduction in costs for Spot Instances compared to their standard instances. But as the prices are variable, your exact savings will depend on the regions and machine types you choose.

To make sure you get these savings, you need to prevent interruptions and availability constraints from reducing your throughput. Buildkite’s job queues and agent targeting features let you control exactly where jobs are run in your infrastructure. In short, each step in your pipeline can define which agents (instances) should run them. With this, you can ensure that the only tasks running on Spot Instances are those ready to save and resume their work.

How Rippling reduced cost and improved developer experience by moving CI to Spot Instances

Register to watch our webinar with Rippling. We dive into all of these aspects and more to help you apply these strategies to your own CI/CD processes.

Alternatives to Spot Instances for reducing CI/CD cost

While Spot Instances can provide significant savings, sometimes they just aren’t right for your pipelines. Luckily, they aren’t the only option for cost-effective CI/CD infrastructure.

On Amazon EC2, you have two options:

  • Reserved Instances: You commit to running a particular instance type for a 1 or 3-year term in exchange for discounted pricing. Reserved instances add consist compute to your agent fleet. However, you pay whether or not you use the reserved capacity.
  • AWS Savings Plans: You commit to a certain hourly spend for a 1 or 3-year term in exchange for discounted pricing. In contrast to Reserved Instances, you maintain flexibility to run any instance types across services as needed, which lets you adapt based on your needs.

You could even combine some of these approaches—for example, using Spot Instances for cost-effective burst capacity while maintaining a baseline of Reserved Instances at all times. Since these options require a commitment to either instances or spend, though, they aren’t as flexible as Spot Instances when scaling your fleet of agents up or down on the fly. It’s easy to end up with unused capacity unless you take the time to monitor your workloads and figure out the spend that makes sense before committing.

Whichever route you choose, your infrastructure should be able to handle interruptions, balance your agent fleet, and target the right group of agents for each workload. Understanding the pricing models your cloud provider offers is key to optimizing the cost of your CI/CD infrastructure.

Conclusion

As your CI/CD jobs multiply, cost optimization strategies like using Spot Instances become critical. In addition, while interruptible Spot capacity needs upfront work to handle properly, the potential savings make it well worth the investment—up to 90% in many cases.

The key is taking a pragmatic approach—ramping up Spot Instance usage slowly, making your pipelines resilient to interruption, and maintaining non-Spot capacity as needed. This allows teams to realize the cost benefits of Spot Instances while mitigating operational risks.

For help implementing Spot Instances with Buildkite, see:

You can also talk with our friendly team for tailored assistance.

Buildkite Pipelines is a CI/CD tool designed for developer happiness. Easily follow and decipher logs, get observability into key build metrics, and tune for enterprise-grade speed, scale, and security. Every new signup gets a free 30-day trial to test out the key features. See Buildkite Pipelines to learn more.