Agent management best practices

This page covers best practices for effective management of Buildkite Agents. Buildkite Agents execute your pipeline's jobs. The right infrastructure, queue layout, and lifecycle policies for your Buildkite Agents determine the security, speed, and cost of your agent fleet.

Choosing the right architecture

Buildkite Agents can run on local machines, cloud compute, container schedulers, and serverless infrastructure. Choose based on your workload characteristics, cost constraints, and operational maturity. Many teams adopt a hybrid approach, combining different stacks for different workload types.

Stack	Best for	Key benefits
Cloud compute	High utilization, disk-heavy jobs	Bin-pack multiple agents, warm images, large cache support
Containers (Kubernetes/ECS)	Elastic isolation per job, burst isolation	Fast autoscaling, clean environments, strong isolation
Buildkite hosted agents	Speed to value, zero ops, bursty workloads	Fully managed, isolated clusters, per-minute billing
Hybrid approach	Cost optimization and accounting for different use cases for different teams	Provides the best agent infrastructure for your particular needs

See a more detailed overview of each architecture type for Buildkite Agents to choose what's right for your Buildkite organization.

Cloud compute

Run multiple agents per an instance to maximize cost efficiency and enable heavy caching.

Pros:

Strong isolation with predictable performance
Warm images reduce job startup time
Compatible with spot instances for cost savings
Support for large disk caches and GPU/TPU workloads

Cons:

Additional operational overhead to patch and maintain instances
Cost inefficiency at low utilization if agents are under-used
Slower agent spin-up times compared to other agent architectures

Learn more in Elastic CI Stack for AWS.

Containers (Kubernetes, ECS)

You can deploy ephemeral agents per job for maximum isolation and rapid scaling, or long-running agents that stay alive between jobs for improved performance through warm starts and persistent caching.

Pros:

Fast spin-up with fine-grained autoscaling
Clean environments reduce build flakiness
Native resource limits and multi-tenant isolation

Cons:

Pulling large images can increase job startup latency
Requires cluster expertise and ongoing platform maintenance
Limited access to large persistent disk caches per job

Learn more in Agent Stack for Kubernetes.

Buildkite hosted agents

Buildkite hosted agents provide fully managed infrastructure with isolated clusters and minimal operational overhead.

Pros:

Fully managed infrastructure with zero operational overhead
Built-in caching for Git mirrors and containers, as well as attachable Cache volumes for temporary data storage
Isolated clusters that provide strong security boundaries
Per-minute billing with automatic scaling for bursty workloads
Ideal for highly parallel test suites

Cons:

Hosted agents run outside your private network boundary, so may not meet strict compliance or data-residency requirements
Less control over hardware configuration and OS versions than in self-managed compute
Higher cost for sustained high throughput compared to self-managed compute

Capacity strategy

There is no need to settle on a single architecture within your Buildkite organization as you utilize different stacks based on the needs and knowledge level in your teams.

For example, a popular approach among Buildkite users is to have a self-managed agent fleet that is based on either Kubernetes or cloud compute instances (AWS or Google Cloud Platform), as well as on Buildkite macOS hosted agents due to ease of management, clean development environments, and optimized caching the latter provide. Different teams in those Buildkite organizations can utilize the stacks that are better suited to their needs.

Similarly, in terms of agent fleet scaling, instead of choosing between using static or autoscaling agents exclusively, you can:

Keep one-two small static instances in your default queue for pipeline uploads as this speeds up pipeline starts and allows proper autoscaling.
Use dedicated autoscaling queues for actual workload.

Structuring clusters and queues

You should organize clusters as security boundaries and queues for workload routing. Use separate queues and a small subset of agents to trial new architectures (for example, Buildkite hosted agents) before rolling them out broadly across your Buildkite organization.

Learn more about using clusters and queues in Managing clusters and Managing queues.

Agent lifecycle

Long-running agents provide caching benefits (Git mirrors, dependencies):
- Retire oldest agents first during scale-down
- Add telemetry to detect flaky agents
Ephemeral agents reduce attack surface and configuration drift. Buildkite hosted agents support repository caches and shared volumes.

Right-sizing of your agent fleet

Monitor queue times with cluster insights and Buildkite Agent Metrics.
Use cloud-based autoscaling (Elastic CI Stack for AWS, Buildkite Agent Scaler, Agent Stack for Kubernetes).
Maintain dedicated pools for CPU-intensive, GPU-enabled, or OS-specific workloads.
Configure graceful termination to allow jobs to complete.
To be able to duplicate your fleet of agents in an easy way, favor agent images and configurations that are able to run in more than one environment. For example, you can have a single Docker image that contains the latest Buildkite Agent binary, a selection of development and deployment tools, and a config that reads information such as queues or tags from environment variables. You could then run such image as Kubernetes agents, ECS agents, or in a Docker setup on a virtual machine.

Resilience and redundancy

Strive to have an architecture that allows you to run agents in multiple regions or on a secondary platform to make sure that the critical queues keep running during outages. For example, instead of running all your agents for a critical queue in a single availability zone - spread your agents to other availability zones. This way, if one of the availability zones experiences issues, the agents in other zones will still be able to pick up the jobs.

Opt for building out your agent architecture in such a way that a single host or cluster problem will only affect a limited (preferably small) subset of queues or pipelines, and not your entire agent fleet.

Security

Build security into agent infrastructure from the start. Follow least privilege principles and integrate proper secret management. It's recommended that you:

Store secrets in hooks or cloud secret stores. You can find more on proper secrets management in Buildkite Pipelines in Buildkite secrets and Secrets management
Use short-lived tokens and ephemeral agents
Enforce infrastructure-as-code (Terraform, CloudFormation)

For more information on agent security, see Buildkite Agent security.