Monitoring and observability

This page covers the best practices regarding monitoring, observability, and logging in Buildkite Pipelines.

Telemetry operational tips

  • When implementing telemetry, start by profiling the wait and checkout times for your queues as the biggest, cheapest wins.
  • Include pipeline, queue, repo path, and commit metadata in spans and events to make troubleshooting easier.
  • Stream Buildkite Pipelines telemetry data to your standard observability stack so platform-level SLOs and alerts exist alongside the app telemetry, keeping one source of truth.

Quick checklist for using telemetry

Choose integrations based on your existing observability tooling and needs:

  • Enable Amazon EventBridge for real-time alerting when you need to integrate with AWS-native tooling. Start with setting up notifications and subscribe your alerting pipeline.
  • Turn on OpenTelemetry (OTel) export when you need vendor-neutral observability that works with your existing OTel collector. Start with job spans and queue metrics.
  • If you are using Datadog, enable agent APM tracing.
  • If you are using Backstage, integrate the Buildkite Backstage plugin to surface pipeline health and build status directly in your developer portal.
  • If you are using Honeycomb, send build events and traces to enable high-cardinality analysis of pipeline performance and failures.

Core pipeline telemetry recommendations

Establish standardized metrics collection across all pipelines to enable consistent monitoring and analysis:

  • Track build times by pipeline, step, and queue to identify performance bottlenecks with build duration metrics.
  • Monitor agent availability and scaling efficiency across different workload types by tracking queue wait times.
  • Measure success rates by pipeline, branch, and time period to identify reliability trends through failure rate analysis.
  • Standardize retry counts for flaky tests and assign custom exit statuses that you can report on with your telemetry provider.
  • Track retry success rates by exit code to differentiate between transient failures worth retrying and permanent failures that need fixing.
  • Use OTel integration to gain deep visibility into pipeline execution flows.

Using analytics for performance improvement

  • Monitor build duration, throughput, and success rate as key metrics. Use OTel integration and queue metrics.
  • You can also use OTel integration to identify the slowest steps and optimize them through bottleneck analysis.
  • Look for repeated error types with failure clustering.

Logging and monitoring

  • Favor JSON or other parsable formats for structured logs, as such formats can be easily queried when debugging. Use log groups to better represent relevant sections in the logs visually.
  • Differentiate between info, warnings, and errors by using appropriate log levels.
  • Store logs, reports, and binaries as artifacts for debugging and compliance.
  • Use cluster insights or external tools to analyze durations and failure patterns to track trends.
  • Avoid creating log files that are too large. Large log files make it harder to troubleshoot issues and are harder to manage in the Buildkite Pipelines' interface.
    • To avoid overly large log files, try not to use verbose output of apps and tools unless needed. See also Managing log output.
    • If you are using Bazel, note that Bazel's log file is extremely verbose. Instead, consider using the Bazel BEP Failure Analyzer Buildkite Plugin to get a simplified view of the error(s).

Set relevant alerts

  • Notify responsible teams for failing builds with failure alerts.
  • Detect bottlenecks when builds queue too long by monitoring queue depth. You can use queue metrics (insights) for this.
  • Trigger alerts when agents go offline or degrade to monitor agent health. If individual agent health is less of a concern, then terminate an unhealthy instance and spin up a new one.

Getting metrics out of Buildkite Pipelines

Buildkite Pipelines provides multiple ways to export CI/CD metrics depending on your needs (agent fleet health, build performance, trace correlation, test quality, and so on) and where you want the data (Datadog, Prometheus, Grafana, CloudWatch, your own OpenTelemetry collector, or Buildkite's built-in dashboards).

Most teams need two or three of these approaches working together, as they are complementary rather than competing. The following sections introduce each approach, explain when to use it, and link to detailed setup documentation.

Decision matrix

What you want to measure Best approach Plan tier Push or pull Key destinations
What you want to measure Agent fleet health (agents online, busy, idle per queue) Best approach buildkite-agent-metrics Plan tier All Push or pull Pull (polls Buildkite API) Key destinations Prometheus, StatsD/DogStatsD to Datadog, CloudWatch
What you want to measure Agent process metrics (goroutines, memory, GC) Best approach Agent health check service Plan tier All Push or pull Pull (Prometheus scrape) Key destinations Prometheus
What you want to measure Build and job lifecycle traces¹ (spans, durations, wait times) Best approach OpenTelemetry notification service Plan tier Enterprise Push or pull Push (OTel) Key destinations Any OTel-compatible collector (Honeycomb, Grafana Tempo, Datadog, and others)
What you want to measure Agent-side job execution traces Best approach OpenTelemetry agent tracing Plan tier All Push or pull Push (OTel) Key destinations Any OTel-compatible collector
What you want to measure Queue depth, wait times, concurrency² Best approach Cluster insights and GraphQL API Plan tier Varies Push or pull Pull or UI Key destinations Built-in UI; GraphQL or REST for custom dashboards
What you want to measure Build events for alerting and dashboards Best approach Webhooks and Amazon EventBridge Plan tier All Push or pull Push Key destinations PagerDuty, Datadog, custom endpoints
What you want to measure Test performance and flaky tests Best approach Test Engine Plan tier Add-on Push or pull UI and API Key destinations Built-in UI; API for export

¹ The buildkite.job span includes the pipeline slug, build number, and a wait_time_ms attribute. You can also use a Signals to Metrics Connector to produce metrics from spans.

² The GraphQL ClusterQueue node exposes a metrics field with connectedAgentsCount, runningJobsCount, waitingJobsCount, and waitTimeSec (min/p50/p95/max). The same data is available through REST API at /v2/organizations/{org}/clusters/{cluster_uuid}/queues/{queue_uuid}/metrics.

buildkite-agent-metrics and the agent health check service are different tools

The buildkite-agent-metrics tool gives you fleet-level queue and agent counts by polling the Buildkite API. The agent's health check service exposes per-agent process health through a Prometheus endpoint on the agent binary itself. You likely want both.

Metrics approaches in detail

Each approach below covers a different aspect of CI/CD observability available in Buildkite Pipelines. Choose a combination of these to get full coverage across fleet health, build performance, and test quality.

Fleet health dashboard

buildkite-agent-metrics is a standalone binary (separate from the agent) that polls the Buildkite API and exports agent and queue metrics.

Metrics provided:

  • Agents: total, busy, idle counts per queue
  • Jobs: running, scheduled, waiting counts
  • Queue depth and wait times

Supported destinations:

  • Prometheus — exposes a /metrics endpoint for scraping
  • StatsD — emits StatsD-format metrics, which is also the path to get metrics into Datadog (configure DogStatsD as the StatsD receiver)
  • CloudWatch — publishes directly to AWS CloudWatch Metrics

Use this approach when you want a fleet-level view of agent capacity and queue health in your external monitoring tool. This is the primary path for getting agent metrics into Datadog, Prometheus, or CloudWatch.

Getting agent metrics into Datadog

To get Buildkite agent metrics into Datadog, configure buildkite-agent-metrics with the StatsD backend pointed at a DogStatsD receiver (the Datadog Agent's built-in StatsD server). See the buildkite-agent-metrics CLI documentation for setup details.

This tool polls the Buildkite API, so it shows point-in-time snapshots rather than event-level granularity. It does not cover build lifecycle events or trace data.

Per-agent process health

The Buildkite agent's health check service includes a native Prometheus-compatible /metrics endpoint served by the agent process itself (available since agent version 3.113.0).

Metrics provided:

  • Go runtime metrics: goroutines, memory allocation, GC pause times
  • Agent process health: uptime, version info

Use this approach when you run Prometheus and want to monitor agent process health alongside your other infrastructure. This is useful for detecting agent crashes, memory leaks, or degraded agents.

This endpoint shows individual agent process health, not fleet-level queue or capacity data. For fleet-level metrics, use buildkite-agent-metrics alongside it.

Build lifecycle traces with OpenTelemetry

Enterprise only feature

The OpenTelemetry tracing notification service requires an Enterprise plan. It provides traces (spans), not traditional metrics (gauges or counters). If you need time-series metrics, you need to derive them from spans in your backend (for example, using span-to-metrics features in Datadog or Grafana).

The OpenTelemetry tracing notification service pushes build and job lifecycle events as OpenTelemetry (OTel) traces to your collector.

Data provided (as trace spans):

  • Build lifecycle: created, scheduled, running, finished
  • Job lifecycle with durations, wait times, queue information
  • Pipeline and organization metadata as span attributes

Supported destinations: Any OTel-compatible backend, including Honeycomb, Grafana, Datadog APM, Jaeger, or your own OpenTelemetry collector.

Use this approach when you have an existing distributed tracing setup and want CI/CD events to appear as spans alongside your application traces. This is best for correlating build activity with deployments and service health.

Agent-side execution traces

The Buildkite agent can emit OpenTelemetry spans for job execution, providing execution-side trace context.

Data provided (as trace spans):

  • Job checkout, plugin, command, and artifact upload phases as individual spans
  • Execution timing for each phase

Supported destinations: Any OTel-compatible backend.

Use this approach when you want end-to-end trace context flowing from your application code through CI and back. This works alongside the notification service, as they are complementary:

  • Notification service provides control-plane lifecycle (build created, scheduled, running)
  • Agent tracing provides execution-side detail (checkout, plugins, command, artifacts)

Built-in cluster insights dashboards

Buildkite's built-in cluster insights dashboards show queue health, wait times, agent utilization, and concurrency.

Metrics provided:

  • Queue depth and wait times over time
  • Agent utilization and concurrency
  • Job throughput

Use this approach for quick visual checks of CI health without any external tooling. This is useful for debugging queue backups or capacity issues in real time. For queue-specific data, see queue metrics.

Note that some of the data shown in cluster insights is not yet available through an external export path (API, OpenTelemetry, or otherwise).

Custom dashboards with the GraphQL API

Buildkite's GraphQL API exposes build, job, agent, pipeline, and queue data for programmatic access.

Data available:

  • Build and job metadata, statuses, timings
  • Agent and queue information
  • Pipeline configuration and metrics

Use this approach when building custom dashboards (for example, in Retool or Grafana using a JSON API datasource), automation scripts, or when feeding data into your own data warehouse.

This is a polling-based approach, so you need to build your own scheduling to keep data fresh. Rate limits apply.

Real-time events with webhooks and EventBridge

Buildkite pushes build, job, and agent lifecycle events to your HTTP endpoints (webhooks) or Amazon EventBridge.

Events available:

  • Build created, started, finished, blocked
  • Job scheduled, started, finished, activated
  • Agent connected, disconnected, stopped

Supported destinations: Any HTTP endpoint (PagerDuty, Datadog webhook intake, custom services), or Amazon EventBridge to Lambda, SQS, or SNS.

Use this approach for event-driven alerting (for example, notifying a team when a build fails), feeding CI events into incident management systems, or building custom integrations. You can also configure pipeline-level notifications directly in your pipeline YAML.

Test-level performance metrics

Buildkite Test Engine ingests test results and provides test-level metrics.

Metrics provided:

  • Test duration trends
  • Flaky test detection and rates
  • Pass and fail rates over time
  • Slowest tests

Use this approach when you care about test health independently from build infrastructure health. This is best for engineering teams focused on test suite reliability and performance.

Test Engine is a separate product from build and agent metrics. It covers test execution quality, not CI infrastructure health.

Common metrics recipes

The following recipes show how to connect Buildkite Pipelines' metrics to popular destinations. Each one maps a common goal to the right approach and configuration.

Agent metrics in Datadog

Configure buildkite-agent-metrics to emit StatsD metrics and point it at your Datadog Agent's DogStatsD listener (default: localhost:8125). This gives you agent counts, queue depth, and job counts as Datadog metrics that you can graph and alert on.

buildkite-agent-metrics -backend statsd \
  -statsd-host localhost:8125 \
  -statsd-tags \
  -token $BUILDKITE_AGENT_TOKEN

The -statsd-tags flag enables Datadog-compatible tagging, so metrics are tagged by queue rather than including the queue name in the metric name. This makes it easier to filter and group metrics in Datadog dashboards.

Build traces in Honeycomb or Grafana Tempo

Set up the OpenTelemetry tracing notification service to push to your OTel endpoint. For deeper execution-phase spans, also enable agent-level OpenTelemetry tracing. Together they provide control-plane lifecycle and execution detail.

Queue wait times in Prometheus

Run buildkite-agent-metrics with the Prometheus backend and scrape its /metrics endpoint. You get queue-level wait time metrics. For more granular per-job wait times, use OpenTelemetry traces, which provide span durations rather than traditional gauges.

Build failure alerts in PagerDuty

Configure a webhook notification service to send build.finished events to PagerDuty's Events API. Filter on build.state == "failed" in PagerDuty's event rules. You can also use conditional notifications in your pipeline YAML to send alerts to specific channels.

Pipeline performance data collection

Poll the GraphQL API for build and job data on a schedule and store it in your own data warehouse. The API has time window limits on queryable data, so start collecting early. For built-in historical views, cluster insights provides some data with limited time ranges.

Per-agent process health in Prometheus

Enable the agent's health check service and add the /metrics endpoint to your Prometheus scrape config. This gives you Go runtime metrics for each agent process, which is useful for detecting degraded or unhealthy agents.

Current limitations

The following are the areas where the current metrics capabilities have known limitations:

  • Metrics export parity: Cluster insights shows data that can't be fully replicated through any external export path today. If you are building external dashboards, some metrics might currently not be available for export.
  • OpenTelemetry enrichment: Additional span attributes such as build metadata, trigger context, and span links for triggered builds are being actively improved.
  • Historical data: Current cluster insights and queue metrics have limited lookback periods. If you need longer time windows for capacity planning, consider using the GraphQL API to collect and store data in your own warehouse.
  • Traces and metrics gap: OpenTelemetry exports are trace-based (spans), but some workflows require traditional time-series metrics (gauges, counters). Converting spans to metrics requires backend-side processing that not all observability stacks handle well.
  • Event payload coverage: Webhooks and Amazon EventBridge event payloads don't include all metadata, such as retry context and manual-versus-automatic action flags.