Monitoring and observability

This page covers the best practices regarding monitoring, observability, and logging in Buildkite Pipelines.

Telemetry operational tips

  • When implementing telemetry, start by profiling the wait and checkout times for your queues as the biggest, cheapest wins.
  • Include pipeline, queue, repo path, and commit metadata in spans and events to make troubleshooting easier.
  • Stream Buildkite Pipeline's telemetry data to your standard observability stack so platform-level SLOs and alerts exist alongside the app telemetry, keeping one source of truth.

Quick checklist for using telemetry

Choose integrations based on your existing tooling and needs:

  • Enable Amazon EventBridge for real-time alerting when you need to integrate with AWS-native tooling. Start with setting up notifications and subscribe your alerting pipeline.
  • Turn on OpenTelemetry (OTel) export when you need vendor-neutral observability that works with your existing OTel collector. Start with job spans and queue metrics.
  • If you are using Datadog, enable agent APM tracing.
  • If you are using Backstage, integrate the Buildkite Backstage plugin to surface pipeline health and build status directly in your developer portal.
  • If you are using Honeycomb, send build events and traces to enable high-cardinality analysis of pipeline performance and failures.

Core pipeline telemetry recommendations

Establish standardized metrics collection across all pipelines to enable consistent monitoring and analysis:

  • Track build times by pipeline, step, and queue to identify performance bottlenecks with build duration metrics.
  • Monitor agent availability and scaling efficiency across different workload types by tracking queue wait times.
  • Measure success rates by pipeline, branch, and time period to identify reliability trends through failure rate analysis.
  • Standardize retry counts for flaky tests and assign custom exit statuses that you can report on with your telemetry provider.
  • Track retry success rates by exit code to differentiate between transient failures worth retrying and permanent failures that need fixing.
  • Use OTel integration to gain deep visibility into pipeline execution flows.

Using analytics for performance improvement

  • Monitor build duration, throughput, and success rate as key metrics. Use OTel integration and queue metrics.
  • You can also use OTel integration to identify the slowest steps and optimize them through bottleneck analysis.
  • Look for repeated error types with failure clustering.

Logging and monitoring

  • Favor JSON or other parsable formats for structured logs, as such formats can be easily queried when debugging. Use log groups to better represent relevant sections in the logs visually.
  • Differentiate between info, warnings, and errors by using appropriate log levels.
  • Store logs, reports, and binaries as artifacts for debugging and compliance.
  • Use cluster insights or external tools to analyze durations and failure patterns to track trends.
  • Avoid creating log files that are too large. Large log files make it harder to troubleshoot issues and are harder to manage in the Buildkite Pipelines' interface.
    • To avoid overly large log files, try not to use verbose output of apps and tools unless needed. See also Managing log output.
    • If you are using Bazel, note that Bazel's log file is extremely verbose. Instead, consider using the Bazel BEP Failure Analyzer Buildkite Plugin to get a simplified view of the error(s).

Set relevant alerts

  • Notify responsible teams for failing builds with failure alerts.
  • Detect bottlenecks when builds queue too long by monitoring queue depth. You can use queue metrics (insights) for this.
  • Trigger alerts when agents go offline or degrade to monitor agent health. If individual agent health is less of a concern, then terminate an unhealthy instance and spin up a new one.