Monitoring and observability

This page covers the best practices regarding monitoring, observability, and logging in Buildkite Pipelines.

Telemetry operational tips

When implementing telemetry, start by profiling the wait and checkout times for your queues as the biggest, cheapest wins.
Include pipeline, queue, repo path, and commit metadata in spans and events to make troubleshooting easier.
Stream Buildkite Pipeline's telemetry data to your standard observability stack so platform-level SLOs and alerts exist alongside the app telemetry, keeping one source of truth.

Choose integrations based on your existing observability tooling and needs:

Enable Amazon EventBridge for real-time alerting when you need to integrate with AWS-native tooling. Start with setting up notifications and subscribe your alerting pipeline.
Turn on OpenTelemetry (OTel) export when you need vendor-neutral observability that works with your existing OTel collector. Start with job spans and queue metrics.
If you are using Datadog, enable agent APM tracing.
If you are using Backstage, integrate the Buildkite Backstage plugin to surface pipeline health and build status directly in your developer portal.
If you are using Honeycomb, send build events and traces to enable high-cardinality analysis of pipeline performance and failures.

Establish standardized metrics collection across all pipelines to enable consistent monitoring and analysis:

Track build times by pipeline, step, and queue to identify performance bottlenecks with build duration metrics.
Monitor agent availability and scaling efficiency across different workload types by tracking queue wait times.
Measure success rates by pipeline, branch, and time period to identify reliability trends through failure rate analysis.
Standardize retry counts for flaky tests and assign custom exit statuses that you can report on with your telemetry provider.
Track retry success rates by exit code to differentiate between transient failures worth retrying and permanent failures that need fixing.
Use OTel integration to gain deep visibility into pipeline execution flows.

Monitor build duration, throughput, and success rate as key metrics. Use OTel integration and queue metrics.
You can also use OTel integration to identify the slowest steps and optimize them through bottleneck analysis.
Look for repeated error types with failure clustering.

Favor JSON or other parsable formats for structured logs, as such formats can be easily queried when debugging. Use log groups to better represent relevant sections in the logs visually.
Differentiate between info, warnings, and errors by using appropriate log levels.
Store logs, reports, and binaries as artifacts for debugging and compliance.
Use cluster insights or external tools to analyze durations and failure patterns to track trends.
Avoid creating log files that are too large. Large log files make it harder to troubleshoot issues and are harder to manage in the Buildkite Pipelines' interface.
- To avoid overly large log files, try not to use verbose output of apps and tools unless needed. See also Managing log output.
- If you are using Bazel, note that Bazel's log file is extremely verbose. Instead, consider using the Bazel BEP Failure Analyzer Buildkite Plugin to get a simplified view of the error(s).

Notify responsible teams for failing builds with failure alerts.
Detect bottlenecks when builds queue too long by monitoring queue depth. You can use queue metrics (insights) for this.
Trigger alerts when agents go offline or degrade to monitor agent health. If individual agent health is less of a concern, then terminate an unhealthy instance and spin up a new one.