Monitoring and observability
This page covers the best practices regarding monitoring, observability, and logging in Buildkite Pipelines.
Telemetry operational tips
- When implementing telemetry, start by profiling the wait and checkout times for your queues as the biggest, cheapest wins.
- Include pipeline, queue, repo path, and commit metadata in spans and events to make troubleshooting easier.
- Stream Buildkite Pipeline's telemetry data to your standard observability stack so platform-level SLOs and alerts exist alongside the app telemetry, keeping one source of truth.
Quick checklist for using telemetry
Choose integrations based on your existing tooling and needs:
- Enable Amazon EventBridge for real-time alerting when you need to integrate with AWS-native tooling. Start with setting up notifications and subscribe your alerting pipeline.
- Turn on OpenTelemetry (OTel) export when you need vendor-neutral observability that works with your existing OTel collector. Start with job spans and queue metrics.
- If you are using Datadog, enable agent APM tracing.
- If you are using Backstage, integrate the Buildkite Backstage plugin to surface pipeline health and build status directly in your developer portal.
- If you are using Honeycomb, send build events and traces to enable high-cardinality analysis of pipeline performance and failures.
Core pipeline telemetry recommendations
Establish standardized metrics collection across all pipelines to enable consistent monitoring and analysis:
- Track build times by pipeline, step, and queue to identify performance bottlenecks with build duration metrics.
- Monitor agent availability and scaling efficiency across different workload types by tracking queue wait times.
- Measure success rates by pipeline, branch, and time period to identify reliability trends through failure rate analysis.
- Standardize retry counts for flaky tests and assign custom exit statuses that you can report on with your telemetry provider.
- Track retry success rates by exit code to differentiate between transient failures worth retrying and permanent failures that need fixing.
- Use OTel integration to gain deep visibility into pipeline execution flows.
Using analytics for performance improvement
- Monitor build duration, throughput, and success rate as key metrics. Use OTel integration and queue metrics.
- You can also use OTel integration to identify the slowest steps and optimize them through bottleneck analysis.
- Look for repeated error types with failure clustering.
Logging and monitoring
- Favor JSON or other parsable formats for structured logs, as such formats can be easily queried when debugging. Use log groups to better represent relevant sections in the logs visually.
- Differentiate between info, warnings, and errors by using appropriate log levels.
- Store logs, reports, and binaries as artifacts for debugging and compliance.
- Use cluster insights or external tools to analyze durations and failure patterns to track trends.
- Avoid creating log files that are too large. Large log files make it harder to troubleshoot issues and are harder to manage in the Buildkite Pipelines' interface.
- To avoid overly large log files, try not to use verbose output of apps and tools unless needed. See also Managing log output.
- If you are using Bazel, note that Bazel's log file is extremely verbose. Instead, consider using the Bazel BEP Failure Analyzer Buildkite Plugin to get a simplified view of the error(s).
Set relevant alerts
- Notify responsible teams for failing builds with failure alerts.
- Detect bottlenecks when builds queue too long by monitoring queue depth. You can use queue metrics (insights) for this.
- Trigger alerts when agents go offline or degrade to monitor agent health. If individual agent health is less of a concern, then terminate an unhealthy instance and spin up a new one.