Prometheus metrics
All Prometheus metrics exported by the Agent Stack for Kubernetes controller begin with buildkite_
. The second component of the metric name refers to the controller component that produces the metric.
How to enable Prometheus monitoring
The Agent Stack for Kubernetes controller can expose Prometheus metrics for monitoring and observability. To enable Prometheus monitoring, complete these two steps:
- Enable metrics port exposure in the Helm chart.
- Create a PodMonitor resource for scraping.
The instructions that follow assume that you have Prometheus Operator installed in your cluster. If you're using a different Prometheus setup, you'll need to configure scraping manually.
Enabling metrics port exposure
Configure the prometheus-port
option in your Helm deployment to expose the metrics endpoint. You can use either the command-line or the value file approach.
Command-line approach
Use the following command to expose the metrics endpoint:
helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
--namespace buildkite \
--create-namespace \
--set agentToken=<buildkite-cluster-agent-token> \
--set config.prometheus-port=8080
Values file approach
Set the following configuration in your values file:
# values.yml
agentToken: "<buildkite-cluster-agent-token>"
config:
prometheus-port: 8080
tags:
- queue=kubernetes
And run the following command:
helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
--namespace buildkite \
--create-namespace \
--values values.yml
This exposes metrics on port 8080 at the /metrics
endpoint within the controller pod.
Creating a PodMonitor for scraping
If you're using Prometheus Operator, create a PodMonitor
resource to automatically scrape metrics from the controller:
# buildkite-podmonitor.yml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: buildkite-agent-stack
namespace: buildkite
labels:
app: buildkite-agent-stack
spec:
selector:
matchLabels:
app: agent-stack-k8s # Replace with your actual Helm release name followed by "-agent-stack-k8s"
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
Apply the PodMonitor:
kubectl apply -f buildkite-podmonitor.yml
Verification
Verify that monitoring is working correctly:
# Check that the metrics port is exposed
kubectl get pods -n buildkite -o wide
kubectl port-forward -n buildkite deployment/agent-stack-k8s 8080:8080
# In another terminal, test metrics endpoint
curl http://localhost:8080/metrics
# Verify PodMonitor is created and discovered
kubectl get podmonitor -n buildkite
Notes on using the metrics
Most metrics below are counter metrics, designed to be used in conjunction with the rate
and a time window. These are named ending in _total
.
PromQL examples:
-
rate(buildkite_scheduler_job_create_success_total[10m])
- jobs successfully created per second over a 10 minute window. -
rate(buildkite_scheduler_job_create_errors_total[10m])
- errors per second of failures to create jobs over a 10-minute window.
Some metrics are gauges, which can be useful for diagnosing particular issues.
A few metrics are native histograms, which requires the Prometheus feature flag to be enabled (--enable-feature=native-histograms
). These are mostly latency histograms named ending in _seconds
, and again work well with rate
:
-
rate(buildkite_job_end_to_end_seconds[10m])
- histogram of time in seconds that jobs spent between being returned from a query to Buildkite and being created in Kubernetes, over a 10 minute window. -
rate(buildkite_monitor_job_query_seconds[10m])
- histogram of time spent querying Buildkite for jobs that can be scheduled, over a 10 minute window.
Labels and their meanings
Label name | Description | Values |
---|---|---|
source |
The event that caused a counter to increase. |
|
reason , error_reason
|
For operations on Kubernetes, the Kubernetes reason associated with an error | Examples
|
reason |
For the limiter, a classification of the error returned by the downstream component |
|
eviction_reason |
The reason an eviction was created |
|
completion_watcher
Full metric name | Labels | Description |
---|---|---|
buildkite_completion_watcher_cleanup_errors_total |
reason |
Count of errors during attempts to clean up a job with a finished agent |
buildkite_completion_watcher_cleanups_total |
- | Count of jobs with finished agents successfully cleaned up |
buildkite_completion_watcher_onadd_events_total |
- | Count of OnAdd informer events |
buildkite_completion_watcher_onupdate_events_total |
- | Count of OnUpdate informer events |
deduper
Full metric name | Labels | Description |
---|---|---|
buildkite_deduper_job_handler_calls_total |
- | Count of jobs that were passed to the next handler in the chain |
buildkite_deduper_job_handler_errors_total |
- | Count of jobs that weren't scheduled because the next handler in the chain returned an error |
buildkite_deduper_jobs_already_not_running_total |
source |
Count of times a job was already missing from inFlight |
buildkite_deduper_jobs_already_running_total |
source |
Count of times a job was already present in inFlight |
buildkite_deduper_jobs_marked_running_total |
source |
Count of times a job was added to inFlight |
buildkite_deduper_jobs_running |
- | Current number of running jobs according to deduper |
buildkite_deduper_jobs_unmarked_running_total |
source |
Count of times a job was removed from inFlight |
buildkite_deduper_onadd_events_total |
- | Count of OnAdd informer events |
buildkite_deduper_ondelete_events_total |
- | Count of OnDelete informer events |
buildkite_deduper_onupdate_events_total |
- | Count of OnUpdate informer events |
job_watcher
Full metric name | Labels | Description |
---|---|---|
buildkite_job_watcher_cleanup_errors_total |
reason |
Count of errors during attempts to clean up a stalled job |
buildkite_job_watcher_cleanups_total |
- | Count of stalled jobs successfully cleaned up |
buildkite_job_watcher_job_fail_on_buildkite_errors_total |
- | Count of errors when jobWatcher tried to acquire and fail a job on Buildkite |
buildkite_job_watcher_jobs_failed_on_buildkite_total |
- | Count of jobs that jobWatcher successfully acquired and failed on Buildkite |
buildkite_job_watcher_jobs_finished_without_pod_total |
- | Count of jobs that entered a terminal state (Failed or Succeeded) without a pod |
buildkite_job_watcher_jobs_stalled_without_pod_total |
- | Count of jobs that ran for too long without a pod |
buildkite_job_watcher_num_ignored_jobs |
- | Current count of jobs ignored for jobWatcher checks |
buildkite_job_watcher_num_stalling_jobs |
- | Current number of jobs that are running but have no pods |
buildkite_job_watcher_onadd_events_total |
- | Count of OnAdd informer events |
buildkite_job_watcher_ondelete_events_total |
- | Count of OnDelete informer events |
buildkite_job_watcher_onupdate_events_total |
- | Count of OnUpdate informer events |
limiter
Full metric name | Labels | Description |
---|---|---|
buildkite_limiter_job_handler_calls_total |
- | Count of jobs that were passed to the next handler in the chain |
buildkite_limiter_job_handler_errors_total |
reason |
Count of jobs that weren't scheduled because the next handler in the chain returned an error |
buildkite_limiter_max_in_flight |
- | Configured limit on number of jobs simultaneously in flight |
buildkite_limiter_onadd_events_total |
- | Count of OnAdd informer events |
buildkite_limiter_ondelete_events_total |
- | Count of OnDelete informer events |
buildkite_limiter_onupdate_events_total |
- | Count of OnUpdate informer events |
buildkite_limiter_token_overflows_total |
source |
Count of attempts to return a token when the bucket was full |
buildkite_limiter_token_underflows_total |
source |
Count of attempts to take a token when the bucket was empty |
buildkite_limiter_token_wait_duration_seconds |
- | Time spent waiting for a limiter token to become available |
buildkite_limiter_tokens_available |
- | Limiter tokens currently available |
buildkite_limiter_waiting_for_token |
- | Number of limiter workers currently waiting for a token |
buildkite_limiter_waiting_for_work |
- | Number of limiter workers currently waiting for work |
buildkite_limiter_work_queue_length |
- | Amount of enqueued work in the limiter |
buildkite_limiter_work_wait_duration_seconds |
- | Time spent waiting in the limiter for work to become available |
monitor
Full metric name | Labels | Description |
---|---|---|
buildkite_monitor_job_handler_errors_total |
- | Count of jobs that weren't scheduled because the next handler in the chain returned an error |
buildkite_monitor_job_queries_total |
- | Count of queries to Buildkite to fetch jobs |
buildkite_monitor_job_query_errors_total |
- | Count of errors from queries to Buildkite to fetch jobs |
buildkite_monitor_job_query_seconds |
- | Time taken to fetch jobs from Buildkite |
buildkite_monitor_jobs_filtered_out_total |
- | Count of jobs that didn't match the configured agent tags |
buildkite_monitor_jobs_handled_total |
- | Count of jobs that were passed to the next handler in the chain |
buildkite_monitor_jobs_returned_total |
- | Count of jobs returned from queries to Buildkite |
buildkite_monitor_monitor_up |
- | Whether the monitor loop is running (0 = stopped, 1 = running) |
pod_watcher
Full metric name | Labels | Description |
---|---|---|
buildkite_pod_watcher_job_fail_on_buildkite_errors_total |
- | Count of errors when podWatcher tried to acquire and fail a job on Buildkite |
buildkite_pod_watcher_jobs_failed_on_buildkite_total |
- | Count of jobs that podWatcher successfully acquired and failed on Buildkite |
buildkite_pod_watcher_num_ignored_jobs |
- | Current count of jobs ignored for podWatcher checks |
buildkite_pod_watcher_num_job_cancel_checkers |
- | Current count of job cancellation checkers |
buildkite_pod_watcher_num_watching_for_image_failure |
- | Current count of pods being watched for potential image-related failures |
buildkite_pod_watcher_onadd_events_total |
- | Count of OnAdd informer events |
buildkite_pod_watcher_ondelete_events_total |
- | Count of OnDelete informer events |
buildkite_pod_watcher_onupdate_events_total |
- | Count of OnUpdate informer events |
buildkite_pod_watcher_pod_eviction_errors_total |
eviction_reason , error_reason
|
Count of failures to create pod evictions by podWatcher |
buildkite_pod_watcher_pods_evicted_total |
eviction_reason |
Count of evictions created for pods by podWatcher |
scheduler
Full metric name | Labels | Description |
---|---|---|
buildkite_scheduler_job_create_calls_total |
- | Count of jobs that were passed to Kubernetes to create |
buildkite_scheduler_job_create_errors_total |
reason |
Count of jobs that weren't created in Kubernetes because of an error |
buildkite_scheduler_job_create_success_total |
- | Count of jobs that were successfully created in Kubernetes |
buildkite_scheduler_job_fail_on_buildkite_errors_total |
- | Count of errors when scheduler tried to acquire and fail a job on Buildkite |
buildkite_scheduler_jobs_failed_on_buildkite_total |
- | Count of jobs that scheduler successfully acquired and failed on Buildkite |
Other
Full metric name | Labels | Description |
---|---|---|
buildkite_job_end_to_end_seconds |
- | End-to-end processing times of jobs. Specifically, for each job, the duration between starting the query that returned the job from Buildkite, and successfully creating that job in Kubernetes. |