Troubleshooting
If you're experiencing any issues with Buildkite Agent Stack for Kubernetes controller, it is recommended that you enable the debug mode and log collection to obtain better visibility and insight into such issues or any other related problems.
Enable debug mode
Increasing the verbosity of Buildkite Agent Stack for Kubernetes controller's logs can be accomplished by enabling debug mode. Once enabled, the logs will emit individual, detailed actions performed by the controller while obtaining jobs from Buildkite's API, processing configurations to generate a Kubernetes PodSpec and creating a new Kubernetes Job. Debug mode can help to identify processing delays or incorrect job processing issues.
Debug mode can be enabled during the installation (Helm chart deployment) of the Buildkite Agent Stack for Kubernetes controller via the command line:
helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
--namespace buildkite \
--create-namespace \
--debug \
--values values.yml
Or within the controller's configuration values YAML file:
# values.yaml
...
config:
debug: true
...
Kubernetes log collection
To enable log collection for the Buildkite Agent Stack for Kubernetes controller, use the utils/log-collector
script in the controller repository.
Prerequisites
- kubectl binary
- kubectl setup and authenticated to correct k8s cluster
Inputs to the script
When executing the log-collector
script, you will be prompted for:
Kubernetes Namespace where the Buildkite Agent Stack for Kubernetes controller is deployed.
Buildkite job ID to collect Job and Pod logs.
Gathering of data and logs
The log-collector
script will gather the following information:
Kubernetes Job, Pod resource details for the Buildkite Agent Stack for Kubernetes controller.
Kubernetes Pod logs for the Buildkite Agent Stack for Kubernetes controller.
Kubernetes Job, Pod resource details for the Buildkite job ID (if provided).
Kubernetes Pod logs that executed the Buildkite job ID (if provided).
The logs will be archived in a tarball named logs.tar.gz
in the current directory. If requested, these logs may be provided via email to the Buildkite Support (support@buildkite.com
).
Common issues and fixes
Below are some common issues that users may experience when using the Buildkite Agent Stack for Kubernetes controller to process Buildkite jobs.
Jobs are being created, but not processed by controller
The primary requirement to have the Buildkite Agent Stack for Kubernetes controller acquire and process a Buildkite job is a matching queue
tag. If the controller is configured to process scheduled jobs with tag "queue=kubernetes"
you will need to ensure that your pipeline YAML is targeting the same queue at either the pipeline-level or at each step-level.
If a job is created without a queue target, the default queue will be applied. The Buildkite Agent Stack for Kubernetes controller expects all jobs to have a queue
tag explicitly defined, even for "default" cluster queues. Any job missing a queue
tag will be skipped by the controller during processing and the controller emit the following log:
job missing 'queue' tag, skipping...
To view the agent tags applied to your job(s), the following GraphQL query can be executed (be sure to substitute your Organization's slug and Cluster ID):
query getClusterScheduledJobs {
organization(slug: "<organization-slug>") {
jobs(
state: [SCHEDULED]
type: [COMMAND]
order: RECENTLY_CREATED
first: 100
clustered: true
cluster: "<cluster-id>"
) {
count
edges {
node {
... on JobTypeCommand {
url
uuid
agentQueryRules
}
}
}
}
}
}
This will return the 100
newest created jobs for the <cluster-id>
Cluster in the <organization-slug>
Organization that are in a scheduled
state and waiting for the controller to convert them each to a Kubernetes Job. Each Buildkite job's agent tags will be defined under agentQueryRules
.
Controller stops accepting new jobs from a cluster queue
Sometimes the count of jobs in waiting
state in the Buildkite Pipelines UI may increase, however, no new pods are created. Reviewing the logs may reveal a max-in-flight reached
error, for example:
DEBUG limiter scheduler/limiter.go:77 max-in-flight reached {"in-flight": 25}
Initial troubleshooting steps
- Enable the debug log and look for errors related to
max-in-flight
reached. - Confirm that no new Kubernetes jobs are created while the UI displays the jobs as
waiting
.
Workaround
Execute the kubectl -n buildkite rollout restart deployment agent-stack-k8s
command to restart the controller pod and clear the max-in-flight reached
condition as this will allow scheduling to resume.
Fix
If you are using any version of the controller older than v0.2.7, upgrade to the latest version.
Wrong exit code affects auto job retries
Error code from the Kubernetes pods may not be passed through the agent, preventing the use of exit-based retries. This is what the error could look like:
The following init containers failed:
CONTAINER EXIT CODE SIGNAL REASON MESSAGE
My-agent 137 0 ContainerStatusUnknown The container could not be located when the pod was terminated
Such scenario might take place if in the Buildkite Pipelines UI, the exit code was 137
, however the exit code emitted from the container was 1
. As a result, the kickoff of retries will not happen if they were configured to happen for the exit code 1
.
Workaround
Add a retry rule for all stack-level failures. An example of such configuration would look like this:
retry:
- signal_reason: "stack_error"
limit: 3
Fix
Upgrading to version v.0.29.0 is the recommended action in this case as a "stack_error" exit reason was added to the agent, to provide better visibility to stack-level errors.