Troubleshooting the Elastic CI Stack for GCP
Infrastructure as code isn't always easy to troubleshoot, but here are some ways to debug what's going on inside the Elastic CI Stack for GCP, and some solutions for troubleshooting specific situations and issues.
Using Cloud Logging
Elastic CI Stack for GCP sends logs to Cloud Logging via the Ops Agent. The following log sources are available:
Application logs
-
Buildkite Agent logs - log name:
buildkite_agent- Contains agent lifecycle events, job execution, and errors
- Severity levels:
DEBUG,INFO,WARNING,ERROR,CRITICAL - View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/buildkite_agent"
-
Docker Daemon logs (if Docker is installed) - log name:
docker- Contains Docker daemon events and errors
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/docker"
-
Preemption Monitor logs - log name:
preemption_monitor- Contains preemptible instance termination handling logs
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/preemption_monitor"
System logs
-
System messages - log name:
syslog- General system messages and events
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/syslog"
-
Authentication logs - log name:
auth- SSH and authentication events
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/auth"
Cloud Initialization logs
-
Cloud-init logs - log name:
cloud_init- VM bootstrap process logs
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/cloud_init"
-
Cloud-init output - log name:
cloud_init_output- Output from startup scripts
- View in Logs Explorer:
log_name="projects/PROJECT_ID/logs/cloud_init_output"
Viewing logs in Cloud Console
- Navigate to Monitoring > Logs Explorer in the Cloud Console
- Use filters to view specific logs
View all logs from a specific instance:
resource.type="gce_instance"
resource.labels.instance_id="INSTANCE_ID"
View Buildkite agent errors:
resource.type="gce_instance"
log_name="projects/PROJECT_ID/logs/buildkite_agent"
severity >= ERROR
View startup script output:
resource.type="gce_instance"
log_name="projects/PROJECT_ID/logs/cloud_init_output"
Viewing logs with gcloud CLI
View recent Buildkite Agent logs:
gcloud logging read "resource.type=gce_instance AND log_name=projects/PROJECT_ID/logs/buildkite_agent" \
--limit 50 \
--format json \
--project PROJECT_ID
View logs from a specific instance:
gcloud logging read "resource.labels.instance_id=INSTANCE_ID" \
--limit 100 \
--freshness 1h \
--project PROJECT_ID
View ERROR-level logs only:
gcloud logging read "resource.type=gce_instance AND severity>=ERROR" \
--limit 50 \
--format json \
--project PROJECT_ID
For more information on logging, see LOGGING.md.
Accessing Elastic CI Stack for GCP instances directly
Sometimes, looking at the logs isn't enough to figure out what's going on in your instances. In these cases, it can be useful to access the shell on the instance directly.
SSH access (if enabled)
If your Elastic CI Stack for GCP has been configured to allow SSH access (enable_ssh_access = true):
# SSH directly (requires external IP or Cloud NAT)
gcloud compute ssh INSTANCE_NAME --zone ZONE --project PROJECT_ID
Identity-aware proxy (IAP)
If IAP is enabled (enable_iap_access = true), you can SSH without external IPs:
# SSH via IAP tunnel
gcloud compute ssh INSTANCE_NAME \
--zone ZONE \
--tunnel-through-iap \
--project PROJECT_ID
Or use the SSH button in the Cloud Console:
- Navigate to Compute Engine > VM instances
- Click the SSH button next to the instance
Serial console
For instances that won't boot or are inaccessible:
# View serial console output
gcloud compute instances get-serial-port-output INSTANCE_NAME \
--zone ZONE \
--project PROJECT_ID
Managed instance group fails to boot instances
Resource shortage or configuration errors can cause this issue. Check the managed instance group's Activity log for diagnostics.
Check instance group status:
gcloud compute instance-groups managed describe INSTANCE_GROUP_NAME \
--region REGION \
--project PROJECT_ID
Check for quota issues:
gcloud compute project-info describe --project PROJECT_ID
Instances are abruptly terminated
This can happen when using preemptible instances. GCP sends a notification to a preemptible instance 30 seconds prior to termination. The preemption-monitor service intercepts that notification and attempts to gracefully shut down.
To identify if your instance was preempted
Check the Cloud Logging for the preemption monitor:
gcloud logging read "resource.type=gce_instance AND log_name=projects/PROJECT_ID/logs/preemption_monitor" \
--limit 20 \
--format json \
--project PROJECT_ID
Look for log lines indicating termination notice:
Received preemption notice for instance INSTANCE_ID
Stacks over-provision agents
If you have multiple stacks, check that they listen to unique queues determined by the buildkite_queue variable. Each Elastic CI Stack for GCP you deploy should have a unique value for this parameter. Otherwise, each stack scales out independently to service all the jobs on the queue, but the jobs will be distributed amongst them. This will mean that your stacks are over-provisioned.
This could also happen if you have agents that are not part of an Elastic CI Stack for GCP started with a tag of the form queue=<name of queue>. Any agents started like this will compete with a stack for jobs, but the stack will scale out as if this competition did not exist.
Instances fail to boot the Buildkite Agent
Check the managed instance group's activity logs and Cloud Logging for the booting instances to determine the issue. Observe where in the startup script the boot is failing. Identify what resource is failing when the instances are attempting to use it, and fix that issue.
Check startup script logs:
gcloud logging read "resource.labels.instance_id=INSTANCE_ID AND log_name=projects/PROJECT_ID/logs/cloud_init_output" \
--limit 100 \
--format json \
--project PROJECT_ID
Instances fail jobs
Successfully booted instances can fail jobs for numerous reasons. A frequent source of issues is their disk filling up before the hourly cleanup job fixes it or terminates them.
Check disk space on an instance:
# SSH into the instance
gcloud compute ssh INSTANCE_NAME --zone ZONE --project PROJECT_ID
# Check disk usage
df -h
# Check inode usage
df -i
# Check Docker disk usage
sudo docker system df
Check Docker cleanup logs:
# View regular cleanup logs
sudo journalctl -u docker-gc.service -n 50
# View emergency cleanup logs
sudo journalctl -u docker-low-disk-gc.service -n 50
Perform a manual cleanup
If an instance has a full disk, you can manually trigger cleanup:
# Run regular garbage collection
sudo systemctl start docker-gc.service
# Run emergency garbage collection
sudo systemctl start docker-low-disk-gc.service
# Check disk space status
sudo /usr/local/bin/bk-check-disk-space.sh
echo $? # 0 = healthy, 1 = low disk space
Autoscaling not working
If the managed instance group isn't scaling based on queue depth, you can try the following troubleshooting steps.
Check if autoscaling is enabled:
gcloud compute instance-groups managed describe INSTANCE_GROUP_NAME \
--region REGION \
--project PROJECT_ID
Verify if the buildkite-agent-metrics function is deployed:
gcloud functions list --project PROJECT_ID | grep buildkite-agent-metrics
Check if the metrics are being published:
gcloud monitoring time-series list \
--filter 'metric.type="custom.googleapis.com/buildkite/scheduled_jobs"' \
--project PROJECT_ID
Permission errors
If instances can't access resources, start with checking service account permissions:
gcloud projects get-iam-policy PROJECT_ID \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:elastic-ci-agent@*"
Common permission issues
- "Can't access Secret Manager" - enable
enable_secret_access = true. - "Can't access Cloud Storage" - enable
enable_storage_access = true. - "Can't pull Docker images from Artifact Registry" - grant Artifact Registry Reader role.
- "Can't write logs" - verify that Logs Writer role is assigned.
Getting help
If you're still stuck after trying the troubleshooting steps suggested above:
- Check the GitHub repository - Issues.
- Email Buildkite Support at support@buildkite.com with:
- Your stack configuration (redact sensitive values)
- Relevant Cloud Logging logs
- Terraform error messages
- Instance group status and errors
Additional information
The following GCP documentation resources can help you with the troubleshooting process: