Cancelling hanging jobs
We have rolled out a new fix for pipelines that have hanging jobs.
If you are using test splitting queues such as Shopify's open source tool ci-queue (https://github.com/Shopify/ci-queue) at scale, you may have noticed that some jobs are hanging around and causing your builds to take longer than expected.
There is now a new agent cli command to cancel these pesky jobs to improve your build times.
To force cancel all jobs in a step you can run the following from any job within the target build:
buildkite-agent step cancel --step <step_key> --force
In this kind of situation you may have a pipeline that looks like this:
steps:
- command: test_reporter.sh
key: reporter
- command: test_worker.sh
key: workers
parallelism: 500
soft_fail: true
In this pipeline the workers are responsible for running the tests but the outcome of the tests is reported by the reporter step. The outcome of the workers in Buildkite is not relevant to the build outcome. When a step is marked as soft fail Buildkite still waits for all jobs to finish before marking the build as passed. This can cause the build to take longer than necessary. To fix this you can add a step to cancel the workers after the reporter step has finished.
steps:
- command:
- test_reporter.sh
- buildkite-agent step cancel --step workers --force
key: reporter
- command: test_worker.sh
key: workers
parallelism: 500
soft_fail: true
If you find this cancels too quickly, leaving agents unable to upload logs and artefacts, you can set a custom a grace period with the --force-grace-period-seconds
flag. This will allow the agents to finish their work before being cancelled.
steps:
- command:
- test_reporter.sh
- buildkite-agent step cancel --step workers --force --force-grace-period-seconds 10
key: reporter
- command: test_worker.sh
key: workers
parallelism: 500
soft_fail: true
Quinn
Start turning complexity into an advantage
Create an account to get started for free.

