Stack API

The Stack API provides endpoints for implementing a stack reliably. These endpoints require Agent tokens for authentication.

A stack is defined as a software process that has these two abilities simultaneously:

  • The ability to pull/receive new jobs from the Buildkite API.
  • The ability to turn those job definitions into running agents.

A stack can also be broadly understood as an orchestrator or a scheduler of Buildkite jobs.

The Stack API powers Buildkite's Agent Kubernetes Stack. It's designed to give advanced enterprise users custom control over the scheduling of jobs at larger scales.

This API is currently available in preview.

Register a stack

Register a new stack or update an existing one. You must use this API to register a stack key before using any of the following APIs. You can choose to register a stack key ad-hoc once, or have it as part of your stack implementation. This endpoint is idempotent.

The register payload includes a mandatory queue_key field, which tells Buildkite which cluster queue the stack is intended to serve. However, such binding isn't enforced so there is a possibility that you could use a single stack implementation to power all cluster queues.

The number of active stacks per organization is limited, and each stack is subject to independent rate limits.

Request payload:

Field Type Required Description
key string Yes Unique identifier for the stack in the org.
type string Yes Type of stack. 3rd party stack should use "custom".
queue_key string Yes Cluster queue key the stack plans to serve.
metadata key-value object Yes Additional metadata for the stack

Example:

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -H "Content-Type: application/json" \
  -X POST "https://agent.buildkite.com/v3/stacks/register" \
  -d '{
    "key": "my-kubernetes-stack",
    "type": "custom",
    "queue_key": "default",
    "metadata": {
      "version": "1.0.0",
      "region": "us-east-1"
    }
  }'
{
  "key": "my-kubernetes-stack",
  "type": "kubernetes",
  "cluster_queue_key": "default",
  "metadata": {
    "version": "1.0.0",
    "region": "us-east-1"
  }
}

Success response: 201 Created (new stack) or 200 OK (existing stack updated)

De-register a stack

De-register a stack from the cluster. Ideally, when a stack stops, it should use this API to de-register its key from the Buildkite backend. This will ensure an organization doesn't exceed the stack count quota.

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -X POST "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/deregister"

Success response: 204 No Content

List scheduled jobs (Metadata only)

This is the most important API of the Stack APIs. It fetches all jobs that have been scheduled to run by Buildkite's internal state machine. When a cluster queue is paused, cluster_queue.dispatch_paused will return true, and a stack implementation must respect this flag (i.e. avoid starting new jobs whenever the queue is paused).

A stack often makes scheduling decisions based on returned metadata and turns this job metadata into running agents using --acquire-jobs.

It's worth noting that until these jobs transition into another state, the API will keep returning them. To avoid starting duplicate jobs, we offer some utility APIs below.

Query parameters:

Parameter Type Required Description
queue_key string Yes Filter jobs by queue key
limit integer No Maximum number of jobs to return, max 1000

Example:

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -X GET "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/scheduled_jobs?queue_key=default&limit=10"
{
  "jobs": [
    {
      "id": "01234567-89ab-cdef-0123-456789abcdef",
      "scheduled_at": "2023-10-01T12:00:00.000Z",
      "priority": 1,
      "agent_query_rules": ["test=a"],
      "pipeline_slug": "my-pipeline",
      "pipeline_id": "pipeline-uuid",
      "build_number": 123,
      "build_branch": "main",
      "build_id": "build-uuid",
      "step_key": "test"
    }
  ],
  "page_info": {
    "has_next_page": false,
    "end_cursor": null
  },
  "cluster_queue": {
    "id": "queue-id",
    "dispatch_paused": false
  }
}

Success response: 200 OK

Get a job (Env + command)

In some cases, the job metadata returned from the API above isn't sufficient to make a full scheduling decision. In such cases, you can use this API to get the full payload data of a job individually. Specifically, the job payload data contains env and command. Due to the dynamic nature of Buildkite pipelines, these two fields can often grow to above 100KB.

It's useful when you want to make scheduling decisions based on in-depth analysis of a job.

JOB_UUID="01234567-89ab-cdef-0123-456789abcdef"
curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -X GET "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/jobs/$JOB_UUID"
{
  "id": "01234567-89ab-cdef-0123-456789abcdef",
  "env": {
    "BUILDKITE_JOB_ID": "01234567-89ab-cdef-0123-456789abcdef",
    "BUILDKITE_BUILD_NUMBER": "123"
  },
  "command": "echo Hello 👋"
}

Success response: 200 OK

Reserve jobs

In order to prevent pulling duplicate jobs, a stack can reserve jobs that it has decided to execute. If this API is called, a stack should only execute jobs that are successfully reserved, as shown in the reserved fields in the response. Until the reservation expires, the reserved jobs will not show up in subsequent list scheduled jobs API calls. If the reservation expires, the reserved jobs will return to the scheduled state.

You can reserve multiple jobs for execution. This API can be repeatedly called to extend the expiration of reservation states.

Alternatively, a stack implementation can maintain its own persistent layer to keep track of job lifecycle, in which case, calling this API will be unnecessary.

Request payload:

Field Type Required Description
job_uuids array[string] Yes Array of job UUIDs to reserve
reservation_expiry_seconds integer No Reservation duration in seconds

Example:

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -H "Content-Type: application/json" \
  -X PUT "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/scheduled_jobs/batch_reserve" \
  -d '{
    "job_uuids": [
      "01234567-89ab-cdef-0123-456789abcdef",
      "fedcba98-7654-3210-fedc-ba9876543210"
    ],
    "reservation_expiry_seconds": 1800
  }'
{
  "reserved": [
    "01234567-89ab-cdef-0123-456789abcdef",
    "fedcba98-7654-3210-fedc-ba9876543210"
  ],
  "not_reserved": []
}

Success response: 200 OK

Get job states

Retrieve the current state of multiple jobs. This is useful when a stack is provisioning infrastructure for a job and the job is cancelled before the infrastructure is ready. A stack can choose to decommission infrastructure proactively to save cost.

This API is also helpful to inform a stack when a job's responsibility can be safely shifted to the running agent.

This API uses POST method for batch data loading.

Request payload:

Field Type Required Description
job_uuids array[string] Yes Array of job UUIDs to get states for

Example:

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -H "Content-Type: application/json" \
  -X POST "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/jobs/get-states" \
  -d '{
    "job_uuids": [
      "01234567-89ab-cdef-0123-456789abcdef",
      "fedcba98-7654-3210-fedc-ba9876543210"
    ]
  }'
{
  "states": {
    "01234567-89ab-cdef-0123-456789abcdef": "scheduled",
    "fedcba98-7654-3210-fedc-ba9876543210": "running"
  }
}

Success response: 200 OK

Fail a job

Mark a job as failed when the stack cannot execute it. In some situations, an agent cannot be spawned due to infrastructure or other issues. In this case, for each job, a stack can call this API at most once to fail the job with error details. error_detail can be an arbitrary string less than 4KB.

This is a critical API to shorten the feedback cycle to end users. For example, in the Kubernetes stack, if a pod has an image pull issue, the k8s stack uses this API to fail a job with feedback.

A job that is failed with this approach will have a special error popup on the Buildkite Build page.

Request payload:

Field Type Required Description
exit_status integer Yes Exit status code for the failed job. Cannot be 0
error_detail string Yes Error description (max 4KB)

Example:

curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -H "Content-Type: application/json" \
  -X POST "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/jobs/$JOB_UUID/fail" \
  -d '{
    "exit_status": -1,
    "error_detail": "Stack failed to start agent: insufficient resources"
  }'

Success response: 200 OK

Create stack notification

In situations when a stack may take more than a few seconds to provision infrastructure for a job, or when the stack is waiting for some external conditions to be satisfied, a stack can give short textual notifications to the Buildkite Build page. This can help with visibility and debugging. A notification detail can be a short string. A job cannot have more than 50 stack notifications, so a stack should use this API judiciously.

Request payload

Field Type Required Description
detail string Yes Short notification message (max length 255)
curl -H "Authorization: Token $BUILDKITE_CLUSTER_TOKEN" \
  -H "Content-Type: application/json" \
  -X POST "https://agent.buildkite.com/v3/stacks/my-kubernetes-stack/jobs/$JOB_UUID/stack_notifications" \
  -d '{
    "detail": "Pod is starting up"
  }'

Success response: 200 OK