Terraform setup for the Elastic CI Stack for GCP

This guide helps you to get started with the Elastic CI Stack for GCP using Terraform.

Elastic CI Stack for GCP allows you to launch a private, autoscaling Buildkite agent cluster in your own GCP project.

Before you start

Before deploying the Elastic CI Stack for GCP, review the prerequisites, required skills, and billable services to ensure you have the necessary tools, knowledge, and budget planning in place.

Prerequisites

Terraform version >= 1.0
Buildkite Account
GCP Account with a project
gcloud CLI configured

Required and recommended skills

The Elastic CI Stack for GCP does not require familiarity with the underlying GCP services to deploy it. However, to run builds, some familiarity with the following GCP services is recommended:

Google Compute Engine (to select a machine_type appropriate for your workload)
Google Cloud Storage (for storing build artifacts)
Secret Manager (for storing the Buildkite agent token securely)

Elastic CI Stack for GCP provides defaults and pre-configurations suited for most use cases without the need for additional customization. Still, you'll benefit from familiarity with VPCs, Cloud NAT, and firewall rules for custom instance networking.

For post-deployment diagnostic purposes, deeper familiarity with Compute Engine is recommended to be able to access the instances launched to execute Buildkite jobs over SSH or Identity-Aware Proxy.

Billable services

The Elastic CI Stack for GCP template deploys several billable GCP services that do not require upfront payment and operate on a pay-as-you-go principle, with the bill proportional to usage.

Service name	Purpose	Required
Compute Engine	Deployment of VM instances	☑️
Persistent Disk	Root disk storage of VM instances	☑️
Cloud Functions	Publishing queue metrics for autoscaling	☑️
Secret Manager	Storing the Buildkite agent token (recommended)	☑️
Cloud Logging	Logs for instances and Cloud Function	☑️
Cloud Monitoring	Metrics for autoscaling	☑️
Cloud NAT	Outbound internet access for instances	☑️
Cloud Storage	Build artifacts storage (if enabled)	❌

Buildkite services are billed according to your plan.

What's on each machine?

When using the default base image, each machine includes:

Debian 13 (trixie)
The Buildkite agent
Git
Docker (when using custom Packer image)
Docker Compose v2 (when using custom Packer image)
Docker Buildx (when using custom Packer image)
gcloud CLI - useful for performing any ops-related tasks
jq - useful for manipulating JSON responses from CLI tools

You can build a custom image if you need additional tools for your pipelines.

For more details on what versions are installed, see the corresponding Packer templates.

The Buildkite agent runs as user buildkite-agent.

Supported builds

This stack is designed to run your builds in a shared-nothing pattern similar to the 12 factor application principles:

Each project should encapsulate its dependencies through Docker and Docker Compose.
Build pipeline steps should assume no state on the machine (and instead rely on the build meta-data, build artifacts, or Cloud Storage).
Secrets, including SSH keys for source control, are configured using Secret Manager.

By following these conventions, you get a scalable, repeatable, and source-controlled CI environment that any team within your organization can use.

Custom images

Custom images help teams ensure that their agents have all required tools and configurations before instance launch. This prevents instances from reverting to the base image state when agents restart, which would lose any manual changes made during run time.

Requirements

To use the Packer templates provided, you will need to install the following installed on your system:

Docker
Make
gcloud CLI

The following GCP IAM permissions are required for building custom images using the provided Packer templates:

{
  "title": "Packer Image Builder",
  "description": "Permissions required to build VM images with Packer",
  "includedPermissions": [
    "compute.disks.create",
    "compute.disks.delete",
    "compute.disks.get",
    "compute.disks.use",
    "compute.images.create",
    "compute.images.delete",
    "compute.images.get",
    "compute.images.useReadOnly",
    "compute.instances.create",
    "compute.instances.delete",
    "compute.instances.get",
    "compute.instances.setMetadata",
    "compute.instances.setServiceAccount",
    "compute.machineTypes.get",
    "compute.networks.get",
    "compute.subnetworks.use",
    "compute.subnetworks.useExternalIp",
    "compute.zones.get",
    "iam.serviceAccounts.actAs"
  ]
}

It is also recommended that you have a base knowledge of:

Creating an image

To create a custom image with Docker support (recommended for production):

cd packer
./build --project-id your-gcp-project-id

This builds a Debian 13-based image with:

Pre-installed Buildkite agent
Docker Engine with Compose v2 and Buildx
Multi-architecture build support
Automated Docker garbage collection
Disk space monitoring and self-protection
Centralized logging with Ops Agent

Deploying the stack

This section walks through the deployment process step by step, from obtaining your agent token to initializing and applying your Terraform configuration.

Step 1: Get your Buildkite agent token

Obtain the value for the agent token you'd previously configured for your Buildkite cluster.

If you don't have your agent token's value, you'll need to create a new one, which you can do from the Agents > Clusters > your specific cluster page. Once created, don't forget to copy the agent token's value and save it somewhere secure, as you won't be able to see its value from Buildkite again.

Step 2: Store the token in Secret Manager (recommended)

For production deployments, store the token in Secret Manager:

echo -n "your-agent-token" | gcloud secrets create buildkite-agent-token \
  --data-file=- \
  --project=your-project-id

# Verify the secret was created
gcloud secrets describe buildkite-agent-token --project=your-project-id

Step 3: Create your Terraform configuration

Create a new directory for your Terraform configuration:

mkdir buildkite-gcp-stack
cd buildkite-gcp-stack

Create a main.tf file:

terraform {
  required_version = ">= 1.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 4.0, < 8.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # Required
  project_id                  = var.project_id
  buildkite_organization_slug = var.buildkite_organization_slug
  buildkite_agent_token_secret = "projects/${var.project_id}/secrets/buildkite-agent-token/versions/latest"

  # Stack configuration
  stack_name      = "buildkite"
  buildkite_queue = "default"
  region          = var.region

  # Scaling configuration
  min_size = 0
  max_size = 10

  # Instance configuration
  machine_type = "e2-standard-4"
}

Create a variables.tf file:

variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "us-central1"
}

variable "buildkite_organization_slug" {
  description = "Buildkite organization slug"
  type        = string
}

Create a terraform.tfvars file:

project_id                  = "your-gcp-project-id"
region                      = "us-central1"
buildkite_organization_slug = "your-org-slug"

Create an outputs.tf file (optional):

output "network_name" {
  description = "Name of the VPC network"
  value       = module.buildkite_stack.network_name
}

output "instance_group_name" {
  description = "Name of the managed instance group"
  value       = module.buildkite_stack.instance_group_manager_name
}

output "agent_service_account_email" {
  description = "Email of the agent service account"
  value       = module.buildkite_stack.agent_service_account_email
}

Step 4: Initialize and deploy

Authenticate with GCP:

gcloud auth application-default login

Initialize Terraform:

terraform init

Review the planned changes:

terraform plan

Deploy the stack:

terraform apply

Type yes when prompted to confirm the deployment.

The module will create:

VPC network with Cloud NAT
IAM service accounts with appropriate permissions
Managed instance group with Buildkite agents
Cloud Function for autoscaling metrics
Health checks and autoscaling based on queue depth

Advanced configuration

This section covers some of the configurations you might want to use for a deeper customization of your stack.

Using a custom VM image

If you built a custom Packer image with Docker support:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  # Use custom image family
  image = "buildkite-ci-stack"
}

Configuring agent tags

Target specific agents in your pipeline steps using tags:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  buildkite_agent_tags = "docker=true,os=linux,environment=production"
}

Then in your pipeline.yml, set the following:

steps:
  - command: echo "hello from production"
    agents:
      queue: "default"
      environment: "production"

For more information, see the Queues overview page.

Multiple queues

To create multiple agent pools with different configurations, deploy multiple stacks with different queue names:

# Production stack
module "buildkite_stack_production" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  stack_name      = "buildkite-production"
  buildkite_queue = "production"
  machine_type    = "e2-standard-4"
  max_size        = 20

  # ... other configuration ...
}

# Build stack for larger builds
module "buildkite_stack_builds" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  stack_name      = "buildkite-builds"
  buildkite_queue = "builds"
  machine_type    = "n1-standard-8"
  max_size        = 10

  # ... other configuration ...
}

Enabling Cloud Storage access

If your builds need to upload/download artifacts to Cloud Storage:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_storage_access = true
}

Using IAP for secure SSH access

Enable Identity-Aware Proxy for secure SSH access without external IPs:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_iap_access = true
}

Then connect to instances:

gcloud compute ssh INSTANCE_NAME \
  --zone ZONE \
  --tunnel-through-iap \
  --project PROJECT_ID

Restricting SSH access

Restrict SSH access to specific IP ranges:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_ssh_access  = true
  ssh_source_ranges  = ["203.0.113.0/24"]  # Your office IP range
}

SSH keys for source control

The Elastic CI Stack for GCP automatically loads SSH keys from GCP Secret Manager and adds them to an ephemeral ssh-agent for your builds. This allows your builds to clone private repositories without storing keys on disk.

The agent's environment hook checks for secrets in the following order:

{pipeline-slug}/private_ssh_key - pipeline-specific SSH key
{pipeline-slug}/id_rsa_github - pipeline-specific GitHub deploy key
private_ssh_key - global SSH key shared across all pipelines
id_rsa_github - global GitHub deploy key shared across all pipelines

Where {pipeline-slug} is the slug of the pipeline running the build. Pipeline-specific keys are checked first, then the global keys. The first key found is loaded into the agent.

The enable_secret_access Terraform variable must be set to true (the default) for agents to access secrets from Secret Manager.

Uploading an SSH key

To generate a private SSH key and store it in Secret Manager:

# Generate a deploy key for your project
ssh-keygen -t rsa -b 4096 -f id_rsa_buildkite
pbcopy < id_rsa_buildkite.pub # Add this to your repository's deploy keys

# Store as a global key (available to all pipelines)
gcloud secrets create private_ssh_key \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

# Clean up the local key
rm id_rsa_buildkite id_rsa_buildkite.pub

To store a pipeline-specific key, include the pipeline slug in the secret name:

gcloud secrets create "my-pipeline-slug/private_ssh_key" \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

Updating an existing SSH key

To update a key already stored in Secret Manager, add a new version:

gcloud secrets versions add private_ssh_key \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

Adding resource labels

Add labels for cost tracking and organization:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  labels = {
    team        = "platform"
    environment = "production"
    cost-center = "engineering"
  }
}

Updating the stack

To update your stack configuration:

Modify your Terraform configuration files
Review the changes:

terraform plan

Apply the changes:

terraform apply

Terraform will automatically perform rolling updates to minimize disruption:

New instances will be created with the updated configuration
Old instances will be drained and terminated
The process of updating the stack respects max_surge and max_unavailable settings

Destroying the stack

To tear down the entire stack, use:

terraform destroy

Additional information

To gain a better understanding of how Elastic CI Stack for GCP works and how to use it most effectively and securely, check out the following resources: