Terraform setup for the Elastic CI Stack for GCP

This guide helps you to get started with the Elastic CI Stack for GCP using Terraform.

Elastic CI Stack for GCP allows you to launch a private, autoscaling Buildkite agent cluster in your own GCP project.

Before you start

Before deploying the Elastic CI Stack for GCP, review the prerequisites, required skills, and billable services to ensure you have the necessary tools, knowledge, and budget planning in place.

Prerequisites

Billable services

The Elastic CI Stack for GCP template deploys several billable GCP services that do not require upfront payment and operate on a pay-as-you-go principle, with the bill proportional to usage.

Service name Purpose Required
Compute Engine Deployment of VM instances ☑️
Persistent Disk Root disk storage of VM instances ☑️
Cloud Functions Publishing queue metrics for autoscaling ☑️
Secret Manager Storing the Buildkite agent token (recommended) ☑️
Cloud Logging Logs for instances and Cloud Function ☑️
Cloud Monitoring Metrics for autoscaling ☑️
Cloud NAT Outbound internet access for instances ☑️
Cloud Storage Build artifacts storage (if enabled)

Buildkite services are billed according to your plan.

What's on each machine?

When using the default base image, each machine includes:

You can build a custom image if you need additional tools for your pipelines.

For more details on what versions are installed, see the corresponding Packer templates.

The Buildkite agent runs as user buildkite-agent.

Supported builds

This stack is designed to run your builds in a shared-nothing pattern similar to the 12 factor application principles:

  • Each project should encapsulate its dependencies through Docker and Docker Compose.
  • Build pipeline steps should assume no state on the machine (and instead rely on the build meta-data, build artifacts, or Cloud Storage).
  • Secrets, including SSH keys for source control, are configured using Secret Manager.

By following these conventions, you get a scalable, repeatable, and source-controlled CI environment that any team within your organization can use.

Custom images

Custom images help teams ensure that their agents have all required tools and configurations before instance launch. This prevents instances from reverting to the base image state when agents restart, which would lose any manual changes made during run time.

Requirements

To use the Packer templates provided, you will need to install the following installed on your system:

  • Docker
  • Make
  • gcloud CLI

The following GCP IAM permissions are required for building custom images using the provided Packer templates:

{
  "title": "Packer Image Builder",
  "description": "Permissions required to build VM images with Packer",
  "includedPermissions": [
    "compute.disks.create",
    "compute.disks.delete",
    "compute.disks.get",
    "compute.disks.use",
    "compute.images.create",
    "compute.images.delete",
    "compute.images.get",
    "compute.images.useReadOnly",
    "compute.instances.create",
    "compute.instances.delete",
    "compute.instances.get",
    "compute.instances.setMetadata",
    "compute.instances.setServiceAccount",
    "compute.machineTypes.get",
    "compute.networks.get",
    "compute.subnetworks.use",
    "compute.subnetworks.useExternalIp",
    "compute.zones.get",
    "iam.serviceAccounts.actAs"
  ]
}

It is also recommended that you have a base knowledge of:

Creating an image

To create a custom image with Docker support (recommended for production):

cd packer
./build --project-id your-gcp-project-id

This builds a Debian 12-based image with:

  • Pre-installed Buildkite agent
  • Docker Engine with Compose v2 and Buildx
  • Multi-architecture build support
  • Automated Docker garbage collection
  • Disk space monitoring and self-protection
  • Centralized logging with Ops Agent

Deploying the stack

This section walks through the deployment process step by step, from obtaining your agent token to initializing and applying your Terraform configuration.

Step 1: Get your Buildkite agent token

Obtain the value for the agent token you'd previously configured for your Buildkite cluster.

If you don't have your agent token's value, you'll need to create a new one, which you can do from the Agents > Clusters > your specific cluster page. Once created, don't forget to copy the agent token's value and save it somewhere secure, as you won't be able to see its value from Buildkite again.

Step 3: Create your Terraform configuration

Create a new directory for your Terraform configuration:

mkdir buildkite-gcp-stack
cd buildkite-gcp-stack

Create a main.tf file:

terraform {
  required_version = ">= 1.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 4.0, < 8.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # Required
  project_id                  = var.project_id
  buildkite_organization_slug = var.buildkite_organization_slug
  buildkite_agent_token_secret = "projects/${var.project_id}/secrets/buildkite-agent-token/versions/latest"

  # Stack configuration
  stack_name      = "buildkite"
  buildkite_queue = "default"
  region          = var.region

  # Scaling configuration
  min_size = 0
  max_size = 10

  # Instance configuration
  machine_type = "e2-standard-4"
}

Create a variables.tf file:

variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "GCP region"
  type        = string
  default     = "us-central1"
}

variable "buildkite_organization_slug" {
  description = "Buildkite organization slug"
  type        = string
}

Create a terraform.tfvars file:

project_id                  = "your-gcp-project-id"
region                      = "us-central1"
buildkite_organization_slug = "your-org-slug"

Create an outputs.tf file (optional):

output "network_name" {
  description = "Name of the VPC network"
  value       = module.buildkite_stack.network_name
}

output "instance_group_name" {
  description = "Name of the managed instance group"
  value       = module.buildkite_stack.instance_group_manager_name
}

output "agent_service_account_email" {
  description = "Email of the agent service account"
  value       = module.buildkite_stack.agent_service_account_email
}

Step 4: Initialize and deploy

  • Authenticate with GCP:
gcloud auth application-default login
  • Initialize Terraform:
terraform init
  • Review the planned changes:
terraform plan
  • Deploy the stack:
terraform apply
  • Type yes when prompted to confirm the deployment.

The module will create:

  • VPC network with Cloud NAT
  • IAM service accounts with appropriate permissions
  • Managed instance group with Buildkite agents
  • Cloud Function for autoscaling metrics
  • Health checks and autoscaling based on queue depth

Advanced configuration

This section covers some of the configurations you might want to use for a deeper customization of your stack.

Using a custom VM image

If you built a custom Packer image with Docker support:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  # Use custom image family
  image = "buildkite-ci-stack"
}

Configuring agent tags

Target specific agents in your pipeline steps using tags:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  buildkite_agent_tags = "docker=true,os=linux,environment=production"
}

Then in your pipeline.yml, set the following:

steps:
  - command: echo "hello from production"
    agents:
      queue: "default"
      environment: "production"

For more information, see the Queues overview page.

Multiple queues

To create multiple agent pools with different configurations, deploy multiple stacks with different queue names:

# Production stack
module "buildkite_stack_production" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  stack_name      = "buildkite-production"
  buildkite_queue = "production"
  machine_type    = "e2-standard-4"
  max_size        = 20

  # ... other configuration ...
}

# Build stack for larger builds
module "buildkite_stack_builds" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  stack_name      = "buildkite-builds"
  buildkite_queue = "builds"
  machine_type    = "n1-standard-8"
  max_size        = 10

  # ... other configuration ...
}

Enabling Cloud Storage access

If your builds need to upload/download artifacts to Cloud Storage:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_storage_access = true
}

Using IAP for secure SSH access

Enable Identity-Aware Proxy for secure SSH access without external IPs:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_iap_access = true
}

Then connect to instances:

gcloud compute ssh INSTANCE_NAME \
  --zone ZONE \
  --tunnel-through-iap \
  --project PROJECT_ID

Restricting SSH access

Restrict SSH access to specific IP ranges:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  enable_ssh_access  = true
  ssh_source_ranges  = ["203.0.113.0/24"]  # Your office IP range
}

SSH keys for source control

The Elastic CI Stack for GCP automatically loads SSH keys from GCP Secret Manager and adds them to an ephemeral ssh-agent for your builds. This allows your builds to clone private repositories without storing keys on disk.

The agent's environment hook checks for secrets in the following order:

  1. {pipeline-slug}/private_ssh_key - pipeline-specific SSH key
  2. {pipeline-slug}/id_rsa_github - pipeline-specific GitHub deploy key
  3. private_ssh_key - global SSH key shared across all pipelines
  4. id_rsa_github - global GitHub deploy key shared across all pipelines

Where {pipeline-slug} is the slug of the pipeline running the build. Pipeline-specific keys are checked first, then the global keys. The first key found is loaded into the agent.

The enable_secret_access Terraform variable must be set to true (the default) for agents to access secrets from Secret Manager.

Uploading an SSH key

To generate a private SSH key and store it in Secret Manager:

# Generate a deploy key for your project
ssh-keygen -t rsa -b 4096 -f id_rsa_buildkite
pbcopy < id_rsa_buildkite.pub # Add this to your repository's deploy keys

# Store as a global key (available to all pipelines)
gcloud secrets create private_ssh_key \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

# Clean up the local key
rm id_rsa_buildkite id_rsa_buildkite.pub

To store a pipeline-specific key, include the pipeline slug in the secret name:

gcloud secrets create "my-pipeline-slug/private_ssh_key" \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

Updating an existing SSH key

To update a key already stored in Secret Manager, add a new version:

gcloud secrets versions add private_ssh_key \
  --data-file=id_rsa_buildkite \
  --project=your-project-id

Adding resource labels

Add labels for cost tracking and organization:

module "buildkite_stack" {
  source  = "buildkite/elastic-ci-stack-for-gcp/buildkite"
  version = ">= 0.1.0"

  # ... other configuration ...

  labels = {
    team        = "platform"
    environment = "production"
    cost-center = "engineering"
  }
}

Updating the stack

To update your stack configuration:

  • Modify your Terraform configuration files
  • Review the changes:
terraform plan
  • Apply the changes:
terraform apply

Terraform will automatically perform rolling updates to minimize disruption:

  • New instances will be created with the updated configuration
  • Old instances will be drained and terminated
  • The process of updating the stack respects max_surge and max_unavailable settings

Destroying the stack

To tear down the entire stack, use:

terraform destroy

Additional information

To gain a better understanding of how Elastic CI Stack for GCP works and how to use it most effectively and securely, check out the following resources: