Running Buildkite Agent on AWS

The Buildkite Agent can be run on AWS using our Elastic CI Stack for AWS CloudFormation template, or by installing the agent on your self-managed instances.

Using our Elastic CI Stack for AWS CloudFormation template

The Elastic CI Stack for AWS is a CloudFormation template for an autoscaling Buildkite Agent cluster. The agent instances include Docker, S3 and CloudWatch integration.

You can use an Elastic CI Stack deployment to test Linux or Windows projects, parallelize large test suites, run Docker containers or docker-compose integration tests, or perform any AWS ops related tasks.

Read the documentation on GitHub, and launch it using the button on your organization’s Agents page.

Installing the Agent on your own AWS instances

To run the agent on your own AWS instances, use whichever installer matches your instance operating system. For Amazon Linux 2 use the Red Hat/CentOS installer. For macOS, see installing the agent on AWS mac1.metal instances

Installing the Agent on AWS mac1.metal instances

Setting up a macOS AMI that starts a Buildkite Agent on launch is a multi step process. You can start with one of the macOS AMIs from AWS, or with an AMI you've already installed Xcode or other software on.

To use Xcode and the iOS Simulator, you must configure auto-login of a GUI session, and launch the Buildkite Agent in an aqua session as a Launchd Agent:

  1. Reserve a AWS mac1.metal Dedicated Instance.
  2. Boot a macOS instance using your desired AMI on the Dedicated Instance
  3. Configure instance VPC subnets, security groups, and key pairs so that you can access the instance.
  4. Using an SSH or SSM session, set a password for the ec2-user and enable screen sharing.
  5. Using a VNC session, enable auto-login for the ec2-user in System Preferences and disable the auto screen lock and screen saver.
  6. Follow the OS X installation guide instructions to install the Buildkite Agent using Homebrew and configure starting on login.
  7. Verify that the Buildkite Agent has connected to with your desired agent tags.
  8. Create an AMI from your instance.

Your saved AMI can now be used to boot as many macOS instances as you require. To make this process repeatable, save your instance configuration in a Launch Template.

There is an excellent blog post on running iOS agents in the cloud that goes into more detail on preparing macOS AMIs using Packer.

Known issues

Preventing builds from accessing Amazon EC2 metadata

If you provision infrastructure like databases, Redis, Amazon SQS, etc using AWS permission sandboxes, you might want to restrict access to those roles in your builds.

If you run your builds on an AWS EC2 permission sandbox and then allow Buildkite agents to generate and inject some sandboxed AWS credentials into the build secrets, such builds will have access to the EC2 metadata API. They will also be able to access the same permissions as your EC2.

To avoid this, you need to prevent the builds from accessing your EC2 metadata or provide sandboxed AWS credentials for each build and restrict their permissions. There are two main ways to do it:

  • Compartmentalizing your Buildkite agents
  • Downgrading an instance profile role

If you run all the build steps in Docker containers, take a look at compartmentalizing your agents. If you are using Kubernetes for your Buildkite CI, use the same approach and also check out this article for more information and inspiration.

Restricting permissions via compartmentalization of agents

This approach suggests the use of Elastic Stack. However, these instructions can also be followed using hooks or scripts.

You can divide your Buildkite agents by responsibilities. For example — agents building for development environments or release, agents deploying for staging or production, etc. This will help reflect multiple AWS environments at your Buildkite organization.

To divide the responsibilities and permissions of Buildkite agents and provide the relevant teams with sandboxed IAM permissions for their own microservices, for each pipeline you will need to use a third-party AWS AssumeRole Buildkite Plugin. This plugin also takes care of the injection of AWS credentials.

To ensure that the agent in charge of a job, build, pipeline, etc., is allowed to run and will assume the role it has permission to, you can perform a pre-checkout hook on the agent.

Restricting permissions by downgrading an instance profile role

This approach is suggested by Amazon and is helpful if you are not using Elastic CI Stack for AWS.

To restrict permissions of an instance, you can permanently downgrade an instance’s profile from a high-permission bootstrap role to a low-permission steady-state role. The high-permission role has a policy that allows replacing the instance profile with a low-permission role, but there is no such policy for a low-permission role.

Further tightening the security around EC2 permissions

For added security, you can expire agents after a job. For example, you can:
1. Create a new agent for a pending job
2. Transition the agent to a sandbox role
3. Terminate the agent instance when the agent completes the job

Starting a new EC2 instance for every job results in a small trade-off of speed in favor of security. However, the Buildkite CI stack for AWS uses a Lambda to start new EC2 instances on demand, and it usually takes around one minute for a typical Linux instance.

A larger trade-off here is the need to keep discarding the cache on the machine — for example, pre-fetched and pre-built Docker images — and start anew every time.

If you’re less concerned about the CI spend, new EC2 instance starting time, and other resources, you can specify a minimum stack size large enough to keep a pool of agents ready to go. This way, you can quickly replace any terminated agent instance with a clean instance. Buildkite uses this approach to secure open-source agent instances as they could be running untrusted code.

For more information on AWS security practices regarding restricting access to the API in EKS, see Amazon EKS security best practices.

There is no exact recommended quantity of agents in a pool. An optimal pool size is the minimum number of available agents you would want to have ready to run jobs instantly.

You can start with one or two extra instances that are always available for running lightweight jobs (e.g., pipeline uploads), and you can increase the number of agents per machine so that they can run in parallel.

For organizations where at any given moment there are engineers working (for example, for shift-based 24/7 schedules or in globally distributed teams), having a large pool of build agents always available makes sense. Otherwise, idly running agents overnight might be a waste of resources.