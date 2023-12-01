Waiting for lost agents to time out can be frustrating and leads to considerably longer build times. This issue is amplified when running large fleets with thousands of agents or using cost-saving measures like Spot Instances, which can introduce instability.

You'll notice a significant difference with this update—we've reduced the time it takes to detect and clean up lost agents from 10 minutes to 4 minutes. Jobs can now be reassigned to new agents as early as 4 minutes after we stop receiving a heartbeat.

Where is the 60% time reduction coming from?

-2 minutes: The grace period for missing agent heartbeats has been reduced from 5 to 3 minutes.

-4 minutes: The process for cleaning up lost agents has been optimized, reducing runtimes from 5 minutes to 1 minute.

With shorter timeouts, lost jobs now fail and recover faster, slashing build times. And we all love faster feedback loops.