fix(scale-up): prevent negative TotalTargetCapacity when runners exceed maximum#5062
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes an edge case in the control-plane runner scaling logic where concurrent pool/scale-up activity can temporarily push currentRunners above maximumRunners, leading to negative target capacity being sent to EC2. Also adds retry handling to improve resilience when de-registering runners from GitHub during scale-down.
Changes:
- Guard
newRunnerscalculation inscale-up.tswithMath.max(0, ...)to prevent negative capacity. - Add a unit test covering the “current runners exceed maximum” race condition scenario.
- Add retry-with-backoff behavior for GitHub runner deletion in
scale-down.ts, plus tests for transient 5xx failures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lambdas/functions/control-plane/src/scale-runners/scale-up.ts | Prevents negative runner counts from being passed into fleet creation. |
| lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts | Adds coverage for the race-condition scenario where current runners exceed max. |
| lambdas/functions/control-plane/src/scale-runners/scale-down.ts | Introduces retry logic for GitHub runner de-registration to handle transient API errors. |
| lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts | Adds tests validating retry behavior and non-termination when retries are exhausted. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Outdated
Show resolved
Hide resolved
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts
Outdated
Show resolved
Hide resolved
8a1ea0b to
78c1ddb
Compare
…ed maximum When pool and scale-up lambdas run concurrently, currentRunners can temporarily exceed maximumRunners. This caused the calculation `maximumRunners - currentRunners` to produce a negative value, which was then passed to EC2 CreateFleet API, resulting in: InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative. This fix wraps the calculation with Math.max(0, ...) to ensure we never attempt to create a negative number of runners. Fixes race condition between pool-lambda and scale-up-lambda.
78c1ddb to
07e96d6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem
When pool and scale-up lambdas run concurrently,
currentRunnerscan temporarily exceedmaximumRunners. This causes the calculationmaximumRunners - currentRunnersto produce a negative value, which is then passed to the EC2 CreateFleet API.Error observed in CloudWatch logs:
Root cause:
The scale-up lambda calculates:
When
currentRunners > maximumRunners(e.g., 55 runners with max 50), this produces:This negative value is passed to
CreateFleetAPI which rejects it.Solution
Wrap the calculation with
Math.max(0, ...)to ensure we never attempt to create a negative number of runners:Changes
Math.max(0, ...)guard to prevent negative runner countTesting
createRunnercall is made in this scenario