Skip to content

fix(scale-up): prevent negative TotalTargetCapacity when runners exceed maximum#5062

Open
shivdesh wants to merge 1 commit intogithub-aws-runners:mainfrom
shivdesh:fix/negative-target-capacity-race-condition
Open

fix(scale-up): prevent negative TotalTargetCapacity when runners exceed maximum#5062
shivdesh wants to merge 1 commit intogithub-aws-runners:mainfrom
shivdesh:fix/negative-target-capacity-race-condition

Conversation

@shivdesh
Copy link

@shivdesh shivdesh commented Mar 10, 2026

Problem

When pool and scale-up lambdas run concurrently, currentRunners can temporarily exceed maximumRunners. This causes the calculation maximumRunners - currentRunners to produce a negative value, which is then passed to the EC2 CreateFleet API.

Error observed in CloudWatch logs:

InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative.

Root cause:

The scale-up lambda calculates:

const newRunners = Math.min(scaleUp, maximumRunners - currentRunners);

When currentRunners > maximumRunners (e.g., 55 runners with max 50), this produces:

newRunners = Math.min(1, 50 - 55) = Math.min(1, -5) = -5

This negative value is passed to CreateFleet API which rejects it.

Solution

Wrap the calculation with Math.max(0, ...) to ensure we never attempt to create a negative number of runners:

const newRunners = Math.max(0, Math.min(scaleUp, maximumRunners - currentRunners));

Changes

  • scale-up.ts: Added Math.max(0, ...) guard to prevent negative runner count
  • scale-up.test.ts: Added test case for race condition scenario

Testing

  • Added unit test that simulates 10 current runners with max of 5
  • Verified no createRunner call is made in this scenario
  • All existing tests pass

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an edge case in the control-plane runner scaling logic where concurrent pool/scale-up activity can temporarily push currentRunners above maximumRunners, leading to negative target capacity being sent to EC2. Also adds retry handling to improve resilience when de-registering runners from GitHub during scale-down.

Changes:

  • Guard newRunners calculation in scale-up.ts with Math.max(0, ...) to prevent negative capacity.
  • Add a unit test covering the “current runners exceed maximum” race condition scenario.
  • Add retry-with-backoff behavior for GitHub runner deletion in scale-down.ts, plus tests for transient 5xx failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Prevents negative runner counts from being passed into fleet creation.
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Adds coverage for the race-condition scenario where current runners exceed max.
lambdas/functions/control-plane/src/scale-runners/scale-down.ts Introduces retry logic for GitHub runner de-registration to handle transient API errors.
lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts Adds tests validating retry behavior and non-termination when retries are exhausted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shivdesh shivdesh force-pushed the fix/negative-target-capacity-race-condition branch from 8a1ea0b to 78c1ddb Compare March 11, 2026 21:01
…ed maximum

When pool and scale-up lambdas run concurrently, currentRunners can
temporarily exceed maximumRunners. This caused the calculation
`maximumRunners - currentRunners` to produce a negative value, which
was then passed to EC2 CreateFleet API, resulting in:

  InvalidTargetCapacitySpecification: TotalTargetCapacity should not be negative.

This fix wraps the calculation with Math.max(0, ...) to ensure we never
attempt to create a negative number of runners.

Fixes race condition between pool-lambda and scale-up-lambda.
@shivdesh shivdesh force-pushed the fix/negative-target-capacity-race-condition branch from 78c1ddb to 07e96d6 Compare March 12, 2026 16:27
@npalm npalm requested a review from Copilot March 13, 2026 00:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants