fix(core): prevent uncontrolled retry loop during fallback#27008
fix(core): prevent uncontrolled retry loop during fallback#27008adamfweidman wants to merge 5 commits into
Conversation
The retry/fallback path was re-enqueuing the pending user message without deduplication or delay when falling back to the same model, causing a multiplication loop. This fix ensures that we only retry immediately when transitioning to a new model, and respect the backoff delay when retrying the same model. TAG=agy CONV=37b3cdb6-8437-4ced-ae4e-3d04054bacb1
|
|
Size Change: +1.1 kB (0%) Total Size: 34.1 MB
ℹ️ View Unchanged
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where persistent 429 errors caused an uncontrolled retry loop when falling back to the same model. By differentiating between switching to a new model and retrying the existing one, the system now correctly applies backoff delays, preventing excessive request spamming. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Extracts the duplicated delay and backoff logic in retryWithBackoff into a helper function applyBackoffDelay to improve maintainability. TAG=agy CONV=37b3cdb6-8437-4ced-ae4e-3d04054bacb1
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds debug logging for non-interactive mode and updates the fallback handler to return model names instead of booleans. The core change modifies the retry logic to apply exponential backoff when retrying with the same model, rather than retrying immediately. Review feedback identified a critical logic flaw where the attempt counter is reset even when the model remains the same, which could lead to infinite retry loops and bypass maximum attempt limits. The reviewer recommended unifying the model resolution logic and ensuring the attempt counter is only reset when a new model is actually introduced.
| const currentModel = getAvailabilityContext?.()?.policy.model; | ||
|
|
||
| const fallbackModel = await onPersistent429( | ||
| authType, | ||
| classifiedError, | ||
| ); | ||
| if (fallbackModel) { | ||
| attempt = 0; // Reset attempts and retry with the new model. | ||
| currentDelay = initialDelayMs; | ||
| continue; | ||
|
|
||
| // Only continue (immediate retry) if fallbackModel is a NEW model | ||
| if ( | ||
| typeof fallbackModel === 'string' && | ||
| fallbackModel !== currentModel | ||
| ) { | ||
| continue; | ||
| } else { | ||
| // If it's the same model (or a boolean retry signal), wait before retrying | ||
| currentDelay = await applyBackoffDelay( | ||
| currentDelay, | ||
| maxDelayMs, | ||
| signal, | ||
| ); | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
The fallback logic in retryWithBackoff contains a logic flaw that can lead to an infinite retry loop, bypassing the maxAttempts limit. This occurs because the attempt counter is unconditionally reset to 0 when a fallback model is suggested, even if the model remains the same. This violates the requirement for recursive handlers to have a mechanism to limit retry attempts. Additionally, the logic should use the resolved primary model from the availability context to ensure consistency and avoid re-evaluating conditions. Resetting currentDelay when falling back to the same model is too aggressive; exponential backoff should continue unless a new model is actually introduced.
const fallbackModel = await onPersistent429(
authType,
classifiedError,
);
if (fallbackModel) {
const currentContext = getAvailabilityContext?.();
const currentModel = currentContext?.policy.model;
if (
typeof fallbackModel === 'string' &&
fallbackModel !== currentModel
) {
attempt = 0;
currentDelay = initialDelayMs;
continue;
} else {
const jitter = currentDelay * 0.3 * (Math.random() * 2 - 1);
const delayWithJitter = Math.max(0, currentDelay + jitter);
await delay(delayWithJitter, signal);
currentDelay = Math.min(maxDelayMs, currentDelay * 2);
continue;
}
}References
- A recursive error/reconnect handler is acceptable as long as it includes a mechanism to limit the maximum number of retry attempts to prevent infinite loops.
- Avoid overlapping checks in model resolution logic by relying on condensed sources of truth. Use the resolved primary model directly rather than re-evaluating conditions.
- In retry loops, re-evaluate dynamic conditions like the maximum number of attempts on each iteration to ensure the logic adapts to potential state changes.
| const currentModel = getAvailabilityContext?.()?.policy.model; | ||
|
|
||
| const fallbackModel = await onPersistent429( | ||
| authType, | ||
| classifiedError, | ||
| ); | ||
| if (fallbackModel) { | ||
| attempt = 0; // Reset attempts and retry with the new model. | ||
| currentDelay = initialDelayMs; | ||
| continue; | ||
|
|
||
| // Only continue (immediate retry) if fallbackModel is a NEW model | ||
| if ( | ||
| typeof fallbackModel === 'string' && | ||
| fallbackModel !== currentModel | ||
| ) { | ||
| continue; | ||
| } else { | ||
| // If it's the same model (or a boolean retry signal), wait before retrying | ||
| currentDelay = await applyBackoffDelay( | ||
| currentDelay, | ||
| maxDelayMs, | ||
| signal, | ||
| ); | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
This block duplicates the logic flaw where the attempt counter is prematurely reset, potentially causing infinite loops. To ensure consistency and reduce logic duplication, this model resolution and retry logic should be unified. The check should rely on the resolved primary model from the context rather than static constants or redundant evaluations.
const fallbackModel = await onPersistent429(
authType,
classifiedError,
);
if (fallbackModel) {
const currentContext = getAvailabilityContext?.();
const currentModel = currentContext?.policy.model;
if (
typeof fallbackModel === 'string' &&
fallbackModel !== currentModel
) {
attempt = 0;
currentDelay = initialDelayMs;
continue;
} else {
const jitter = currentDelay * 0.3 * (Math.random() * 2 - 1);
const delayWithJitter = Math.max(0, currentDelay + jitter);
await delay(delayWithJitter, signal);
currentDelay = Math.min(maxDelayMs, currentDelay * 2);
continue;
}
}References
- Recursive handlers must include a mechanism to limit the maximum number of retry attempts to prevent infinite loops.
- Avoid overlapping checks in model resolution logic and reduce logic duplication by using the resolved primary model directly.
There was a problem hiding this comment.
Code Review
This pull request updates the fallback handler to return model identifiers and introduces a backoff mechanism in the retry logic when the fallback model is identical to the current one. A new test case was added to verify this behavior. Review feedback identified a potential infinite loop vulnerability if the current model is undefined, which would cause the logic to skip the necessary backoff delay and reset retry attempts prematurely.
| if ( | ||
| typeof fallbackModel === 'string' && | ||
| fallbackModel !== currentModel | ||
| ) { | ||
| continue; | ||
| } else { | ||
| // If it's the same model (or a boolean retry signal), wait before retrying | ||
| currentDelay = await applyBackoffDelay( | ||
| currentDelay, | ||
| maxDelayMs, | ||
| signal, | ||
| ); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The retry logic is vulnerable to a Denial of Service (DoS) due to uncontrolled retry loops and non-exponential backoff. This occurs because the attempt counter and currentDelay are reset before checking if the fallbackModel is new. Specifically, if currentModel is undefined, the condition fallbackModel !== currentModel incorrectly evaluates to true, leading to immediate retries without delay. This violates the repository rule requiring a mechanism to limit retry attempts to prevent infinite loops.
// Only continue (immediate retry) if fallbackModel is a NEW model
if (
typeof fallbackModel === 'string' &&
currentModel !== undefined &&
fallbackModel !== currentModel
) {
continue;
} else {
// If it's the same model (or a boolean retry signal), wait before retrying
currentDelay = await applyBackoffDelay(
currentDelay,
maxDelayMs,
signal,
);
continue;
}References
- A recursive error/reconnect handler is acceptable as long as it includes a mechanism to limit the maximum number of retry attempts to prevent infinite loops.
| if ( | ||
| typeof fallbackModel === 'string' && | ||
| fallbackModel !== currentModel | ||
| ) { | ||
| continue; | ||
| } else { | ||
| // If it's the same model (or a boolean retry signal), wait before retrying | ||
| currentDelay = await applyBackoffDelay( | ||
| currentDelay, | ||
| maxDelayMs, | ||
| signal, | ||
| ); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
This section contains a similar Denial of Service (DoS) vulnerability. The attempt counter and currentDelay are reset before verifying if the fallbackModel is new. This, combined with the backoff delay being skipped if currentModel is undefined, can lead to an infinite loop and broken exponential backoff. This violates the repository rule requiring a mechanism to limit retry attempts to prevent infinite loops.
// Only continue (immediate retry) if fallbackModel is a NEW model
if (
typeof fallbackModel === 'string' &&
currentModel !== undefined &&
fallbackModel !== currentModel
) {
continue;
} else {
// If it's the same model (or a boolean retry signal), wait before retrying
currentDelay = await applyBackoffDelay(
currentDelay,
maxDelayMs,
signal,
);
continue;
}References
- A recursive error/reconnect handler is acceptable as long as it includes a mechanism to limit the maximum number of retry attempts to prevent infinite loops.
Summary
Fixes an uncontrolled retry loop bug where a single user message gets re-sent 30-50+ times to the fallback model when the primary model returns a 429 error.
Details
The bug was caused by
retryWithBackoffresetting the attempt counter and bypassing the delay on fallback, even when falling back to the same model. The fix ensures that it only skips the delay when transitioning to a new model, and respects the backoff delay when retrying the same model.Related Issues
Fixes #23889
How to Validate
Run the reproduction test added in
packages/core/src/utils/retry.test.ts:npm test -w @google/gemini-cli-core -- src/utils/retry.test.tsVerify that it waits between retries when falling back to the same model.
Pre-Merge Checklist