-
-
Notifications
You must be signed in to change notification settings - Fork 360
feat: add self-optimizing adaptive rate limiting with automatic retry #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add self-optimizing adaptive rate limiting with automatic retry #162
Conversation
Implements a RequestQueue class that manages API requests with configurable rate limiting. The queue automatically processes requests at the specified interval, preventing rate limit errors while ensuring all requests are eventually fulfilled. Key features: - Automatic request queuing when rate limit is configured - Sequential processing with configurable delays - Detailed logging of queue status and wait times - Zero overhead when rate limiting is disabled Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Updates the rate limiting system to use the new RequestQueue for better handling of concurrent requests. Instead of rejecting or blocking requests that exceed the rate limit, they are now automatically queued and processed at the configured interval. Changes: - Add requestQueue to global state - Introduce executeWithRateLimit() wrapper function - Update chat-completions and messages handlers to use queue - Initialize queue with configured rate limit on server startup - Add eslint exception for state assignment race condition The old checkRateLimit() function is kept for backwards compatibility but marked as deprecated. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Add utility module to parse rate limit headers from API responses. Supports multiple header formats: - X-RateLimit-* (GitHub style) - RateLimit-* (RFC draft) - Retry-After (for 429 responses) Implements even distribution strategy to calculate optimal delay based on remaining requests and reset time. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Add optional onHeaders callback parameter to createChatCompletions service to allow capturing response headers before processing the response body. Works for both streaming and non-streaming responses. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Integrate rate limit header parsing in chat completions and messages handlers. The system now: - Parses rate limit headers from API responses - Calculates optimal delay using even distribution - Dynamically updates request queue rate limit - Falls back to configured rate limit when headers absent This enables automatic adaptation to API rate limits and helps prevent abuse detection while maximizing throughput. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Add unit tests covering all rate limit header formats and delay calculation logic: - X-RateLimit-* (GitHub/Copilot style) - RateLimit-* (RFC draft format) - Retry-After header (seconds and HTTP date) - Header priority and fallback behavior - Delay calculation with various scenarios Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Update comments to specify that X-RateLimit-* headers are in GitHub/Copilot style, since this proxy only calls the GitHub Copilot API. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Remove the default value of 3 seconds for --rate-limit flag to ensure rate limiting is only active when explicitly requested by the user. This allows requests to execute immediately without queuing when the flag is not provided. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Remove adaptive rate limiting since GitHub Copilot API does not provide rate limit headers. The API only returns x-quota-snapshot-* headers which track quota usage, not rate limits, and overage is permitted freely. Removed: - src/lib/rate-limit-parser.ts - tests/rate-limit-parser.test.ts - onHeaders callback from createChatCompletions - Rate limit header parsing logic from handlers The opt-in request queue remains functional for users who want to set a fixed rate limit via --rate-limit flag. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Implement comprehensive rate limit resilience to make the API proxy unstoppable for AI agents running autonomously. Features: - Parse Retry-After header from 429 responses (supports seconds and HTTP dates) - Automatic retry with exponential backoff (up to 5 retries) - Dynamic rate limit adjustment based on API responses - Enhanced error messages with retry information - Works with and without --rate-limit flag Implementation: - New RateLimitError class with retry information - parseRetryAfter() handles GitHub's retry headers - RequestQueue.executeWithRetry() handles automatic retries - Queue adjusts rate limit dynamically when 429s occur - forwardError() returns structured 429 responses with Retry-After Benefits: - No manual intervention needed for rate limit errors - Agents can work autonomously all day long - Learns and adapts to API rate limits in real-time - Never drops requests (retries up to 5 times) - Clear logging shows retry attempts and wait times Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
High-impact improvements for production resilience: 1. Jitter for Retry Delays: - Adds ±20% random jitter to all retry delays - Prevents thundering herd when many requests retry simultaneously - Applies to both rate limit retries and exponential backoff 2. Request Timeout: - 60-second timeout per request to prevent hanging - Timeout errors are automatically retried (transient) - Protects against unresponsive upstream API 3. Queue Backpressure Warning (NOT rejection): - Logs warning when queue depth exceeds 100 requests - NEVER rejects client requests - queues them all - Allows API proxy to handle any volume gracefully 4. Better Error Categorization: - Retries transient errors: 429, 500, 502, 503, 504, timeouts, network errors - Fails immediately on permanent errors: 400, 401, 403, 404 - Uses exponential backoff with jitter for non-429 retries (1s, 2s, 4s, 8s, 16s) - Smart detection of HTTPError status codes 5. Rate Limit Headers on All Responses: - X-RateLimit-Limit: Maximum requests per minute - X-RateLimit-Remaining: Requests remaining before rate limit - X-RateLimit-Reset: Unix timestamp when rate limit resets - X-Queue-Depth: Current queue size for visibility - Retry-After: Set when queue depth is high (>50 requests) Benefits: - Clients get proactive rate limit information - No client requests are ever rejected - Better distributed retry attempts (jitter) - Faster failure on permanent errors - Automatic recovery from transient failures - Full transparency into API proxy state Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Store error body text when creating HTTPError to avoid consuming Response body twice. The body can only be read once, so we cache it during initial error logging and reuse it in forwardError. This fixes crashes when handling non-retryable errors like 499 (client canceled request). Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Adds intelligent rate limiting that learns from both successes and failures: **Adaptive Increase (on 429s):** - Tracks rate limit hits in 60s windows - Adds 20% buffer when >3 hits/minute - Adjusts to GitHub's Retry-After + buffer **Adaptive Decrease (on successes):** - Tracks consecutive successful requests - Decreases rate limit by 10% after 10 successes - Speeds up when API allows it **Smart Default:** - Changed from 0 (disabled) to 1s (adaptive enabled) - Use --rate-limit 0 to explicitly disable - Minimum: 100ms, Maximum: 60s **Frequency-Based Adjustment:** - More conservative when hitting many 429s - Gradually speeds up when API is happy - Prevents over-aggressive rate limiting This reduces 429 responses while maximizing throughput automatically. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Only call updateRateLimit() if --rate-limit flag is explicitly provided. This allows the RequestQueue constructor's default of 1s to take effect for adaptive rate limiting. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Makes the system much more conservative to reduce 429 errors: **Slower Decrease (Speed Up Less Aggressively):** - Increase success threshold: 10 → 20 requests - Decrease factor: 10% (0.9) → 5% (0.95) - Now requires 20 consecutive successes before speeding up by only 5% **Faster Increase with Buffer (Slow Down More Aggressively):** - Lower buffer trigger: >3 hits → >2 hits per minute - Increase buffer: 20% → 40% - Applies 40% buffer after just 3 rate limit hits in 60s window **Impact:** - Reduces 429 errors significantly - Stays at higher rate limits longer - More cautious when speeding up - More aggressive when hitting rate limits This should dramatically reduce the ~64% rate limit error rate observed in production. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
⚙️ Conservative Tuning AppliedBased on production testing showing ~64% rate limit errors, the adaptive algorithm has been made significantly more conservative: Changes in commit Slower Speed-Up (Decrease Rate Limit)
Faster Slow-Down (Increase Rate Limit)
Expected Impact:
The system will now favor reliability over throughput optimization. |
Implements two major optimizations to balance speed and reliability: **1. Adaptive Decrease Strategy (Smarter Initial Rate Discovery)** - When far from limit (>10s): 10% decrease after 10 successes - When medium distance (2-10s): 7% decrease after 15 successes - When close to limit (<2s): 5% decrease after 20 successes (cautious) Impact: Converges to optimal rate much faster (3-4x improvement) Example: 20s → 18s → 16.7s → 15.4s (instead of 20s → 19s → 18.1s...) **2. Request Deduplication/Caching** - In-memory cache with 30s TTL, max 1000 entries - SHA-256 hash of request payload as cache key - Only caches non-streaming responses - Reduces GitHub API calls for identical requests - Automatic cleanup of expired entries Impact: Dramatically reduces API calls for duplicate requests Example: count_tokens requests, repeated messages **Benefits:** - Faster convergence from high rate limits (20s → ~10s) - Reduced GitHub API usage (fewer 429s, lower quota consumption) - Better client experience (faster responses for cached requests) - Still maintains conservative approach near actual limits **Implementation:** - Created RequestCache class with get/set/cleanup methods - Integrated cache into both /messages and /chat-completions handlers - Cache only used for non-streaming to keep implementation simple - Cache returns null if entry expired or not found Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
🚀 Speed & Efficiency ImprovementsAdded two major optimizations in commit 1. Adaptive Decrease Strategy (Faster Convergence)Instead of fixed 5% decrease after 20 successes, the system now adapts based on distance from limit:
Impact:
2. Request Caching (Reduced API Calls)Simple in-memory cache for duplicate requests:
Impact:
Combined benefit: Faster optimization + fewer API calls = better experience for proxy clients while maintaining low error rate. |
Combination approach to minimize 429 errors: - Always add buffer on every rate limit hit (no more bare minimum) - 1st hit: +25% buffer - 2+ hits: +50% buffer - 3+ hits: +75% buffer This addresses the issue of hitting multiple 429s in succession by being immediately conservative on the first rate limit, then increasingly cautious if we continue to hit limits. Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Summary
Implements a production-ready, self-optimizing adaptive rate limiting system with automatic retry, intelligent error handling, and comprehensive rate limit headers. The system learns from both successes and failures to find the optimal request rate automatically, favoring reliability over speed.
Motivation
When multiple requests arrive faster than GitHub's rate limits allow, the API needs to handle errors gracefully while maximizing throughput. AI agents need to work autonomously all day long without getting blocked by rate limits or transient failures.
This PR introduces:
Key Principles
✅ Never reject client requests - Queue everything, let clients decide based on headers
✅ Always return rate limit headers - Full transparency for client-side backpressure
✅ Reliability over speed - Conservative tuning minimizes 429 errors
✅ Automatic optimization - Finds optimal rate automatically without manual tuning
✅ Prevent thundering herd - Jitter distributes retry attempts
Features
🎯 Bidirectional Adaptive Rate Limiting
Starts Smart:
--rate-limit 0to explicitly disable--rate-limit Nto set custom initial rateLearns from Failures (increases rate limit) - Conservative:
Retry-Afterheader instantlyLearns from Success (decreases rate limit) - Cautious:
Conservative Tuning (commit
5a47bff):Example Adaptation:
🔄 Automatic Retry with Jitter
Retry-Afterheader from GitHub📊 Dynamic Rate Limit Adjustment
🛡️ Resilient Error Handling
📡 Rate Limit Headers on All Responses
Standard Headers:
X-RateLimit-Limit: Maximum requests per minute (based on configured rate)X-RateLimit-Remaining: Requests remaining before hitting queue depthX-RateLimit-Reset: Unix timestamp when rate limit window resetsRetry-After: Set when queue depth is high (>50), suggests client slowdownCustom Headers:
X-Queue-Depth: Current number of requests waiting in queueBenefits:
🚀 Request Queue
X-Queue-DepthheaderImplementation
Core Files
src/lib/queue.ts(enhanced)successThresholdToDecrease = 20(was 10)decreaseFactor = 0.95(was 0.9 - 5% vs 10%)trackRateLimitHit(): Tracks 429 frequency in 60s windowsadjustRateLimitUp(): Increases rate limit on 429s, adds 40% buffer on frequent hitstrackSuccessfulRequest(): Tracks successes and decreases rate limit cautiouslyexecuteWithRetry(): Automatic retry logic with jitterexecuteWithTimeout(): 60s request timeoutsrc/lib/retry.tsaddJitter(): Adds ±20% random jitter to delaysisRetryableError(): Categorizes errors (retryable vs permanent)isTransientError(): Checks HTTP status codes for transienceparseRetryAfter(): ParsesRetry-Afterheader (seconds or HTTP date)RateLimitError: Structured error with retry informationcheckRateLimitError(): Detects 429 responses and extracts retry infosrc/lib/rate-limit-headers.tsaddRateLimitHeaders(): Adds standard rate limit headers to responsessrc/lib/error.ts(enhanced)RateLimitErrorRetry-Afterheader for client compatibilitysrc/services/copilot/create-chat-completions.ts(enhanced)RateLimitErrorinstead of genericHTTPErrorsrc/routes/*/handler.ts(enhanced)src/start.ts(enhanced)--rate-limitis explicitly providedUsage
Response Headers Example
Normal operation:
When queue is high (>50 requests):
Behavior
Default (1s start with conservative adaptive adjustment):
With custom
--rate-limit N:Explicitly disabled (
--rate-limit 0):Example Conservative Adaptive Behavior:
Example Resilience (Rate Limit):
Example Resilience (Transient Error):
Benefits
Technical Notes
Conservative Adaptive Rate Limiting Algorithm
Increase on 429 (Aggressive):
Decrease on Success (Cautious):
Rate Limit Detection
Retry-Afterheader (supports seconds and HTTP dates)x-ratelimit-user-retry-after(GitHub-specific)x-ratelimit-exceededfor detailed error infoRetry Strategy
Retry-Afterheader + jitter (±20%)Jitter Implementation
Error Categorization
Rate Limit Headers Calculation
Breaking Changes
--rate-limit 0to explicitly disable.Test Plan