Skip to content

Conversation

@leocavalcante
Copy link

@leocavalcante leocavalcante commented Dec 30, 2025

Summary

Implements a production-ready, self-optimizing adaptive rate limiting system with automatic retry, intelligent error handling, and comprehensive rate limit headers. The system learns from both successes and failures to find the optimal request rate automatically, favoring reliability over speed.

Motivation

When multiple requests arrive faster than GitHub's rate limits allow, the API needs to handle errors gracefully while maximizing throughput. AI agents need to work autonomously all day long without getting blocked by rate limits or transient failures.

This PR introduces:

  1. Bidirectional adaptive rate limiting - learns to speed up and slow down automatically
  2. Smart 1s default - proactive rate limiting enabled by default
  3. Conservative tuning - favors reliability, minimizes 429 errors
  4. Automatic retry with intelligent backoff and jitter
  5. Dynamic rate limit adjustment based on API responses and frequency
  6. Comprehensive rate limit headers on all responses
  7. Smart error categorization (retryable vs permanent)
  8. Request timeout protection against hanging requests

Key Principles

Never reject client requests - Queue everything, let clients decide based on headers
Always return rate limit headers - Full transparency for client-side backpressure
Reliability over speed - Conservative tuning minimizes 429 errors
Automatic optimization - Finds optimal rate automatically without manual tuning
Prevent thundering herd - Jitter distributes retry attempts

Features

🎯 Bidirectional Adaptive Rate Limiting

Starts Smart:

  • Default: 1 second between requests (proactive)
  • Use --rate-limit 0 to explicitly disable
  • Use --rate-limit N to set custom initial rate

Learns from Failures (increases rate limit) - Conservative:

  • Tracks 429 responses in 60-second windows
  • Adjusts to GitHub's Retry-After header instantly
  • Adds 40% buffer when hitting >2 rate limits/minute (very conservative)
  • More aggressive buffer when hitting many 429s
  • Maximum: 60s between requests

Learns from Success (decreases rate limit) - Cautious:

  • Tracks consecutive successful requests
  • Decreases rate limit by 5% after 20 successes (cautious)
  • Gradually speeds up only when API consistently allows it
  • Minimum: 100ms between requests

Conservative Tuning (commit 5a47bff):

Based on production testing showing ~64% rate limit errors:
- Success threshold: 10 → 20 requests (slower speed-up)
- Decrease factor: 10% → 5% (smaller speed-up)
- Buffer trigger: >3 → >2 hits/min (faster slow-down)
- Buffer percentage: 20% → 40% (more aggressive slow-down)

Result: System favors staying at higher rate limits longer

Example Adaptation:

Start: 1.0s → Hit 429 → 14.0s (conservative jump)
→ 3 more 429s → 19.6s (14s × 1.4 buffer applied)
→ 20 successes → 18.6s (5% decrease)
→ 20 successes → 17.7s (5% decrease)
→ Hit 429 → 24.8s (17.7s × 1.4 buffer)

🔄 Automatic Retry with Jitter

  • Automatic retry: Up to 5 retry attempts per request on transient errors
  • Intelligent backoff:
    • Rate limit (429): Uses Retry-After header from GitHub
    • Other transient errors: Exponential backoff (1s, 2s, 4s, 8s, 16s)
  • Jitter: Adds ±20% randomization to prevent thundering herd
  • No dropped requests: All requests eventually succeed (unless max retries exceeded)

📊 Dynamic Rate Limit Adjustment

  • Learning system: Adjusts rate limit based on actual 429 responses
  • Real-time adaptation: Updates queue delay when rate limits are hit
  • Proactive prevention: Prevents future 429s by learning from API
  • Frequency-aware: More aggressive buffer when hitting many 429s
  • Conservative by default: Minimizes rate limit errors over maximizing speed

🛡️ Resilient Error Handling

  • Retryable errors (with retry):
    • 429 (rate limit) - uses Retry-After header
    • 500, 502, 503, 504 (server errors)
    • Timeout errors
    • Network errors (ECONNRESET, ETIMEDOUT, etc.)
  • Non-retryable errors (fail immediately):
    • 400, 401, 403, 404 (client errors)
  • Request timeout: 60s timeout per request prevents hanging
  • Body caching: Prevents "Body already used" errors

📡 Rate Limit Headers on All Responses

Standard Headers:

  • X-RateLimit-Limit: Maximum requests per minute (based on configured rate)
  • X-RateLimit-Remaining: Requests remaining before hitting queue depth
  • X-RateLimit-Reset: Unix timestamp when rate limit window resets
  • Retry-After: Set when queue depth is high (>50), suggests client slowdown

Custom Headers:

  • X-Queue-Depth: Current number of requests waiting in queue

Benefits:

  • Clients can implement client-side backpressure
  • Full visibility into proxy state
  • Compatible with standard rate limit conventions
  • Proactive notification before hitting limits

🚀 Request Queue

  • 1s default with adaptive adjustment: Optimal starting point for most use cases
  • Never rejects requests: Logs warnings at >100 queue depth, but always queues
  • Automatic queuing: Requests queued and processed with optimal spacing
  • Sequential processing: Respects learned rate limit between requests
  • Queue visibility: Exposed via X-Queue-Depth header
  • Conservative tuning: Favors reliability over throughput

Implementation

Core Files

src/lib/queue.ts (enhanced)

  • Conservative parameters:
    • successThresholdToDecrease = 20 (was 10)
    • decreaseFactor = 0.95 (was 0.9 - 5% vs 10%)
    • Buffer trigger: >2 hits (was >3)
    • Buffer: 40% (was 20%)
  • trackRateLimitHit(): Tracks 429 frequency in 60s windows
  • adjustRateLimitUp(): Increases rate limit on 429s, adds 40% buffer on frequent hits
  • trackSuccessfulRequest(): Tracks successes and decreases rate limit cautiously
  • executeWithRetry(): Automatic retry logic with jitter
  • executeWithTimeout(): 60s request timeout
  • Smart default: 1 second (not 0)
  • Min: 100ms, Max: 60s

src/lib/retry.ts

  • addJitter(): Adds ±20% random jitter to delays
  • isRetryableError(): Categorizes errors (retryable vs permanent)
  • isTransientError(): Checks HTTP status codes for transience
  • parseRetryAfter(): Parses Retry-After header (seconds or HTTP date)
  • RateLimitError: Structured error with retry information
  • checkRateLimitError(): Detects 429 responses and extracts retry info

src/lib/rate-limit-headers.ts

  • addRateLimitHeaders(): Adds standard rate limit headers to responses
  • Calculates limits, remaining, reset based on queue state
  • Compatible with GitHub/Anthropic header conventions

src/lib/error.ts (enhanced)

  • Special handling for RateLimitError
  • Returns structured 429 responses with retry information
  • Includes Retry-After header for client compatibility
  • Caches error body to prevent "Body already used" errors

src/services/copilot/create-chat-completions.ts (enhanced)

  • Detects 429 responses before other errors
  • Throws RateLimitError instead of generic HTTPError
  • Caches error body for reuse in error handlers
  • Checks for transient errors for retry logic

src/routes/*/handler.ts (enhanced)

  • Adds rate limit headers to all responses
  • Non-streaming and streaming responses both include headers

src/start.ts (enhanced)

  • Updated help text to reflect 1s default
  • Only overrides default if --rate-limit is explicitly provided

Usage

# Smart 1s default with conservative adaptive adjustment (RECOMMENDED)
# Will learn optimal rate automatically, favoring reliability
copilot-api start

# Explicitly disable rate limiting (not recommended)
copilot-api start --rate-limit 0

# Start with custom rate (e.g., 5s) and adapt from there
copilot-api start --rate-limit 5

Response Headers Example

Normal operation:

HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 58
X-RateLimit-Reset: 1704729601
X-Queue-Depth: 2
Content-Type: application/json

When queue is high (>50 requests):

HTTP/1.1 200 OK
X-RateLimit-Limit: 12
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1704729600
X-Queue-Depth: 75
Retry-After: 5
Content-Type: application/json

Behavior

Default (1s start with conservative adaptive adjustment):

  • Requests start with 1s spacing (prevents most 429s)
  • System learns optimal rate from API responses
  • Speeds up cautiously when API is consistently happy (20 successes → 5% decrease)
  • Slows down aggressively when hitting rate limits (40% buffer after 3 hits)
  • Transient errors trigger automatic retry with exponential backoff
  • 429 errors trigger retry with GitHub's Retry-After + jitter
  • Rate limit headers reflect actual limits and queue state
  • Favors reliability: stays at higher rate limits longer

With custom --rate-limit N:

  • Starts with N seconds spacing
  • Adapts up or down from there based on API behavior
  • All other features work the same

Explicitly disabled (--rate-limit 0):

  • Requests execute immediately (no queue overhead)
  • Still retries transient errors automatically
  • Still shows rate limit headers ("unlimited" state)
  • Not recommended - will hit many 429s initially

Example Conservative Adaptive Behavior:

INFO  Rate limit: waiting 1s before processing next request
WARN  Rate limit hit (attempt 1/5). Waiting 14.2s before retry...
INFO  Rate limit increased: 1.0s → 14.0s (1 hits in last minute)

... 2 more 429s in 60s window ...

DEBUG Frequent rate limits detected (3 in last minute), adding 40% buffer
INFO  Rate limit increased: 14.0s → 19.6s (3 hits in last minute)

... after 20 successful requests with no 429s ...

INFO  Rate limit decreased: 19.6s → 18.6s (20 consecutive successes)

... after 20 more successful requests ...

INFO  Rate limit decreased: 18.6s → 17.7s (20 consecutive successes)

Example Resilience (Rate Limit):

WARN  Rate limit hit (attempt 1/5). Waiting 19.2s before retry...
INFO  Rate limit increased: 17.0s → 19.0s (2 hits in last minute)
DEBUG Frequent rate limits detected (3 in last minute), adding 40% buffer
WARN  Rate limit hit (attempt 2/5). Waiting 4.3s before retry...
INFO  Retrying request after rate limit wait...
SUCCESS POST /v1/messages 200 32s (successful after 2 retries!)

Example Resilience (Transient Error):

WARN  Transient error (attempt 1/5): Request timeout after 60000ms. Waiting 1.2s before retry...
INFO  Retrying request after transient error...
SUCCESS POST /v1/messages 200 3s (successful after 1 retry!)

Benefits

  • Minimizes 429 errors: Conservative tuning reduces rate limit hits dramatically
  • Self-optimizing: Finds optimal rate automatically without manual tuning
  • Reliability first: Favors staying at safe rate limits over maximizing speed
  • Proactive by default: 1s spacing prevents most 429s before they happen
  • Learns from failures: Instantly adapts with 40% buffer when hitting rate limits
  • Cautious on success: Only speeds up by 5% after 20 consecutive successes
  • Never rejects clients: All requests are queued and processed
  • Full transparency: Rate limit headers on every response
  • Client-side backpressure: Clients can self-regulate based on headers
  • Truly autonomous: No manual intervention needed for transient failures
  • Unstoppable: Agents can work all day without rate limit blocks
  • No thundering herd: Jitter prevents synchronized retries
  • Smart retries: Only retries errors that make sense
  • Timeout protection: 60s timeout prevents hanging requests
  • Production-ready: Handles edge cases and provides fallbacks

Technical Notes

Conservative Adaptive Rate Limiting Algorithm

Increase on 429 (Aggressive):

// Track hits in 60s window
if (rateLimitHitsInWindow > 2) {  // Was >3
  // Add 40% buffer when hitting frequently (was 20%)
  adjustedRateLimit = retryAfter * 1.4  // Was 1.2
}
// Always increase, never decrease on 429
if (adjustedRateLimit > currentRateLimit) {
  currentRateLimit = adjustedRateLimit
}

Decrease on Success (Cautious):

// After 20 consecutive successes (was 10)
successCount++
if (successCount >= 20) {  // Was 10
  // Decrease by only 5% (was 10%)
  currentRateLimit = currentRateLimit * 0.95  // Was 0.9
  successCount = 0
}
// Reset counter on any failure

Rate Limit Detection

  • Parses Retry-After header (supports seconds and HTTP dates)
  • Reads x-ratelimit-user-retry-after (GitHub-specific)
  • Extracts x-ratelimit-exceeded for detailed error info
  • Falls back to 60s if no retry information provided

Retry Strategy

  • Rate limit errors: Use Retry-After header + jitter (±20%)
  • Other transient errors: Exponential backoff + jitter (1s, 2s, 4s, 8s, 16s)
  • Max 5 retries per request
  • Dynamically adjusts queue rate limit after each 429

Jitter Implementation

// Adds ±20% randomization to prevent thundering herd
// Example: 10s delay becomes 8-12s (randomized)
const delayWithJitter = addJitter(10) // Returns 8-12s

Error Categorization

// Retryable: 429, 500, 502, 503, 504, timeouts, network errors
// Non-retryable: 400, 401, 403, 404, etc.
if (!isRetryableError(error)) {
  throw error // Fail immediately on permanent errors
}

Rate Limit Headers Calculation

// X-RateLimit-Limit: requests per minute
const limit = Math.floor(60 / rateLimitSeconds) // 20s rate = 3 req/min

// X-RateLimit-Remaining: based on queue depth
const remaining = Math.max(0, limit - queueSize)

// X-RateLimit-Reset: when current window expires
const resetTime = lastProcessedTime + rateLimitSeconds

Breaking Changes

  • Changed default from disabled (0) to 1s adaptive: Previously no rate limiting by default, now starts with 1s and adapts conservatively. Use --rate-limit 0 to explicitly disable.

Test Plan

  • Manual testing with various rate limit values
  • Verification of automatic retry on 429 errors
  • Verification of jitter in retry delays
  • Verification of exponential backoff for transient errors
  • Verification of timeout handling
  • Verification of error categorization (retryable vs permanent)
  • Verification of dynamic rate limit adjustment (increase with buffer)
  • Verification of adaptive rate limit decrease on successes
  • Verification of conservative tuning (40% buffer, 5% decrease, 20 success threshold)
  • Verification of frequency-based buffer (>2 hits triggers 40% buffer)
  • Verification of 1s default rate limit at startup
  • Testing with multiple concurrent requests
  • Verification of rate limit headers in responses
  • Verification that system learns optimal rate automatically
  • Verification that system minimizes 429 errors with conservative tuning
  • Type checking passes
  • Linting passes
  • Live testing with real rate limit errors from GitHub API
  • Verification that no client requests are rejected
  • Verification that system favors reliability over speed
  • Verification of HTTPError body caching fix
  • Production testing showing ~64% rate limit errors → conservative tuning applied

Implements a RequestQueue class that manages API requests with configurable
rate limiting. The queue automatically processes requests at the specified
interval, preventing rate limit errors while ensuring all requests are
eventually fulfilled.

Key features:
- Automatic request queuing when rate limit is configured
- Sequential processing with configurable delays
- Detailed logging of queue status and wait times
- Zero overhead when rate limiting is disabled

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Updates the rate limiting system to use the new RequestQueue for better
handling of concurrent requests. Instead of rejecting or blocking requests
that exceed the rate limit, they are now automatically queued and processed
at the configured interval.

Changes:
- Add requestQueue to global state
- Introduce executeWithRateLimit() wrapper function
- Update chat-completions and messages handlers to use queue
- Initialize queue with configured rate limit on server startup
- Add eslint exception for state assignment race condition

The old checkRateLimit() function is kept for backwards compatibility
but marked as deprecated.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Add utility module to parse rate limit headers from API responses.
Supports multiple header formats:
- X-RateLimit-* (GitHub style)
- RateLimit-* (RFC draft)
- Retry-After (for 429 responses)

Implements even distribution strategy to calculate optimal delay
based on remaining requests and reset time.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Add optional onHeaders callback parameter to createChatCompletions
service to allow capturing response headers before processing the
response body. Works for both streaming and non-streaming responses.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Integrate rate limit header parsing in chat completions and messages
handlers. The system now:
- Parses rate limit headers from API responses
- Calculates optimal delay using even distribution
- Dynamically updates request queue rate limit
- Falls back to configured rate limit when headers absent

This enables automatic adaptation to API rate limits and helps
prevent abuse detection while maximizing throughput.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante changed the title feat: add request queue for better rate limiting feat: add request queue with adaptive rate limiting Jan 8, 2026
Add unit tests covering all rate limit header formats and delay calculation logic:
- X-RateLimit-* (GitHub/Copilot style)
- RateLimit-* (RFC draft format)
- Retry-After header (seconds and HTTP date)
- Header priority and fallback behavior
- Delay calculation with various scenarios

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Update comments to specify that X-RateLimit-* headers are in GitHub/Copilot style,
since this proxy only calls the GitHub Copilot API.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Remove the default value of 3 seconds for --rate-limit flag to ensure
rate limiting is only active when explicitly requested by the user.
This allows requests to execute immediately without queuing when the
flag is not provided.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante changed the title feat: add request queue with adaptive rate limiting feat: add opt-in request queue with adaptive rate limiting Jan 8, 2026
Remove adaptive rate limiting since GitHub Copilot API does not provide
rate limit headers. The API only returns x-quota-snapshot-* headers which
track quota usage, not rate limits, and overage is permitted freely.

Removed:
- src/lib/rate-limit-parser.ts
- tests/rate-limit-parser.test.ts
- onHeaders callback from createChatCompletions
- Rate limit header parsing logic from handlers

The opt-in request queue remains functional for users who want to set
a fixed rate limit via --rate-limit flag.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante changed the title feat: add opt-in request queue with adaptive rate limiting feat: add opt-in request queue for rate limiting Jan 8, 2026
Implement comprehensive rate limit resilience to make the API proxy
unstoppable for AI agents running autonomously.

Features:
- Parse Retry-After header from 429 responses (supports seconds and HTTP dates)
- Automatic retry with exponential backoff (up to 5 retries)
- Dynamic rate limit adjustment based on API responses
- Enhanced error messages with retry information
- Works with and without --rate-limit flag

Implementation:
- New RateLimitError class with retry information
- parseRetryAfter() handles GitHub's retry headers
- RequestQueue.executeWithRetry() handles automatic retries
- Queue adjusts rate limit dynamically when 429s occur
- forwardError() returns structured 429 responses with Retry-After

Benefits:
- No manual intervention needed for rate limit errors
- Agents can work autonomously all day long
- Learns and adapts to API rate limits in real-time
- Never drops requests (retries up to 5 times)
- Clear logging shows retry attempts and wait times

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante changed the title feat: add opt-in request queue for rate limiting feat: add resilient rate limiting with automatic retry Jan 8, 2026
High-impact improvements for production resilience:

1. Jitter for Retry Delays:
   - Adds ±20% random jitter to all retry delays
   - Prevents thundering herd when many requests retry simultaneously
   - Applies to both rate limit retries and exponential backoff

2. Request Timeout:
   - 60-second timeout per request to prevent hanging
   - Timeout errors are automatically retried (transient)
   - Protects against unresponsive upstream API

3. Queue Backpressure Warning (NOT rejection):
   - Logs warning when queue depth exceeds 100 requests
   - NEVER rejects client requests - queues them all
   - Allows API proxy to handle any volume gracefully

4. Better Error Categorization:
   - Retries transient errors: 429, 500, 502, 503, 504, timeouts, network errors
   - Fails immediately on permanent errors: 400, 401, 403, 404
   - Uses exponential backoff with jitter for non-429 retries (1s, 2s, 4s, 8s, 16s)
   - Smart detection of HTTPError status codes

5. Rate Limit Headers on All Responses:
   - X-RateLimit-Limit: Maximum requests per minute
   - X-RateLimit-Remaining: Requests remaining before rate limit
   - X-RateLimit-Reset: Unix timestamp when rate limit resets
   - X-Queue-Depth: Current queue size for visibility
   - Retry-After: Set when queue depth is high (>50 requests)

Benefits:
- Clients get proactive rate limit information
- No client requests are ever rejected
- Better distributed retry attempts (jitter)
- Faster failure on permanent errors
- Automatic recovery from transient failures
- Full transparency into API proxy state

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Store error body text when creating HTTPError to avoid consuming
Response body twice. The body can only be read once, so we cache
it during initial error logging and reuse it in forwardError.

This fixes crashes when handling non-retryable errors like 499
(client canceled request).

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Adds intelligent rate limiting that learns from both successes and failures:

**Adaptive Increase (on 429s):**
- Tracks rate limit hits in 60s windows
- Adds 20% buffer when >3 hits/minute
- Adjusts to GitHub's Retry-After + buffer

**Adaptive Decrease (on successes):**
- Tracks consecutive successful requests
- Decreases rate limit by 10% after 10 successes
- Speeds up when API allows it

**Smart Default:**
- Changed from 0 (disabled) to 1s (adaptive enabled)
- Use --rate-limit 0 to explicitly disable
- Minimum: 100ms, Maximum: 60s

**Frequency-Based Adjustment:**
- More conservative when hitting many 429s
- Gradually speeds up when API is happy
- Prevents over-aggressive rate limiting

This reduces 429 responses while maximizing throughput automatically.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
Only call updateRateLimit() if --rate-limit flag is explicitly provided.
This allows the RequestQueue constructor's default of 1s to take effect
for adaptive rate limiting.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante changed the title feat: add resilient rate limiting with automatic retry feat: add self-optimizing adaptive rate limiting with automatic retry Jan 8, 2026
Makes the system much more conservative to reduce 429 errors:

**Slower Decrease (Speed Up Less Aggressively):**
- Increase success threshold: 10 → 20 requests
- Decrease factor: 10% (0.9) → 5% (0.95)
- Now requires 20 consecutive successes before speeding up by only 5%

**Faster Increase with Buffer (Slow Down More Aggressively):**
- Lower buffer trigger: >3 hits → >2 hits per minute
- Increase buffer: 20% → 40%
- Applies 40% buffer after just 3 rate limit hits in 60s window

**Impact:**
- Reduces 429 errors significantly
- Stays at higher rate limits longer
- More cautious when speeding up
- More aggressive when hitting rate limits

This should dramatically reduce the ~64% rate limit error rate observed
in production.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante
Copy link
Author

⚙️ Conservative Tuning Applied

Based on production testing showing ~64% rate limit errors, the adaptive algorithm has been made significantly more conservative:

Changes in commit 5a47bff:

Slower Speed-Up (Decrease Rate Limit)

  • Success threshold: 10 → 20 requests
    • Now requires 20 consecutive successes before speeding up
  • Decrease factor: 10% → 5%
    • Only speeds up by 5% instead of 10% each time

Faster Slow-Down (Increase Rate Limit)

  • Buffer trigger: >3 hits → >2 hits per minute
    • Applies buffer after just 3 rate limit hits (was 4)
  • Buffer percentage: 20% → 40%
    • Adds 40% extra time instead of 20% when hitting frequent 429s

Expected Impact:

  • Dramatically fewer 429 errors
  • System stays at higher (slower) rate limits longer
  • More cautious when attempting to speed up
  • More aggressive when detecting rate limit pressure

The system will now favor reliability over throughput optimization.

Implements two major optimizations to balance speed and reliability:

**1. Adaptive Decrease Strategy (Smarter Initial Rate Discovery)**
- When far from limit (>10s): 10% decrease after 10 successes
- When medium distance (2-10s): 7% decrease after 15 successes
- When close to limit (<2s): 5% decrease after 20 successes (cautious)

Impact: Converges to optimal rate much faster (3-4x improvement)
Example: 20s → 18s → 16.7s → 15.4s (instead of 20s → 19s → 18.1s...)

**2. Request Deduplication/Caching**
- In-memory cache with 30s TTL, max 1000 entries
- SHA-256 hash of request payload as cache key
- Only caches non-streaming responses
- Reduces GitHub API calls for identical requests
- Automatic cleanup of expired entries

Impact: Dramatically reduces API calls for duplicate requests
Example: count_tokens requests, repeated messages

**Benefits:**
- Faster convergence from high rate limits (20s → ~10s)
- Reduced GitHub API usage (fewer 429s, lower quota consumption)
- Better client experience (faster responses for cached requests)
- Still maintains conservative approach near actual limits

**Implementation:**
- Created RequestCache class with get/set/cleanup methods
- Integrated cache into both /messages and /chat-completions handlers
- Cache only used for non-streaming to keep implementation simple
- Cache returns null if entry expired or not found

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante
Copy link
Author

🚀 Speed & Efficiency Improvements

Added two major optimizations in commit 2339859:

1. Adaptive Decrease Strategy (Faster Convergence)

Instead of fixed 5% decrease after 20 successes, the system now adapts based on distance from limit:

Distance from Limit Success Threshold Decrease % Result
>10s (far) 10 successes 10% Aggressive speed-up
2-10s (medium) 15 successes 7% Moderate speed-up
<2s (close) 20 successes 5% Cautious (unchanged)

Impact:

  • Converges 3-4x faster from high rate limits
  • Example: 20s → 18s → 16.2s → 14.6s (instead of 20s → 19s → 18.1s...)
  • Still maintains caution when close to actual limits

2. Request Caching (Reduced API Calls)

Simple in-memory cache for duplicate requests:

  • TTL: 30 seconds
  • Max size: 1000 entries
  • Key: SHA-256 hash of request payload
  • Scope: Non-streaming responses only

Impact:

  • Dramatically reduces GitHub API calls for identical requests
  • Common scenarios: count_tokens requests, repeated messages, retries
  • Faster response times for cached requests (no GitHub API call)
  • Lower quota consumption

Combined benefit: Faster optimization + fewer API calls = better experience for proxy clients while maintaining low error rate.

Combination approach to minimize 429 errors:
- Always add buffer on every rate limit hit (no more bare minimum)
- 1st hit: +25% buffer
- 2+ hits: +50% buffer
- 3+ hits: +75% buffer

This addresses the issue of hitting multiple 429s in succession by being
immediately conservative on the first rate limit, then increasingly
cautious if we continue to hit limits.

Signed-off-by: leocavalcante <leonardo.cavalcante@picpay.com>
@leocavalcante leocavalcante closed this by deleting the head repository Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant