How to build a large-scale real-time and batch processing system for OpenAI workloads? #2073

prdai · 2025-02-02T08:22:53Z

prdai
Feb 2, 2025

Hi,

I'm trying to figure out how to design a system that can manage real-time and batch processing for OpenAI applications. What architectures, tools, and best practices have you used to nail high performance, fault tolerance, and reliability?

If anyone has any help or advice you could give, I would appreciate it :)

Thank you

Answered by manimovassagh

Mar 20, 2026

Here's a practical architecture that covers both real-time and batch workloads with OpenAI:

Real-time path (low latency, user-facing)

Client -> API Gateway -> Worker Pool -> OpenAI API
                              |
                          Redis (cache + rate limit tracking)

Use AsyncOpenAI with connection pooling for concurrent requests
Implement a token bucket rate limiter in Redis to stay within API limits
Cache identical prompts (hash the messages array) to avoid redundant calls
Set aggressive timeouts and retry with exponential backoff

import asyncio
from openai import AsyncOpenAI
from tenacity import retry, wait_exponential, stop_after_attempt

client = AsyncOpenAI(max_retries=3…

View full answer

manimovassagh · 2026-03-20T22:45:33Z

manimovassagh
Mar 20, 2026

Here's a practical architecture that covers both real-time and batch workloads with OpenAI:

Real-time path (low latency, user-facing)

Client -> API Gateway -> Worker Pool -> OpenAI API
                              |
                          Redis (cache + rate limit tracking)

Use AsyncOpenAI with connection pooling for concurrent requests
Implement a token bucket rate limiter in Redis to stay within API limits
Cache identical prompts (hash the messages array) to avoid redundant calls
Set aggressive timeouts and retry with exponential backoff

import asyncio
from openai import AsyncOpenAI
from tenacity import retry, wait_exponential, stop_after_attempt

client = AsyncOpenAI(max_retries=3, timeout=30.0)

@retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3))
async def complete(messages):
    return await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

Batch path (high throughput, background jobs)

Use OpenAI's native Batch API for 50% cost savings:

import json
from openai import OpenAI

client = OpenAI()

# 1. Create a JSONL file with your requests
requests = [
    {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": prompt}]}}
    for i, prompt in enumerate(prompts)
]

with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# 2. Upload and create batch
input_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(input_file_id=input_file.id, endpoint="/v1/chat/completions", completion_window="24h")

# 3. Poll for completion
import time
while batch.status not in ("completed", "failed", "expired"):
    time.sleep(60)
    batch = client.batches.retrieve(batch.id)

Fault tolerance essentials:

Dead letter queue: failed requests go to a DLQ (SQS, RabbitMQ, Redis streams) for manual retry
Idempotency keys: deduplicate requests if your workers restart mid-batch
Circuit breaker: if OpenAI returns 5xx errors repeatedly, stop sending requests temporarily
Observability: log request IDs, latencies, token usage, and error rates. OpenAI returns x-request-id in every response header

Queue-based architecture for mixed workloads:

Producers -> Message Queue (Redis Streams / SQS)
                    |
            +-------+-------+
            |               |
      RT Workers      Batch Aggregator
      (async, <2s)    (collects -> Batch API)
            |               |
        Response          Results File
        Cache             (parse & store)

Route requests by urgency: anything user-facing goes through the real-time workers, everything else gets batched.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to build a large-scale real-time and batch processing system for OpenAI workloads? #2073

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to build a large-scale real-time and batch processing system for OpenAI workloads? #2073

Uh oh!

prdai Feb 2, 2025

Replies: 1 comment

Uh oh!

manimovassagh Mar 20, 2026

prdai
Feb 2, 2025

manimovassagh
Mar 20, 2026