Skip to content

llm proxy: add --disable-streaming flag to force stream:false for clients that don't handle SSE #5213

@yrobla

Description

@yrobla

Summary

thv llm proxy start cannot be used as a drop-in backend for Gemini CLI because Gemini CLI sends requests in native Gemini API format, which is incompatible with the AI gateway's OpenAI-compatible endpoint.

Root Cause (discovered via debugging)

When GOOGLE_GEMINI_BASE_URL is set, Gemini CLI sends requests to:

POST /v1beta/models/gemini-2.5-flash-lite:generateContent
{"contents":[{"parts":[{"text":"..."}]}]}

The AI gateway only accepts:

POST /v1/chat/completions
{"model":"...","messages":[{"role":"user","content":"..."}]}

These are fundamentally different protocols. The proxy cannot bridge this gap without implementing a full Gemini↔OpenAI request/response translation layer.

What was investigated

During debugging, several proxy-level fixes were attempted and partially implemented:

  • Path rewriting: /v1beta/openai/.../v1/... (works for some requests but not the main chat path)
  • --disable-streaming flag: strips/replaces data: [DONE] sentinel from SSE responses (valid fix for OpenAI-compatible clients that mishandle SSE, but doesn't address the protocol mismatch)
  • Forcing stream:false for requests without an explicit stream field (valid for OpenAI-compatible clients sending non-streaming requests)

The [DONE] JSON parse error seen in earlier testing was a symptom of a different request path (generateJson in NumericalClassifierStrategy), not the main chat stream.

Real fix options

  1. Gateway level (preferred): The AI gateway should accept native Gemini API paths (/v1beta/models/{model}:generateContent) and translate them to its backend format. This makes the gateway a true drop-in replacement for generativelanguage.googleapis.com.

  2. Translation layer in proxy: Add a full Gemini↔OpenAI translation layer to thv llm proxy. Significant complexity — different request schema, response schema, streaming format, error format, tool call format.

  3. Gemini CLI configuration: If Gemini CLI supports an OpenAI-compatible mode (some versions do via a different URL/auth configuration), configure it to send OpenAI-format requests instead of native Gemini format.

Related

  • The --disable-streaming flag implemented during this investigation is still useful for OpenAI-compatible clients (e.g. Cursor, VS Code extensions) that don't handle SSE [DONE] correctly. That work should be kept.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgoPull requests that update go codeproxy

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions