llm proxy: add --disable-streaming flag to force stream:false for clients that don't handle SSE

## Summary

`thv llm proxy start` cannot be used as a drop-in backend for Gemini CLI because Gemini CLI sends requests in **native Gemini API format**, which is incompatible with the AI gateway's OpenAI-compatible endpoint.

## Root Cause (discovered via debugging)

When `GOOGLE_GEMINI_BASE_URL` is set, Gemini CLI sends requests to:
```
POST /v1beta/models/gemini-2.5-flash-lite:generateContent
{"contents":[{"parts":[{"text":"..."}]}]}
```

The AI gateway only accepts:
```
POST /v1/chat/completions
{"model":"...","messages":[{"role":"user","content":"..."}]}
```

These are fundamentally different protocols. The proxy cannot bridge this gap without implementing a full Gemini↔OpenAI request/response translation layer.

## What was investigated

During debugging, several proxy-level fixes were attempted and partially implemented:
- Path rewriting: `/v1beta/openai/...` → `/v1/...` (works for some requests but not the main chat path)
- `--disable-streaming` flag: strips/replaces `data: [DONE]` sentinel from SSE responses (valid fix for OpenAI-compatible clients that mishandle SSE, but doesn't address the protocol mismatch)
- Forcing `stream:false` for requests without an explicit `stream` field (valid for OpenAI-compatible clients sending non-streaming requests)

The `[DONE]` JSON parse error seen in earlier testing was a symptom of a different request path (`generateJson` in `NumericalClassifierStrategy`), not the main chat stream.

## Real fix options

1. **Gateway level (preferred)**: The AI gateway should accept native Gemini API paths (`/v1beta/models/{model}:generateContent`) and translate them to its backend format. This makes the gateway a true drop-in replacement for `generativelanguage.googleapis.com`.

2. **Translation layer in proxy**: Add a full Gemini↔OpenAI translation layer to `thv llm proxy`. Significant complexity — different request schema, response schema, streaming format, error format, tool call format.

3. **Gemini CLI configuration**: If Gemini CLI supports an OpenAI-compatible mode (some versions do via a different URL/auth configuration), configure it to send OpenAI-format requests instead of native Gemini format.

## Related

- The `--disable-streaming` flag implemented during this investigation is still useful for OpenAI-compatible clients (e.g. Cursor, VS Code extensions) that don't handle SSE `[DONE]` correctly. That work should be kept.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm proxy: add --disable-streaming flag to force stream:false for clients that don't handle SSE #5213

Summary

Root Cause (discovered via debugging)

What was investigated

Real fix options

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

llm proxy: add --disable-streaming flag to force stream:false for clients that don't handle SSE #5213

Description

Summary

Root Cause (discovered via debugging)

What was investigated

Real fix options

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions