Skip to content

feat(plugins-google): add cached_content option for explicit context caching#5661

Closed
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus:kamdibus/gemini-cached-content-support
Closed

feat(plugins-google): add cached_content option for explicit context caching#5661
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus:kamdibus/gemini-cached-content-support

Conversation

@kamil-bidus
Copy link
Copy Markdown

@kamil-bidus kamil-bidus commented May 6, 2026

Motivation

The Gemini plugin's LLM class supports many GenerateContentConfig options (thinking_config, retrieval_config, safety_settings, etc.) but not cached_content. The plugin already reads cached_content_token_count from response usage in LLMStream._parse_part, so cache hits surface in metrics — there's just no way to attach a CachedContent resource to outgoing requests.

For voice-agent workloads on Gemini 3 Flash with ~6 KB system prompts, implicit caching is unreliable: in a 100-call/day deployment only ~3% of turn-1 requests pick up cached tokens despite firing a same-prefix warmup before the user's first utterance. This matches the broader user reports in #2359 ("Gemini Implicit Caching is still broken - I tested through the gemini API"). Explicit context caching is the documented alternative: create a CachedContent once, reference it by name on every generateContent call. Prefix tokens are processed in under 100 ms and billed at a discount.

Change

Add cached_content: NotGivenOr[str] = NOT_GIVEN to LLM.__init__. Standard propagation pattern:

  1. _LLMOptions dataclass field
  2. __init__ parameter and docstring
  3. Pass-through to _LLMOptions(...)
  4. is_given(...) check in chat() propagating into extra["cached_content"]
  5. Reaches GenerateContentConfig via **self._extra_kwargs

End-to-end usability — request-side suppression

Gemini's API rejects generateContent requests that pass cached_content together with system_instruction, tools, or tool_config — those fields belong inside the CachedContent resource. The exact server response on the conflict:

"CachedContent can not be used with GenerateContent request setting system_instruction, tools or tool_config. Proposed fix: move those values to CachedContent from GenerateContent request."

Without handling that, exposing the parameter would still 400 on any realistic agent — anyone with a system prompt or function tools (i.e. the plugin's primary user base) couldn't actually use the new option. So LLMStream._run now strips `system_instruction`, `tools`, and `tool_config` from the outgoing request whenever `cached_content` is attached. Behaviour is unchanged for callers that don't set `cached_content`: gating is strictly `is-given` on that one option.

Cache lifecycle (creation via `client.caches.create(...)`, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility. The docstring spells out the contract: the cache resource must contain whichever of `system_instruction` / `tools` the model needs, since the plugin will keep them off the request.

Compatibility

Default `NOT_GIVEN` keeps existing behavior unchanged. Verified by `test_cached_content_omitted_when_not_set` and `test_request_includes_system_instruction_and_tools_when_no_cache`: when the parameter isn't passed, the field is absent from `_extra_kwargs` and the outgoing request still carries `system_instruction` and `tools` exactly as before.

Works with both Gemini Developer API (`cachedContents/{id}`) and Vertex AI (`projects/{p}/locations/{l}/cachedContents/{id}`). The plugin passes the string through unmodified; format validation is the SDK's responsibility.

Tests

`tests/test_plugin_google_llm.py` — 6 cases covering both halves:

Propagation (3):

  • `test_cached_content_propagates_to_extra_kwargs` — set on init, observed in stream `_extra_kwargs`
  • `test_cached_content_omitted_when_not_set` — default `NOT_GIVEN` produces no key
  • `test_cached_content_stored_on_opts` — `_LLMOptions.cached_content` round-trips

Request-side suppression (3) — patch `client.aio.models.generate_content_stream`, capture the `GenerateContentConfig` actually received:

  • `test_request_omits_system_instruction_when_cached_content_set` — config arrives with `system_instruction=None` even though the chat context contains a system message
  • `test_request_omits_tools_when_cached_content_set` — `tools` and `tool_config` absent even when the stream is constructed with a function tool
  • `test_request_includes_system_instruction_and_tools_when_no_cache` — backward-compat: without the option, both fields propagate as before

All existing google-plugin tests still pass. `ruff check` / `ruff format` clean.

Refs

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 6, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

@kamil-bidus kamil-bidus marked this pull request as draft May 7, 2026 14:35
…caching

The plugin currently relies on Gemini's implicit cache, which is
heuristic. In voice-agent workloads where the system prompt is large
and stable across calls, implicit caching often misses on turn 1 of
a conversation, paying the full cold-start cost.

Explicit caching is the documented alternative: the application
creates a CachedContent resource via client.caches.create(...) and
references it by name on subsequent generateContent calls. Cached
prefix tokens are billed at a discount and processed in under 100ms.

The plugin already reads cached_content_token_count from response
usage but had no way to set cached_content on requests. This adds
the parameter on LLM.__init__, stores it on _LLMOptions, and
propagates it into GenerateContentConfig via extra_kwargs.

End-to-end usability matters: Gemini rejects generateContent
requests that pass cached_content together with system_instruction,
tools, or tool_config — those fields belong inside the CachedContent
resource. Without handling that, setting cached_content on any LLM
that also has a system prompt or function tools would 400. So
LLMStream._run now suppresses system_instruction, tools, and
tool_config from the outgoing request whenever cached_content is
attached. Cache lifecycle (creation, TTL refresh, deletion) and the
choice of what to bake into the cache stay the application's
responsibility — the plugin only consumes the resource name and
ensures the matching fields are absent from the request.

Behaviour is unchanged for callers that don't pass cached_content:
the gating is strictly is-given on that one option. Documented on
the docstring so users know the cache must contain whichever of
system_instruction / tools the model needs.

Tests cover propagation, the omitted-when-not-set default, and the
three suppression branches (system_instruction stripped, tools
stripped, tool_config stripped) plus the unchanged-when-no-cache
backward-compat path.

Refs livekit#2359.
@kamil-bidus kamil-bidus force-pushed the kamdibus/gemini-cached-content-support branch from 853f638 to c57dd80 Compare May 7, 2026 15:06
@kamil-bidus kamil-bidus closed this May 7, 2026
@kamil-bidus kamil-bidus deleted the kamdibus/gemini-cached-content-support branch May 7, 2026 15:09
@kamil-bidus
Copy link
Copy Markdown
Author

Superseded by #5675.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants