feat(plugins-google): add cached_content option for explicit context caching#5661
Closed
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
Closed
feat(plugins-google): add cached_content option for explicit context caching#5661kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
Conversation
…caching The plugin currently relies on Gemini's implicit cache, which is heuristic. In voice-agent workloads where the system prompt is large and stable across calls, implicit caching often misses on turn 1 of a conversation, paying the full cold-start cost. Explicit caching is the documented alternative: the application creates a CachedContent resource via client.caches.create(...) and references it by name on subsequent generateContent calls. Cached prefix tokens are billed at a discount and processed in under 100ms. The plugin already reads cached_content_token_count from response usage but had no way to set cached_content on requests. This adds the parameter on LLM.__init__, stores it on _LLMOptions, and propagates it into GenerateContentConfig via extra_kwargs. End-to-end usability matters: Gemini rejects generateContent requests that pass cached_content together with system_instruction, tools, or tool_config — those fields belong inside the CachedContent resource. Without handling that, setting cached_content on any LLM that also has a system prompt or function tools would 400. So LLMStream._run now suppresses system_instruction, tools, and tool_config from the outgoing request whenever cached_content is attached. Cache lifecycle (creation, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility — the plugin only consumes the resource name and ensures the matching fields are absent from the request. Behaviour is unchanged for callers that don't pass cached_content: the gating is strictly is-given on that one option. Documented on the docstring so users know the cache must contain whichever of system_instruction / tools the model needs. Tests cover propagation, the omitted-when-not-set default, and the three suppression branches (system_instruction stripped, tools stripped, tool_config stripped) plus the unchanged-when-no-cache backward-compat path. Refs livekit#2359.
853f638 to
c57dd80
Compare
Author
|
Superseded by #5675. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The Gemini plugin's
LLMclass supports manyGenerateContentConfigoptions (thinking_config, retrieval_config, safety_settings, etc.) but notcached_content. The plugin already readscached_content_token_countfrom response usage inLLMStream._parse_part, so cache hits surface in metrics — there's just no way to attach aCachedContentresource to outgoing requests.For voice-agent workloads on Gemini 3 Flash with ~6 KB system prompts, implicit caching is unreliable: in a 100-call/day deployment only ~3% of turn-1 requests pick up cached tokens despite firing a same-prefix warmup before the user's first utterance. This matches the broader user reports in #2359 ("Gemini Implicit Caching is still broken - I tested through the gemini API"). Explicit context caching is the documented alternative: create a
CachedContentonce, reference it by name on everygenerateContentcall. Prefix tokens are processed in under 100 ms and billed at a discount.Change
Add
cached_content: NotGivenOr[str] = NOT_GIVENtoLLM.__init__. Standard propagation pattern:_LLMOptionsdataclass field__init__parameter and docstring_LLMOptions(...)is_given(...)check inchat()propagating intoextra["cached_content"]GenerateContentConfigvia**self._extra_kwargsEnd-to-end usability — request-side suppression
Gemini's API rejects
generateContentrequests that passcached_contenttogether withsystem_instruction,tools, ortool_config— those fields belong inside theCachedContentresource. The exact server response on the conflict:Without handling that, exposing the parameter would still 400 on any realistic agent — anyone with a system prompt or function tools (i.e. the plugin's primary user base) couldn't actually use the new option. So
LLMStream._runnow strips `system_instruction`, `tools`, and `tool_config` from the outgoing request whenever `cached_content` is attached. Behaviour is unchanged for callers that don't set `cached_content`: gating is strictly `is-given` on that one option.Cache lifecycle (creation via `client.caches.create(...)`, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility. The docstring spells out the contract: the cache resource must contain whichever of `system_instruction` / `tools` the model needs, since the plugin will keep them off the request.
Compatibility
Default `NOT_GIVEN` keeps existing behavior unchanged. Verified by `test_cached_content_omitted_when_not_set` and `test_request_includes_system_instruction_and_tools_when_no_cache`: when the parameter isn't passed, the field is absent from `_extra_kwargs` and the outgoing request still carries `system_instruction` and `tools` exactly as before.
Works with both Gemini Developer API (`cachedContents/{id}`) and Vertex AI (`projects/{p}/locations/{l}/cachedContents/{id}`). The plugin passes the string through unmodified; format validation is the SDK's responsibility.
Tests
`tests/test_plugin_google_llm.py` — 6 cases covering both halves:
Propagation (3):
Request-side suppression (3) — patch `client.aio.models.generate_content_stream`, capture the `GenerateContentConfig` actually received:
All existing google-plugin tests still pass. `ruff check` / `ruff format` clean.
Refs