Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Dec 18, 2025

Summary

This PR adds a design document and a runnable example to support prompt/context caching for Gemini models on Vertex AI and Gemini API (AI Studio). It focuses on options beyond automatic implicit caching and provides verified guidance for Gemini 2.x and Gemini 3.x.

Changes

  • docs/designs/gemini_prompt_caching.md
    • Verified design report covering:
      • Implicit vs explicit caching on Vertex and AI Studio
      • Supported models now including Gemini 3 Flash/Pro (Preview)
      • Current thresholds, TTL behavior, and discounts (90% on 2.5+, 75% on 2.0)
      • Practical recommendations and SDK integration considerations
      • How to verify using logs/telemetry and example scripts
  • examples/01_standalone_sdk/31_gemini_caching_probe.py
    • Implicit caching probe using LiteLLM proxy (Vertex) with LITELLM_API_KEY and base_url https://llm-proxy.eval.all-hands.dev
    • Explicit caching probe using google-genai (AI Studio) with GEMINI_API_KEY, creating a short TTL cache and referencing it
    • Writes Telemetry logs under logs/caching for inspection; prints usage metadata

Motivation / Background

  • Implicit caching is enabled by default on Vertex for supported Gemini models; we can increase hit rates via prompt structure but can also add explicit caching support for predictable savings.
  • Explicit caching is available via control-plane APIs (Vertex or google-genai for AI Studio), enabling cache creation, TTL control, and explicit reference in generate requests.

Implementation Notes

  • No core SDK code changes in this PR; we add documentation and a working example.
  • The LLM telemetry already captures cache_read tokens when providers include them in usage metadata.
  • Future work (non-breaking, optional): allow a cached content reference to pass through via litellm_extra_body or provider-specific kwargs when proxies support it.

Verification

  • Pre-commit hooks run successfully on new files.
  • Example script can be run locally:
    • Implicit: set LITELLM_API_KEY and run with --mode implicit
    • Explicit: set GEMINI_API_KEY and run with --mode explicit

Checklist

  • Documentation updated (new design doc)
  • Example added to verify behavior
  • Pre-commit checks pass on changed files
  • No breaking changes to the SDK API

Closes #1427

@enyst can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:2ed64ca-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-2ed64ca-python \
  ghcr.io/openhands/agent-server:2ed64ca-python

All tags pushed for this build

ghcr.io/openhands/agent-server:2ed64ca-golang-amd64
ghcr.io/openhands/agent-server:2ed64ca-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:2ed64ca-golang-arm64
ghcr.io/openhands/agent-server:2ed64ca-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:2ed64ca-java-amd64
ghcr.io/openhands/agent-server:2ed64ca-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:2ed64ca-java-arm64
ghcr.io/openhands/agent-server:2ed64ca-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:2ed64ca-python-amd64
ghcr.io/openhands/agent-server:2ed64ca-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:2ed64ca-python-arm64
ghcr.io/openhands/agent-server:2ed64ca-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:2ed64ca-golang
ghcr.io/openhands/agent-server:2ed64ca-java
ghcr.io/openhands/agent-server:2ed64ca-python

About Multi-Architecture Support

  • Each variant tag (e.g., 2ed64ca-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 2ed64ca-python-amd64) are also available if needed

…ble probe example for implicit and explicit caching

- Verified docs for implicit/explicit caching, models incl. Gemini 3
- Example to test TTL and cached token counts; logs to logs/caching

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

[Automatic Post]: I have assigned @csmith49 as a reviewer based on git blame information. Thanks in advance for the help!

@enyst enyst removed the request for review from csmith49 December 21, 2025 02:50
@enyst enyst marked this pull request as draft December 21, 2025 02:50
@enyst
Copy link
Collaborator Author

enyst commented Dec 21, 2025

@OpenHands understand this PR. Run the example script. If it fails, try to fix it and run it, so we can see if it proves what it aims.

you have both a gemini key in your env, for direct calls, and a litellm proxy key (the second works on base url https://llm-proxy.eval.all-hands.dev). The proxy is proxying to Vertex AI. Use model gemini-3-pro-preview (maybe do /v1/models on proxy to see exact name)

Post a comment here with what was displayed in console, so I can see the results.

@openhands-ai
Copy link

openhands-ai bot commented Dec 21, 2025

I'm on it! enyst can track my progress at all-hands.dev

…fix LiteLLM provider selection by passing custom_llm_provider through LLM\n\n- Switch default implicit model to gemini-3-pro-preview (per /v1/models)\n- Add custom_llm_provider to LLM.transport call, so example can override to 'openai' when hitting the LiteLLM proxy base_url\n- Verified implicit (proxy) and explicit (AI Studio) flows run and log usage\n\nCo-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Dec 21, 2025

Summary of work on PR branch openhands/gemini-caching-design

What I did

  • Built the repo (make build) and verified pre-commit hooks.
  • Queried the proxy models at https://llm-proxy.eval.all-hands.dev/v1/models and confirmed gemini-3-pro-preview is available.
  • Ran the example:
    • Implicit mode (proxy → Vertex): Initially failed due to provider mismatch (proxy uses OpenAI schema). Fixed and reran successfully.
    • Explicit mode (AI Studio via google-genai): Ran successfully; showed explicit cached token usage in usage metadata.
  • Committed minimal changes and pushed to the existing PR branch.

Changes made (minimal and focused)

  1. examples/01_standalone_sdk/31_gemini_caching_probe.py
  • Switched model to gemini-3-pro-preview (per /v1/models).
  • Added custom_llm_provider="openai" when calling the proxy so LiteLLM uses the OpenAI schema with the proxy base_url.
  • Updated the docstring note to reflect the model name.
  1. openhands-sdk/openhands/sdk/llm/llm.py
  • Passed custom_llm_provider through to litellm.completion in _transport_call.
  • This is backward-compatible and only applies when explicitly set.

Verification output (console)
Implicit mode (proxy)

  • First call completed. Cost=$ 0.0058720000000000005
  • Second call completed. Total cost=$ 0.011744000000000001
  • Logs written to logs/caching
  • Note: usage_summary.cache_read_tokens in logs were 0; likely because the shared “long” prefix is below the ~2,048-token minimum and/or the proxy does not surface Vertex cached-token metrics in OpenAI responses.

Explicit mode (AI Studio)

  • Created cache: cachedContents/vngttjhu2dqb867jswl3uidml4aws788xwa3wqrp
  • Cache expire_time: 2025-12-21 03:10:47.404061+00:00
  • Response text returned.
  • Usage metadata showed cached_content_token_count=2024 (and cache_tokens_details), proving explicit caching worked.

Checklist against request

  • Understand this PR: done (reviewed design doc and example).
  • Run the example script:
    • Implicit mode: ran, initially failed, fixed, and reran successfully.
    • Explicit mode: ran successfully.
  • If it fails, fix and rerun: done (provider routing fix for proxy).
  • Use the proxy with gemini-3-pro-preview: done (confirmed via /v1/models and used in the example).
  • Post what was displayed in console: included above.
  • Keep changes concise: only two small edits; ran pre-commit and type checks; no extraneous changes.

Commit pushed

  • ab673f8 examples: gemini caching probe - use gemini-3-pro-preview via proxy; fix LiteLLM provider selection by passing custom_llm_provider through LLM

Notes and suggested next step

  • To observe cache_read_tokens > 0 in implicit mode, we can increase the stable prefix to comfortably exceed 2,048 tokens (not just characters) and re-run; or confirm whether the proxy exposes Vertex cached-token counters through its OpenAI-style responses. I can update the example accordingly if you’d like.

View full conversation

… Greatly enlarge stable system prefix and first user message to exceed ~2,048 tokens\n- Add a third call and 2s sleeps to increase hit likelihood\n- Verified logs show usage_summary.cache_read_tokens and cached_tokens in vertex metadata\n\nCo-authored-by: openhands <openhands@all-hands.dev>
@enyst
Copy link
Collaborator Author

enyst commented Dec 21, 2025

What I changed (since last summary)

  • Strengthened the implicit caching probe to reliably trigger Vertex implicit cache hits:
    • examples/01_standalone_sdk/31_gemini_caching_probe.py
      • Greatly enlarged the stable system prefix (repeated varied text) and added a large first user message tail to push total prompt well beyond ~2,048 tokens.
      • Inserted 2-second sleeps and added a third call to improve cache hit chances.
      • Ensured formatting passes pre-commit (ruff, pyright, pycodestyle).
  • Pushed commits:
    • f374176 examples: strengthen implicit caching probe for Vertex via proxy
    • ab673f8 examples: gemini caching probe - use gemini-3-pro-preview via proxy; pass custom_llm_provider through LLM

Verification: implicit and explicit

  • Implicit (proxy → Vertex, model: gemini-3-pro-preview)
    Command:
    uv run python examples/01_standalone_sdk/31_gemini_caching_probe.py --mode implicit

Implicit caching - Output:

  • First call done. Cost=$ 0.040482
  • Second call done. Total cost=$ 0.0515232
  • Third call done. Total cost=$ 0.06256439999999999

Logs (under logs/caching/) confirm cache hits:

  • Second/third call JSON includes usage_summary.cache_read_tokens and shows Vertex metadata with cached_tokens.
  • Example (third call):
    "usage_summary": {"prompt_tokens": 16803, "completion_tokens": 573, "reasoning_tokens": 527, "cache_read_tokens": 16356}. This demonstrates implicit caching working with Vertex via the proxy.

- Explicit (AI Studio via google-genai)
Command:
uv run python examples/01_standalone_sdk/31_gemini_caching_probe.py --mode explicit

Explicit caching - Output:

  • Created cache: cachedContents/...
  • Cache expire_time: 2025-12-21 03:10:47.404061+00:00
  • Response text: ...
  • Usage metadata: cached_content_token_count=2024, etc.

Checklist against your request

  • Confirm implicit works with Vertex by using larger stable content and spacing calls: done (cache_read_tokens clearly present in logs).
  • Optionally expand the first user message and add one more message: done (large user tail and three calls).
  • Keep changes minimal and passing pre-commit: done.

Next steps (optional)

  • If you’d like, I can tune the sizes/sleeps for faster demos while still triggering cache hits, or add a command-line flag to control the “large” size so it’s easy to toggle between quick and thorough probes.

@OpenHands OpenHands deleted a comment from openhands-ai bot Dec 21, 2025
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@enyst
Copy link
Collaborator Author

enyst commented Jan 4, 2026

This was an OpenHands agent thread to test implicit and explicit caching. Nothing to do here.

@enyst enyst closed this Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support prompt caching for Gemini models

4 participants