Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 105 additions & 54 deletions product/ai-gateway/cache-simple-and-semantic.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,20 @@ description: Speed up requests and reduce costs by caching LLM responses.
---

<Info>
**Simple** caching is available for all plans.<br />
**Semantic** caching requires a vector database and is only available on select Enterprise plans. [Contact us](https://portkey.ai/docs/support/contact-us) to learn more about enabling this feature.
**Simple** caching is available for all plans.
<br />
**Semantic** caching requires a vector database and is only available on
select Enterprise plans. [Contact
us](https://portkey.ai/docs/support/contact-us) to learn more about enabling
this feature.
</Info>

Cache LLM responses to serve requests up to **20x faster** and cheaper.

| Mode | How it Works | Best For | Supported Routes |
|------|--------------|----------|------------------|
| **Simple** | Exact match on input | Repeated identical prompts | All models including image generation |
| **Semantic** | Matches semantically similar requests | Denoising variations in phrasing | `/chat/completions`, `/completions` |
| Mode | How it Works | Best For | Supported Routes |
| ------------ | ------------------------------------- | -------------------------------- | ------------------------------------- |
| **Simple** | Exact match on input | Repeated identical prompts | All models including image generation |
| **Semantic** | Matches semantically similar requests | Denoising variations in phrasing | `/chat/completions`, `/completions` |

## Enable Cache

Expand All @@ -36,25 +40,63 @@ Add `cache` to your [config object](/api-reference/config-object#cache-object-de
</CodeGroup>

<Note>
Caching won't work if `x-portkey-debug: "false"` header is included.
Caching won't work if `x-portkey-debug: "false"` header is included.
</Note>

## Simple Cache

Exact match on input prompts. If the same request comes again, Portkey returns the cached response.
Returns the cached response when the **exact same request** is sent again.

**Hit when all of these match a cached entry:**

- Full request body (`messages` or `prompt`, `model`, `temperature`, `max_tokens`, and every other parameter)
- `x-portkey-metadata` (if used)
- `x-portkey-cache-namespace` (if used)
- The entry is still within `max_age`

**Miss when:**

- The request is sent for the first time
- Any field in the body changes—even one character in the prompt or a different parameter
- The cached entry has expired
- `x-portkey-cache-force-refresh: true` is set on the request

## Semantic Cache

Matches requests with similar meaning using cosine similarity. [Learn more →](https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/)
Matches requests with **similar meaning**, not just identical text. [Learn more →](https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/)

<Info>
Semantic cache is a superset—it handles simple cache hits too.
Semantic cache is a superset of simple cache. Portkey checks for an exact
match first and only runs semantic search on a miss.
</Info>

<Note>
Semantic cache works with requests under 8,191 tokens and ≤4 messages.
Semantic cache works with requests under 8,191 tokens and ≤4 messages.
</Note>

**Hit when all of these are true (after a simple-cache miss):**

- User text has **similar meaning** to a cached request — cosine similarity above the threshold (default `0.95`)
- `model`, `temperature`, `max_tokens`, and any other body parameter **match exactly**
- `x-portkey-metadata` matches exactly (if used)
- The chat has at least one non-system message — the **first message (usually `system`) is ignored** during matching, so changing it does not affect cache hits

**Example — same model, different wording → semantic hit:**

```json
// First request (cached)
{
"model": "gpt-4o",
"messages": [{ "role": "user", "content": "Who is the US president?" }]
}

// Second request — SEMANTIC HIT
{
"model": "gpt-4o",
"messages": [{ "role": "user", "content": "Tell me who is the president of the US." }]
}
```

### Set up semantic caching (self-hosted)

To enable semantic caching on a self-hosted Portkey gateway, configure the embedding provider and a vector database.
Expand All @@ -73,6 +115,7 @@ To enable semantic caching on a self-hosted Portkey gateway, configure the embed
```

`SEMANTIC_CACHE_EMBEDDING_PROVIDER` accepts `openai`, `google` (Gemini embeddings), or `vertex-ai` (Vertex AI embeddings). Set `SEMANTIC_CACHE_EMBEDDINGS_URL`, `SEMANTIC_CACHE_EMBEDDING_MODEL`, and `SEMANTIC_CACHE_EMBEDDING_DIMENSIONS` to match the chosen provider's embedding model.

</Step>
<Step title="Configure the vector database">
Set the following environment variables in your gateway environment to connect to your vector store (Milvus or Pinecone):
Expand Down Expand Up @@ -110,29 +153,18 @@ To enable semantic caching on a self-hosted Portkey gateway, configure the embed
```json
{ "cache": { "mode": "semantic" } }
```

</Step>
</Steps>

<Warning>
**Limitations:**
- Embedding generation supports OpenAI, Google (Gemini), and Vertex AI embedding providers.
- The LLM model used for generating responses must be OpenAI-compatible.
- Each request must include at least one `user` message along with system messages. Requests with only system messages are dropped.
**Limitations:** - Embedding generation supports OpenAI, Google (Gemini), and
Vertex AI embedding providers. - The LLM model used for generating responses
must be OpenAI-compatible. - Each request must include at least one `user`
message along with system messages. Requests with only system messages are
dropped.
</Warning>

### Message matching behavior

Semantic cache requires **at least two messages**. The first message (typically `system`) is ignored for matching:

```json
[
{ "role": "system", "content": "You are a helpful assistant" },
{ "role": "user", "content": "Who is the president of the US?" }
]
```

Only the `user` message is used for matching. Change the system message without affecting cache hits.

## Cache TTL

Set expiration with `max_age` (in seconds):
Expand All @@ -141,11 +173,11 @@ Set expiration with `max_age` (in seconds):
{ "cache": { "mode": "semantic", "max_age": 60 } }
```

| Setting | Value |
|---------|-------|
| Minimum | 60 seconds |
| Setting | Value |
| ------- | --------------------------- |
| Minimum | 60 seconds |
| Maximum | 90 days (7,776,000 seconds) |
| Default | 7 days (604,800 seconds) |
| Default | 7 days (604,800 seconds) |

### Organization-Level TTL

Expand All @@ -156,6 +188,7 @@ Admins can set default TTL for all workspaces to align with data retention polic
3. Save

**Precedence:**

- No `max_age` in request → org default used
- Request `max_age` > org default → org default wins
- Request `max_age` < org default → request value honored
Expand All @@ -178,12 +211,15 @@ response = portkey.with_options(
```

```javascript Node
const response = await portkey.chat.completions.create({
messages: [{ role: 'user', content: 'Hello' }],
model: '@openai-prod/gpt-4o',
}, {
cacheForceRefresh: true
});
const response = await portkey.chat.completions.create(
{
messages: [{ role: "user", content: "Hello" }],
model: "@openai-prod/gpt-4o",
},
{
cacheForceRefresh: true,
},
);
```

```bash cURL
Expand All @@ -197,8 +233,8 @@ curl https://api.portkey.ai/v1/chat/completions \
</CodeGroup>

<Info>
- Requires cache config to be passed
- For semantic hits, refreshes ALL matching entries
- Requires cache config to be passed - For semantic hits, refreshes ALL
matching entries
</Info>

## Cache Namespace
Expand All @@ -217,12 +253,15 @@ response = portkey.with_options(
```

```javascript Node
const response = await portkey.chat.completions.create({
messages: [{ role: 'user', content: 'Hello' }],
model: '@openai-prod/gpt-4o',
}, {
cacheNamespace: 'user-123'
});
const response = await portkey.chat.completions.create(
{
messages: [{ role: "user", content: "Hello" }],
model: "@openai-prod/gpt-4o",
},
{
cacheNamespace: "user-123",
},
);
```

```bash cURL
Expand All @@ -247,7 +286,11 @@ Set cache at top-level or per-target:
"strategy": { "mode": "fallback" },
"targets": [
{ "override_params": { "model": "@openai-prod/gpt-4o" } },
{ "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" } }
{
"override_params": {
"model": "@anthropic-prod/claude-3-5-sonnet-20241022"
}
}
]
}
```
Expand All @@ -256,31 +299,39 @@ Set cache at top-level or per-target:
{
"strategy": { "mode": "fallback" },
"targets": [
{ "override_params": { "model": "@openai-prod/gpt-4o" }, "cache": { "mode": "simple", "max_age": 200 } },
{ "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" }, "cache": { "mode": "semantic", "max_age": 100 } }
{
"override_params": { "model": "@openai-prod/gpt-4o" },
"cache": { "mode": "simple", "max_age": 200 }
},
{
"override_params": {
"model": "@anthropic-prod/claude-3-5-sonnet-20241022"
},
"cache": { "mode": "semantic", "max_age": 100 }
}
]
}
```

</CodeGroup>

<Info>
Target-level cache takes precedence over top-level.
</Info>
<Info>Target-level cache takes precedence over top-level.</Info>

<Note>
Targets with `override_params` need that exact param combination cached before hits occur.
Targets with `override_params` need that exact param combination cached before
hits occur.
</Note>

## Analytics & Logs

**Analytics** → Cache tab shows:

- Cache hit rate
- Latency savings
- Cost savings

**Logs** → Status column shows: `Cache Hit`, `Cache Semantic Hit`, `Cache Miss`, `Cache Refreshed`, or `Cache Disabled`. [Learn more →](/product/observability/logs)

<Frame>
<img src="/images/product/ai-gateway/ai-11.png"/>
<img src="/images/product/ai-gateway/ai-11.png" />
</Frame>