Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ A Payload CMS plugin that adds vector search capabilities to your collections. P
- 🧩 **Extensible Schema** — attach custom [`extensionFields`](#knowledge-pool-config) to the embeddings collection and persist values per chunk for querying.
- 🌐 [**REST API**](#rest-endpoints) — built-in vector-search endpoint with Payload-style [`where` filtering](#metadata-filtering-where) and configurable limits.
- 🏊 [**Multiple Knowledge Pools**](#knowledge-pool-config) — separate knowledge pools with independent configurations.
- 🌍 [**Localization (i18n)**](#localization-i18n) — first-class pattern for embedding and searching multi-locale Payload content.

## Table of Contents

Expand All @@ -34,6 +35,7 @@ A Payload CMS plugin that adds vector search capabilities to your collections. P
- [CollectionVectorizeOption](#collectionvectorizeoption)
- [Metadata Filtering (`where`)](#metadata-filtering-where)
- [Chunkers](#chunkers)
- [Localization (i18n)](#localization-i18n)
- [Bulk Embeddings API](#bulk-embeddings-api)
- [Validation & Retries](#validation--retries)
- [API Reference](#api-reference)
Expand Down Expand Up @@ -398,6 +400,55 @@ const postsToKnowledgePool: ToKnowledgePoolFn = async (doc, payload) => {

Because you control the output, you can mix different field types, discard empty values, or inject any metadata that aligns with your `extensionFields`.

## Localization (i18n)

Payload's localization works with this plugin out of the box. There is no dedicated `locale` config — locale-aware embedding and search is a first-class supported workflow built on the existing `extensionFields` and [`where` filter](#metadata-filtering-where) primitives.

The pattern in three steps:

**1. Declare `locale` as a required extension field on your knowledge pool:**

```typescript
extensionFields: [
{ name: 'locale', type: 'text', required: true },
// ...other fields
],
```

**2. Iterate locales inside `toKnowledgePool`** and tag each chunk with the locale it came from:

```typescript
const postsToKnowledgePool: ToKnowledgePoolFn = async (doc, payload) => {
const result: Array<{ chunk: string; locale: string }> = []
for (const locale of ['en', 'es', 'fr']) {
const localized = await payload.findByID({
collection: 'posts',
id: doc.id,
locale,
})
const chunks = await chunkText(localized.title ?? '', payload)
for (const chunk of chunks) {
result.push({ chunk, locale })
}
}
return result
}
```

**3. Filter at search time** by the visitor's locale using the existing [`where` filter](#metadata-filtering-where):

```typescript
const results = await vectorizedPayload.search({
query: 'product warranty terms',
knowledgePool: 'mainKnowledgePool',
where: { locale: { equals: req.locale } },
})
```

Visitors get locale-accurate semantic search: an English query returns English chunks, a Spanish query returns Spanish chunks, and the two never mix.

**Tradeoff to know.** Every document save re-embeds every locale together. For CMS workloads this is usually fine — edits are infrequent, embeddings are cheap. If your workflow can't tolerate this (e.g. per-locale incremental re-embedding to control cost), see the [Roadmap](#roadmap) entry on scope-aware chunk identity and open an issue with your use case.

## Bulk Embeddings API

The bulk embedding API is designed for large-scale embedding using provider batch APIs (like Voyage AI). **Bulk runs are never auto-queued** — they must be triggered manually via the admin UI or API.
Expand Down Expand Up @@ -1022,6 +1073,7 @@ Common scripts:

- **Additional adapters** — Pinecone, Qdrant, SQLite, etc. See [adapters/README.md](./adapters/README.md) for the `DbAdapter` contract.
- **Vercel CI matrix** — exercising the serverless job model end-to-end on Vercel preview deployments.
- **Scope-aware chunk identity** — `(sourceCollection, docId, ...scopeFields)` as identity for advanced editorial workflows: draft/published with locale, per-tenant isolation, A/B variants. Design is drafted (see [`docs/plans/archive/2026-05-10-scope-aware-chunk-identity.md`](./docs/plans/archive/2026-05-10-scope-aware-chunk-identity.md)). Waiting on a real use case before building — open an issue if this would unblock you.

Want one of these sooner? Star the repo and open an issue.

Expand Down
146 changes: 146 additions & 0 deletions dev/specs/vectorizeReorder.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
import type { Payload } from 'payload'
import { afterAll, beforeAll, describe, expect, test } from 'vitest'
import { postgresAdapter } from '@payloadcms/db-postgres'
import { buildConfig, getPayload } from 'payload'
import { lexicalEditor } from '@payloadcms/richtext-lexical'

import payloadcmsVectorize from 'payloadcms-vectorize'
import { createMockAdapter } from 'helpers/mockAdapter.js'
import { DIMS } from './constants.js'
import { createTestDb, destroyPayload, waitForVectorizationJobs } from './utils.js'

/**
* Verifies the safety invariant of the vectorize task ordering:
* if the embedding API fails on a re-embed, the doc's existing chunks
* must remain in the DB. The destructive delete must not run before
* we have valid embeddings ready to insert.
*/
describe('Vectorize task does not wipe existing chunks on embed failure', () => {
let payload: Payload
const dbName = 'vectorize_reorder_test'

// Controlled embed fn that can be made to fail mid-test.
let shouldEmbedFail = false
const embedDocs = async (texts: string[]) => {
if (shouldEmbedFail) {
throw new Error('Simulated embedding provider failure')
}
return texts.map(() => Array(DIMS).fill(0.5))
}
const embedQuery = async (_text: string) => Array(DIMS).fill(0)

beforeAll(async () => {
await createTestDb({ dbName })

const config = await buildConfig({
secret: 'reorder-test-secret',
editor: lexicalEditor(),
jobs: {
tasks: [],
autoRun: [{ cron: '*/2 * * * * *', limit: 10 }],
},
collections: [
{
slug: 'posts',
fields: [{ name: 'title', type: 'text' }],
},
],
db: postgresAdapter({
pool: {
connectionString: `postgresql://postgres:password@localhost:5433/${dbName}`,
},
}),
plugins: [
payloadcmsVectorize({
dbAdapter: createMockAdapter(),
knowledgePools: {
default: {
collections: {
posts: {
toKnowledgePool: async (doc: any) => [{ chunk: doc.title ?? '' }],
},
},
embeddingConfig: {
version: 'reorder-test-v1',
queryFn: embedQuery,
realTimeIngestionFn: embedDocs,
},
},
},
}),
],
})

payload = await getPayload({
config,
key: `vectorize-reorder-test-${Date.now()}`,
cron: true,
})
})

afterAll(async () => {
await destroyPayload(payload)
})

test('existing chunks survive when re-embed fails', async () => {
// 1. Create a doc and let the first vectorize succeed.
const post = await payload.create({
collection: 'posts',
data: { title: 'Original title' } as any,
})
await waitForVectorizationJobs(payload)

const beforeFailure = await payload.find({
collection: 'default',
where: {
and: [
{ sourceCollection: { equals: 'posts' } },
{ docId: { equals: String(post.id) } },
],
},
})
expect(beforeFailure.docs.length).toBeGreaterThan(0)
const originalIds = beforeFailure.docs.map((d: any) => d.id).sort()

// 2. Flip the embed fn to throw, then trigger a re-vectorize.
shouldEmbedFail = true
try {
await payload.update({
collection: 'posts',
id: post.id,
data: { title: 'Updated title' } as any,
})
await waitForVectorizationJobs(payload, 15000)

// 3. The job must have errored.
const failedJobs = await payload.find({
collection: 'payload-jobs',
where: {
and: [
{ taskSlug: { equals: 'payloadcms-vectorize:vectorize' } },
{ hasError: { equals: true } },
],
},
sort: '-createdAt',
limit: 1,
})
expect(failedJobs.totalDocs).toBeGreaterThan(0)

// 4. The existing chunks must STILL be present, with the same IDs.
// Before the reorder fix, deleteChunks ran first and wiped these.
const afterFailure = await payload.find({
collection: 'default',
where: {
and: [
{ sourceCollection: { equals: 'posts' } },
{ docId: { equals: String(post.id) } },
],
},
})
const remainingIds = afterFailure.docs.map((d: any) => d.id).sort()
expect(remainingIds).toEqual(originalIds)
} finally {
shouldEmbedFail = false
}
})
})
105 changes: 105 additions & 0 deletions docs/plans/2026-05-13-vectorize-safety-and-localization-docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Vectorize Task Safety Reorder + Localization Docs

**Date:** 2026-05-13
**Status:** Design
**Topic:** Two small, independently-valuable changes split off from the parked scope-aware-chunk-identity spec. Ship together because they share a release story.

## Background

A separate brainstorming session (see [archive/2026-05-10-scope-aware-chunk-identity.md](archive/2026-05-10-scope-aware-chunk-identity.md)) explored generalizing the locale-scoping problem into a first-class `scopeKey` config. We concluded that full feature is YAGNI for now: the locale case (the dominant motivator) is already solvable with the existing extension-field + `where` pattern, and the remaining motivators (draft/published with locale, per-tenant isolation, A/B variants) are rare enough to defer until a user reports the gap.

Two pieces of that work are valuable on their own and ship here:

1. The **vectorize task reorder** is a general safety improvement, unrelated to scope.
2. The **README "Localization" section** turns the existing capability into a discoverable feature and neutralizes competitor positioning that markets "locale-scoped semantic search" as a differentiator.

A third small piece — a **Roadmap line** about scope-aware identity — converts the parked spec into a market-research signal: if users file issues citing it, that surfaces real demand and unparks the spec.

## Change 1: Vectorize Task Reorder

### Problem

`src/tasks/vectorize.ts` currently runs in this order:

```
deleteChunks → toKnowledgePool → validateChunkData → embed → storeChunk
```

The destructive step (`deleteChunks`) happens first. Any failure in `toKnowledgePool`, validation, or the external embedding API leaves the doc with **no embeddings at all** until someone fixes the underlying issue and re-triggers. The most common real-world cause is a transient embedding-provider failure (rate limit, network blip, malformed input), which silently wipes a doc's chunks until the next save.

### Fix

Reorder to:

```
toKnowledgePool → validateChunkData → embed → deleteChunks → storeChunk
```

The destructive step now runs only after we have valid embeddings ready to insert. A rate-limit error fails the task without touching the DB; the next retry rebuilds cleanly with the existing chunks intact in the meantime.

Concretely, in [src/tasks/vectorize.ts:83-123](src/tasks/vectorize.ts#L83-L123), move the `await adapter.deleteChunks(...)` call from before `toKnowledgePoolFn(...)` to just before the `Promise.all` over `storeChunk(...)`.

### Residual gap (out of scope)

A window between `deleteChunks` and the end of the `storeChunk` `Promise.all` still allows partial failures to leave the doc partially-embedded. Closing this fully needs an adapter-level transaction across delete+store. That is a separate, larger change; the reorder alone removes the much more common failure mode (pre-delete failures) at near-zero cost.

### Bulk embed path

[src/tasks/bulkEmbedAll.ts](src/tasks/bulkEmbedAll.ts) uses [src/utils/deleteDocumentEmbeddings.ts](src/utils/deleteDocumentEmbeddings.ts) at batch-completion time. The same reorder principle applies: the delete should happen only after the batch result has been validated and the embeddings are ready to write. Planning phase should map the exact call site; the change is conceptually identical to the per-doc path.

## Change 2: README "Localization" Section

### Problem

A Payload developer evaluating vector-search plugins wants to know whether the plugin supports multi-locale content. Today's README has no "Localization" anchor in the TOC, no example, no mention of the `where` filter pattern for locale-aware search. A reasonable evaluator concludes the plugin doesn't support i18n and chooses a competitor that markets the feature explicitly — even though the underlying capability is identical.

### Fix

Add a new top-level section between [Chunkers](README.md#chunkers) and [Bulk Embeddings API](README.md#bulk-embeddings-api), titled **"Localization (i18n)"**, that walks through the recommended pattern end-to-end:

1. **Declare `locale` as a required extension field** on the knowledge pool.
2. **Iterate locales inside `toKnowledgePool`**, returning all-locale chunks tagged with `locale`. Provide a working snippet using `payload.findByID({ locale })`.
3. **Filter at search time** with `where: { locale: { equals: req.locale } }` (link to the existing [Metadata Filtering](README.md#metadata-filtering-where) section).
4. **Note the tradeoff**: every edit re-embeds every locale together. For most CMS workloads this is a non-issue (edits are infrequent, embeddings are cheap). If a user's workflow can't tolerate this, point them at the Roadmap line (Change 3) to file an issue.

Add the section to the TOC at line 36 (between `Metadata Filtering` and `Chunkers`) and add a Features bullet near the top of the README so the capability is discoverable from the first scroll. Suggested bullet: `🌍 **Localization (i18n)** — first-class pattern for embedding and searching multi-locale Payload content.`

The section is ~50 lines of prose plus the snippet. Self-contained; no other README rewrites required.

## Change 3: Roadmap Signal for Scope-Aware Identity

### Problem

The parked spec is a complete design for a real-but-niche capability. Burying it in the archive folder means we never hear from the users who would benefit. We want a low-cost mechanism that surfaces real demand without committing to build.

### Fix

Add one line to the **Help wanted** subsection of [README.md#roadmap](README.md#roadmap) (around line 1021-1025):

> - **Scope-aware chunk identity** — `(sourceCollection, docId, …scopeFields)` as identity for advanced editorial workflows: draft/published with locale, per-tenant isolation, A/B variants. Design is drafted (see `docs/plans/archive/`). Waiting on a real use case before building — open an issue if this would unblock you.

This converts the parked spec into a market-research instrument. Issues citing it go straight to the prioritization queue, and the link to the archived design gives interested users (and future-us) a starting point.

## Out of Scope

- The `scopeKey` config field, contract redesign, per-scope delete-and-replace algorithm, and adapter changes. All preserved in the archived spec; deferred pending user demand.
- Adapter-level transactions across delete+store.
- Backfill tooling for any future scope-key opt-in.

## Testing

**Change 1 (reorder):**

- New integration test in `dev/specs/`: vectorize a doc, then trigger another vectorize where `realTimeIngestionFn` is mocked to throw. Assert that the existing chunks for the doc are still present after the failure. (This is the test that exercises the bug the reorder fixes; it should fail on the current `main` and pass after the change.)
- Re-run existing vectorize specs to confirm no regression for the happy path.

**Changes 2 and 3 (README only):**

- No code tests. Manual review of rendered Markdown (GitHub preview) before merge to verify the TOC anchor resolves and the snippet is copy-pasteable.

## Release Notes

- Patch or minor bump (no API change). Changelog entry framing:
> **Vectorize task safety:** embedding-provider failures no longer wipe a doc's existing embeddings. The vectorize task now generates, validates, and embeds chunks before deleting the old chunk-set, so transient errors leave the previous chunks intact for the next retry.
>
> **Localization docs:** new README section covers the recommended pattern for embedding and searching multi-locale Payload content using extension fields and the existing `where` filter.
Loading
Loading