Add vector_search_indexes resource (direct engine)#5123
Add vector_search_indexes resource (direct engine)#5123janniklasrose wants to merge 45 commits intomainfrom
Conversation
Approval status: pending
|
1943af9 to
87018ce
Compare
) ## Changes Persist `endpoint_uuid` in state and detect identity drift on `vector_search_endpoints`. The endpoint name is stable but its UUID changes if the endpoint is deleted and recreated by name (e.g. via the workspace UI). Without persisting the UUID: - The bundle silently rebound permissions to a different backing endpoint without recreating the endpoint resource. - Anything else referencing `endpoint_uuid` (most importantly the permissions object_id, but also indexes added on top in the next PR) raced the recreate. `VectorSearchEndpointState` now embeds `vectorsearch.CreateEndpoint` and adds `EndpointUuid`. `DoCreate` records the UUID from the create response; `DoUpdate` copies it from `entry.RemoteState` so unrelated updates (e.g. `min_qps`) don't blank it out. `OverrideChangeDesc` classifies `endpoint_uuid` drift as `Recreate` when saved differs from remote, `Skip` otherwise. `drift/recreated_same_name` flips from a "badness snapshot" (which captured the old behavior of permissions silently rebinding) to the recreate behavior, with a permissions block on the endpoint to verify the cascade rebinds correctly. `drift/min_qps/out.plan.direct.json` regenerates to include the new `endpoint_uuid` skip entry in the detailed plan. ## Why Splitting this out of the larger `vector_search_indexes` PR ([#5123](#5123)) so it can land independently. The index PR builds on the persisted UUID for orphan detection, but the endpoint UUID work stands on its own and is useful regardless. ## Tests - `make fmtfull`, `make checks`, `make lintfull` — clean. - `make test` — green (`libs/apps/runlocal` needed `NODE_OPTIONS=` for the harness leak; unrelated). `bundle/internal/schema TestRequiredAnnotationsForNewFields` panics, which is failing on `main` for unrelated reasons. - `go test ./acceptance -run 'TestAccept/bundle/resources/vector_search_endpoints'` — all green, including the flipped `drift/recreated_same_name`. _This PR was written by Claude Code._
- invariant vector_search_index config now creates the required endpoint in-bundle so the testserver's endpoint-existence check passes - drift/budget_policy now asserts that the remote drift is ignored (matches ignore_remote_changes for budget_policy_id: effective value may include inherited workspace policy, not user-set) - vector_search_indexes/basic URL updated to match the current catalog.schema.name-based InitializeURL behavior Co-authored-by: Isaac
Covers the case where an index is deleted outside the bundle: the next plan should propose creation and the next deploy restores it. Mirrors acceptance/bundle/resources/volumes/remote-delete and closes the missing drift coverage for indexes (endpoints have drift/budget_policy and drift/min_qps; indexes had none). Co-authored-by: Isaac
…ectorIndexRequest make lint surfaces exhaustruct issues on changed packages. Explicitly initialize the zero-value fields introduced by the prior two commits (EndpointUuid on the new state type, IndexSubtype on the existing RemapState literal). Co-authored-by: Isaac
The name is a 3-part UC identifier (catalog.schema.index) where catalog and schema are external references the bundle does not own. Prefixing the whole name turned "main.default.my_index" into "dev_user_main.default.my_index", which is not a valid catalog. Prefix only the leaf component, matching how references work: "main.default.my_index" -> "main.default.dev_user_my_index". Co-authored-by: Isaac
Exercises the fix from the previous commit: for
vector_search_indexes.vs_index.name = my_catalog.${resources.schemas.my_schema.name}.my_index
only the leaf (my_index) is prefixed, the catalog is preserved as-is,
and the schema reference stays intact — it resolves at deploy time to
the already-prefixed schema, yielding the expected 3-part name.
Co-authored-by: Isaac
Indexes are UC-only (delta sync against a UC table, UC grants as ACL). Mirrors the flag already set on vector_search_endpoints so these tests are skipped on non-UC cloud runs instead of failing on missing catalog. Co-authored-by: Isaac
If the endpoint an index is attached to is deleted out-of-band, the index still exists by name but its backing endpoint is gone. The planner saw the index's remote state as valid and produced "skip", then deploy recreated the endpoint (new UUID) and left the index orphaned against the old UUID — on the next deploy attempt the user saw "index already exists" with no way to resolve short of manual deletion. VectorSearchIndexState now persists endpoint_uuid alongside the create request. DoRead looks it up from the endpoint service (the index API doesn't return it). OverrideChangeDesc classifies saved vs remote endpoint UUID drift as Recreate so the plan correctly shows "recreate vector_search_indexes.*" when the endpoint has been replaced, and the apply path delete+creates the index against the new endpoint. Covered by acceptance/.../vector_search_indexes/drift/orphaned_endpoint. Co-authored-by: Isaac
Recreate of a Vector Search index re-runs the full embedding pipeline (Delta Sync) or drops every upserted vector (Direct Access). Both can take significant time and cost. Bring vector_search_indexes into the same destructive-action prompt path used for schemas, pipelines, volumes, dashboards, and the Lakebase resources, with a message explaining the cost. Acceptance scripts that recreate indexes now pass --auto-approve, and all VS index test outputs pick up the prompt message at destroy time. Co-authored-by: Isaac
DELETE on a Vector Search index is asynchronous: the backend tears down the embedding pipeline over a few minutes, and any operation on the same name during that window returns "Index ... is currently pending deletion". Recreate (delete+create on the same name) hit this on every deploy. DoDelete now polls GetIndex until it returns 404, so a follow-up DoCreate sees a clean slate. Timeout is 15 minutes, matching observed worst-case teardown. Also harden recovery from a partial recreate: if Create after Delete fails, apply.Recreate now drops the state entry instead of leaving it with an empty ID. Pre-fix state files that already carry an empty ID are tolerated by the planner — empty ID is treated the same as a missing entry, so the next plan re-creates the resource cleanly instead of failing with "invalid state: empty id". Testserver now models the async deletion window: DELETE moves the index into a pending buffer, CREATE during pending returns the backend's exact error, and the next GET (the wait loop's poll) clears the buffer. The recreate/index_type acceptance test exercises the full path. Co-authored-by: Isaac
2b22f02 to
44ade3f
Compare
Previously lookupEndpointUuid swallowed all non-404 errors and returned "",
which would feed empty remoteUuid into OverrideChangeDesc and propose a
destructive Recreate ("endpoint replaced out-of-band") on transient or
permission errors. The Recreate is dangerous: Delta Sync re-runs the
embedding pipeline, and Direct Access loses all upserted vectors.
Now the helper returns (string, error): 404 maps to ("", nil) — the orphan
signal — and any other error is propagated through DoRead/DoCreate so the
plan fails loudly instead of misclassifying it as drift.
Document the OverrideChangeDesc divergence from vector_search_endpoint
(which requires remoteUuid != ""): for indexes, an empty remoteUuid is the
orphan signal, and the lookup contract guarantees that case is unambiguous.
Add a Badness-marked test that deploys a bundle with both a vector_search_endpoint and a vector_search_index referencing it, then changes the endpoint_type to trigger an endpoint Recreate. The plan correctly recreates the endpoint but leaves the dependent index unchanged, so on a real workspace the endpoint delete would either fail (indexes still attached) or orphan the index. Root cause is in the planner (bundle/direct/bundle_plan.go): there is no logic to propagate Recreate from a dependency to its dependents. This is a framework-level concern that affects more than just VS, so it's deferred to a follow-up. The Badness entry documents the gap.
Add a Badness-marked validate test showing that the name_prefix preset
does not rewrite a vector_search_indexes.*.endpoint_name literal that
points at a bundle-managed (and therefore prefixed) endpoint. The output
shows vs_endpoint -> prefix_vs_endpoint while vs_index_literal still
targets the unprefixed name vs_endpoint.
The DABs idiom is to use ${resources.vector_search_endpoints.X.name}
(captured by vs_index_ref in the same fixture). That form resolves
correctly to the prefixed name at plan/deploy time, so users have a
working pattern. The literal form silently breaks though, and the
preset has enough information to rewrite it; tracked as Badness for a
follow-up fix in apply_presets.go.
Mirror the existing vector_search_endpoint bind test: pre-create both endpoint and index, bind the index into the bundle, deploy, unbind, and destroy. Verifies the index survives unbind+destroy as expected. Required by bundle/direct/dresources/README.md for new resource types.
The placeholder substitutes index_type, not primary_key — the original name was misleading.
Drop the Terraform-provider justification (already implied by "direct engine only") and the long list of internal mechanics. Keep the entry focused on what customers see.
CreateIndex returns immediately with metadata of an index whose embedding pipeline is still provisioning; queries against an index that isn't ready fail. Implement WaitAfterCreate so dependent resources (and the next plan) see a usable index. 75-minute timeout matches the terraform provider. Co-authored-by: Isaac
| # columns_to_sync is request-only in the create spec and not returned by the read API. | ||
| - field: delta_sync_index_spec.columns_to_sync | ||
| reason: input_only |
There was a problem hiding this comment.
I made a PR (under review by VS team) to return this field
| // createIndexTimeout caps the wait for an index to become ready after creation. | ||
| // Delta sync indexes do an initial sync from the source table, which can stretch | ||
| // out for large tables. Matches the terraform provider's defaultIndexProvisionTimeout. | ||
| // https://github.com/databricks/terraform-provider-databricks/blob/c61a32300445f84efb2bb6827dee35e6e523f4ff/vectorsearch/resource_vector_search_index.go#L19 |
There was a problem hiding this comment.
update reference after databricks/terraform-provider-databricks#5598 is merged
|
|
||
| trace $CLI bundle summary | ||
|
|
||
| trace $CLI vector-search-endpoints create-endpoint "${endpoint_name}" STANDARD | jq '{id, name, endpoint_type}' |
There was a problem hiding this comment.
Could you create it as part of the same bundle and refer to it in the index?
There was a problem hiding this comment.
no strong feeling either way.
- pro: they are strongly coupled so should be declared together
- con: precedent is to have resources/RESOURCE/*/databricks.yml.tmpl only contain the resource
There was a problem hiding this comment.
precedent is to have resources/RESOURCE/*/databricks.yml.tmpl only contain the resource
I don't think we are strict about it, we have already resources combining both (see catalogs or database_catalogs for example).
Having them together helps us to test the actual use case users will have + cleaning up is a bit easier too
| // provisioning; queries against an index that isn't ready fail. Blocking here | ||
| // lets dependent resources (and the next plan) see a usable index. | ||
| func (r *ResourceVectorSearchIndex) WaitAfterCreate(ctx context.Context, config *VectorSearchIndexState) (*VectorSearchIndexRemote, error) { | ||
| index, err := retries.Poll(ctx, createIndexTimeout, func() (*vectorsearch.VectorIndex, *retries.Err) { |
There was a problem hiding this comment.
This can be a very long operation, right? Shall we maybe not wait for index to be ready and then later introduce support for lifecycle.started here which will wait?
There was a problem hiding this comment.
This is following terraform's 75min timeout (see databricks/terraform-provider-databricks#5598 that I made by request from VS team). Don't we always wait for creation for every resource? We can also keep WaitAfterCreate but make it opt-in (maybe use an experimental flag)?
There was a problem hiding this comment.
We don't for clusters and sql warehouses for example (in both direct and TF mode) and we now make it opt in with lifecycle.started option #5150
My concern with waiting is that if it's a very long operation users don't have a way out of it right now
But implementation wise I'm fine with both to be honest, we can iterate on it any time
| } | ||
|
|
||
| func (r *ResourceVectorSearchIndex) DoUpdate(ctx context.Context, id string, config *VectorSearchIndexState, entry *PlanEntry) (*VectorSearchIndexRemote, error) { | ||
| // Vector search indexes have no update API; all field changes trigger recreation via resources.yml. |
There was a problem hiding this comment.
What will happen if the new field is added but it can be updatable, how do we not leave it unnoticed?
Changes
Adds
vector_search_indexesas a first-class DABs resource on the direct engine, alongside the existingvector_search_endpoints. Direct engine only — vector search has no Terraform provider.What's included:
bundle/config/resources/vector_search_index.go(withgrants) andbundle/direct/dresources/vector_search_index.go(state, lifecycle, drift classification).RemapStateround-tripsindex_subtypeso a populated remote subtype isn't classified as drift on the next plan.table.recreate_on_changesfor immutable spec fields (name,endpoint_name,index_type,index_subtype,primary_key,delta_sync_index_spec,direct_access_index_spec);delta_sync_index_spec.columns_to_syncmarkedignore_remote_changes(request-only field — see follow-up note below). The index API has no rename or update path, so any config-side change has to round-trip through delete + create.endpoint_uuidof the endpoint it was created against.DoReadlooks up the current endpoint UUID by name; if the endpoint was deleted out-of-band the lookup returns""andOverrideChangeDescclassifies the saved-vs-remote mismatch asRecreate. Builds on the endpoint UUID persistence merged in Persist endpoint UUID for vector_search_endpoints drift detection #5127.WaitAfterDeleteadapter method (sibling toWaitAfterCreate/WaitAfterUpdate). For VS indexes it pollsGetIndexuntil 404 (15-minute cap).apply.RecreaterunsDoDelete → DeleteState → WaitAfterDelete → DoCreate → SaveState → WaitAfterCreate, so a wait-time failure leaves the bundle consistent. Replaces the priorSaveState("", nil, nil)placeholder that producedinvalid state: empty idplanning failures on partial recreate.bundle/phases/. The message intentionally covers both Delta Sync ("re-runs the embedding pipeline") and Direct Access ("upserted vectors lost") in one paragraph — picking a type-specific message from the bundle config would be wrong on type changes (DELTA_SYNC→DIRECT_ACCESSrecreates would describe the destination type while the actual teardown is of the source type).catalog.schema.name, since catalog and schema are external references (the previous behavior produced invalid names likedev_jan_main.default.my_index). The mutator skips names that still carry literal${...}tokens, since the leaf split would otherwise inject the prefix inside the trailing ref expression itself.Ready: trueimmediately, matching the convention used by every other slow resource the testserver fakes (endpoints →ONLINE, database instances →AVAILABLE, apps →RUNNING).index_type/ spec-block consistency is intentionally not validated client-side — the CreateIndex API rejects mismatched combinations at deploy time, and replicating that check in DABs would just duplicate backend logic.Why
The direct engine recently gained
vector_search_endpoints(#4887). This PR extends the support to indexes, which were the missing half. Along the way it surfaces and fixes a number of issues:recreatedeploys hit it every time. Without a wait, every recreate failed on the immediate Create.apply.Recreatewas writing a malformed empty-ID state entry as its "delete state" step, which then poisoned the next plan withinvalid state: empty id.Follow-ups
delta_sync_index_spec.columns_to_syncis request-only in the SDK today: the field is accepted onCreatebut theGetresponse doesn't echo it back, which is why we mark itignore_remote_changeshere. There's an open backend PR to exposecolumns_to_syncon the read path; once the SDK is regenerated against that, we can drop theignore_remote_changesentry and let normal drift detection handle the field.vector_search_endpoints.budget_policy_iddrift (effective vs. requested) and the SDK doc-comment forvector_search_endpoints.usage_policy_idare intentionally not in this PR — both will be addressed by the next SDK bump and the corresponding./task generate-schemaregen.Tests
./task fmt,./task checks,./task lint— all clean../task test— unit tests green acrossbundle/....TestVectorSearchIndexNameWithUnresolvedRefsLeftAloneinapply_target_mode_test.goexercises the leaf-prefix skip on${var.catalog}.${var.schema}.${var.index}.acceptance/bundle/resources/vector_search_indexes/:basic,drift/columns_to_sync,drift/deleted_remotely,drift/orphaned_endpoint,recreate/index_type,recreate/mixed_types,grants/select.recreate/index_type/out.requests.recreate.direct.json) capturesGET → DELETE → GET → POSTwith--getenabled inprint_requests.py. The middleGETis theWaitAfterDeletepoll; if a future change drops the wait the regenerated capture loses that line and the test fails.acceptance/bundle/validate/presets_name_prefixcovers the leaf-only name prefix on a 3-part index name.acceptance/bundle/invariant/configs/vector_search_index.yml.tmplexercises the resource through the invariant matrix; the testserver enforces endpoint existence on index create.--profile tmpagainst staging across initial deploy / drift / recreate / destroy.This PR was written by Claude Code.