keyspace, schedule: make keyspace creation atomic #10467 by bufferflies · Pull Request #10492 · tikv/pd

bufferflies · 2026-03-26T02:15:36Z

What problem does this PR solve?

Issue Number: Close #10461

What is changed and how does it work?

Make keyspace creation persist keyspace metadata, keyspace-group membership,
and region label rules in one transaction so create failures no longer leave
partial storage state behind. Move in-memory keyspace and keyspace-group cache
updates to post-commit callbacks so failed transactions do not leak committed
state into memory.

Check List

Tests

Unit test

Code changes

Has persistent data change

Side effects

Increased code complexity

Release note

Fix partial keyspace creation state when later create steps fail after the
initial metadata transaction succeeds.

Summary by CodeRabbit

Refactor
- Keyspace creation now runs as a single transactional pipeline with unified validation and optional explicit IDs; reserve/init avoids pre-splitting and pre-creating disabled meta.
Bug Fixes
- Improved robustness: retries for bootstrap creation, safer handling of duplicate/partial failures, and post-load state checks that may disable keyspaces with invalid region bounds.
Metrics
- Removed two fine-grained step metrics; tracer records fewer, clearer creation steps.
Tests
- Added/updated tests and helpers for repeated init/create and transactional flows.
Labeling
- Label-rule validation made exported and reused consistently.

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot · 2026-03-26T02:15:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nolouch for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-26T02:15:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Consolidates keyspace creation into a single transactional pipeline with txn-op + post-commit callbacks, replaces direct group cache/store writes with txn-op producers, updates tracer/metrics and labeler APIs, adds load-time keyspace state check, and adapts tests to the new txn-op/callback pattern. (49 words)

Changes

Cohort / File(s)	Summary
Keyspace core `pkg/keyspace/keyspace.go`, `pkg/keyspace/metrics.go`	Rewrote create flow: added `createKeyspaceInner`/`createKeyspaceWithoutCheck`, moved to txn-op + post-commit callbacks (`saveNewKeyspaceTxnOp`, `saveKeyspaceRegionLabelerTxnOp`, `updateKeyspaceForGroupTxnOp`), removed separate enable/update steps, adjusted tracer steps and added `userKind`/`waitSplit` fields and failpoint support.
Group management `pkg/keyspace/tso_keyspace_group.go`	Replaced exported `UpdateKeyspaceForGroup` with txn-op/callback producers (`updateKeyspaceForGroupTxnOp`, `saveKeyspaceGroupTxnOp`); added in-lock prechecks and success-only cache update in callbacks.
Keyspace tests `pkg/keyspace/keyspace_test.go`, `pkg/keyspace/tso_keyspace_group_test.go`	Added `TestInitTwice`, `TestCreateSameKeyspaceTwice`; adjusted expected create-keyspace metric steps; added helpers to execute txn ops and invoke callbacks (`saveNewKeyspaceForTest`, `updateKeyspaceForGroupForTest`).
Labeler API & call sites `pkg/schedule/labeler/rules.go`, `pkg/schedule/labeler/labeler.go`, `pkg/schedule/labeler/plan.go`, `pkg/schedule/labeler/labeler_test.go`	Exported `LabelRule.CheckAndAdjust()` (was `checkAndAdjust()`), updated call sites; added `RegionLabeler.GetRuleStorage()` and `SaveRuleWithoutTxn()` and switched `SetLabelRule` to use it.
Handler load-time check `server/apiv2/handlers/keyspace.go`	Handlers now call `manager.CheckKeyspaceState(meta)` after loading metadata to possibly disable an `ENABLED` keyspace in-memory when region bounds aren’t satisfied.
Tests (integration) `tests/integrations/mcs/tso/keyspace_group_manager_test.go`	Use loaded keyspace IDs instead of hardcoded IDs when asserting group splits and membership.
Misc / manifest `go.mod`	Minor dependency/manifest adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Manager as KeyspaceManager
    participant Store
    participant Labeler
    participant GroupMgr as GroupManager

    Client->>Manager: CreateKeyspace(request)
    Manager->>Manager: validate name, config, optional ID
    Manager->>Store: RunInTxn(txnOps: save meta/id map, region labeler rule save, group mutation)
    Note right of Store: atomic commit of meta, id map, labeler, group
    Store-->>Manager: txn commit success
    Manager->>Labeler: post-txn callback: register/update region label rules
    Manager->>GroupMgr: post-txn callback: update in-memory group membership
    Manager-->>Client: return created keyspace (ENABLED)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

keyspace: add create keyspace metrics and logs #10263: overlaps with create-keyspace control flow and tracer/metrics changes.
keyspace: roll back keyspace cache when etcd save failed #10240: related to keyspace-group update flow and moving to txn-op + callback semantics.
mcs: add affinity redirect and scheduling watcher #10042: touches labeler API and rule validation adjustments.

Suggested labels

lgtm, approved

Suggested reviewers

rleungx
JmPotato

Poem

🐇 I hopped through code where transactions bind,
No half-made burrows left behind.
Callbacks tidy caches after commit's cheer,
Labels and groups settle whole and clear.
A carrot nibble — atomic and kind.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: making keyspace creation atomic, which directly addresses issue `#10467`.
Description check	✅ Passed	The description provides the issue reference, explains the change using a commit message format, and checks appropriate boxes (unit test, persistent data change, code complexity increase) with a relevant release note.
Linked Issues check	✅ Passed	The PR fully addresses issue `#10461` by consolidating keyspace creation into a single transaction, preventing partial storage state, and moving cache updates to post-commit callbacks.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the atomicity objective: keyspace/keyspace-group creation refactoring, transaction-operation helpers, metrics updates, and related test modifications with no extraneous changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bufferflies · 2026-03-26T02:16:54Z

/run-check-issue-triage-complete

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pkg/keyspace/metrics.go (1)
17-25: Import order does not follow gci/goimports convention.

The imports should follow: stdlib | third-party | pingcap | tikv/pd. Currently github.com/tikv/pd/pkg/storage/endpoint (line 21) is placed between third-party imports (prometheus and zap).

As per coding guidelines: "Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd".
♻️ Suggested import reordering
 import (
 	"time"
 
 	"github.com/prometheus/client_golang/prometheus"
-	"github.com/tikv/pd/pkg/storage/endpoint"
 	"go.uber.org/zap"
 
 	"github.com/pingcap/log"
+
+	"github.com/tikv/pd/pkg/storage/endpoint"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/metrics.go` around lines 17 - 25, Reorder the imports in
pkg/keyspace/metrics.go to follow the gci/goimports convention (stdlib |
third-party | pingcap | tikv/pd): keep "time" first, group third-party imports
("github.com/prometheus/client_golang/prometheus", "go.uber.org/zap") next, then
pingcap ("github.com/pingcap/log"), and finally tikv/pd
("github.com/tikv/pd/pkg/storage/endpoint"); adjust the import block ordering
accordingly and run goimports/gci to verify formatting.
server/apiv2/handlers/keyspace.go (1)
174-179: Consider extracting duplicated state adjustment logic.

The same logic for adjusting keyspace state appears in both LoadKeyspace (lines 174-179) and LoadKeyspaceByID (lines 209-213). Consider extracting to a helper function to reduce duplication.

Also note: mutating meta.State directly modifies the proto object. Verify this doesn't affect any in-memory caches that might hold a reference to the same object returned by LoadKeyspace.
♻️ Suggested helper extraction
// adjustKeyspaceStateForResponse adjusts the keyspace state for API response
// if region split is pending.
func adjustKeyspaceStateForResponse(meta *keyspacepb.KeyspaceMeta, svr *server.Server, manager *keyspace.Manager) {
	if meta != nil && meta.State == keyspacepb.KeyspaceState_ENABLED {
		if svr.GetKeyspaceConfig().ToWaitRegionSplit() && !manager.CheckKeyspaceRegionBound(meta.Id) {
			meta.State = keyspacepb.KeyspaceState_DISABLED
		}
	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/apiv2/handlers/keyspace.go` around lines 174 - 179, The duplicated
logic that mutates meta.State in LoadKeyspace and LoadKeyspaceByID should be
extracted into a helper (e.g., adjustKeyspaceStateForResponse) that takes the
*keyspacepb.KeyspaceMeta, *server.Server and *keyspace.Manager and encapsulates
the ToWaitRegionSplit()/CheckKeyspaceRegionBound(meta.Id) check and sets State
to DISABLED for responses; to avoid mutating shared in-memory proto caches,
ensure you operate on a defensive copy of meta (clone the proto or create a
shallow copy) before calling the helper so the in-memory cached meta is not
modified.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/keyspace/keyspace.go`:
- Around line 185-197: initReserveKeyspace currently returns any error from
createKeyspaceWithoutCheck, which now returns ErrKeyspaceExists and causes
bootstrap to fail on restart; update initReserveKeyspace to treat
ErrKeyspaceExists as benign by checking the returned error from
createKeyspaceWithoutCheck and returning nil when err == ErrKeyspaceExists (but
return other errors normally), keeping the rest of the tracing and logic
unchanged; reference createKeyspaceWithoutCheck, ErrKeyspaceExists and
initReserveKeyspace when making the change.
- Around line 232-248: The CreateKeyspaceByID path currently skips validateID()
when caller-supplied ids are used; update the branch that sets newID = ids[0] to
call validateID(newID) and return an error if validation fails (in the same
manner as the allocator path), before calling tracer.SetKeyspace and
isProtectedKeyspaceID; apply the same change to the other similar branch around
the CreateKeyspaceByID/ids handling (the one at the 363–367 site). Use the
existing validateID, isProtectedKeyspaceID, and manager.allocID symbols to
locate and mirror the allocator-path checks so out-of-range or otherwise invalid
supplied IDs are rejected consistently.

In `@pkg/keyspace/tso_keyspace_group.go`:
- Around line 464-512: updateKeyspaceForGroupTxnOp currently mutates the live
cache entry kg before the txn runs, leaking uncommitted changes; instead, clone
the group (e.g., make a deep copy of kg), apply the opAdd/opDelete mutation to
the clone and pass that clone into saveKeyspaceGroupsTxnOp, and do not touch
m.groups while unlocked; in the returned callback (cb) only on success reacquire
m.Lock() and call m.groups[userKind].Put(clonedKg) to publish the change, and on
error do nothing (or explicitly discard the clone) so the live kg is never
mutated outside the committed-path. Ensure kg (the original from m.groups) is
never modified in updateKeyspaceForGroupTxnOp before the transaction runs.

---

Nitpick comments:
In `@pkg/keyspace/metrics.go`:
- Around line 17-25: Reorder the imports in pkg/keyspace/metrics.go to follow
the gci/goimports convention (stdlib | third-party | pingcap | tikv/pd): keep
"time" first, group third-party imports
("github.com/prometheus/client_golang/prometheus", "go.uber.org/zap") next, then
pingcap ("github.com/pingcap/log"), and finally tikv/pd
("github.com/tikv/pd/pkg/storage/endpoint"); adjust the import block ordering
accordingly and run goimports/gci to verify formatting.

In `@server/apiv2/handlers/keyspace.go`:
- Around line 174-179: The duplicated logic that mutates meta.State in
LoadKeyspace and LoadKeyspaceByID should be extracted into a helper (e.g.,
adjustKeyspaceStateForResponse) that takes the *keyspacepb.KeyspaceMeta,
*server.Server and *keyspace.Manager and encapsulates the
ToWaitRegionSplit()/CheckKeyspaceRegionBound(meta.Id) check and sets State to
DISABLED for responses; to avoid mutating shared in-memory proto caches, ensure
you operate on a defensive copy of meta (clone the proto or create a shallow
copy) before calling the helper so the in-memory cached meta is not modified.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b76ad37-a633-4470-96f6-ef77ae5c598b

📥 Commits

Reviewing files that changed from the base of the PR and between 3eb99ae and 5ef55c9.

📒 Files selected for processing (10)

pkg/keyspace/keyspace.go
pkg/keyspace/keyspace_test.go
pkg/keyspace/metrics.go
pkg/keyspace/tso_keyspace_group.go
pkg/keyspace/tso_keyspace_group_test.go
pkg/schedule/labeler/labeler.go
pkg/schedule/labeler/labeler_test.go
pkg/schedule/labeler/plan.go
pkg/schedule/labeler/rules.go
server/apiv2/handlers/keyspace.go

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

pkg/keyspace/keyspace.go (2)
233-248: ⚠️ Potential issue | 🟠 Major

Validate caller-supplied IDs before creating the keyspace.

The ids branch at Lines 233-235 bypasses validateID(), so CreateKeyspaceByID() can persist IDs that allocID() would reject. Reuse the same validation before SetKeyspace() and the protected-ID check.
💡 Minimal fix
 	if len(ids) > 0 {
 		newID = ids[0]
+		if err := validateID(newID); err != nil {
+			return nil, err
+		}
 	} else {
 		// Allocate new keyspaceID.
 		newID, err = manager.allocID()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 233 - 248, The caller-supplied branch
bypasses validation: after assigning newID (whether from ids[0] or
manager.allocID()), call the existing validateID(newID) helper and return its
error if it fails before invoking tracer.SetKeyspace(newID, name) and before the
isProtectedKeyspaceID(newID) check; in short, centralize validation on newID
(used by CreateKeyspaceByID path) immediately after newID is chosen so both
supplied IDs and allocated IDs go through validateID() prior to SetKeyspace()
and the protected-ID error logic.
185-197: ⚠️ Potential issue | 🔴 Critical

Keep reserved-keyspace bootstrap idempotent.

At Line 196, initReserveKeyspace() now propagates errs.ErrKeyspaceExists from createKeyspaceWithoutCheck(). On restart/recovery the reserved keyspace is already persisted, so this turns a normal bootstrap into a startup failure. Please treat ErrKeyspaceExists as benign here, the same way the pre-alloc loop already does.
💡 Minimal fix
 	_, err = manager.createKeyspaceWithoutCheck(tracer, config, now)
-	return err
+	if err != nil && err != errs.ErrKeyspaceExists {
+		return err
+	}
+	return nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 185 - 197, initReserveKeyspace
currently returns errs.ErrKeyspaceExists from createKeyspaceWithoutCheck which
makes normal restarts fail; modify Manager.initReserveKeyspace to treat
errs.ErrKeyspaceExists as benign by checking the returned error from
manager.createKeyspaceWithoutCheck(tracer, config, now) and returning nil when
the error equals errs.ErrKeyspaceExists, otherwise returning the original error;
reference the functions Manager.initReserveKeyspace and
createKeyspaceWithoutCheck and the sentinel error errs.ErrKeyspaceExists in your
change.
pkg/keyspace/tso_keyspace_group.go (1)
488-516: ⚠️ Potential issue | 🔴 Critical

Keep updateKeyspaceForGroupTxnOp side-effect free until commit.

At Line 490, kg is the live cache entry from m.groups, but it is mutated before the returned txn runs. If a later setup step aborts before RunInTxn, or another request interleaves on the same group, uncommitted membership leaks into memory; then Lines 503-513 mutate the same shared state again without holding m.Lock(). Clone the group, persist the clone, and only publish it from a success callback that reacquires the lock.

As per coding guidelines, "Guard shared state with mutex/RWMutex; keep lock ordering consistent."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/tso_keyspace_group.go` around lines 488 - 516, The live group
entry kg is being mutated before the transaction runs; instead, make
updateKeyspaceForGroupTxnOp operate on a deep copy (e.g., clone := kg.Clone() or
manual copy) and apply the add/delete changes to the clone only, build the txn
op from the clone, and in the success callback reacquire m.Lock()/m.Unlock() and
call m.groups[userKind].Put(clone) to publish the committed change; ensure the
rollback branch does not mutate the original shared kg and that any temporary
mutation touches only the cloned object.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/keyspace/keyspace_test.go`:
- Around line 143-150: The failpoint injection for "waitSplitKeyspaceFailed" is
behind a dead branch because manager is constructed with &mockConfig{} in
SetupTest which makes ToWaitRegionSplit() return false; update the test so the
manager runs the wait-split path (either by configuring a non-zero wait-split
setting on the mockConfig used in SetupTest or constructing manager with a
config where ToWaitRegionSplit() returns true) before enabling the failpoint, or
move the failpoint enable to a code path that always executes (e.g., immediately
before calling CreateKeyspace or inside the commit-handling path); ensure the
test calls CreateKeyspace and exercises the post-commit split path so the
failpoint actually triggers.

In `@pkg/keyspace/keyspace.go`:
- Around line 313-349: The code must not return an error after
manager.store.RunInTxn (the transaction that persisted keyspace metadata) has
committed; modify the post-commit split logic in the function that calls
manager.store.RunInTxn / tracer.OnSaveKeyspaceMetaFinished so that failures from
waitKeyspaceRegionSplit(keyspaceID) are treated as best-effort: do not return
nil,err after the txn has succeeded. Instead either (A) swallow the wait error
(log.Warn with details and set a "split pending" condition on the returned
keyspace result or an explicit status field) and return the created keyspace and
nil error, or (B) change the function's result type to include a status enum
(e.g., Created|CreatedSplitPending) and return that along with the keyspace when
waitKeyspaceRegionSplit fails. Ensure callers are updated to handle the new
status or the guaranteed non-error return when the keyspace is durable.

---

Duplicate comments:
In `@pkg/keyspace/keyspace.go`:
- Around line 233-248: The caller-supplied branch bypasses validation: after
assigning newID (whether from ids[0] or manager.allocID()), call the existing
validateID(newID) helper and return its error if it fails before invoking
tracer.SetKeyspace(newID, name) and before the isProtectedKeyspaceID(newID)
check; in short, centralize validation on newID (used by CreateKeyspaceByID
path) immediately after newID is chosen so both supplied IDs and allocated IDs
go through validateID() prior to SetKeyspace() and the protected-ID error logic.
- Around line 185-197: initReserveKeyspace currently returns
errs.ErrKeyspaceExists from createKeyspaceWithoutCheck which makes normal
restarts fail; modify Manager.initReserveKeyspace to treat
errs.ErrKeyspaceExists as benign by checking the returned error from
manager.createKeyspaceWithoutCheck(tracer, config, now) and returning nil when
the error equals errs.ErrKeyspaceExists, otherwise returning the original error;
reference the functions Manager.initReserveKeyspace and
createKeyspaceWithoutCheck and the sentinel error errs.ErrKeyspaceExists in your
change.

In `@pkg/keyspace/tso_keyspace_group.go`:
- Around line 488-516: The live group entry kg is being mutated before the
transaction runs; instead, make updateKeyspaceForGroupTxnOp operate on a deep
copy (e.g., clone := kg.Clone() or manual copy) and apply the add/delete changes
to the clone only, build the txn op from the clone, and in the success callback
reacquire m.Lock()/m.Unlock() and call m.groups[userKind].Put(clone) to publish
the committed change; ensure the rollback branch does not mutate the original
shared kg and that any temporary mutation touches only the cloned object.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e277696-956c-4189-b34f-4704216ffc40

📥 Commits

Reviewing files that changed from the base of the PR and between 5ef55c9 and 9c634d1.

📒 Files selected for processing (6)

pkg/keyspace/keyspace.go
pkg/keyspace/keyspace_test.go
pkg/keyspace/metrics.go
pkg/keyspace/tso_keyspace_group.go
pkg/keyspace/tso_keyspace_group_test.go
server/apiv2/handlers/keyspace.go

✅ Files skipped from review due to trivial changes (1)

server/apiv2/handlers/keyspace.go

coderabbitai · 2026-03-26T02:52:41Z

+	re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed", `return(true)`))
+	ks, err := manager.CreateKeyspace(requests[0])
+	re.NoError(err)
+	km, err := manager.LoadKeyspace(ks.Name)
+	re.NoError(err)
+	re.Equal(keyspacepb.KeyspaceState_ENABLED, km.State)
+
+	re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed"))


⚠️ Potential issue | 🟡 Minor

This failpoint never executes in this suite.

manager is created with &mockConfig{} in SetupTest, so ToWaitRegionSplit() is false and the waitSplitKeyspaceFailed injection at Line 143 sits behind a dead branch. This block is currently just another happy-path create, so it does not actually cover the post-commit failure path. Configure non-zero wait-split settings here, or move the injection to a path that always runs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/keyspace/keyspace_test.go` around lines 143 - 150, The failpoint injection for "waitSplitKeyspaceFailed" is behind a dead branch because manager is constructed with &mockConfig{} in SetupTest which makes ToWaitRegionSplit() return false; update the test so the manager runs the wait-split path (either by configuring a non-zero wait-split setting on the mockConfig used in SetupTest or constructing manager with a config where ToWaitRegionSplit() returns true) before enabling the failpoint, or move the failpoint enable to a code path that always executes (e.g., immediately before calling CreateKeyspace or inside the commit-handling path); ensure the test calls CreateKeyspace and exercises the post-commit split path so the failpoint actually triggers.

no, injection will also affect the region labeler setting

🧩 Analysis chain

Let me check how waitSplitKeyspaceFailed is used in the implementation.

🏁 Script executed:

#!/bin/bash rg -n "waitSplitKeyspaceFailed" --type=go -C 10

Length of output: 3334

Let me look at more context around the waitSplitKeyspaceFailed usage and the region labeler path.

🏁 Script executed:

#!/bin/bash # Get broader context around the saveKeyspaceRegionLabelerTxnOp and the full createKeyspaceWithoutCheck function rg -n "saveKeyspaceRegionLabelerTxnOp\|waitSplitKeyspaceFailed\|saveKeyspaceGroupsTxnOpFailed" --type=go -A 5 -B 5

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search more broadly rg -n "waitSplitKeyspace\|saveKeyspaceGroups\|RegionLabeler\|regionLabeler\|labeler" pkg/keyspace/keyspace.go | head -60

Length of output: 0

🏁 Script executed:

#!/bin/bash # View the full createKeyspaceWithoutCheck function area grep -n "waitSplitKeyspace\|saveKeyspaceGroups\|labeler\|RegionLabel\|createKeyspaceWithoutCheck\|txnOp\|txnCb" pkg/keyspace/keyspace.go | head -80

Length of output: 1357

🏁 Script executed:

#!/bin/bash # View the full createKeyspaceWithoutCheck function sed -n '290,375p' pkg/keyspace/keyspace.go

Length of output: 2402

🏁 Script executed:

#!/bin/bash # View saveKeyspaceRegionLabelerTxnOp implementation sed -n '425,480p' pkg/keyspace/keyspace.go

Length of output: 1811

@bufferflies, thanks for the clarification — I dug into the code to understand your point, but the evidence points the other way.

Looking at createKeyspaceWithoutCheck (lines 292–375 of keyspace.go):

saveKeyspaceRegionLabelerTxnOp is called at line 322, before the transaction runs — it is not guarded by waitSplit. However, it uses the skipSplitRegion failpoint (line 426), not waitSplitKeyspaceFailed. And since SetupSuite already enables skipSplitRegion suite-wide, this function returns nil, nil, nil for every test in the suite — meaning the region labeler is never actually set up in these tests.

waitSplitKeyspaceFailed is injected at line 354, strictly inside if tracer.waitSplit { (line 352). With manager built from &mockConfig{} (where ToWaitRegionSplit() returns false), tracer.waitSplit is false, so this block — and the failpoint inside it — is never reached.

These are two separate code paths using separate failpoints. waitSplitKeyspaceFailed does not influence saveKeyspaceRegionLabelerTxnOp.

The test still provides value (verifying that requests[0] can be retried successfully after the saveKeyspaceGroupsTxnOpFailed-induced failure), but the waitSplitKeyspaceFailed enable at line 150 is effectively a no-op in this configuration. Would you consider either:

Setting WaitRegionSplit: true on the local manager config in this test (so tracer.waitSplit = true and the failpoint fires), or

Adding a note in the test comment that this section only validates retry-after-failure, not the post-commit split error path?

🧠 Learnings used

Learnt from: bufferflies Repo: tikv/pd PR: 10492 File: pkg/keyspace/keyspace.go:292-301 Timestamp: 2026-03-27T03:13:21.715Z Learning: In tikv/pd `pkg/keyspace/keyspace.go`, `createKeyspaceWithoutCheck` intentionally persists the new keyspace as `keyspacepb.KeyspaceState_ENABLED` immediately upon creation (not DISABLED as before). Keyspace state is intentionally decoupled from the region split process. `CheckKeyspaceState` only transiently flips the state on the caller's in-memory copy when `CheckKeyspaceRegionBound` returns false — it does NOT persist the change. This is by design: split readiness is a separate, read-time concern and must not be flagged as inconsistent state handling. Do not raise issues about ENABLED vs DISABLED state being different across code paths due to this pattern.

Learnt from: bufferflies Repo: tikv/pd PR: 10330 File: pkg/keyspace/util.go:462-485 Timestamp: 2026-03-18T13:01:12.324Z Learning: In tikv/pd, `GetKeyspaceSplitKeys` in `pkg/keyspace/util.go` is intentionally designed to return split keys for only one keyspace at a time (a single `checker.GetKeyspaceIDInRange` call). The SplitChecker is invoked repeatedly across scheduling cycles to incrementally resolve all cross-keyspace spans, so splitting all boundaries in a single call is not the goal.

Learnt from: CR Repo: tikv/pd PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2026-02-09T06:38:09.379Z Learning: Applies to **/*_test.go : Tests must handle failpoints using github.com/pingcap/failpoint. Do not run `go test` directly unless failpoints are enabled/disabled correctly. Use `make test`, `make basic-test`, or `make gotest` commands to auto enable/disable failpoints.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

pkg/keyspace/keyspace.go (1)
345-357: ⚠️ Potential issue | 🟠 Major

Don't report CreateKeyspace failure after the txn has already committed.

Once the transaction at line 321 succeeds, the keyspace meta, keyspace-group membership, and label rule are already durable. If waitKeyspaceRegionSplit() then fails at line 346, the caller gets nil, err even though the keyspace exists. A retry will hit ErrKeyspaceExists.

Either make the wait best-effort after commit (log and continue), or return a result that lets callers distinguish "created, split pending" from a real rollback.
💡 Suggested approach: treat split-wait as best-effort
 	// Split keyspace region.
 	if tracer.waitSplit {
 		err = manager.waitKeyspaceRegionSplit(tracer.keyspaceID)
 		failpoint.Inject("waitSplitKeyspaceFailed", func() {
 			err = errors.New("failpoint triggered: waitSplitKeyspaceFailed")
 		})
 		if err != nil {
 			log.Warn("[create-keyspace] failed to wait keyspace region split",
 				zap.Uint32("keyspace-id", keyspace.GetId()),
 				zap.String("keyspace-name", keyspace.GetName()),
 				zap.Error(err),
 			)
-			return nil, err
+			// Keyspace is already persisted; treat split-wait failure as best-effort.
+			// The keyspace is usable; splits will eventually occur.
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 345 - 357, The CreateKeyspace path
currently returns an error if manager.waitKeyspaceRegionSplit(tracer.keyspaceID)
fails after the transaction has committed; change this to treat the split-wait
as best-effort: when tracer.waitSplit is true and waitKeyspaceRegionSplit
returns an error (including failpoint injection), log the failure with context
(keep the existing zap fields) but do not return nil,error — instead continue
and return the created keyspace result (i.e., do not treat post-commit
split-wait failures as transaction failures). Update the branch around
tracer.waitSplit / manager.waitKeyspaceRegionSplit in CreateKeyspace so callers
receive the created keyspace even if the split wait failed.
pkg/keyspace/keyspace_test.go (1)
150-157: ⚠️ Potential issue | 🟡 Minor

This failpoint never executes in this suite.

manager is created with &mockConfig{} in SetupTest, so ToWaitRegionSplit() returns false. The waitSplitKeyspaceFailed failpoint at line 150 is behind a dead branch in createKeyspaceWithoutCheck (guarded by if tracer.waitSplit). This block is currently just another happy-path create and does not actually cover the post-commit failure path.

Configure non-zero wait-split settings here, or move the injection to a path that always runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace_test.go` around lines 150 - 157, The failpoint
waitSplitKeyspaceFailed never runs because manager is initialized with
&mockConfig{} in SetupTest so ToWaitRegionSplit() returns false and the if
tracer.waitSplit branch in createKeyspaceWithoutCheck is skipped; fix by
enabling the wait-split path in this test: construct or configure manager with a
non-zero wait-split setting (or replace &mockConfig{} with a config that returns
true from ToWaitRegionSplit), or move the failpoint injection into a code path
that always executes (e.g., before calling createKeyspaceWithoutCheck or into
the happy-path create function), so that the failpoint
"github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed" is actually hit when
calling manager.CreateKeyspace; ensure references to ToWaitRegionSplit,
createKeyspaceWithoutCheck, tracer.waitSplit, and manager are used to locate and
change the test setup.

🧹 Nitpick comments (1)

pkg/keyspace/keyspace.go (1)
296-305: Consider pre-allocating with exact capacity.

The slices are initialized with capacity 3, but the actual number of operations depends on runtime conditions (e.g., skipSplitRegion failpoint, nil kgm). This is minor, but using make([]txnOp, 0) or dynamically computing the expected count would be slightly cleaner.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 296 - 305, The txnOps/txnCbs slices
are preallocated with capacity 3 even though the actual number depends on
runtime conditions (e.g., skipSplitRegion failpoint and presence of kgm); update
the initialization to either use make([]txnOp, 0) and make([]txnCb, 0) (no fixed
capacity) or compute the expected capacity based on those conditions before
calling make. Modify the declarations of txnOps and txnCbs and keep the addTxn
closure as-is (txnOps, txnCbs, addTxn are the symbols to change), so capacity
matches runtime logic instead of hardcoding 3.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/keyspace/tso_keyspace_group.go`:
- Around line 349-351: The check that returns the wrong error uses oldKG to
detect a missing group but returns ErrKeyspaceGroupExists; update the code to
return ErrKeyspaceGroupNotExists instead. Locate the branch where oldKG == nil
(in the function handling keyspace group lookup) and replace the
ErrKeyspaceGroupExists return with ErrKeyspaceGroupNotExists so the error
accurately reflects the missing group; keep all other logic unchanged.

---

Duplicate comments:
In `@pkg/keyspace/keyspace_test.go`:
- Around line 150-157: The failpoint waitSplitKeyspaceFailed never runs because
manager is initialized with &mockConfig{} in SetupTest so ToWaitRegionSplit()
returns false and the if tracer.waitSplit branch in createKeyspaceWithoutCheck
is skipped; fix by enabling the wait-split path in this test: construct or
configure manager with a non-zero wait-split setting (or replace &mockConfig{}
with a config that returns true from ToWaitRegionSplit), or move the failpoint
injection into a code path that always executes (e.g., before calling
createKeyspaceWithoutCheck or into the happy-path create function), so that the
failpoint "github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed" is actually
hit when calling manager.CreateKeyspace; ensure references to ToWaitRegionSplit,
createKeyspaceWithoutCheck, tracer.waitSplit, and manager are used to locate and
change the test setup.

In `@pkg/keyspace/keyspace.go`:
- Around line 345-357: The CreateKeyspace path currently returns an error if
manager.waitKeyspaceRegionSplit(tracer.keyspaceID) fails after the transaction
has committed; change this to treat the split-wait as best-effort: when
tracer.waitSplit is true and waitKeyspaceRegionSplit returns an error (including
failpoint injection), log the failure with context (keep the existing zap
fields) but do not return nil,error — instead continue and return the created
keyspace result (i.e., do not treat post-commit split-wait failures as
transaction failures). Update the branch around tracer.waitSplit /
manager.waitKeyspaceRegionSplit in CreateKeyspace so callers receive the created
keyspace even if the split wait failed.

---

Nitpick comments:
In `@pkg/keyspace/keyspace.go`:
- Around line 296-305: The txnOps/txnCbs slices are preallocated with capacity 3
even though the actual number depends on runtime conditions (e.g.,
skipSplitRegion failpoint and presence of kgm); update the initialization to
either use make([]txnOp, 0) and make([]txnCb, 0) (no fixed capacity) or compute
the expected capacity based on those conditions before calling make. Modify the
declarations of txnOps and txnCbs and keep the addTxn closure as-is (txnOps,
txnCbs, addTxn are the symbols to change), so capacity matches runtime logic
instead of hardcoding 3.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9bbe6b47-6718-470a-8a2c-12f02d150c4b

📥 Commits

Reviewing files that changed from the base of the PR and between 9c634d1 and 693a40b.

📒 Files selected for processing (4)

pkg/keyspace/keyspace.go
pkg/keyspace/keyspace_test.go
pkg/keyspace/metrics.go
pkg/keyspace/tso_keyspace_group.go

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

pkg/keyspace/keyspace_test.go (1)
150-157: ⚠️ Potential issue | 🟡 Minor

This failpoint still never executes in this suite.

SetupTest() builds suite.manager with &mockConfig{}, so ToWaitRegionSplit() is false and the waitSplitKeyspaceFailed injection at Line 150 stays behind a dead branch. This is currently another happy-path create, not coverage for the post-commit wait failure path. Reconfigure the manager with WaitRegionSplit: true for this case, and register the disable with defer/Cleanup immediately after Enable so a failing assertion cannot leak the failpoint into later tests.

As per coding guidelines, "Tests must handle failpoints using github.com/pingcap/failpoint."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace_test.go` around lines 150 - 157, The test currently
enables the failpoint "waitSplitKeyspaceFailed" but SetupTest() constructs
suite.manager with &mockConfig{} where ToWaitRegionSplit() is false, so the
injected branch never executes; update the test to configure the manager to
enable waiting by building it with WaitRegionSplit: true (or otherwise set
mockConfig.ToWaitRegionSplit() to return true) before calling
manager.CreateKeyspace/LoadKeyspace so the failpoint is exercised, and
immediately register
failpoint.Disable("github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed")
with defer or t.Cleanup right after failpoint.Enable to ensure the failpoint is
always removed even if assertions fail; refer to SetupTest, mockConfig,
ToWaitRegionSplit, manager.CreateKeyspace, and the "waitSplitKeyspaceFailed"
failpoint when making the change.
pkg/keyspace/keyspace.go (1)
349-360: ⚠️ Potential issue | 🟠 Major

Don't report create failure after the transaction has already committed.

By the time this branch runs, the keyspace meta, group membership, region label rule, and post-commit cache updates are already durable. Returning nil, err here makes callers think create rolled back, but a retry will just hit ErrKeyspaceExists. Treat the split wait as best-effort after commit, or return the created keyspace with an explicit "split pending" status.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 349 - 360, The wait-for-split branch
currently calls manager.waitKeyspaceRegionSplit (guarded by tracer.waitSplit)
and returns nil, err on failure which incorrectly signals the create rolled
back; change this to treat the split wait as best-effort: catch errors from
manager.waitKeyspaceRegionSplit, log a warning with context
(keyspace.GetId()/GetName()) but do NOT return nil, err after the transaction
has committed — instead continue returning the created keyspace; alternatively,
if you need to surface state, set a "split pending" status/flag on the returned
keyspace object before returning so callers know split is outstanding, and
remove the existing early return that returns nil, err.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/keyspace/keyspace.go`:
- Around line 289-298: createKeyspaceWithoutCheck currently persists the new
KeyspaceMeta with State set to ENABLED while CheckKeyspaceState mutates a
caller-local copy when regions aren’t split, causing inconsistent views; either
persist the true post-create state (e.g., set State to DISABLED until split
readiness is confirmed and have CheckKeyspaceState update the stored
KeyspaceMeta) or add an explicit persisted readiness field (e.g., SplitReady or
Readiness) on keyspacepb.KeyspaceMeta and use that instead of flipping State in
CheckKeyspaceState; update createKeyspaceWithoutCheck, CheckKeyspaceState, and
the other affected block referenced (the code around the 519-534 area) so all
readers derive the same persisted state rather than mutating only in-memory
copies.

In `@pkg/keyspace/tso_keyspace_group.go`:
- Around line 494-519: The callback cb currently mutates the in-memory kg
fetched from m.groups and then calls m.groups[userKind].Put(kg), which can
overwrite a newer committed snapshot; instead, after acquiring m.Lock() when
err==nil, reload the committed group snapshot for userKind/groupID from
persistent storage (do not use the live m.groups entry), apply the
opAdd/opDelete change to that freshly loaded snapshot, and then call
m.groups[userKind].Put(...) with that reloaded-and-modified snapshot (or skip
Put if the reload shows the mutation already applied); reference cb, m.groups,
userKind, groupID, kg, Put, opAdd and opDelete when making the change.

---

Duplicate comments:
In `@pkg/keyspace/keyspace_test.go`:
- Around line 150-157: The test currently enables the failpoint
"waitSplitKeyspaceFailed" but SetupTest() constructs suite.manager with
&mockConfig{} where ToWaitRegionSplit() is false, so the injected branch never
executes; update the test to configure the manager to enable waiting by building
it with WaitRegionSplit: true (or otherwise set mockConfig.ToWaitRegionSplit()
to return true) before calling manager.CreateKeyspace/LoadKeyspace so the
failpoint is exercised, and immediately register
failpoint.Disable("github.com/tikv/pd/pkg/keyspace/waitSplitKeyspaceFailed")
with defer or t.Cleanup right after failpoint.Enable to ensure the failpoint is
always removed even if assertions fail; refer to SetupTest, mockConfig,
ToWaitRegionSplit, manager.CreateKeyspace, and the "waitSplitKeyspaceFailed"
failpoint when making the change.

In `@pkg/keyspace/keyspace.go`:
- Around line 349-360: The wait-for-split branch currently calls
manager.waitKeyspaceRegionSplit (guarded by tracer.waitSplit) and returns nil,
err on failure which incorrectly signals the create rolled back; change this to
treat the split wait as best-effort: catch errors from
manager.waitKeyspaceRegionSplit, log a warning with context
(keyspace.GetId()/GetName()) but do NOT return nil, err after the transaction
has committed — instead continue returning the created keyspace; alternatively,
if you need to surface state, set a "split pending" status/flag on the returned
keyspace object before returning so callers know split is outstanding, and
remove the existing early return that returns nil, err.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 761b7560-e073-45aa-8cfd-18c0fbdfdb76

📥 Commits

Reviewing files that changed from the base of the PR and between 693a40b and 4f06e65.

📒 Files selected for processing (5)

pkg/keyspace/keyspace.go
pkg/keyspace/keyspace_test.go
pkg/keyspace/metrics.go
pkg/keyspace/tso_keyspace_group.go
server/apiv2/handlers/keyspace.go

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

pkg/keyspace/keyspace.go (2)
295-300: ⚠️ Potential issue | 🟠 Major

Persisted/cached ENABLED still diverges from the effective keyspace state.

Line 297 persists new keyspaces as ENABLED, Line 396 caches that value, but Line 534 only rewrites a few caller-local copies when the split is still pending. That leaves different read paths disagreeing about the same keyspace: GetKeyspaceStateByID / GetEnabledKeyspaceNameByID will still treat it as enabled while handlers that call CheckKeyspaceState can return disabled. Persist the real transition, or add a separate readiness field and use it consistently everywhere.

Also applies to: 393-396, 522-537
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 295 - 300, The persisted
Keyspace.State is being set to ENABLED at creation
(tracer.keyspaceID/tracer.keyspaceName) but other code paths
(CheckKeyspaceState) treat it as not-ready, causing divergence; fix by either
(A) persisting the actual transitional state instead of always ENABLED (update
the creation/site where State is assigned and where the cache is written around
the caching code that references the same state), or (B) add a clear, consistent
readiness field (e.g., Ready or Initialized) to the persisted Keyspace and
populate it at the real transition point, then update the cache write logic and
all accessors (GetKeyspaceStateByID, GetEnabledKeyspaceNameByID,
CheckKeyspaceState and the cache population code) to read that single field so
all read paths agree on a keyspace’s effective state.
328-345: ⚠️ Potential issue | 🟠 Major

Don't surface a create failure after the transaction has committed.

By the time Line 357 runs, the metadata, keyspace-group membership, and label rule are already durable and the post-commit callbacks have executed. Returning nil, err here makes a successful create look like a rollback, and the next retry just turns into ErrKeyspaceExists (including the bootstrap pre-alloc retry loop above). Make the split wait best-effort after commit, or return the created keyspace with an explicit "split pending" status.

Also applies to: 352-363
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/keyspace/keyspace.go` around lines 328 - 345, After the transaction
commit and post-commit callbacks (the manager.store.RunInTxn block and the
subsequent txnCbs iteration) you must not surface an error that would make a
successful create look like a rollback; instead detect that the txn committed
and convert split-wait failures into a best-effort async action or return the
created keyspace with a "split pending" status. Change the error handling around
err after the RunInTxn/txnCbs block so that if the metadata/membership/label
updates were applied you do not return nil, err: either swallow the
split/split-wait error (log it) and launch the split wait asynchronously, or
populate and return the created Keyspace object with an explicit SplitPending
flag; ensure callbacks already executed still see success and remove the final
return of the transaction error that implies rollback (update the code paths
that reference manager.store.RunInTxn, txnCbs, the err check and the log.Warn).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go`:
- Around line 924-940: The test calls LoadKeyspace immediately after Bootstrap,
which can race with background pre-allocation; change the test to poll until
both keyspaces resolve before proceeding: repeatedly call
leaderServer.GetServer().GetKeyspaceManager().LoadKeyspace for each name
(keyspaces[0] and keyspaces[1]) inside a testutil.Eventually (or similar wait
loop) and only continue when both return without error, then use their Ids for
the subsequent SplitKeyspaceGroupByID, waitFinishSplit, and assertions
(references: LoadKeyspace, GetKeyspaceManager, leaderServer,
SplitKeyspaceGroupByID, waitFinishSplit).

---

Duplicate comments:
In `@pkg/keyspace/keyspace.go`:
- Around line 295-300: The persisted Keyspace.State is being set to ENABLED at
creation (tracer.keyspaceID/tracer.keyspaceName) but other code paths
(CheckKeyspaceState) treat it as not-ready, causing divergence; fix by either
(A) persisting the actual transitional state instead of always ENABLED (update
the creation/site where State is assigned and where the cache is written around
the caching code that references the same state), or (B) add a clear, consistent
readiness field (e.g., Ready or Initialized) to the persisted Keyspace and
populate it at the real transition point, then update the cache write logic and
all accessors (GetKeyspaceStateByID, GetEnabledKeyspaceNameByID,
CheckKeyspaceState and the cache population code) to read that single field so
all read paths agree on a keyspace’s effective state.
- Around line 328-345: After the transaction commit and post-commit callbacks
(the manager.store.RunInTxn block and the subsequent txnCbs iteration) you must
not surface an error that would make a successful create look like a rollback;
instead detect that the txn committed and convert split-wait failures into a
best-effort async action or return the created keyspace with a "split pending"
status. Change the error handling around err after the RunInTxn/txnCbs block so
that if the metadata/membership/label updates were applied you do not return
nil, err: either swallow the split/split-wait error (log it) and launch the
split wait asynchronously, or populate and return the created Keyspace object
with an explicit SplitPending flag; ensure callbacks already executed still see
success and remove the final return of the transaction error that implies
rollback (update the code paths that reference manager.store.RunInTxn, txnCbs,
the err check and the log.Warn).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b388bcb0-9876-4822-879c-66a73fbc25cd

📥 Commits

Reviewing files that changed from the base of the PR and between 4f06e65 and 19abccb.

📒 Files selected for processing (3)

pkg/keyspace/keyspace.go
pkg/keyspace/tso_keyspace_group.go
tests/integrations/mcs/tso/keyspace_group_manager_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/keyspace/tso_keyspace_group.go

Signed-off-by: tongjian <1045931706@qq.com>

lhy1024 · 2026-04-01T04:03:26Z

-		err2 := manager.store.RunInTxn(manager.ctx, func(txn kv.Txn) error {
-			metaPath := keypath.KeyspaceMetaPath(id)
-			return txn.Remove(metaPath)
+	if tracer.waitSplit {


Do we need rollback keyspace if split failed? Or use Disabled as keyspace default status when we created it?

This now commits the keyspace meta / keyspace-group membership / region label rule before the region split has actually finished, and the persisted keyspace state is already ENABLED at that point.

That changes the failure semantics of CreateKeyspace: if the subsequent wait-for-split step times out, CreateKeyspace returns an error, but the keyspace has already been created and committed. A retry will then immediately hit ErrKeyspaceExists. In other words, "create failed" no longer means "nothing was created".

func TestWaitSplitFailureLeavesCommittedKeyspace(t *testing.T) { re := require.New(t) manager := setupMinimalKeyspaceManager(t) manager.UpdateConfig(&mockConfig{ WaitRegionSplit: true, WaitRegionSplitTimeout: typeutil.Duration{Duration: time.Millisecond}, CheckRegionSplitInterval: typeutil.Duration{Duration: time.Millisecond}, }) req := &CreateKeyspaceRequest{ Name: "waitsplitcase1", CreateTime: time.Now().Unix(), } created, err := manager.CreateKeyspace(req) re.Nil(created) re.ErrorIs(err, errs.ErrRegionSplitTimeout) loaded, err := manager.LoadKeyspace(req.Name) re.NoError(err) re.Equal(keyspacepb.KeyspaceState_ENABLED, loaded.State) }

I added a minimal unit test to verify this behavior: after CreateKeyspace returns ErrRegionSplitTimeout, LoadKeyspace still succeeds and the stored state is ENABLED (see TestWaitSplitFailureLeavesCommittedKeyspace).

We just ensure the meta is atomic; the keyspace state of the API checks the split status.

pd/server/apiv2/handlers/keyspace.go

Lines 173 to 175 in 4f06e65

}

manager.CheckKeyspaceState(meta)

c.IndentedJSON(http.StatusOK, &KeyspaceMeta{meta})

.

That means a retry does not retry the same create attempt semantics anymore; it just immediately hits ErrKeyspaceExists.

I think we still need to fix one of these:

treat the post-commit wait-split as best-effort and return the created keyspace, or

persist a real transitional state / readiness signal so the API and callers can distinguish "created but split pending" from a rollback.

lhy1024 · 2026-04-01T04:08:35Z

+	if meta.State != keyspacepb.KeyspaceState_ENABLED {
+		return
+	}
+	if !manager.CheckKeyspaceRegionBound(meta.Id) {
+		meta.State = keyspacepb.KeyspaceState_DISABLED
+		return
+	}
+}
+


It will miss some path

This read-time CheckKeyspaceState patch only fixes a few single-keyspace read paths, but it does not fix the underlying state inconsistency.

Because create now persists the keyspace as ENABLED before split completion, the same keyspace can be observed with different states depending on which path reads it:

LoadKeyspace + CheckKeyspaceState sees DISABLED

LoadRangeKeyspace / LoadAllKeyspaces still see ENABLED

background logic such as GC also still treats it as ENABLED

func TestLoadRangeKeyspaceStillShowsEnabledBeforeSplitComplete(t *testing.T) { re := require.New(t) manager := setupMinimalKeyspaceManager(t) req := &CreateKeyspaceRequest{ Name: "rangebefore1", CreateTime: time.Now().Unix(), } created, err := manager.CreateKeyspace(req) re.NoError(err) single, err := manager.LoadKeyspace(req.Name) re.NoError(err) manager.CheckKeyspaceState(single) re.Equal(keyspacepb.KeyspaceState_DISABLED, single.State) scanned, err := manager.LoadRangeKeyspace(created.Id, 1) re.NoError(err) re.Len(scanned, 1) re.Equal(created.Id, scanned[0].Id) re.Equal(keyspacepb.KeyspaceState_ENABLED, scanned[0].State) }

yes, LoadAllKeyspaces is too big; checking all keyspaces' split status is expensive.

My concern is not only LoadAllKeyspaces.

Right now the same keyspace can still be observed with different effective states depending on the read path:

detail API paths call CheckKeyspaceState

LoadRangeKeyspace / list APIs still expose the persisted ENABLED

GetKeyspaceStateByID / GetEnabledKeyspaceNameByID still read the cached/persisted state directly

background logic also still consumes that state, for example GC and storage-size collection paths

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot · 2026-04-02T08:43:49Z

@bufferflies: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`5d9ed48`	link	true	`/test pull-error-log-review`
pull-unit-test-next-gen-2	`5d9ed48`	link	true	`/test pull-unit-test-next-gen-2`
pull-unit-test-next-gen-3	`5d9ed48`	link	true	`/test pull-unit-test-next-gen-3`
pull-integration-realcluster-test	`5d9ed48`	link	true	`/test pull-integration-realcluster-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

lhy1024 · 2026-04-02T09:50:26Z

-	userKind := endpoint.StringUserKind(request.Config[UserKindKey])
-	config, err := manager.kgm.GetKeyspaceConfigByKind(userKind)
+	addTxn(op, cb)
+	groupID, err := strconv.ParseUint(config[TSOKeyspaceGroupIDKey], 10, 64)


This assumes TSOKeyspaceGroupIDKey is always present, but that is not true when keyspace-group is disabled.

CI is already hitting strconv.ParseUint: parsing "": invalid syntax during keyspace manager bootstrap.

okJiang · 2026-04-08T10:10:15Z

 	var k api.KeyspaceMeta
 	keyspaceName := "keyspace_1"
 	testutil.Eventually(re, func() bool {
+		fmt.Printf("get keyspace aaaa: %v\n", k)


Should we keep them?

bufferflies added 2 commits March 20, 2026 09:51

wrap one txn

8960828

Signed-off-by: tongjian <1045931706@qq.com>

atomic to create keyspace

e7eef2f

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 26, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 26, 2026

bufferflies mentioned this pull request Mar 26, 2026

keyspace, schedule: make keyspace creation atomic #10467

Closed

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Mar 26, 2026

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread pkg/keyspace/keyspace.go Outdated

Comment thread pkg/keyspace/keyspace.go

Comment thread pkg/keyspace/tso_keyspace_group.go Outdated

atoimc

9c634d1

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the bugfix/create_keyspace_atomic branch from 5ef55c9 to 9c634d1 Compare March 26, 2026 02:40

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

bufferflies force-pushed the bugfix/create_keyspace_atomic branch from e9b0b94 to 693a40b Compare March 26, 2026 03:42

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread pkg/keyspace/tso_keyspace_group.go

lock and unlock

4f06e65

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the bugfix/create_keyspace_atomic branch from 693a40b to 4f06e65 Compare March 26, 2026 07:21

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread pkg/keyspace/keyspace.go

Comment thread pkg/keyspace/tso_keyspace_group.go Outdated

add retry for prealloc

19abccb

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread tests/integrations/mcs/tso/keyspace_group_manager_test.go Outdated

fix unstable test

252d9c8

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the bugfix/create_keyspace_atomic branch from 618a771 to 252d9c8 Compare March 27, 2026 06:32

bufferflies requested review from lhy1024, okJiang and rleungx March 27, 2026 06:33

lhy1024 reviewed Apr 1, 2026

View reviewed changes

add lock

5d9ed48

Signed-off-by: tongjian <1045931706@qq.com>

lhy1024 reviewed Apr 2, 2026

View reviewed changes

okJiang reviewed Apr 8, 2026

View reviewed changes

	}
	manager.CheckKeyspaceState(meta)
	c.IndentedJSON(http.StatusOK, &KeyspaceMeta{meta})

Conversation

bufferflies commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Mar 26, 2026

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

bufferflies commented Mar 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bufferflies commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

coderabbitai Bot Mar 26, 2026 •

edited

Loading