feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27355
feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27355anthony-murphy wants to merge 1 commit into
Conversation
The DuplicateBatch dataCorruptionError event recorded the new (duplicate) batch's identity but not the original's, and its `messageClientSequenceNumber` was the synthetic counter produced by `OpGroupingManager.ungroupOp` rather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or some other path. Changes: * `DuplicateBatchDetector` now records a `RecordedBatchInfo` (`clientId`, `batchStartCsn`, `batchIdExplicit`) for every runtime-observed batch, and returns the original's info on a duplicate match. Snapshot wire format (`[[sequenceNumber, batchId]]`) is unchanged. * The `DuplicateBatch` telemetry event now includes `batchIdExplicit`, `otherClientId`, `otherBatchStartCsn`, `otherBatchIdExplicit`, `otherFromSnapshot`, and overrides `messageClientSequenceNumber` to `batchStart.batchStartCsn` so the logged value is the real wire csn. * Tests updated; 2 new tests cover the new `RecordedBatchInfo` path and detection of explicit-vs-synthesized batchIds. No public API surface change; `DuplicateBatchDetector` is internal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (215 lines, 3 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
Improves DuplicateBatch diagnostics in container-runtime by preserving identifying information for the original batch occurrence (when available) and ensuring telemetry reports the real outer-envelope client sequence number (avoiding the grouped-message synthetic CSN artifact).
Changes:
- Extend
DuplicateBatchDetectorto track per-sequence recorded batch info (clientId / batchStartCsn / explicit-batchId bit) and return it on duplicate detection while keeping the existing snapshot wire format. - Enhance
DuplicateBatchtelemetry inContainerRuntimewith original-batch fields and overridemessageClientSequenceNumberwithbatchStart.batchStartCsn. - Update and expand unit tests to validate runtime-recorded info and explicit-vs-derived batchId behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts | Updates tests for new duplicate result shape and adds coverage for runtime-recorded original info + explicit batchId detection. |
| packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts | Records richer per-batch runtime info and includes it in duplicate detection results while preserving snapshot serialization shape. |
| packages/runtime/container-runtime/src/containerRuntime.ts | Logs new DuplicateBatch telemetry fields and corrects messageClientSequenceNumber to use batchStartCsn. |
Comments suppressed due to low confidence (1)
packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts:143
BatchStartInfodefinesclientIdandbatchStartCsnas required fields, so the=== undefinedchecks here are effectively dead code in production. RecordingRecordedBatchInfounconditionally for runtime-seen batches (and reservinginfo: undefinedonly for snapshot-loaded entries) would make the intent clearer and keepotherBatchInfo === undefinedaligned with “from snapshot”.
// Add new batch. Record identifying info so we can report it if a future duplicate matches us.
const info: RecordedBatchInfo | undefined =
batchStart.clientId === undefined || batchStart.batchStartCsn === undefined
? undefined
: {
clientId: batchStart.clientId,
batchStartCsn: batchStart.batchStartCsn,
// True iff the wire carried explicit batchId metadata (resubmit path).
// False indicates the batchId was derived from clientId + batchStartCsn (fresh submit).
batchIdExplicit: batchStart.batchId !== undefined,
};
| function makeBatch({ | ||
| sequenceNumber, | ||
| minimumSequenceNumber, | ||
| batchId, | ||
| clientId, | ||
| batchStartCsn, | ||
| }: { sequenceNumber: number; minimumSequenceNumber: number } & ( | ||
| | { batchId: string; clientId?: undefined; batchStartCsn?: undefined } | ||
| | { batchId?: undefined; clientId: string; batchStartCsn: number } | ||
| )): BatchStartInfo { | ||
| }: { | ||
| sequenceNumber: number; | ||
| minimumSequenceNumber: number; | ||
| batchId?: string; | ||
| clientId?: string; | ||
| batchStartCsn?: number; | ||
| }): BatchStartInfo { |
| if (otherSequenceNumber !== undefined) { | ||
| const other = this.batchesBySeqNum.get(otherSequenceNumber); | ||
| assert( | ||
| this.batchIdsBySeqNum.get(otherSequenceNumber) === batchId, | ||
| other?.batchId === batchId, | ||
| 0xce0 /* batchIdToSeqNum and seqNumToBatchId should be in sync for duplicate */, | ||
| ); | ||
| return { duplicate: true, otherSequenceNumber }; | ||
| return { | ||
| duplicate: true, | ||
| otherSequenceNumber, | ||
| otherBatchInfo: other.info, | ||
| }; |
Deep ReviewReviewed commit Readiness: 7/10 — ALMOST READY The detector-layer refactor and the Path to Ready
Context for Reviewers
For human reviewer
|
| // with a synthetic counter (1, 2, 3, ...). Override with the real outer | ||
| // envelope's clientSequenceNumber so downstream telemetry doesn't get a | ||
| // misleading "fake csn" value. | ||
| messageClientSequenceNumber: batchStart.batchStartCsn, |
There was a problem hiding this comment.
Deep Review: This override silently changes the meaning of an already-emitted telemetry property. ...extractSafePropertiesFromMessage(batchStart.keyMessage) already emits messageClientSequenceNumber (the post-ungroupOp synthetic counter); the subsequent messageClientSequenceNumber: batchStart.batchStartCsn replaces that value with the outer-envelope CSN.
The PR description confirms the field is in production dashboards today: "Across 50 production events over 7 days, every DuplicateBatch showed ... MessageClientSequenceNumber == 1". Any KQL query, alert, or dashboard tile keyed off DuplicateBatch.MessageClientSequenceNumber == 1 will stop matching once this ships.
Before merging, either:
(a) notify the on-call/telemetry owner (cc kian-thompson — historical reviewer on telemetry shape, see #22454) that the field's semantics are changing, and call out the back-compat impact in the PR description; or
(b) emit the corrected value under a distinct field name (e.g. outerBatchClientSequenceNumber) and leave the inherited spread value untouched.
(a) is preferred — the existing value is known-misleading and is exactly what this PR is trying to stop surfacing as authoritative — but it requires the explicit communication step. Also worth extending the inline comment at the override site to flag that this overrides an inherited field rather than adding a new one.
| // with a synthetic counter (1, 2, 3, ...). Override with the real outer | ||
| // envelope's clientSequenceNumber so downstream telemetry doesn't get a | ||
| // misleading "fake csn" value. | ||
| messageClientSequenceNumber: batchStart.batchStartCsn, |
There was a problem hiding this comment.
Deep Review: The rollout/opt-in path for the new diagnostic fields is undocumented. PR #27351 (daesunp, merged 2026-05-20T00:20:38Z — ~2h before this PR was created) reverted #27262 and shifted Fluid.ContainerRuntime.enableBatchIdTracking back to explicit opt-in. The DuplicateBatch event only fires inside the duplicate-detection path that is now opt-in.
The PR description's success criterion — "the next event will tell us which it is" — implicitly assumes the tenant population emitting these events is opted in. The gated path is reachable via Fluid.Container.enableOfflineFull or Fluid.ContainerRuntime.enableBatchIdTracking, so this is a documentation gap, not a code defect. But the diagnostic value of the new fields is contingent on opt-in coverage; reviewers and the on-call investigator need that context.
Suggested fix: add a one-paragraph note to the PR description stating which enablement gate will be set for the affected tenant population and roughly when — either the existing flags will be turned on for those tenants, or this PR is staged ahead of a separate re-enablement.
Description
The
DuplicateBatchdataCorruptionErrorevent recorded the new (duplicate) batch's identity but not the original's, and itsmessageClientSequenceNumberwas the synthetic counter produced byOpGroupingManager.ungroupOprather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a production duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or another path.Changes
opLifecycle/duplicateBatchDetector.ts— ExportedRecordedBatchInfo(clientId,batchStartCsn,batchIdExplicit); renamedbatchIdsBySeqNum→batchesBySeqNum(nowMap<number, { batchId, info? }>);processInboundBatchreturns the original'sRecordedBatchInfo(orundefinedwhen the original came from a snapshot) on duplicate match. Snapshot wire format[[sequenceNumber, batchId]]preserved for backward compatibility.containerRuntime.ts—DuplicateBatchevent now logsbatchIdExplicit,otherClientId,otherBatchStartCsn,otherBatchIdExplicit,otherFromSnapshot, and overridesmessageClientSequenceNumberwith the real outer-envelopebatchStart.batchStartCsn(with a code comment explaining theungroupOpfake-csn artifact).Motivation — what production telemetry will now reveal
Across 50 production events over 7 days, every
DuplicateBatchshowedDupClientId == MsgClientId,BatchStartCsn == 1,MessageClientSequenceNumber == 1, and an emptydetails.batchId— which leaves multiple very different root-cause hypotheses (service-side double-sequence vs. client-side resubmit collision vs. another path) indistinguishable. With the new fields, the next event will tell us which it is:batchIdExplicit == falseandotherBatchIdExplicit == false, both csn equal*Explicitflags istrueReviewer Guidance
The review process is outlined on this wiki page.
DuplicateBatchDetectoris internal (exported fromopLifecycle/index.tsbut not in anyapi-report.md). No public API surface change, no changeset required.getRecentBatchInfoForSummarystill emits the same[number, string][]tuples; a round-trip integration test exercises this.messageClientSequenceNumberoverride is correctness-tied — please confirmbatchStart.batchStartCsnis the right substitute (it's set from the outer-envelopeclientSequenceNumberbeforeungroupOpruns; seeremoteMessageProcessor.ts).extractSafePropertiesFromMessage(channelCollection.ts:480/526/578,dataStoreContext.ts:761,pendingStateManager.ts:729,inboundBatchAggregator.ts:183). Deliberately left for a follow-up — flagging for awareness.Tagged
deep-reviewbecause the fix is bound up with a live production data-corruption investigation; we want to be sure the new telemetry is correct before the next event lands.