Skip to content

feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27355

Open
anthony-murphy wants to merge 1 commit into
microsoft:mainfrom
anthony-murphy:duplicate-batch-telemetry-diagnostics
Open

feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27355
anthony-murphy wants to merge 1 commit into
microsoft:mainfrom
anthony-murphy:duplicate-batch-telemetry-diagnostics

Conversation

@anthony-murphy
Copy link
Copy Markdown
Contributor

Description

The DuplicateBatch dataCorruptionError event recorded the new (duplicate) batch's identity but not the original's, and its messageClientSequenceNumber was the synthetic counter produced by OpGroupingManager.ungroupOp rather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a production duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or another path.

Changes

  • opLifecycle/duplicateBatchDetector.ts — Exported RecordedBatchInfo (clientId, batchStartCsn, batchIdExplicit); renamed batchIdsBySeqNumbatchesBySeqNum (now Map<number, { batchId, info? }>); processInboundBatch returns the original's RecordedBatchInfo (or undefined when the original came from a snapshot) on duplicate match. Snapshot wire format [[sequenceNumber, batchId]] preserved for backward compatibility.
  • containerRuntime.tsDuplicateBatch event now logs batchIdExplicit, otherClientId, otherBatchStartCsn, otherBatchIdExplicit, otherFromSnapshot, and overrides messageClientSequenceNumber with the real outer-envelope batchStart.batchStartCsn (with a code comment explaining the ungroupOp fake-csn artifact).
  • Tests — All existing assertions updated; 2 new tests cover the runtime-info path and explicit-vs-synthesized batchId detection. 11/11 unit + 3/3 integration tests pass.

Motivation — what production telemetry will now reveal

Across 50 production events over 7 days, every DuplicateBatch showed DupClientId == MsgClientId, BatchStartCsn == 1, MessageClientSequenceNumber == 1, and an empty details.batchId — which leaves multiple very different root-cause hypotheses (service-side double-sequence vs. client-side resubmit collision vs. another path) indistinguishable. With the new fields, the next event will tell us which it is:

Field combination on next event Likely root cause
batchIdExplicit == false and otherBatchIdExplicit == false, both csn equal Service-side double-sequence — escalate to ODSP.
Exactly one of the two *Explicit flags is true Client-side resubmit colliding with fresh submit.

Reviewer Guidance

The review process is outlined on this wiki page.

  • DuplicateBatchDetector is internal (exported from opLifecycle/index.ts but not in any api-report.md). No public API surface change, no changeset required.
  • Snapshot wire format unchanged: getRecentBatchInfoForSummary still emits the same [number, string][] tuples; a round-trip integration test exercises this.
  • Worth a careful look: the messageClientSequenceNumber override is correctness-tied — please confirm batchStart.batchStartCsn is the right substitute (it's set from the outer-envelope clientSequenceNumber before ungroupOp runs; see remoteMessageProcessor.ts).
  • Out of scope, but related: the same fake-csn artifact exists in other call sites that consume grouped sub-messages via extractSafePropertiesFromMessage (channelCollection.ts:480/526/578, dataStoreContext.ts:761, pendingStateManager.ts:729, inboundBatchAggregator.ts:183). Deliberately left for a follow-up — flagging for awareness.

Tagged deep-review because the fix is bound up with a live production data-corruption investigation; we want to be sure the new telemetry is correct before the next event lands.

The DuplicateBatch dataCorruptionError event recorded the new
(duplicate) batch's identity but not the original's, and its
`messageClientSequenceNumber` was the synthetic counter produced by
`OpGroupingManager.ungroupOp` rather than the real outer-envelope
csn. Both shortcomings made it impossible to tell from telemetry
whether a duplicate was caused by a service-side double-sequence, a
client-side resubmit collision, or some other path.

Changes:

* `DuplicateBatchDetector` now records a `RecordedBatchInfo`
  (`clientId`, `batchStartCsn`, `batchIdExplicit`) for every
  runtime-observed batch, and returns the original's info on a
  duplicate match. Snapshot wire format
  (`[[sequenceNumber, batchId]]`) is unchanged.
* The `DuplicateBatch` telemetry event now includes
  `batchIdExplicit`, `otherClientId`, `otherBatchStartCsn`,
  `otherBatchIdExplicit`, `otherFromSnapshot`, and overrides
  `messageClientSequenceNumber` to `batchStart.batchStartCsn` so
  the logged value is the real wire csn.
* Tests updated; 2 new tests cover the new `RecordedBatchInfo` path
  and detection of explicit-vs-synthesized batchIds.

No public API surface change; `DuplicateBatchDetector` is internal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 02:19
@github-actions
Copy link
Copy Markdown
Contributor

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (215 lines, 3 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves DuplicateBatch diagnostics in container-runtime by preserving identifying information for the original batch occurrence (when available) and ensuring telemetry reports the real outer-envelope client sequence number (avoiding the grouped-message synthetic CSN artifact).

Changes:

  • Extend DuplicateBatchDetector to track per-sequence recorded batch info (clientId / batchStartCsn / explicit-batchId bit) and return it on duplicate detection while keeping the existing snapshot wire format.
  • Enhance DuplicateBatch telemetry in ContainerRuntime with original-batch fields and override messageClientSequenceNumber with batchStart.batchStartCsn.
  • Update and expand unit tests to validate runtime-recorded info and explicit-vs-derived batchId behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts Updates tests for new duplicate result shape and adds coverage for runtime-recorded original info + explicit batchId detection.
packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts Records richer per-batch runtime info and includes it in duplicate detection results while preserving snapshot serialization shape.
packages/runtime/container-runtime/src/containerRuntime.ts Logs new DuplicateBatch telemetry fields and corrects messageClientSequenceNumber to use batchStartCsn.
Comments suppressed due to low confidence (1)

packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts:143

  • BatchStartInfo defines clientId and batchStartCsn as required fields, so the === undefined checks here are effectively dead code in production. Recording RecordedBatchInfo unconditionally for runtime-seen batches (and reserving info: undefined only for snapshot-loaded entries) would make the intent clearer and keep otherBatchInfo === undefined aligned with “from snapshot”.
		// Add new batch. Record identifying info so we can report it if a future duplicate matches us.
		const info: RecordedBatchInfo | undefined =
			batchStart.clientId === undefined || batchStart.batchStartCsn === undefined
				? undefined
				: {
						clientId: batchStart.clientId,
						batchStartCsn: batchStart.batchStartCsn,
						// True iff the wire carried explicit batchId metadata (resubmit path).
						// False indicates the batchId was derived from clientId + batchStartCsn (fresh submit).
						batchIdExplicit: batchStart.batchId !== undefined,
					};

Comment on lines 21 to +33
function makeBatch({
sequenceNumber,
minimumSequenceNumber,
batchId,
clientId,
batchStartCsn,
}: { sequenceNumber: number; minimumSequenceNumber: number } & (
| { batchId: string; clientId?: undefined; batchStartCsn?: undefined }
| { batchId?: undefined; clientId: string; batchStartCsn: number }
)): BatchStartInfo {
}: {
sequenceNumber: number;
minimumSequenceNumber: number;
batchId?: string;
clientId?: string;
batchStartCsn?: number;
}): BatchStartInfo {
Comment on lines 114 to +124
if (otherSequenceNumber !== undefined) {
const other = this.batchesBySeqNum.get(otherSequenceNumber);
assert(
this.batchIdsBySeqNum.get(otherSequenceNumber) === batchId,
other?.batchId === batchId,
0xce0 /* batchIdToSeqNum and seqNumToBatchId should be in sync for duplicate */,
);
return { duplicate: true, otherSequenceNumber };
return {
duplicate: true,
otherSequenceNumber,
otherBatchInfo: other.info,
};
@anthony-murphy
Copy link
Copy Markdown
Contributor Author

Deep Review

Reviewed commit 4230044 on 2026-05-19.

Readiness: 7/10 — ALMOST READY

The detector-layer refactor and the batchStartCsn override are well-designed and well-tested at the unit layer. No correctness defects found. Three polish items keep this short of sign-off: a missing event-site test that pins the grouped-batch messageClientSequenceNumber override (the PR's highest-value correctness line), a silent semantic change to an already-emitted telemetry property that production dashboards depend on, and an undocumented rollout/opt-in story given that #27351 reverted enableBatchIdTracking to opt-in only ~2h before this PR opened.

Path to Ready

  • Resolve inline threads
  • Add an event-site test in containerRuntime.spec.ts that synthesizes a grouped duplicate batch with an outer-envelope clientSequenceNumber distinct from the post-ungroupOp sub-message counter and asserts the emitted DuplicateBatch event's messageClientSequenceNumber equals batchStartCsn (not 1). Without this, a future refactor that reorders the object literal, drops the override, or renames batchStartCsn would silently reintroduce the exact MessageClientSequenceNumber == 1 symptom this PR is shipping to fix.

Context for Reviewers

For human reviewer
  • markfields sign-off on the batchStartCsn substitution — the PR description explicitly requests this ("Worth a careful look ... please confirm batchStart.batchStartCsn is the right substitute"). The originator's confirmation is the canonical sign-off.
  • kian-thompson telemetry-shape review — the new field names (batchIdExplicit, otherClientId, otherBatchStartCsn, otherBatchIdExplicit, otherFromSnapshot) plus the silent semantic change to messageClientSequenceNumber are exactly the surface kian-thompson historically audits (see Offline: Add recent batch info from DuplicateBatchDetector to summary #22454 review).
  • Two unresolved threads from copilot-pull-request-reviewer worth a human pass alongside the inline threads below: duplicateBatchDetector.spec.ts:33 (the makeBatch helper now accepts batchId/clientId/batchStartCsn all optional, weakening type safety against invalid combinations) and duplicateBatchDetector.ts:124 (the assert(other?.batchId === batchId, ...) pattern relies on optional-chaining for type narrowing; also applies to line 133).
  • Production rollout / opt-in coverage decision — whether to set enableBatchIdTracking for the affected tenants, and on what timeline, is an operational call that belongs with the on-call investigator.

// with a synthetic counter (1, 2, 3, ...). Override with the real outer
// envelope's clientSequenceNumber so downstream telemetry doesn't get a
// misleading "fake csn" value.
messageClientSequenceNumber: batchStart.batchStartCsn,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: This override silently changes the meaning of an already-emitted telemetry property. ...extractSafePropertiesFromMessage(batchStart.keyMessage) already emits messageClientSequenceNumber (the post-ungroupOp synthetic counter); the subsequent messageClientSequenceNumber: batchStart.batchStartCsn replaces that value with the outer-envelope CSN.

The PR description confirms the field is in production dashboards today: "Across 50 production events over 7 days, every DuplicateBatch showed ... MessageClientSequenceNumber == 1". Any KQL query, alert, or dashboard tile keyed off DuplicateBatch.MessageClientSequenceNumber == 1 will stop matching once this ships.

Before merging, either:

(a) notify the on-call/telemetry owner (cc kian-thompson — historical reviewer on telemetry shape, see #22454) that the field's semantics are changing, and call out the back-compat impact in the PR description; or

(b) emit the corrected value under a distinct field name (e.g. outerBatchClientSequenceNumber) and leave the inherited spread value untouched.

(a) is preferred — the existing value is known-misleading and is exactly what this PR is trying to stop surfacing as authoritative — but it requires the explicit communication step. Also worth extending the inline comment at the override site to flag that this overrides an inherited field rather than adding a new one.

// with a synthetic counter (1, 2, 3, ...). Override with the real outer
// envelope's clientSequenceNumber so downstream telemetry doesn't get a
// misleading "fake csn" value.
messageClientSequenceNumber: batchStart.batchStartCsn,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: The rollout/opt-in path for the new diagnostic fields is undocumented. PR #27351 (daesunp, merged 2026-05-20T00:20:38Z — ~2h before this PR was created) reverted #27262 and shifted Fluid.ContainerRuntime.enableBatchIdTracking back to explicit opt-in. The DuplicateBatch event only fires inside the duplicate-detection path that is now opt-in.

The PR description's success criterion — "the next event will tell us which it is" — implicitly assumes the tenant population emitting these events is opted in. The gated path is reachable via Fluid.Container.enableOfflineFull or Fluid.ContainerRuntime.enableBatchIdTracking, so this is a documentation gap, not a code defect. But the diagnostic value of the new fields is contingent on opt-in coverage; reviewers and the on-call investigator need that context.

Suggested fix: add a one-paragraph note to the PR description stating which enablement gate will be set for the affected tenant population and roughly when — either the existing flags will be turned on for those tenants, or this PR is staged ahead of a separate re-enablement.

Copy link
Copy Markdown
Contributor

@dannimad dannimad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needs reformat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants