Skip to content

feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27381

Open
dannimad wants to merge 3 commits into
microsoft:mainfrom
dannimad:danielcar/duplicate-batch-telemetry-diagnostics
Open

feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27381
dannimad wants to merge 3 commits into
microsoft:mainfrom
dannimad:danielcar/duplicate-batch-telemetry-diagnostics

Conversation

@dannimad
Copy link
Copy Markdown
Contributor

Re-opens #27355 by @anthony-murphy with the biome formatting fix applied so CI can pass. All code is Tony's; only a whitespace fix was added on top.

Description

The DuplicateBatch dataCorruptionError event recorded the new (duplicate) batch's identity but not the original's, and its messageClientSequenceNumber was the synthetic counter produced by OpGroupingManager.ungroupOp rather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a production duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or another path.

Changes

  • opLifecycle/duplicateBatchDetector.ts — Exported RecordedBatchInfo (clientId, batchStartCsn, batchIdExplicit); renamed batchIdsBySeqNumbatchesBySeqNum (now Map<number, { batchId, info? }>); processInboundBatch returns the original's RecordedBatchInfo (or undefined when the original came from a snapshot) on duplicate match. Snapshot wire format [[sequenceNumber, batchId]] preserved for backward compatibility.
  • containerRuntime.tsDuplicateBatch event now logs batchIdExplicit, otherClientId, otherBatchStartCsn, otherBatchIdExplicit, otherFromSnapshot, and overrides messageClientSequenceNumber with the real outer-envelope batchStart.batchStartCsn (with a code comment explaining the ungroupOp fake-csn artifact).
  • Tests — All existing assertions updated; 2 new tests cover the runtime-info path and explicit-vs-synthesized batchId detection. 11/11 unit + 3/3 integration tests pass.

Motivation — what production telemetry will now reveal

Across 50 production events over 7 days, every DuplicateBatch showed DupClientId == MsgClientId, BatchStartCsn == 1, MessageClientSequenceNumber == 1, and an empty details.batchId — which leaves multiple very different root-cause hypotheses (service-side double-sequence vs. client-side resubmit collision vs. another path) indistinguishable. With the new fields, the next event will tell us which it is:

Field combination on next event Likely root cause
batchIdExplicit == false and otherBatchIdExplicit == false, both csn equal Service-side double-sequence — escalate to ODSP.
Exactly one of the two *Explicit flags is true Client-side resubmit colliding with fresh submit.

Reviewer Guidance

The review process is outlined on this wiki page.

  • DuplicateBatchDetector is internal (exported from opLifecycle/index.ts but not in any api-report.md). No public API surface change, no changeset required.
  • Snapshot wire format unchanged: getRecentBatchInfoForSummary still emits the same [number, string][] tuples; a round-trip integration test exercises this.
  • Worth a careful look: the messageClientSequenceNumber override is correctness-tied — please confirm batchStart.batchStartCsn is the right substitute (it's set from the outer-envelope clientSequenceNumber before ungroupOp runs; see remoteMessageProcessor.ts).
  • Out of scope, but related: the same fake-csn artifact exists in other call sites that consume grouped sub-messages via extractSafePropertiesFromMessage (channelCollection.ts:480/526/578, dataStoreContext.ts:761, pendingStateManager.ts:729, inboundBatchAggregator.ts:183). Deliberately left for a follow-up — flagging for awareness.

Tagged deep-review because the fix is bound up with a live production data-corruption investigation; we want to be sure the new telemetry is correct before the next event lands.

anthony-murphy and others added 2 commits May 19, 2026 19:18
The DuplicateBatch dataCorruptionError event recorded the new
(duplicate) batch's identity but not the original's, and its
`messageClientSequenceNumber` was the synthetic counter produced by
`OpGroupingManager.ungroupOp` rather than the real outer-envelope
csn. Both shortcomings made it impossible to tell from telemetry
whether a duplicate was caused by a service-side double-sequence, a
client-side resubmit collision, or some other path.

Changes:

* `DuplicateBatchDetector` now records a `RecordedBatchInfo`
  (`clientId`, `batchStartCsn`, `batchIdExplicit`) for every
  runtime-observed batch, and returns the original's info on a
  duplicate match. Snapshot wire format
  (`[[sequenceNumber, batchId]]`) is unchanged.
* The `DuplicateBatch` telemetry event now includes
  `batchIdExplicit`, `otherClientId`, `otherBatchStartCsn`,
  `otherBatchIdExplicit`, `otherFromSnapshot`, and overrides
  `messageClientSequenceNumber` to `batchStart.batchStartCsn` so
  the logged value is the real wire csn.
* Tests updated; 2 new tests cover the new `RecordedBatchInfo` path
  and detection of explicit-vs-synthesized batchIds.

No public API surface change; `DuplicateBatchDetector` is internal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ector

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 20:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (215 lines, 3 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves DuplicateBatch data-corruption telemetry in container-runtime so production events can be diagnosed by correlating the duplicate batch with the original occurrence (including whether the batchId was explicitly stamped vs. derived) and by reporting the real outer-envelope client sequence number for grouped batches.

Changes:

  • Extend DuplicateBatchDetector to retain and return identifying info about the original batch when a duplicate is detected, while preserving the summary snapshot wire format.
  • Enrich ContainerRuntime DuplicateBatch telemetry with original-batch fields and override messageClientSequenceNumber to use batchStartCsn (avoiding the grouped-batch synthetic CSN artifact).
  • Update and expand unit tests to cover runtime-info reporting and explicit-vs-derived batchId detection.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts Tracks per-seq recorded batch info and returns original-batch diagnostics on duplicate detection; keeps summary payload backward compatible.
packages/runtime/container-runtime/src/containerRuntime.ts Adds new DuplicateBatch telemetry fields and overrides messageClientSequenceNumber with batchStartCsn for correctness under grouped batching.
packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts Updates existing assertions and adds new tests for runtime-info + explicit/derived batchId scenarios.

Comment on lines +27 to +31
}: {
sequenceNumber: number;
minimumSequenceNumber: number;
batchId?: string;
clientId?: string;
…chDetector

Collapse wrapped continuation lines in the RecordedBatchInfo @remarks list so
eslint's jsdoc/check-indentation rule passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants