feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27381
feat(container-runtime): enhance DuplicateBatch telemetry diagnostics#27381dannimad wants to merge 3 commits into
Conversation
The DuplicateBatch dataCorruptionError event recorded the new (duplicate) batch's identity but not the original's, and its `messageClientSequenceNumber` was the synthetic counter produced by `OpGroupingManager.ungroupOp` rather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or some other path. Changes: * `DuplicateBatchDetector` now records a `RecordedBatchInfo` (`clientId`, `batchStartCsn`, `batchIdExplicit`) for every runtime-observed batch, and returns the original's info on a duplicate match. Snapshot wire format (`[[sequenceNumber, batchId]]`) is unchanged. * The `DuplicateBatch` telemetry event now includes `batchIdExplicit`, `otherClientId`, `otherBatchStartCsn`, `otherBatchIdExplicit`, `otherFromSnapshot`, and overrides `messageClientSequenceNumber` to `batchStart.batchStartCsn` so the logged value is the real wire csn. * Tests updated; 2 new tests cover the new `RecordedBatchInfo` path and detection of explicit-vs-synthesized batchIds. No public API surface change; `DuplicateBatchDetector` is internal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ector Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (215 lines, 3 files), I've queued these reviewers:
How this works
|
There was a problem hiding this comment.
Pull request overview
Improves DuplicateBatch data-corruption telemetry in container-runtime so production events can be diagnosed by correlating the duplicate batch with the original occurrence (including whether the batchId was explicitly stamped vs. derived) and by reporting the real outer-envelope client sequence number for grouped batches.
Changes:
- Extend
DuplicateBatchDetectorto retain and return identifying info about the original batch when a duplicate is detected, while preserving the summary snapshot wire format. - Enrich
ContainerRuntimeDuplicateBatchtelemetry with original-batch fields and overridemessageClientSequenceNumberto usebatchStartCsn(avoiding the grouped-batch synthetic CSN artifact). - Update and expand unit tests to cover runtime-info reporting and explicit-vs-derived batchId detection.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts | Tracks per-seq recorded batch info and returns original-batch diagnostics on duplicate detection; keeps summary payload backward compatible. |
| packages/runtime/container-runtime/src/containerRuntime.ts | Adds new DuplicateBatch telemetry fields and overrides messageClientSequenceNumber with batchStartCsn for correctness under grouped batching. |
| packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts | Updates existing assertions and adds new tests for runtime-info + explicit/derived batchId scenarios. |
| }: { | ||
| sequenceNumber: number; | ||
| minimumSequenceNumber: number; | ||
| batchId?: string; | ||
| clientId?: string; |
…chDetector Collapse wrapped continuation lines in the RecordedBatchInfo @remarks list so eslint's jsdoc/check-indentation rule passes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
The
DuplicateBatchdataCorruptionErrorevent recorded the new (duplicate) batch's identity but not the original's, and itsmessageClientSequenceNumberwas the synthetic counter produced byOpGroupingManager.ungroupOprather than the real outer-envelope csn. Both shortcomings made it impossible to tell from telemetry whether a production duplicate was caused by a service-side double-sequence, a client-side resubmit collision, or another path.Changes
opLifecycle/duplicateBatchDetector.ts— ExportedRecordedBatchInfo(clientId,batchStartCsn,batchIdExplicit); renamedbatchIdsBySeqNum→batchesBySeqNum(nowMap<number, { batchId, info? }>);processInboundBatchreturns the original'sRecordedBatchInfo(orundefinedwhen the original came from a snapshot) on duplicate match. Snapshot wire format[[sequenceNumber, batchId]]preserved for backward compatibility.containerRuntime.ts—DuplicateBatchevent now logsbatchIdExplicit,otherClientId,otherBatchStartCsn,otherBatchIdExplicit,otherFromSnapshot, and overridesmessageClientSequenceNumberwith the real outer-envelopebatchStart.batchStartCsn(with a code comment explaining theungroupOpfake-csn artifact).Motivation — what production telemetry will now reveal
Across 50 production events over 7 days, every
DuplicateBatchshowedDupClientId == MsgClientId,BatchStartCsn == 1,MessageClientSequenceNumber == 1, and an emptydetails.batchId— which leaves multiple very different root-cause hypotheses (service-side double-sequence vs. client-side resubmit collision vs. another path) indistinguishable. With the new fields, the next event will tell us which it is:batchIdExplicit == falseandotherBatchIdExplicit == false, both csn equal*Explicitflags istrueReviewer Guidance
The review process is outlined on this wiki page.
DuplicateBatchDetectoris internal (exported fromopLifecycle/index.tsbut not in anyapi-report.md). No public API surface change, no changeset required.getRecentBatchInfoForSummarystill emits the same[number, string][]tuples; a round-trip integration test exercises this.messageClientSequenceNumberoverride is correctness-tied — please confirmbatchStart.batchStartCsnis the right substitute (it's set from the outer-envelopeclientSequenceNumberbeforeungroupOpruns; seeremoteMessageProcessor.ts).extractSafePropertiesFromMessage(channelCollection.ts:480/526/578,dataStoreContext.ts:761,pendingStateManager.ts:729,inboundBatchAggregator.ts:183). Deliberately left for a follow-up — flagging for awareness.Tagged
deep-reviewbecause the fix is bound up with a live production data-corruption investigation; we want to be sure the new telemetry is correct before the next event lands.