Skip to content

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611

Open
delthas wants to merge 3 commits into
development/8.4from
improvement/ARSN-572/trace-context
Open

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611
delthas wants to merge 3 commits into
development/8.4from
improvement/ARSN-572/trace-context

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 13, 2026

Context

Cloudserver recently gained OpenTelemetry tracing (CLDSRV-884, cloudserver PR scality/cloudserver#6140). That gives us a single end-to-end trace per S3 request spanning cloudserver → vault → hdproxy → hyperiod.

But many state mutations in the S3 stack are written to MongoDB and consumed asynchronously by worker services via the oplog:

  • Backbeat — replication, lifecycle expiration/transition, garbage collection
  • Sorbet — cold-storage restore completion
  • Other oplog consumers

Today the oplog entry carries no trace context, so async work performed hours or days after the S3 request cannot be tied back to the request that caused it. A user's POST …?restore produces a trace that ends at the metadata write; the actual cold-tier retrieval by Sorbet is invisible from the original trace.

Solution

Stamp a W3C trace context (traceparent/tracestate) on every mutating metadata write, auto-derived from the currently-active OTEL context.

The field lives on ObjectMDData next to originOp:

traceContext?: {
    traceparent?: string;
    tracestate?: string;
};

A new helper stampActiveTraceContext in lib/storage/metadata/captureTraceContext.ts:

  • reads the active OTEL context via @opentelemetry/api
  • serializes it via propagation.inject()
  • writes the result to data.traceContext, or clears the field if no span is active (so a stale value loaded from storage cannot leak forward into a new oplog entry)

It is exposed with TS overloads so callers pass whichever shape they have — an ObjectMD instance or a raw metadata object:

export function stampActiveTraceContext(objMd: ObjectMD): void;
export function stampActiveTraceContext(data: Record<string, any>): void;
export function stampActiveTraceContext(data: ObjectMD | Record<string, any>): void {
    const target = data instanceof ObjectMD ? data.getValue() : data;
    // ...
}

It is called from inside MongoUtils.serialize(), so every metadata write that goes through serialize automatically gets the stamp (the putObject / putObjectVerCase4 / repair paths in MongoClientInterface). For the 2 paths that bypass serialize because they rewrite a doc previously read from the DB with already-escaped tags (internalPutObject, internalDeleteObject), stampActiveTraceContext() is called explicitly on the ObjectMD instance (the overload resolves .getValue() internally).

Consumers (backbeat, sorbet — out of scope for this PR) will later:

  1. Read value.traceContext.traceparent from the oplog entry
  2. propagation.extract() it to reconstruct a parent context
  3. Wrap per-entry processing in context.with(extractedCtx, ...)

This closes the loop on end-to-end tracing across the async boundary.

Design decisions

1. Auto-inject at the arsenal write boundary (vs. param plumbing).
The alternative was to thread a new traceContext parameter through the ~15 cloudserver call sites that already set originOp. Rejected because:

  • @opentelemetry/context-async-hooks (loaded by the OTEL SDK at cloudserver startup) already propagates the request context through arsenal's async call chain for free. Re-reading it at the mongo write is idiomatic OTEL.
  • Zero cloudserver surface change — we can iterate on arsenal independently.
  • Parallels existing behavior where arsenal reads originOp out of params and attaches it to the ObjectMD.
  • Avoids replicating the maintenance cost that makes originOp easy to forget on new API endpoints today.

2. Only @opentelemetry/api as a runtime dep (~10 KB, no SDK).
The api package is specifically designed to be safe for libraries to depend on. When no SDK is registered (today's state for anyone not opting into cloudserver OTEL), context.active() returns a no-op context, trace.getSpan() returns undefined, and stampActiveTraceContext() clears the field and returns. Zero runtime cost when OTEL is off — covered by a dedicated unit test (see Verification).

3. Optional field, no schema migration.
traceContext is absent by default — consistent with other optional ObjectMD fields like legalHold, retentionDate. No default-value change in the ObjectMDData construction. Existing consumers that don't understand the field simply ignore it; fully backward compatible.

4. Always reflects the current write — never carries a stale context forward.
If no span is active when stampActiveTraceContext() runs, it clears traceContext (rather than leaving whatever was loaded from storage). Same when propagation.inject() produces no traceparent. This prevents both edge-case OTEL output and the more practical hazard where a doc read for a delete/repair carries a traceContext from its original write that does not belong on the new oplog entry.

5. Single helper, single contract.
The trace-stamping API on arsenal is one function — stampActiveTraceContext, with overloads for the ObjectMD and raw-data call shapes but a single implementation. There is intentionally no public setter on ObjectMD: the contract is "trace context is stamped by the metadata backend at write time," and a model-level setter would invite callers in cloudserver/backbeat to bypass that contract (issue raised in review). Internally this also keeps the stamp/clear semantics in exactly one place. The overload pattern matches existing arsenal usage (convertToEpochTime, uuidv4).

Interaction with async flows (cold restore as example)

Cold storage restore exercises both a sync and an async metadata write:

  1. ObjectRestore:Post (sync, user-initiated via POST /bucket/key?restore) — cloudserver sets originOp; arsenal now also stamps traceContext pointing to the user's request trace. ✅
  2. ObjectRestore:Completed (async, Sorbet-triggered via cloudserver's /_/backbeat/* route) — behavior depends on Sorbet:
    • If Sorbet forwards traceparent on its HTTP call (ideally extracted from step 1's stored traceContext): arsenal picks it up via context.with(remoteContext, ...) in cloudserver's server.js, and the Completed write ties back to the original user restore. ✅
    • If Sorbet calls without traceparent (today's state): safely no-ops — Completed write has no traceContext, no regression. ✅

Same pattern applies to lifecycle expiration/transition, CRR replication, and GC.

Verification

Unit tests

  • tests/unit/storage/metadata/captureTraceContext.spec.jsstampActiveTraceContext against a mocked @opentelemetry/api: no-active-span clears (and is a no-op when there was nothing to clear), traceparent-only write, traceparent + tracestate write, defensive handling of a propagator that injects no traceparent, overwrite of a stale value, and the ObjectMD-instance overload stamping onto getValue().
  • tests/unit/storage/metadata/captureTraceContext.noOtel.spec.js — drives the real @opentelemetry/api with no SDK registered (the "arsenal bumped into a consumer that never initialized OTEL" case): confirms a safe no-op that clears stale context and never throws.
  • tests/unit/storage/metadata/mongoclient/{putObject,delObject,repair}.spec.js — sinon-stubbed tests verify the field lands on the document handed to findOneAndReplace / bulkWrite for each write path the cloudserver hot path exercises, and that it is cleared rather than carried forward when no span is active.

All unit tests in tests/unit/models/ and tests/unit/storage/metadata/ pass; yarn build is clean.

End-to-end (follow-up on a test cluster)

  1. Bump cloudserver's arsenal pin to a build of this branch; roll the cloudserver deployment on an Artesca cluster with ENABLE_OTEL=true
  2. Make a PutObject; note the trace ID in Jaeger
  3. Inspect the MongoDB oplog: o.value.traceContext.traceparent should be present, with characters 3-35 matching the Jaeger trace ID
  4. Negative test with ENABLE_OTEL=false → field should be absent

Issue: ARSN-572

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from d42f30b to 01693cd Compare April 13, 2026 15:38
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
@scality scality deleted a comment from bert-e Apr 13, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.53%. Comparing base (1acb14f) to head (593039a).

Additional details and impacted files
@@                 Coverage Diff                 @@
##           development/8.3    #2611      +/-   ##
===================================================
+ Coverage            73.48%   73.53%   +0.04%     
===================================================
  Files                  222      223       +1     
  Lines                18183    18207      +24     
  Branches              3762     3790      +28     
===================================================
+ Hits                 13362    13388      +26     
+ Misses                4816     4814       -2     
  Partials                 5        5              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 01693cd to ca456c9 Compare April 13, 2026 15:56
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from ca456c9 to a30bce8 Compare April 14, 2026 07:19
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from a30bce8 to 39233c8 Compare April 14, 2026 07:23
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 39233c8 to a1fe7ca Compare May 5, 2026 15:58
@delthas delthas marked this pull request as ready for review May 5, 2026 16:02
@scality scality deleted a comment from bert-e May 5, 2026
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from a1fe7ca to eb5a535 Compare May 5, 2026 16:04
@delthas delthas requested review from a team, DarkIsDude and benzekrimaha May 5, 2026 16:08
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from eb5a535 to 7c5ecce Compare May 6, 2026 14:25
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 7c5ecce to 28bca79 Compare May 6, 2026 14:43
@delthas delthas requested a review from DarkIsDude May 6, 2026 14:49
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 28bca79 to 8619151 Compare May 7, 2026 09:06
@delthas delthas requested review from SylvainSenechal and removed request for benzekrimaha May 12, 2026 08:36
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/captureTraceContext.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
@scality scality deleted a comment from bert-e May 18, 2026
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch 2 times, most recently from 04348d1 to 972bdac Compare May 19, 2026 18:13
@scality scality deleted a comment from claude Bot May 19, 2026
Comment thread lib/models/ObjectMD.ts
@scality scality deleted a comment from claude Bot May 19, 2026
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 972bdac to d9198ea Compare May 20, 2026 09:05
@scality scality deleted a comment from claude Bot May 20, 2026
@delthas delthas requested a review from francoisferrand May 20, 2026 09:43
Copy link
Copy Markdown
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with just a small fear (probably not an issue, but I'd like to be sure) about behavior if/when Otel is not configured/imported/... at all in the component (c.f. comment).

Which brings a related question: shoud this be merged in 8.3, or in a new branch?

Comment thread lib/storage/metadata/captureTraceContext.ts Outdated
Comment thread tests/unit/storage/metadata/captureTraceContext.spec.js
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from d9198ea to 593039a Compare May 27, 2026 08:52
@scality scality deleted a comment from claude Bot May 27, 2026
Pre-existing prettier debt on the files the ARSN-572 trace-context work
touches. Extracted as a separate commit so the feature commit shows
only logic changes, not formatting noise.

Issue: ARSN-572
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 593039a to 72ad687 Compare May 27, 2026 09:12
@delthas delthas changed the base branch from development/8.3 to development/8.4 May 27, 2026 09:12
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented May 27, 2026

LGTM, with just a small fear (probably not an issue, but I'd like to be sure) about behavior if/when Otel is not configured/imported/... at all in the component (c.f. comment).

Which brings a related question: shoud this be merged in 8.3, or in a new branch?

Added a test for this; no issues.

But moving to 8.4 since the other repos for tracing will target their latest branch, too.

Comment thread tests/unit/storage/metadata/mongoclient/delObject.spec.js Outdated
@scality scality deleted a comment from claude Bot May 27, 2026
delthas added 2 commits May 27, 2026 11:22
Add a traceContext field to ObjectMDData carrying the currently-active
OTEL trace context (W3C traceparent/tracestate). A new helper,
stampActiveTraceContext, reads the active OTEL context and stamps it
onto the metadata object in place. It is called from
MongoUtils.serialize (covering the putObject / putObjectVerCase4 /
repair write paths) and explicitly at the two paths that bypass
serialize because they rewrite an already-escaped doc read from the DB
(internalPutObject, internalDeleteObject). The helper is overloaded so
callers pass either an ObjectMD instance or a raw metadata object.

The value ends up in the MongoDB oplog; downstream consumers (backbeat,
sorbet) can extract it to continue the trace across the async boundary,
closing the loop on end-to-end tracing for flows that cross the oplog.

When OTEL is not active on the caller (no SDK, or a request outside a
traced context), there is no active span: stampActiveTraceContext clears
the field and returns — zero cost, and no stale context loaded from
storage is carried forward into a new oplog entry.

Only adds @opentelemetry/api as a runtime dependency (the API-only
surface package, a no-op when no SDK is registered).

Issue: ARSN-572
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 3499403 to 7a8cc7b Compare May 27, 2026 09:22
Comment thread tests/unit/storage/metadata/mongoclient/delObject.spec.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

LGTM

Clean, well-designed implementation. The trace-context stamping strategy — automatic via MongoUtils.serialize() for normal paths, explicit stampActiveTraceContext() calls for paths bypassing serialize (internalPutObject, internalDeleteObject) — covers all write paths correctly with no double-stamping. The stale-context clearing contract is solid. Test coverage is thorough, including the no-SDK edge case.

Review by Claude Code

@scality scality deleted a comment from claude Bot May 27, 2026
@scality scality deleted a comment from claude Bot May 27, 2026
@scality scality deleted a comment from bert-e May 27, 2026
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented May 27, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants