Skip to content

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140

Open
delthas wants to merge 8 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation
Open

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
delthas wants to merge 8 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 2, 2026

Summary

Add OpenTelemetry tracing instrumentation to cloudserver, gated behind ENABLE_OTEL=true. When the flag is unset, no @opentelemetry/* package is loaded — zero overhead off the OTEL path.

Rebased onto development/9.4. The branch leads with a prettier-only commit so the feature commits show logic changes, not formatting noise.

Commits

  1. chore: prettier-format files touched by CLDSRV-884 — isolates the pre-existing prettier reformat of lib/api/api.js + lib/server.js so the subsequent feat: commits carry only logic deltas (e.g. the api.js instrumentation is then +11 / -0).

  2. chore: add OpenTelemetry dependencies — the OTEL Node SDK, OTLP/HTTP trace exporter, and the three instrumentation packages we explicitly opt into (instrumentation-http, instrumentation-mongodb, instrumentation-ioredis), all as direct deps — no auto-instrumentations-node bundle (which would pull ~36 unused instrumentations). protobufjs resolution for a transitive CVE.

  3. feat: add trust-boundary host filter (lib/tracing/trustedHosts.js) — outbound trust boundary. OTEL_TRUSTED_HOSTS is an operator-supplied comma-joined list of lowercase bare hostnames; loadTrustedHosts() splits it and adds loopback aliases (localhost, 127.0.0.1, ::1). makeRequestHook strips traceparent/tracestate on outbound requests to hosts not in the set (the client span is preserved and tagged scality.trace.suppressed=true). normalizeHostHeader strips IPv6 brackets + port from the incoming Host header before matching. Cloudserver does not self-derive the list from its Config — that is computed operator-side (tracked as ZKOP-551); unset env → only loopback trusted (safe default).

  4. feat: add health-probe ignore filter (lib/tracing/healthPaths.js) — isHealthPath() matches k8s liveness/readiness + metrics-scrape paths; wired into instrumentation-http's ignoreIncomingRequestHook (which also drops OPTIONS preflight) so probe traffic produces no spans.

  5. feat: add OTEL bootstrap and manual span helper

    • lib/tracing/index.jsinit() / close() / isEnabled() facade. Boots NodeSDK with the OTLP/HTTP exporter, ParentBasedSampler({ root: TraceIdRatioBasedSampler(ratio) }) (so an inbound traceparent with sampled=1 from NGINX/Beyla is honored end-to-end), the three instrumentations above, span limits, and logRecordProcessors: [] / metricReaders: [] (explicit traces-only — otherwise NodeSDK silently starts OTLP log + metric exporters). OTEL_SAMPLING_RATIO is parsed with Number() and fails fast outside [0,1]. Cluster primary skips init (workers re-exec and init themselves). All @opentelemetry/* requires are lazy inside init() so a process with OTEL off never loads them.
    • lib/instrumentation/simple.jsinstrumentApiMethod(handler, methodName) wraps an S3 handler in an api.<methodName> span owning the whole handler execution; auto-instrumentation spans (HTTP/Mongo/ioredis) nest beneath. Handles callback / Promise / sync return shapes with a try/catch + rethrow (no .then() chains — repo CodeQL rule), and sets cloudserver.error_code on the error path.
    • index.js calls require('./lib/tracing').init() before server.js, so require-in-the-middle patches land before any HTTP/Mongo/ioredis module is loaded.
  6. feat: flush OTEL on shutdownS3Server.cleanUp() is now async: it awaits tracing.close() after server.close() (via util.promisify) and before process.exit(0). close() is race-safe (captures + nulls the module-level SDK before awaiting) and bounded at 5 s via Promise.race with a .unref()'d timer, so an unreachable collector can't block past Kubernetes' default 30 s grace period; the shutdown promise carries its own rejection handler so a late failure can't crash the process. caughtExceptionShutdown() flushes on the non-cluster (process.exit(1)) and cluster-worker (worker.kill()) paths; the cluster primary keeps its original survive-and-log behavior.

    • Inbound trace-context extraction is intentionally not done manually: instrumentation-http already extracts traceparent and makes the server span a child of the remote parent. A manual extract would demote downstream api.<handler> spans to siblings.
  7. feat: instrument all S3 API handlers with OTEL spanslib/api/api.js applies the wrapper via an Object.entries(api) loop with a NON_INSTRUMENTED_KEYS skip set (callApiMethod, checkAuthResults, handleAuthorizationResults). New S3 handlers are auto-instrumented; no per-handler boilerplate. Pure logic delta is 11 lines.

  8. chore: bump arsenal to ARSN-572 PoC branch for e2e trace context testingtemporary: pins arsenal at the ARSN-572 branch (scality/Arsenal#2611), which stamps traceContext onto MongoDB metadata writes so async consumers (Backbeat, Sorbet, lifecycle) can continue the trace across the oplog boundary. To be repinned to a real release tag once ARSN-572 ships.

Origin

Extracted and cleaned up from William Lardier's user/test/wlardier/servicemesh-2 branch (OTEL mixed with perf optimizations + debug artifacts). This PR contains only the OTEL instrumentation. Dropped: lib/otelContextPropagation.js (dead code), MOCK_DOAUTH auth-result caching, ~15 console.log artifacts, and all performance optimizations (out of scope).

Tests

34 unit tests, no functional regressions:

  • tests/unit/lib/tracing/init.spec.js (10) — isEnabled, init (no-op when off, throws without endpoint, boots + idempotent — asserted via sinon stubs on NodeSDK.prototype.start), close (no-op, post-init, exactly-once shutdown under concurrent callers)
  • tests/unit/lib/tracing/trustedHosts.spec.js (13) — loadTrustedHosts parsing + loopback defaults; makeRequestHook trusted/untrusted/inbound-skip/missing-host, port + IPv6-bracket Host-header normalization
  • tests/unit/lib/tracing/healthPaths.spec.js (4) — probe-path matching
  • tests/unit/lib/instrumentationSimple.spec.js (7) — instrumentApiMethod callback success/error, sync throw, async resolve/reject, sync return, OTEL-off bypass

Context

  • Parent investigation: OS-1072
  • Companion arsenal change: scality/Arsenal#2611 (ARSN-572) — to merge before this PR can drop its temporary branch pin
  • Operator-side trust-list propagation: ZKOP-551
  • Scality ADR mandates OpenTelemetry across products; the storage layer (hdcontroller / hyperiod) is already instrumented, and this PR completes the chain via instrumentation-http's automatic outbound traceparent propagation

Issue: CLDSRV-884

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 89.20000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.27%. Comparing base (3f7ce59) to head (72b897a).
⚠️ Report is 9 commits behind head on development/9.4.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/api/api.js 89.47% 10 Missing ⚠️
lib/server.js 65.38% 9 Missing ⚠️
lib/tracing/index.js 86.36% 6 Missing ⚠️
lib/instrumentation/simple.js 96.29% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/tracing/healthPaths.js 100.00% <100.00%> (ø)
lib/tracing/trustedHosts.js 100.00% <100.00%> (ø)
lib/instrumentation/simple.js 96.29% <96.29%> (ø)
lib/tracing/index.js 86.36% <86.36%> (ø)
lib/server.js 78.19% <65.38%> (-1.42%) ⬇️
lib/api/api.js 90.87% <89.47%> (+0.19%) ⬆️

... and 1 file with indirect coverage changes

@@                 Coverage Diff                 @@
##           development/9.4    #6140      +/-   ##
===================================================
+ Coverage            85.21%   85.27%   +0.05%     
===================================================
  Files                  207      211       +4     
  Lines                13832    13972     +140     
===================================================
+ Hits                 11787    11914     +127     
- Misses                2045     2058      +13     
Flag Coverage Δ
file-ft-tests 68.75% <48.40%> (-0.51%) ⬇️
kmip-ft-tests 28.17% <41.60%> (-0.09%) ⬇️
mongo-v0-ft-tests 69.83% <48.40%> (-0.58%) ⬇️
mongo-v1-ft-tests 69.85% <48.40%> (-0.63%) ⬇️
multiple-backend 36.61% <43.20%> (-0.18%) ⬇️
sur-tests 35.47% <43.20%> (-0.17%) ⬇️
sur-tests-inflights 37.36% <43.20%> (-0.22%) ⬇️
unit 72.03% <84.00%> (+0.18%) ⬆️
utapi-v2-tests 34.61% <43.60%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 9c93a1d to 97e63fc Compare April 2, 2026 16:32
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/otel.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 97e63fc to f978280 Compare April 2, 2026 16:35
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from f978280 to 06eea4e Compare April 2, 2026 16:49
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 06eea4e to 75c3afb Compare April 2, 2026 17:02
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 75c3afb to 207b785 Compare April 13, 2026 13:05
Comment thread lib/instrumentation/simple.js Outdated
Comment thread tests/unit/lib/tracing/healthPaths.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/trustedHosts.spec.js Outdated
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented May 21, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following reviewers are expecting changes from the author, or must review again:

Comment thread package.json
Pre-existing prettier debt on lib/api/api.js and lib/server.js — the
9.4 base did not run prettier on these. Extracted as a separate
commit so the subsequent CLDSRV-884 OTEL commits show only logic
changes, not formatting noise.

Issue: CLDSRV-884
Comment thread package.json
Comment thread package.json
Comment thread package.json
Comment thread lib/server.js Outdated
Comment thread package.json
Comment thread package.json
Comment thread lib/tracing/trustedHosts.js
Comment thread package.json
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented May 21, 2026

Updates since the last review pass:

  • Extracted a first chore: prettier-format files touched by CLDSRV-884 commit, so the subsequent OTEL commits show only logic changes (no formatting noise).
  • Split the original big OTEL commit into 4 smaller ones (deps, trust-boundary host filter, health-probe filter, bootstrap + manual span helper), each reviewable in isolation.
  • Trust list is now an OTEL_TRUSTED_HOSTS env var supplied by the operator (comma-joined bare hostnames). Cloudserver no longer self-derives from Config. Deploying the env var is tracked in ZKOP-551 (PR).

Comment thread package.json Outdated
Comment thread lib/instrumentation/simple.js
Comment thread lib/server.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread package.json
Comment thread lib/instrumentation/simple.js
Comment thread lib/tracing/trustedHosts.js
@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

  • Arsenal is pinned to a branch, not a tag, and is a version downgrade (8.4.2 → 8.3.11). Must be pinned to a release tag before merge.
    - Wait for ARSN-572 to ship and pin to the resulting 8.x.y tag.
    - instrumentCallbackHandler can double-end a span (callback ends it with OK, then a throw in the catch sets ERROR on an already-ended span — the error is silently lost).
    - Add a boolean ended guard flag checked by both the callback wrapper and the catch block.
    - loadTrustedHosts does not trim whitespace from comma-separated entries in OTEL_TRUSTED_HOSTS.
    - Add .trim() before .toLowerCase().

    Review by Claude Code

delthas added 7 commits May 26, 2026 14:13
Brings in the OTEL Node SDK, OTLP/HTTP exporter, and the three
instrumentation packages we explicitly opt into (http, mongodb,
ioredis). Split out of the main CLDSRV-884 work so reviewers can
verify the dependency surface independently of the code that
consumes it.

Issue: CLDSRV-884
Pure helper: parses OTEL_TRUSTED_HOSTS into a Set, exposes a
requestHook that strips traceparent/tracestate from outbound
HTTP requests to hosts outside the set. Loopback aliases are
always trusted so dev / local tooling work without extra config.

Used in the next commit's bootstrap (HttpInstrumentation requestHook).

Issue: CLDSRV-884
Pure helper: matches request URLs against a set of known
probe/scrape paths (k8s liveness/readiness, metrics scrape).
Wired into HttpInstrumentation.ignoreIncomingRequestHook in
the next commit so probe traffic doesn't dominate the trace
volume.

Issue: CLDSRV-884
The two pieces that wire OTEL into the process:

  - lib/tracing/index.js: init() / close() / isEnabled() facade.
    Boots the Node SDK with an explicit allowlist of three
    instrumentations (http, mongodb, ioredis), a ParentBased
    TraceIdRatio sampler, the trust-boundary requestHook, and
    the health-probe ignore hook. Cluster primary is skipped
    (workers re-exec the entry and init themselves). close()
    flushes the BatchSpanProcessor with a bounded deadline so
    SIGTERM can't hang past the kubelet's grace period.

  - lib/instrumentation/simple.js: instrumentApiMethod() — wraps
    a callback-style or async S3 handler in a span that ends
    when the underlying handler settles. The next commits use
    this to instrument all 80+ S3 API methods.

  - index.js: calls tracing.init() before requiring server.js
    so the require-in-the-middle patches land before any
    HTTP/Mongo/ioredis module is loaded.

Issue: CLDSRV-884
@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented May 26, 2026

// always-on sampling overwhelms the exporter and storage with traffic
// nobody queries.
const HEALTH_PATHS = new Set(['/live', '/ready', '/_/healthcheck', '/_/healthcheck/deep', '/metrics']);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should health path filtering be configurable, i.e. allow enabling it (possibly with an extra env variable)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple.js is a really weird name: can we find better?
Not much idea if it is in its own module/folder except instrumentatin/instrumentatin.js); maybe tracing/instrumentation.js would be better ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both module are inter-dependent (tracing requires this one & vice-versa), they should probably indeed go into the same module...

}

function instrumentCallbackHandler(self, apiMethod, spanName, args, callbackIndex) {
const api = require('@opentelemetry/api');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any penalty with repeatedly require'ing the same api module (i.e. extra load time on startup, more memory usage...), or is this somehow memoized by node.js anyway and thus essentially free (but for the syntaxic weight, which stems from line 15) ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this may be done in api.js or at the top level of module:

const OTEL_ENABLED = process.env.get('OTEL_ENABLED');
const api = OTEL_ENABLED ? require('api') : null;

this would avoid the extra api parameter in endSpan, and repeated require/api var ; while still not loading OTEL module unless enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the repeat require, it's a no cost. Node have a cache

Comment thread lib/api/api.js
for (const [name, handler] of Object.entries(api)) {
if (typeof handler === 'function' && !NON_INSTRUMENTED_KEYS.has(name)) {
api[name] = instrumentApiMethod(handler, name);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a really dumb question: what do we get by instrumenting each method here, instead of hooking just once in Server.routeRequest (in server.js) or in arsenal's router ?

Comment thread lib/tracing/index.js
Comment on lines +23 to +24
const { config } = require('../Config');
const cluster = require('cluster');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also conditinally require:

const cluster = process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT ? require('cluster') : undefined;

(and specifically for cluster, I think it is loaded anyway by server.js ?)

Comment thread lib/tracing/index.js
Comment on lines +23 to +24
const { config } = require('../Config');
const cluster = require('cluster');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(c.f. other comment: lazy loading opentelemetry here is acceptable i guess, as this is very limited in scope: to a single functino; but more problematic in instrumentatin where all functions use it...)

Comment thread lib/tracing/index.js
if (process.env.OTEL_SAMPLING_RATIO !== undefined) {
const parsed = Number(process.env.OTEL_SAMPLING_RATIO);
if (!Number.isFinite(parsed) || parsed < 0 || parsed > 1) {
throw new Error(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in cloudserver' Config.js, we assert() when config is valid. Is it better to throw or to assert ?

Comment thread lib/server.js
// worker process with 'SIGTERM'.
// worker.kill() is graceful (closes servers, disconnects IPC) but
// does not fire our SIGTERM handler, so the BatchSpanProcessor
// would lose buffered spans without an explicit flush here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we/can we add tests for verify behavior on shutdown?

Copy link
Copy Markdown
Contributor

@DarkIsDude DarkIsDude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good for me. My main concern is about the package.json 🙏

Comment thread package.json
"ts-morph/**/brace-expansion": "^5.0.5",
"ts-morph/**/picomatch": "^4.0.4"
"ts-morph/**/picomatch": "^4.0.4",
"protobufjs": "^7.5.8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this one is needed now ?

Comment thread index.js
Comment on lines +10 to +12
// Start tracing before requiring anything that hooks into HTTP, MongoDB,
// or ioredis — instrumentation patches modules on require, so anything
// loaded earlier than init() would run unpatched.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Start tracing before requiring anything that hooks into HTTP, MongoDB,
// or ioredis — instrumentation patches modules on require, so anything
// loaded earlier than init() would run unpatched.

This is purely standard behaviour in tracing / monitoring. All package that do that (datadog, optl, newrelic, ...) are doing the same things and explain it already.

Comment thread lib/instrumentation/simple.js
}

function instrumentCallbackHandler(self, apiMethod, spanName, args, callbackIndex) {
const api = require('@opentelemetry/api');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the repeat require, it's a no cost. Node have a cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants