Skip to content

[SDTEST-3211] Use Datadog test suite durations endpoint#49

Merged
anmarchenko merged 8 commits into
mainfrom
anmarchenko/use_backend_api
Apr 29, 2026
Merged

[SDTEST-3211] Use Datadog test suite durations endpoint#49
anmarchenko merged 8 commits into
mainfrom
anmarchenko/use_backend_api

Conversation

@anmarchenko
Copy link
Copy Markdown
Member

@anmarchenko anmarchenko commented Apr 14, 2026

Summary

Adds support for the test suite durations backend API and uses that data to improve ddtest planning.

  • Adds a Datadog test suite durations client for POST /api/v2/ci/ddtest/test_suite_durations, including pagination and in-memory storage.
  • Fetches suite durations during test optimization setup without failing planning when the API errors or returns no data.
  • Tracks discovered tests as suite aggregates with total duration, estimated runnable duration, test count, and skipped test count.
  • Uses backend p50 durations to weight test files for splitting, with count-based fallbacks when backend data is unavailable.
  • Calculates skippable percentage from estimated saved duration instead of raw skipped test count.
  • Handles repo-root vs subdirectory execution paths so backend source files can match locally discovered files.
  • Ensures backend-only/stale suites are only added when their source file exists in local discovery results.

Testing

Automated validation:

  • make test
  • make lint

E2E validation should cover:

  • Full discovery with ITR enabled: verify discovered suites are aggregated, skipped tests are omitted from runnable files, and backend p50 durations affect split weights.
  • Fast discovery path with ITR/test skipping disabled: verify ddtest runs only locally discovered test files and does not reintroduce deleted/stale files returned by the backend.
  • Backend durations API behavior: verify successful non-empty response, empty response warning, and API error handling all keep planning non-fatal.
  • Duration-based skippable percentage: verify the output percentage and parallel runner count reflect saved time, not skipped test count.
  • Missing/invalid backend durations: verify ddtest falls back to count-based suite duration and still creates valid splits.
  • Subdirectory execution: run ddtest from a repo subdirectory and verify git-root-relative backend source files map to CWD-relative runnable files.
  • Split execution: verify generated test split files contain only runnable local test files and are weighted by backend durations where available.

Introduces TestSuiteDurationsClient that calls POST /api/v2/ci/ddtest/test_suite_durations
to fetch historical test suite duration percentiles (p50, p90) for optimizing
parallel test splitting. Follows the same layered architecture as the existing
TestOptimizationClient with interface-based dependency injection for testability.

Made-with: Cursor
@anmarchenko anmarchenko changed the title Add API client for test suite durations endpoint [SDTEST-3211] Use Datadog test suite durations endpoint Apr 14, 2026
Fetch backend test suite durations during optimization setup, store them in memory for later use, and keep planning behavior unchanged when the API is empty or errors.

Made-with: Cursor
Add debug logging for the raw backend response body when fetching test suite durations to match the visibility we already have for settings.

Made-with: Cursor
@anmarchenko
Copy link
Copy Markdown
Member Author

E2E Test Report: SUCCESS ✅

Tested by: Shepherd Agent (autonomous QA for Datadog Test Optimization)

Test Environment

  • Method: Local testing via Shepherd / crook against mockdog
  • Playground: anmarchenko/forem — Rails app with RSpec, scoped to spec/policies/*_spec.rb (23 spec files)
  • Revision tested: 00af6a60c05e1b915948de25d30b3a35b78eb21e (branch anmarchenko/use_backend_api)

Approach

  • Restricted ddtest discovery to 23 policy specs via DD_TEST_OPTIMIZATION_RUNNER_TESTS_LOCATION=spec/policies/*_spec.rb.
  • Ran bin/ddtest plan (planning phase only — no test execution needed to evaluate the split) with --min-parallelism 1 --max-parallelism 8.
  • Inspected the produced .testoptimization/tests-split/runner-N split files for two scenarios:
    1. Default mockdog scenario — /api/v2/ci/ddtest/test_suite_durations returns an empty response.
    2. Custom scenario with three heavy outliers + 20 light files — article_policy_spec p50=10s, comment_policy_spec p50=8s, response_template_policy_spec p50=5s, all others 100 ms.

Results

Scenario runner-0 runner-1 runner-2 runner-3 runner-4 runner-5 runner-6 runner-7
No backend durations (count-based fallback) 3 files 3 3 3 3 3 3 2
Skewed backend durations (3 heavy + 20 light) article (10s) comment (8s) response_template (5s) 4 light 4 light 4 light 4 light 4 light
Check Status
POST /api/v2/ci/ddtest/test_suite_durations issued during planning
Empty-response handling — warning emitted, planning continues, count-based weights used
Non-empty response — Found test suite durations testSuitesCount=23 logged, p50 used as weight
Backend-only suites attached to local files via addBackendTestSuites (source_file matched)
Bin-packing reflects weights: heavy outliers isolated, light files grouped

Methodology

  1. Built ddtest from the PR branch via crook run forem -c ddtest-plan --dep ddtest=anmarchenko/use_backend_api --debug.
  2. Targeted local mockdog (/api/v2/ci/ddtest/test_suite_durations mock implemented in shepherd's mockdog).
  3. Verified feature behavior end-to-end through debug logs (durations request/response, suite-count log, weight resolution) and the resulting tests-split/runner-* files.
  4. Repeated the run with a custom durations scenario to confirm the bin-packer responds to p50 weights.

Conclusion

The duration-based weighting flips the bin-packing input from equal weights (1 s default per file) to backend-p50-derived weights, and the resulting split clearly reflects that. Both code paths (no/empty backend data → count fallback; non-empty backend data → p50 weights) were exercised and behave as the PR description specifies.


This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization

@anmarchenko
Copy link
Copy Markdown
Member Author

E2E Test Report (Round 2): Subdirectory Execution ✅

Tested by: Shepherd Agent

Follow-up to the earlier forem report — this round specifically exercises the getCwdSubdirPrefix / stripCwdSubdirPrefix path normalization, since the PR description calls out subdirectory execution as a verification target.

Test Environment

  • Method: Local testing via Shepherd / crook against mockdog
  • Playground: DataDog/dd-testopt-playground-ruby-spree — Rails monorepo with core/ subproject, scoped to core/spec/helpers/*_spec.rb (6 spec files)
  • Working directory: playgrounds/dd-testopt-playground-ruby-spree/core/ (not the repo root — this is the case the PR's normalization is for)
  • Revision tested: 00af6a60c05e1b915948de25d30b3a35b78eb21e (branch anmarchenko/use_backend_api)

Approach

  • Ran bin/ddtest plan from core/ with --max-parallelism 4.
  • Mocked /api/v2/ci/ddtest/test_suite_durations to return repo-root-relative paths (core/spec/helpers/...), since the real backend reports paths relative to git root, not CWD.
  • Two heavy outliers (products_helper p50=5s, base_helper p50=3s) + 4 light files (100 ms each).
  • Compared splits with vs without backend data.

Results

Scenario runner-0 runner-1 runner-2 runner-3
No backend durations (count-based fallback) images, base (2 files) locale, currency (2) products (1) shipment (1)
Skewed durations, repo-root-relative source files products (5s) base (3s) shipment + images (200 ms) currency + locale (200 ms)

File counts flip from [2, 2, 1, 1] (even by count) to [1, 1, 2, 2] (heavies isolated, lights paired) — same qualitative result as forem, scaled to 4 runners.

Check Status
subdirPrefix=core detected from CWD
Each core/spec/helpers/foo_spec.rb from backend stripped to spec/helpers/foo_spec.rb
All 6 backend suites attached to locally discovered files (testSuitesCount=6)
Bin-packing reflects p50 weights despite path-shape mismatch between backend and local
Planning output correctly written to core/.testoptimization/ (not the repo root)

Confirming log lines from the run:

INFO Running from subdirectory, will normalize repo-root-relative paths subdirPrefix=core
DEBUG Normalized test file path for subdirectory execution
      original=core/spec/helpers/products_helper_spec.rb
      normalized=spec/helpers/products_helper_spec.rb
      subdirPrefix=core
... (one per backend suite)
DEBUG Found test suite durations testSuitesCount=6

Conclusion

The subdirectory execution path called out in the PR description is wired correctly:

  1. CWD subdir prefix is computed from git toplevel.
  2. Backend repo-root-relative source_file values are normalized before the lookup against locally discovered (CWD-relative) files.
  3. Backend-only suites are correctly added to suiteAggregates only when their normalized source_file exists in the local discovery results — verified by the testSuitesCount=6 log and the resulting weight-driven split.

Combined with the earlier forem run, both the repo-root and subdirectory execution modes are validated end-to-end.


This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization

@anmarchenko
Copy link
Copy Markdown
Member Author

E2E Test Report (Round 3): ITR Skip + Duration Weighting ✅

Tested by: Shepherd Agent

Third round, focused on the interaction between ITR test skipping and the new p50-based weighting in resolveSuiteDurations / testFileWeight. Specifically verifying:

  1. A file whose suites are fully skipped is excluded from the split entirely (no runner gets it).
  2. A file whose suites are partially skipped gets its weight proportionally reduced: p50 × (NumTests - NumTestsSkipped) / NumTests.

Test Environment

  • Method: Local testing via Shepherd / crook against mockdog
  • Playground: anmarchenko/forem — full RSpec spec/policies/*_spec.rb (23 files, 363 tests, ITR full discovery via rspec --dry-run)
  • Revision tested: 00af6a60c05e1b915948de25d30b3a35b78eb21e

Setup

Custom mockdog scenario forem-policies-itr-skip-mix.yaml:

File Tests p50 Skippable_tests entries
admin_policy_spec.rb 2 4 s 2/2 (full skip)
follow_policy_spec.rb 3 3 s 2/3 (partial skip)
article_policy_spec.rb 194 10 s 0
comment_policy_spec.rb 35 8 s 0
19 other policies various 100 ms each 0

itr_enabled: true, tests_skipping: true, --max-parallelism 8 --min-parallelism 1.

Results — Predicted vs Actual

All hand-computed predictions hit:

Check Predicted Actual
Backend testSuitesCount 23 23 ✅
Local discoveredTestsCount 363 363 ✅
Tests matched as skippable 4 Processed... skippableTestsCount=4
Test is not skipped log lines 363 − 4 = 359 359 ✅
Among admin + follow tests, count NOT marked skippable 1 (only FollowPolicy ... when user is not signed in) exactly that one ✅
Skippable % = (4000 + 2000) / 26900 × 100 22.30% 22.30
Parallel runners = round(8 − 0.223 × 7) 6 6 ✅
test-files.txt count after exclusion 22 22 ✅

Resulting Split

Runner Files Bin load Notes
runner-0 article_policy_spec.rb 10 000 ms Heavy, isolated
runner-1 comment_policy_spec.rb 8 000 ms Heavy, isolated
runner-2 follow_policy_spec.rb 1 000 ms p50=3 s reduced × 1/3 by partial skip — still heavier than any 6 lights, kept alone
runner-3 7 light files ~700 ms
runner-4 6 light files ~600 ms
runner-5 6 light files ~600 ms
(none) admin_policy_spec.rb Excluded — testFileWeight returned (0, false) because NumTests == NumTestsSkipped

Sanity-grepped admin_policy_spec.rb against both test-files.txt and every runner-N file: no matches, confirming the file is dropped end-to-end (not just hidden in a runner).

Conclusion

Both PR-49 invariants for the EstimatedDuration formula are observable in the output:

  1. aggregate.NumTests == aggregate.NumTestsSkippedtestFileWeight returns (0, false) → file omitted from weightedTestFiles and never reaches DistributeTestFiles. ✅
  2. EstimatedDuration = p50 × (NumTests − NumTestsSkipped) / NumTestsfollow_policy_spec.rb placed in its own bin at exactly 1 000 ms (3 000 × 1/3), demonstrably reduced from its un-skipped p50 of 3 000 ms. ✅

Combined with rounds 1 (forem repo-root, no skips) and 2 (spree from core/ subdir, path normalization), the feature has been validated across:

  • count-based fallback (no backend data),
  • duration-only weighting (no skips),
  • subdirectory execution with repo-root-relative backend paths,
  • ITR full skip excludes from runners,
  • ITR partial skip proportionally reduces weight,
  • skippable percentage flows correctly into parallel-runner count.

This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization

@anmarchenko anmarchenko marked this pull request as ready for review April 29, 2026 10:57
@anmarchenko anmarchenko requested a review from a team as a code owner April 29, 2026 10:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00af6a60c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

apiKey := os.Getenv(constants.APIKeyEnvironmentVariable)
if apiKey == "" {
slog.Error("An API key is required for agentless mode. Use the DD_API_KEY env variable to set it")
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle missing agentless API key without panicking

When DD_CIVISIBILITY_AGENTLESS_ENABLED=true but DD_API_KEY is absent, this branch returns a nil *DatadogDurationsAPI. NewDurationsClient stores that typed nil in the DurationsAPI interface, so PrepareTestOptimization later calls c.api.FetchTestSuiteDurations and panics instead of simply falling back without durations. Please return an error/no-op client or guard this typed-nil case before fetching durations.

Useful? React with 👍 / 👎.

Comment thread internal/testoptimization/durations_client.go
@anmarchenko
Copy link
Copy Markdown
Member Author

E2E Test Report (Round 4): Production EU End-to-End ✅

Tested by: Shepherd Agent

Fourth round, this time against real production EU backend (not mockdog) — confirming the feature works end-to-end in production and quantifying the real-world wall-time savings.

Test Environment

  • Method: Local execution against datadoghq.eu (citestcycle-intake, durations endpoint, settings, skippable, git pack upload — all real prod APIs)
  • Playground: anmarchenko/forem — full RSpec suite (836 test files, 9546 examples)
  • Revision tested: 00af6a60c05e1b915948de25d30b3a35b78eb21e (branch anmarchenko/use_backend_api)

Five-Run Progression

The same bin/ddtest run --framework rspec invocation was repeated five times. Between run #1 and #2, the TIA feature flag was flipped on for the forem service in EU. Between run #3 and #4, --max-parallelism was lowered to find the optimum.

Run Max parallelism Backend coverage (testSuitesCount) Slowest runner Spread (non-outlier) Worker-minutes burned
1 8 0 / 836 (TIA off) 8:46 1:37 ~36:30
2 8 104 / 836 (TIA on, durations endpoint live) 6:08 1:42 ~30:00
3 8 834 / 836 (99.7%) 7:24 0:04 34:49
4 6 834 / 836 5:49 0:08 28:19
5 5 834 / 836 5:27 0:07 ~27:05

Net improvement run #1 → run #5: 8:46 → 5:27 = 3:19 saved (38% faster) using 3 fewer parallel workers.

Key Observations

1. Backend coverage compounds quickly.
After run #1 (TIA off, no contribution), runs #1+#2 sent enough span data that the durations endpoint went from 0 → 104 → 834 covered files in two iterations. This is great news for adoption: a single CI cycle with TIA on is enough to seed near-complete coverage for a service.

2. Empty / partial responses are correctly non-fatal.
Run #1 hit the new endpoint but production returned no data (feature flag was still off for the service). The log emitted the expected Test durations API returned no test suites warning and planning continued with count-based fallback. Exactly what the PR description claims.

3. The 5 packed runners in run #3 finished within 4 seconds of each other (3:51 → 3:55).
With 99.7% backend coverage, FFD has full information and packs essentially perfectly. The 4-second spread across 7 packed runners is the tightest possible given the workload — a clear demonstration that p50 weights converge to optimal scheduling.

4. The bin-packer correctly unisolates the heavy file when bin capacity grows.
At max=8 (run #3), spec/services/articles/feeds/variant_query_spec.rb (p50 ≈ 380s) sat alone in runner-0. Drop to max=5 (run #5) and the per-bin "ideal" load grows to ~410s, so FFD fills runner-0 with the elephant plus 79 light files — and runner-0 finished at 5:20, slightly earlier than the 4 packed runners at 5:26–5:27. The bin-packer found a better placement than isolation as soon as the bin sizes allowed it.

5. Optimal max-parallelism is now an observable property, not a guess.
Without backend p50 data, you have to overprovision parallel runners to hedge against unlucky count-based stragglers. With p50 data, the optimum is the largest max-parallelism where each packed runner's load is just below the slowest single-file load. For forem today that's 5; run #5 used 3 fewer workers than run #1 and still finished 38% faster. This is real CI cost reduction, not a synthetic benchmark.

Per-Run Worker Distribution (Run #5)

runner-0 (80 files,  1246 examples): variant_query_spec.rb + 79 lights, 5:20
runner-1 (188 files, 2026 examples): 5:26
runner-2 (189 files, 2050 examples): 5:26
runner-3 (190 files, 2228 examples): 5:27
runner-4 (189 files, 1996 examples): 5:27

Total wall time bound: 5:27. Spread across all 5 runners: 7 seconds.

Production-Side Verification

Check Status
POST /api/v2/ci/ddtest/test_suite_durations against api.datadoghq.eu ✅ 200
Durations response paginated correctly (page_size=500)
TIA settings respected (itr_enabled=true tests_skipping=false → durations attached but no skippable filtering)
addBackendTestSuites matched 834/836 source files to local discovery
Test cycle / coverage / telemetry intakes all returned 202 across all 5 runs
Build status: 9546 examples, 0 failures across all 5 runs

Conclusion

PR-49 works end-to-end against production EU. Beyond the feature merely functioning, the real-data run produces concretely better outcomes than count-based scheduling: tighter packing across non-outlier runners, lower total wall time, and the ability to right-size max-parallelism based on observed weights. The 38% wall-time reduction with 3 fewer parallel workers is the result on a real Rails app's full test suite, not a constructed benchmark.


This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization

@anmarchenko anmarchenko merged commit 4c5f238 into main Apr 29, 2026
3 checks passed
@anmarchenko anmarchenko deleted the anmarchenko/use_backend_api branch April 29, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants