Investigate Spree split imbalance despite even p50 plans

## Summary

We tested ddtest's Spree split behavior using the Shepherd Spree playground against the EU backend and three GitHub Actions runs in `anmarchenko/spree-dd-testopt`.

The planner is producing very even splits according to its current p50 model, but the actual CI worker times are much less even. This suggests the immediate problem is not weighted list scheduling itself, but the duration model used as input to the scheduler.

## Data

Local crook plan with `DD_SERVICE=spree`, `DD_SITE=datadoghq.eu`:

- Backend duration suites: 632
- Files planned: 419
- Duration sources: 419 known, 0 default
- Runners selected: 8
- Modeled p50 wall time: 2m09.915s
- Modeled p50 imbalance: 287ms

Three GHA probes all had similarly even p50-modeled plans but much wider actual worker times:

| Run | p50-modeled spread | p90 load spread on existing split | actual RSpec step spread |
| --- | ---: | ---: | ---: |
| 1 | 0.283s | 6.527s | 56s |
| 2 | 0.273s | 10.241s | 42s |
| 3 | 0.280s | 8.390s | 66s |

The slow bucket moved between runs:

- Run 1: runner 4, 189s actual vs ~130.5s p50 / ~139.5s p90
- Run 2: runner 2, 171s actual vs ~128.7s p50 / ~142.5s p90
- Run 3: runner 5, 172s actual vs ~130.7s p50 / ~149.2s p90

## Would p90 have helped?

p90 would not obviously solve the observed tail by itself. The worst real workers were still 22-50s slower than their p90-modeled load on the current split, so Spree appears to have per-run Rails/runtime variance that is not fully captured by p90 suite durations.

The simulation also reshuffled most files: 359/419, 362/419, and 370/419 files moved in the three runs. So p90 changes the plan substantially, but we cannot prove it reduces wall time from job-level timing alone.

## Current implementation notes

ddtest currently:

- fetches test suite durations using repository URL + service;
- uses backend p50 as each suite's `EstimatedDuration`;
- sums suite estimates into file weights;
- applies weighted list scheduling across runners.

## Ideas to iterate on the split

1. Add an experimental weighting mode: `p50`, `p90`, and perhaps `blend` (`max(p50, p50 + k * (p90 - p50))` or weighted average). Run the same Spree probes across modes.
2. Emit per-file actual runtime per worker as an artifact. Job-level RSpec step times tell us which bucket tailed, but not which file(s) caused the miss.
3. Track model error by file: compare observed per-file runtime against backend p50/p90 and identify files with high variance or systematic underestimation.
4. Add a variance-aware guardrail for volatile files: for files where `p90/p50` or historical error is high, schedule using a more conservative weight.
5. Consider a "large file isolation" heuristic. Spree has very large specs like `shipment_spec.rb`, `return_item_spec.rb`, and `order_spec.rb`; file-level scheduling cannot split those further, so the heaviest files set a lower bound on balance.
6. If the framework supports it, experiment with finer-grained splitting for oversized files, for example example-group or line-based shards.
7. Keep p50 as the default initially, but add debug output comparing selected p50 loads and alternative p90/blended loads so we can evaluate without changing behavior.

## Expected next step

Use Spree as a benchmark playground and compare p50 vs p90/blended planner output with per-file actual runtime capture enabled. The goal is to determine whether p90 improves wall time or merely changes the model while the dominant problem remains unmodeled runtime variance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Spree split imbalance despite even p50 plans #63

Summary

Data

Would p90 have helped?

Current implementation notes

Ideas to iterate on the split

Expected next step

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate Spree split imbalance despite even p50 plans #63

Description

Summary

Data

Would p90 have helped?

Current implementation notes

Ideas to iterate on the split

Expected next step

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions