Skip to content

Investigate Spree split imbalance despite even p50 plans #63

@anmarchenko

Description

@anmarchenko

Summary

We tested ddtest's Spree split behavior using the Shepherd Spree playground against the EU backend and three GitHub Actions runs in anmarchenko/spree-dd-testopt.

The planner is producing very even splits according to its current p50 model, but the actual CI worker times are much less even. This suggests the immediate problem is not weighted list scheduling itself, but the duration model used as input to the scheduler.

Data

Local crook plan with DD_SERVICE=spree, DD_SITE=datadoghq.eu:

  • Backend duration suites: 632
  • Files planned: 419
  • Duration sources: 419 known, 0 default
  • Runners selected: 8
  • Modeled p50 wall time: 2m09.915s
  • Modeled p50 imbalance: 287ms

Three GHA probes all had similarly even p50-modeled plans but much wider actual worker times:

Run p50-modeled spread p90 load spread on existing split actual RSpec step spread
1 0.283s 6.527s 56s
2 0.273s 10.241s 42s
3 0.280s 8.390s 66s

The slow bucket moved between runs:

  • Run 1: runner 4, 189s actual vs ~130.5s p50 / ~139.5s p90
  • Run 2: runner 2, 171s actual vs ~128.7s p50 / ~142.5s p90
  • Run 3: runner 5, 172s actual vs ~130.7s p50 / ~149.2s p90

Would p90 have helped?

p90 would not obviously solve the observed tail by itself. The worst real workers were still 22-50s slower than their p90-modeled load on the current split, so Spree appears to have per-run Rails/runtime variance that is not fully captured by p90 suite durations.

The simulation also reshuffled most files: 359/419, 362/419, and 370/419 files moved in the three runs. So p90 changes the plan substantially, but we cannot prove it reduces wall time from job-level timing alone.

Current implementation notes

ddtest currently:

  • fetches test suite durations using repository URL + service;
  • uses backend p50 as each suite's EstimatedDuration;
  • sums suite estimates into file weights;
  • applies weighted list scheduling across runners.

Ideas to iterate on the split

  1. Add an experimental weighting mode: p50, p90, and perhaps blend (max(p50, p50 + k * (p90 - p50)) or weighted average). Run the same Spree probes across modes.
  2. Emit per-file actual runtime per worker as an artifact. Job-level RSpec step times tell us which bucket tailed, but not which file(s) caused the miss.
  3. Track model error by file: compare observed per-file runtime against backend p50/p90 and identify files with high variance or systematic underestimation.
  4. Add a variance-aware guardrail for volatile files: for files where p90/p50 or historical error is high, schedule using a more conservative weight.
  5. Consider a "large file isolation" heuristic. Spree has very large specs like shipment_spec.rb, return_item_spec.rb, and order_spec.rb; file-level scheduling cannot split those further, so the heaviest files set a lower bound on balance.
  6. If the framework supports it, experiment with finer-grained splitting for oversized files, for example example-group or line-based shards.
  7. Keep p50 as the default initially, but add debug output comparing selected p50 loads and alternative p90/blended loads so we can evaluate without changing behavior.

Expected next step

Use Spree as a benchmark playground and compare p50 vs p90/blended planner output with per-file actual runtime capture enabled. The goal is to determine whether p90 improves wall time or merely changes the model while the dominant problem remains unmodeled runtime variance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions