Tagging system, autoplot, and benchmark helpers for work-precision diagrams

## Motivation

The exemplary benchmarks like [linear_wpd](https://docs.sciml.ai/SciMLBenchmarksOutput/stable/NonStiffODE/linear_wpd/), [allen_cahn_fdm_wpd](https://docs.sciml.ai/SciMLBenchmarksOutput/stable/SimpleHandwrittenPDE/allen_cahn_fdm_wpd/), and [BCR](https://docs.sciml.ai/SciMLBenchmarksOutput/stable/Bio/BCR/) follow a pattern of within-family and between-family comparisons (e.g., all Rosenbrock methods, all 5th order methods, then best-of-each-family comparisons). Currently, achieving this requires significant manual boilerplate in each benchmark file: creating separate `WorkPrecisionSet`s per family, manually selecting best methods, repeating tolerance ranges and error modes, and manually composing comparison plots.

This issue proposes a set of infrastructure improvements to DiffEqDevTools that would:
1. Make it trivial to tag methods and generate family/cross-family comparison plots
2. Run all methods once and generate many views from the same data
3. Automatically identify "best of family" methods for cross-family comparisons
4. Support interactive plots for dense comparison diagrams
5. Eventually simplify SciMLBenchmarks code significantly via these helpers

## Planned Features

### Phase 1: Core tagging infrastructure

Add a `tags` field to `WorkPrecision`:

```julia
mutable struct WorkPrecision
    # ... existing fields ...
    tags::Vector{Symbol}  # NEW: e.g., [:rosenbrock, :stiff, :4th_order, :autodiff]
end
```

Tags are specified via the setups dict (backward compatible — no tags = empty vector):

```julia
setups = [
    Dict(:alg => Rosenbrock23(), :tags => [:rosenbrock, :2nd_order, :stiff]),
    Dict(:alg => Rodas5P(),      :tags => [:rosenbrock, :5th_order, :stiff]),
    Dict(:alg => TRBDF2(),       :tags => [:bdf, :2nd_order, :stiff]),
    Dict(:alg => KenCarp4(),     :tags => [:sdirk, :4th_order, :stiff, :imex]),
    Dict(:alg => Tsit5(),        :tags => [:rk, :5th_order, :nonstiff, :reference]),
    Dict(:alg => CVODE_BDF(),    :tags => [:bdf, :sundials, :reference]),
]
```

Filtering helpers:

```julia
# Get subset matching ALL specified tags (AND logic)
filter_by_tags(wp_set, :rosenbrock)           # all rosenbrock methods
filter_by_tags(wp_set, :5th_order, :stiff)    # 5th order AND stiff
filter_by_tags(wp_set, :reference)            # just reference methods

# Exclude by tags
exclude_by_tags(wp_set, :reference)           # everything except reference
```

**SDE/DAE compatibility**: Tags are purely additive metadata. The existing `WorkPrecisionSet` constructors for `AbstractRODEProblem`, `AbstractEnsembleProblem`, `AbstractBVProblem`, etc., just need to pass tags through. No changes to `numruns_error`, `error_estimate = :weak_final`, `prob_choice`, or any SDE/DAE-specific parameters.

DAE formalism comparisons can use tags naturally:

```julia
setups = [
    Dict(:alg => Rodas5P(), :prob_choice => 1, :tags => [:mass_matrix, :rosenbrock]),
    Dict(:alg => Rodas5P(), :prob_choice => 2, :tags => [:sparse, :rosenbrock]),
    Dict(:alg => IDA(),     :prob_choice => 3, :tags => [:sundials, :dae_residual]),
]
```

### Phase 2: Multi-error-mode runs

Currently each `WorkPrecisionSet` uses a single `error_estimate`. To avoid re-running everything for each error mode, support computing multiple error metrics in one pass:

```julia
wp_set = WorkPrecisionSet(prob, abstols, reltols, setups;
    error_estimates = [:final, :l2, :L2],  # compute all three
    appxsol = test_sol)
```

The `errors` StructArray in `WorkPrecision` already stores all computed error types as a NamedTuple, so this is mainly about requesting `timeseries_errors=true, dense_errors=true` together and storing a vector of active error estimates for plotting.

For **weak SDE errors** (`:weak_final`, `:weak_l2`, etc.), the same mechanism applies — these are already stored in the errors dict. The expensive part (ensemble runs via `numruns_error`) only needs to happen once regardless of how many weak error metrics are extracted from the results.

### Phase 3: Tag-based and reference-method plotting

Extend the plot recipe to support tag-based subsetting and reference method overlays:

```julia
# Plot only rosenbrock family
plot(wp_set, tags = [:rosenbrock])

# Plot 5th order methods with reference methods always included
plot(wp_set, tags = [:5th_order], include_tags = [:reference])

# Reference methods get distinct styling (dashed, thinner)
plot(wp_set, tags = [:imex], reference_tags = [:reference],
     reference_style = (linestyle = :dash, linewidth = 1, alpha = 0.5))
```

When reference methods are far out of frame from the main methods, option to:
- Auto-adjust axis limits to include them
- Or clip/omit them with a warning
- Or use a secondary inset plot

### Phase 4: Best-of-family helpers

Automatically identify standout methods per family:

```julia
# Get the best method per family tag (by Pareto efficiency on the error-time curve)
best = best_by_tag(wp_set, :rosenbrock; n = 2, error_estimate = :final)

# Create a "best of all families" WorkPrecisionSet
families = [:rosenbrock, :bdf, :sdirk, :rk, :imex, :exponential]
best_of = best_of_families(wp_set, families; n = 2)
plot(best_of)  # cross-family comparison with the top 2 from each
```

"Best" should be determined by Pareto efficiency on the work-precision curve (not just minimum error or minimum time, but the overall curve quality). Could use area under the log-log curve, or minimum time at a reference error level, or a combination.

### Phase 5: Time cutoff for slow methods

Some methods are extremely slow at certain tolerances. The current NaN-filtering handles crashes, but not the case where a method takes 100x longer than others. Options:

1. **Process-level timeout via Distributed**: Run each solve in a worker process with a timeout. If it exceeds the cutoff, kill it and mark as NaN.
   ```julia
   WorkPrecisionSet(prob, abstols, reltols, setups;
       timeout = 300.0,  # seconds per solve
       parallel = :distributed)  # use Distributed workers
   ```

2. **Relative timeout**: Set timeout as a multiple of the fastest solve at each tolerance level (e.g., 50x the fastest).

3. **Callback-based**: Use a `DiscreteCallback` that checks wall-clock time and terminates.

The Distributed approach is cleanest for hard timeouts but adds a dependency. The callback approach works within a single process. We should support both.

For SDE weak benchmarks (which are already very expensive with `numruns_error = 1000`), the timeout should apply per-trajectory or per-ensemble, not per-individual-solve.

### Phase 6: AutoDiff on/off comparison helpers

```julia
# Automatically create AD vs no-AD variants of setups
setups_with_ad = with_autodiff_variants(setups;
    ad_backends = [AutoForwardDiff(), AutoFiniteDiff()],
    methods = [:best, 3]  # only for top 3 methods or :all
)

# Or tag-based: only show AD comparison for reference + best methods
plot(wp_set, tags = [:autodiff_forward], compare_tags = [:autodiff_finite])
```

### Phase 7: Autoplot — generate comprehensive plot sets

A single function that generates all standard comparison plots:

```julia
plots = autoplot(wp_set;
    families = [:rosenbrock, :bdf, :sdirk, :rk, :imex],
    tolerance_ranges = Dict(:low => (1e-3, 1e-8), :high => (1e-8, 1e-13)),
    error_modes = [:final, :l2, :L2],
    reference_tags = [:reference],
    autodiff_compare = true,
    best_n = 2,
    backend = :gr  # or :plotlyjs for interactive
)
```

Returns a structured collection of plots:
- Per-family plots (within-family comparison)
- Cross-family "best of" plots
- AD on/off comparison (with best methods only)
- Low tolerance and high tolerance versions of each
- Each error mode version

### Phase 8: Interactive Plotly support

For plots with many overlapping curves, interactive Plotly plots would help:
- Hover to see method name and exact values
- Click legend to toggle individual methods
- Zoom into regions of interest

This could be a separate package extension (`DiffEqDevToolsPlotlyExt`) or just work via PlotlyJS backend for Plots.jl. The plot recipes should degrade gracefully — same recipe, different backend.

### Phase 9: SciMLBenchmarks migration

After the DiffEqDevTools infrastructure is in place, update SciMLBenchmarks to use it:
- Replace manual family-grouping with tags
- Replace repeated WorkPrecisionSet calls with single tagged runs
- Use autoplot to generate the standard comparison plots
- Dramatically reduce per-benchmark boilerplate

## Design Constraints

- **Full backward compatibility**: All new fields have defaults, all new parameters are keyword-only with defaults matching current behavior.
- **SDE weak benchmarks**: The most expensive benchmarks (1000+ trajectories × multiple methods). Tagging/filtering must be zero-cost at solve time — it's purely metadata for post-hoc plot generation.
- **DAE problem formalism**: `prob_choice` pattern must continue to work. Tags complement it by adding semantic meaning (`:mass_matrix` vs `:dae_residual` vs `:mtk_reduced`).
- **No new hard dependencies**: Plotly support via package extension. Distributed timeout via package extension or optional import.

## Implementation Plan

- [ ] **PR 1**: Phase 1 — Core tagging infrastructure (tags field, filtering helpers)
- [ ] **PR 2**: Phase 2 — Multi-error-mode support
- [ ] **PR 3**: Phase 3 — Tag-based plotting and reference method overlays
- [ ] **PR 4**: Phase 4 — Best-of-family helpers
- [ ] **PR 5**: Phase 5 — Time cutoff mechanism
- [ ] **PR 6**: Phase 6 — AutoDiff comparison helpers
- [ ] **PR 7**: Phase 7 — Autoplot
- [ ] **PR 8**: Phase 8 — Plotly extension
- [ ] **PR 9+**: Phase 9 — SciMLBenchmarks migration (multiple PRs, one per benchmark category)

PRs 1–4 are the core value and can be done incrementally. PRs 5–8 are enhancements. PR 9+ is the payoff in simplified benchmark code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tagging system, autoplot, and benchmark helpers for work-precision diagrams #177

Motivation

Planned Features

Phase 1: Core tagging infrastructure

Phase 2: Multi-error-mode runs

Phase 3: Tag-based and reference-method plotting

Phase 4: Best-of-family helpers

Phase 5: Time cutoff for slow methods

Phase 6: AutoDiff on/off comparison helpers

Phase 7: Autoplot — generate comprehensive plot sets

Phase 8: Interactive Plotly support

Phase 9: SciMLBenchmarks migration

Design Constraints

Implementation Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tagging system, autoplot, and benchmark helpers for work-precision diagrams #177

Description

Motivation

Planned Features

Phase 1: Core tagging infrastructure

Phase 2: Multi-error-mode runs

Phase 3: Tag-based and reference-method plotting

Phase 4: Best-of-family helpers

Phase 5: Time cutoff for slow methods

Phase 6: AutoDiff on/off comparison helpers

Phase 7: Autoplot — generate comprehensive plot sets

Phase 8: Interactive Plotly support

Phase 9: SciMLBenchmarks migration

Design Constraints

Implementation Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions