Motivation
The exemplary benchmarks like linear_wpd, allen_cahn_fdm_wpd, and BCR follow a pattern of within-family and between-family comparisons (e.g., all Rosenbrock methods, all 5th order methods, then best-of-each-family comparisons). Currently, achieving this requires significant manual boilerplate in each benchmark file: creating separate WorkPrecisionSets per family, manually selecting best methods, repeating tolerance ranges and error modes, and manually composing comparison plots.
This issue proposes a set of infrastructure improvements to DiffEqDevTools that would:
- Make it trivial to tag methods and generate family/cross-family comparison plots
- Run all methods once and generate many views from the same data
- Automatically identify "best of family" methods for cross-family comparisons
- Support interactive plots for dense comparison diagrams
- Eventually simplify SciMLBenchmarks code significantly via these helpers
Planned Features
Phase 1: Core tagging infrastructure
Add a tags field to WorkPrecision:
mutable struct WorkPrecision
# ... existing fields ...
tags::Vector{Symbol} # NEW: e.g., [:rosenbrock, :stiff, :4th_order, :autodiff]
end
Tags are specified via the setups dict (backward compatible — no tags = empty vector):
setups = [
Dict(:alg => Rosenbrock23(), :tags => [:rosenbrock, :2nd_order, :stiff]),
Dict(:alg => Rodas5P(), :tags => [:rosenbrock, :5th_order, :stiff]),
Dict(:alg => TRBDF2(), :tags => [:bdf, :2nd_order, :stiff]),
Dict(:alg => KenCarp4(), :tags => [:sdirk, :4th_order, :stiff, :imex]),
Dict(:alg => Tsit5(), :tags => [:rk, :5th_order, :nonstiff, :reference]),
Dict(:alg => CVODE_BDF(), :tags => [:bdf, :sundials, :reference]),
]
Filtering helpers:
# Get subset matching ALL specified tags (AND logic)
filter_by_tags(wp_set, :rosenbrock) # all rosenbrock methods
filter_by_tags(wp_set, :5th_order, :stiff) # 5th order AND stiff
filter_by_tags(wp_set, :reference) # just reference methods
# Exclude by tags
exclude_by_tags(wp_set, :reference) # everything except reference
SDE/DAE compatibility: Tags are purely additive metadata. The existing WorkPrecisionSet constructors for AbstractRODEProblem, AbstractEnsembleProblem, AbstractBVProblem, etc., just need to pass tags through. No changes to numruns_error, error_estimate = :weak_final, prob_choice, or any SDE/DAE-specific parameters.
DAE formalism comparisons can use tags naturally:
setups = [
Dict(:alg => Rodas5P(), :prob_choice => 1, :tags => [:mass_matrix, :rosenbrock]),
Dict(:alg => Rodas5P(), :prob_choice => 2, :tags => [:sparse, :rosenbrock]),
Dict(:alg => IDA(), :prob_choice => 3, :tags => [:sundials, :dae_residual]),
]
Phase 2: Multi-error-mode runs
Currently each WorkPrecisionSet uses a single error_estimate. To avoid re-running everything for each error mode, support computing multiple error metrics in one pass:
wp_set = WorkPrecisionSet(prob, abstols, reltols, setups;
error_estimates = [:final, :l2, :L2], # compute all three
appxsol = test_sol)
The errors StructArray in WorkPrecision already stores all computed error types as a NamedTuple, so this is mainly about requesting timeseries_errors=true, dense_errors=true together and storing a vector of active error estimates for plotting.
For weak SDE errors (:weak_final, :weak_l2, etc.), the same mechanism applies — these are already stored in the errors dict. The expensive part (ensemble runs via numruns_error) only needs to happen once regardless of how many weak error metrics are extracted from the results.
Phase 3: Tag-based and reference-method plotting
Extend the plot recipe to support tag-based subsetting and reference method overlays:
# Plot only rosenbrock family
plot(wp_set, tags = [:rosenbrock])
# Plot 5th order methods with reference methods always included
plot(wp_set, tags = [:5th_order], include_tags = [:reference])
# Reference methods get distinct styling (dashed, thinner)
plot(wp_set, tags = [:imex], reference_tags = [:reference],
reference_style = (linestyle = :dash, linewidth = 1, alpha = 0.5))
When reference methods are far out of frame from the main methods, option to:
- Auto-adjust axis limits to include them
- Or clip/omit them with a warning
- Or use a secondary inset plot
Phase 4: Best-of-family helpers
Automatically identify standout methods per family:
# Get the best method per family tag (by Pareto efficiency on the error-time curve)
best = best_by_tag(wp_set, :rosenbrock; n = 2, error_estimate = :final)
# Create a "best of all families" WorkPrecisionSet
families = [:rosenbrock, :bdf, :sdirk, :rk, :imex, :exponential]
best_of = best_of_families(wp_set, families; n = 2)
plot(best_of) # cross-family comparison with the top 2 from each
"Best" should be determined by Pareto efficiency on the work-precision curve (not just minimum error or minimum time, but the overall curve quality). Could use area under the log-log curve, or minimum time at a reference error level, or a combination.
Phase 5: Time cutoff for slow methods
Some methods are extremely slow at certain tolerances. The current NaN-filtering handles crashes, but not the case where a method takes 100x longer than others. Options:
-
Process-level timeout via Distributed: Run each solve in a worker process with a timeout. If it exceeds the cutoff, kill it and mark as NaN.
WorkPrecisionSet(prob, abstols, reltols, setups;
timeout = 300.0, # seconds per solve
parallel = :distributed) # use Distributed workers
-
Relative timeout: Set timeout as a multiple of the fastest solve at each tolerance level (e.g., 50x the fastest).
-
Callback-based: Use a DiscreteCallback that checks wall-clock time and terminates.
The Distributed approach is cleanest for hard timeouts but adds a dependency. The callback approach works within a single process. We should support both.
For SDE weak benchmarks (which are already very expensive with numruns_error = 1000), the timeout should apply per-trajectory or per-ensemble, not per-individual-solve.
Phase 6: AutoDiff on/off comparison helpers
# Automatically create AD vs no-AD variants of setups
setups_with_ad = with_autodiff_variants(setups;
ad_backends = [AutoForwardDiff(), AutoFiniteDiff()],
methods = [:best, 3] # only for top 3 methods or :all
)
# Or tag-based: only show AD comparison for reference + best methods
plot(wp_set, tags = [:autodiff_forward], compare_tags = [:autodiff_finite])
Phase 7: Autoplot — generate comprehensive plot sets
A single function that generates all standard comparison plots:
plots = autoplot(wp_set;
families = [:rosenbrock, :bdf, :sdirk, :rk, :imex],
tolerance_ranges = Dict(:low => (1e-3, 1e-8), :high => (1e-8, 1e-13)),
error_modes = [:final, :l2, :L2],
reference_tags = [:reference],
autodiff_compare = true,
best_n = 2,
backend = :gr # or :plotlyjs for interactive
)
Returns a structured collection of plots:
- Per-family plots (within-family comparison)
- Cross-family "best of" plots
- AD on/off comparison (with best methods only)
- Low tolerance and high tolerance versions of each
- Each error mode version
Phase 8: Interactive Plotly support
For plots with many overlapping curves, interactive Plotly plots would help:
- Hover to see method name and exact values
- Click legend to toggle individual methods
- Zoom into regions of interest
This could be a separate package extension (DiffEqDevToolsPlotlyExt) or just work via PlotlyJS backend for Plots.jl. The plot recipes should degrade gracefully — same recipe, different backend.
Phase 9: SciMLBenchmarks migration
After the DiffEqDevTools infrastructure is in place, update SciMLBenchmarks to use it:
- Replace manual family-grouping with tags
- Replace repeated WorkPrecisionSet calls with single tagged runs
- Use autoplot to generate the standard comparison plots
- Dramatically reduce per-benchmark boilerplate
Design Constraints
- Full backward compatibility: All new fields have defaults, all new parameters are keyword-only with defaults matching current behavior.
- SDE weak benchmarks: The most expensive benchmarks (1000+ trajectories × multiple methods). Tagging/filtering must be zero-cost at solve time — it's purely metadata for post-hoc plot generation.
- DAE problem formalism:
prob_choice pattern must continue to work. Tags complement it by adding semantic meaning (:mass_matrix vs :dae_residual vs :mtk_reduced).
- No new hard dependencies: Plotly support via package extension. Distributed timeout via package extension or optional import.
Implementation Plan
PRs 1–4 are the core value and can be done incrementally. PRs 5–8 are enhancements. PR 9+ is the payoff in simplified benchmark code.
Motivation
The exemplary benchmarks like linear_wpd, allen_cahn_fdm_wpd, and BCR follow a pattern of within-family and between-family comparisons (e.g., all Rosenbrock methods, all 5th order methods, then best-of-each-family comparisons). Currently, achieving this requires significant manual boilerplate in each benchmark file: creating separate
WorkPrecisionSets per family, manually selecting best methods, repeating tolerance ranges and error modes, and manually composing comparison plots.This issue proposes a set of infrastructure improvements to DiffEqDevTools that would:
Planned Features
Phase 1: Core tagging infrastructure
Add a
tagsfield toWorkPrecision:Tags are specified via the setups dict (backward compatible — no tags = empty vector):
Filtering helpers:
SDE/DAE compatibility: Tags are purely additive metadata. The existing
WorkPrecisionSetconstructors forAbstractRODEProblem,AbstractEnsembleProblem,AbstractBVProblem, etc., just need to pass tags through. No changes tonumruns_error,error_estimate = :weak_final,prob_choice, or any SDE/DAE-specific parameters.DAE formalism comparisons can use tags naturally:
Phase 2: Multi-error-mode runs
Currently each
WorkPrecisionSetuses a singleerror_estimate. To avoid re-running everything for each error mode, support computing multiple error metrics in one pass:The
errorsStructArray inWorkPrecisionalready stores all computed error types as a NamedTuple, so this is mainly about requestingtimeseries_errors=true, dense_errors=truetogether and storing a vector of active error estimates for plotting.For weak SDE errors (
:weak_final,:weak_l2, etc.), the same mechanism applies — these are already stored in the errors dict. The expensive part (ensemble runs vianumruns_error) only needs to happen once regardless of how many weak error metrics are extracted from the results.Phase 3: Tag-based and reference-method plotting
Extend the plot recipe to support tag-based subsetting and reference method overlays:
When reference methods are far out of frame from the main methods, option to:
Phase 4: Best-of-family helpers
Automatically identify standout methods per family:
"Best" should be determined by Pareto efficiency on the work-precision curve (not just minimum error or minimum time, but the overall curve quality). Could use area under the log-log curve, or minimum time at a reference error level, or a combination.
Phase 5: Time cutoff for slow methods
Some methods are extremely slow at certain tolerances. The current NaN-filtering handles crashes, but not the case where a method takes 100x longer than others. Options:
Process-level timeout via Distributed: Run each solve in a worker process with a timeout. If it exceeds the cutoff, kill it and mark as NaN.
Relative timeout: Set timeout as a multiple of the fastest solve at each tolerance level (e.g., 50x the fastest).
Callback-based: Use a
DiscreteCallbackthat checks wall-clock time and terminates.The Distributed approach is cleanest for hard timeouts but adds a dependency. The callback approach works within a single process. We should support both.
For SDE weak benchmarks (which are already very expensive with
numruns_error = 1000), the timeout should apply per-trajectory or per-ensemble, not per-individual-solve.Phase 6: AutoDiff on/off comparison helpers
Phase 7: Autoplot — generate comprehensive plot sets
A single function that generates all standard comparison plots:
Returns a structured collection of plots:
Phase 8: Interactive Plotly support
For plots with many overlapping curves, interactive Plotly plots would help:
This could be a separate package extension (
DiffEqDevToolsPlotlyExt) or just work via PlotlyJS backend for Plots.jl. The plot recipes should degrade gracefully — same recipe, different backend.Phase 9: SciMLBenchmarks migration
After the DiffEqDevTools infrastructure is in place, update SciMLBenchmarks to use it:
Design Constraints
prob_choicepattern must continue to work. Tags complement it by adding semantic meaning (:mass_matrixvs:dae_residualvs:mtk_reduced).Implementation Plan
PRs 1–4 are the core value and can be done incrementally. PRs 5–8 are enhancements. PR 9+ is the payoff in simplified benchmark code.