eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437
Open
eugenevinitsky wants to merge 20 commits into
Open
eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437eugenevinitsky wants to merge 20 commits into
eugenevinitsky wants to merge 20 commits into
Conversation
Port the two useful capabilities from vcha's standalone eval_multi_scenarios
into the unified Evaluator/EvalManager pipeline as opt-in features, rather
than carrying a second eval system:
- Per-episode metrics CSV: when eval.export_episode_csv is set, the rollout
collects completed_episode summaries (separate from the my_log aggregate
stream that drives the weighted-mean metrics) and writes one row per
finished episode to episode_metrics/<name>_epoch{E}_step{N}.csv.
- Scenario coverage: when eval.verify_coverage is set, report expected vs.
evaluated episode counts (folded into metrics as coverage_*), plus
duplicate detection when the env tags episodes with a scenario identity.
The manager turns on emit_completed_episodes for evaluators that opt in;
this is purely additive and leaves the existing metric path and _should_stop
untouched. Enabled on the multi_scenario evaluators (validation_replay,
validation_gigaflow).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR ports two evaluation features (per-episode CSV export and scenario coverage verification) from a standalone evaluator flow into the unified EvalManager/Evaluator pipeline, keeping evaluation logic centralized while making the new behavior opt-in per evaluator.
Changes:
- Enable
env.emit_completed_episodesautomatically for evaluators that seteval.export_episode_csvand/oreval.verify_coverage. - Extend the base evaluator rollout to collect
completed_episodesummaries separately, optionally write per-episode CSVs, and report coverage scalars into the evaluator metrics. - Update
drive.inito enable these features for the validation evaluators and repoint gigaflow validationenv.map_dirto.../binaries/carla.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pufferlib/ocean/benchmark/manager.py | Turns on emit_completed_episodes when CSV/coverage features are enabled for an evaluator. |
| pufferlib/ocean/benchmark/evaluators/base.py | Collects per-episode summaries, exports CSV, computes coverage metrics, and avoids polluting the default aggregated metric stream. |
| pufferlib/config/ocean/drive.ini | Enables episode CSV + coverage for validation evaluators; updates gigaflow validation map_dir. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+318
to
+322
| "expected": expected, | ||
| "found": found, | ||
| "complete": found >= expected, | ||
| "duplicates": duplicates, | ||
| } |
| render_views = ["sim_state", "bev"] | ||
| env.simulation_mode = "gigaflow" | ||
| env.map_dir = "pufferlib/resources/drive/binaries/carla_py123d" | ||
| env.map_dir = "pufferlib/resources/drive/binaries/carla" |
Comment on lines
+237
to
+245
| def _maybe_export_episodes(self, args, metrics) -> None: | ||
| """Write a per-episode metrics CSV and/or a scenario-coverage report. | ||
|
|
||
| Both are off by default and enabled per-evaluator via the | ||
| [eval.<name>] section: | ||
|
|
||
| eval.export_episode_csv = true | ||
| eval.verify_coverage = true | ||
|
|
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
925b7c8 to
944cb3d
Compare
Rework [eval.validation_replay] / [eval.validation_gigaflow] so the unified Evaluator pipeline runs the same eval that the standalone eval_multi_scenarios path configured via build_eval_overrides: - New [eval.validation_defaults] template carries the shared clean-eval env + fixed eval reward weights (collision/offroad 3.0, goal 1.0, ...), eval_mode, termination_mode=0, reward_randomization off, target_type=static, traffic_light_behavior=0 (explicit value wins over the clean macro), num_agents=512, and num_scenarios=250. - validation_replay: replay over the real nuplan bins, num_maps=250, max_agents_per_env=64, scenario_length=200, control_sdc_only. - validation_gigaflow: gigaflow on the carla maps, num_maps=8, 40 agents/env, scenario_length=500. Both run every 25 epochs and emit the per-episode CSV + coverage report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the scenario identity into CompletedEpisodeSummary at episode completion (before c_reset resamples the env slot) and emit it from my_completed_episode_to_dict, so per-episode consumers can attribute each summary to its map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
G2: coverage report now uses the per-episode map_name to report unique maps and duplicates, not just counts. G5: add a CPU-only observation render path (eval.render_obs) that writes one interactive HTML per scenario via pufferlib.viz, including each agent's unpacked observation, plus a gallery index. Takes precedence over the egl/html render backends when set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
G3: puffer eval adopts a checkpoint's policy/rnn architecture from its sibling config.yaml, so arbitrary checkpoints load without manual --policy.* overrides. G4: puffer eval accepts --eval_simulation / --num_scenarios / --render / --render_obs / --num_carla_maps and applies them to the chosen evaluator (override-only-when-passed). scripts/eval/* now drive `puffer eval`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add docs/evaluation.md (Evaluator/EvalManager pipeline: config schema, running inline/standalone/ad-hoc, outputs, built-in evaluators) and point the README Eval section at it. Stop ignoring docs/*.md (keep ignoring the sphinx _build output). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- coverage_complete compares unique maps to expected for unique-scenario sweeps (expected <= num_maps, e.g. replay) and falls back to episode count when maps cycle (gigaflow), so a 16-distinct-of-250 sweep no longer reports complete. - obs render filenames end in the numeric index so build_gallery_index matches them and writes the gallery. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The unified Evaluator/EvalManager pipeline now reproduces the standalone eval_multi_scenarios path (per-episode CSV, coverage, obs render, the build_eval_overrides config as [eval.*] sections, ad-hoc CLI, checkpoint arch merge), so drop the duplicate: - delete eval_multi_scenarios / eval_multi_scenarios_render and their helpers (build_eval_overrides, load_eval_multi_scenarios_config, _export_metrics, _log_eval_metrics, verify_scenario_coverage[_gigaflow]) and the two CLI modes. - drop the dead [eval] scalar config block (never wired into training). - set validation_replay env.num_eval_scenarios=250 so eval_mode sweeps all 250 distinct maps instead of the C default of 16. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The replay sweep hung: setting num_eval_scenarios so eval_mode sweeps all distinct maps means the env stops producing episodes after the sweep, but _should_stop counted my_log emissions (~one per batch), a target the sweep never reaches. Fixes: - MultiScenarioEvaluator._should_stop counts completed episodes (1:1 with scenarios) when per-episode collection is on; legacy emission count otherwise. - env_overrides auto-derives num_eval_scenarios from num_scenarios for replay, so they can't drift (incl. ad-hoc --num_scenarios overrides); gigaflow maps still cycle. Drops the hard-coded env.num_eval_scenarios from drive.ini. - base rollout adds a stall backstop: stop after 3x scenario_length steps with no new episode, so an exhausted/misconfigured sweep can't spin to timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Evaluating an arbitrary checkpoint also needs the env to pack observations the way the policy was trained to read them. Extend _merge_checkpoint_arch to pull the obs/action-layout env keys (max_*_observations, target_type, num_target_waypoints, action_type, dynamics_model, traffic_control_scope, reward_conditioning, trajectory_*) from the checkpoint's config.yaml, alongside the policy/rnn architecture. Eval-policy env config (sim mode, maps, rewards) still comes from the [eval.<name>] section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….yaml Evaluating a checkpoint also needs the env to normalize observations and clip the observed scene exactly as the policy was trained — otherwise positions are mis-scaled and the scene is cropped, so the policy drives offroad even on the same maps. Extend _ARCH_ENV_KEYS with the observation normalization scales (max_position, max_goal_position, max_veh_*, max_road_segment_*, max_traffic_control_distance), the observation distances (agent_obs_max_dist, road_obs_front/behind/side_dist), and the target waypoint spacing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The completed_episode summary's map_name is the full bin path. The html
render built its output path as out_dir / f"{map_name}_...", and an absolute
map_name makes pathlib discard out_dir — so the HTML files were written next
to the source bins instead of the render dir (and never appeared in the
results). Basename map_name like the obs render already does, and guard the
info loop against non-dict entries.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Click the panel header to collapse it to just the title (chevron flips), click again to expand. Keeps the replay view unobstructed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Start the panel minimized (title + chevron only); click to expand. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On load, follow the controlled agent the observations were recorded for (first id in ALL_OBS) so the camera centers on the SDC and its obs view opens without a manual click. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spell out that --eval_simulation gigaflow|replay selects the built-in validation_gigaflow / validation_replay evaluator (the by-name mode names its own), and that the override flags apply only when passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two confusing names cleaned up: - The two HTML renderers + the render_obs boolean are unified into one render_backend enum: egl (mp4) | triage_html (scene + per-episode metrics for triage, from the captured compact-replay bundle) | obs_html (interactive scene + the agent's NN observation). Drops the render_obs boolean that silently overrode render_backend, and renames the old "html" value to triage_html. CLI: --render-backend replaces --render_obs. - --num_carla_maps -> --num_maps: eval isn't CARLA-only (replay uses nuPlan bins), and it just sets env.num_maps anyway. Updates the dispatcher, drive.ini (validation_replay -> triage_html), the ad-hoc CLI, scripts/eval/*, and docs (with a Render backends table). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidates all evaluation onto the unified
Evaluator/EvalManagerpipeline: the validation evaluators now reproduce the standaloneeval_multi_scenariossetup, the useful bits of that path are folded into the pipeline, the duplicate code is removed, and standalone-checkpoint eval is made robust. Plus replay-viewer UX and an evaluation doc.Verified end-to-end on the cluster with real checkpoints (
salade,tomate).1. Validation evaluators reproduce the multi-scenario eval
[eval.validation_replay]/[eval.validation_gigaflow](via a sharedvalidation_defaultstemplate) now carry the env + reward config thatbuild_eval_overridesused — replay over the nuplan bins (control_sdc_only) and gigaflow over the carla maps, fixed eval reward weights,traffic_light_behavior=0, etc.2. Per-episode CSV + coverage (opt-in)
eval.export_episode_csv— one row per finished episode →episode_metrics/<name>_epoch{E}_step{N}.csv, withmap_name/scenario_ididentity (G1: attached to the CCompletedEpisodeSummary).eval.verify_coverage—coverage_expected/found/unique_maps/completefolded into metrics + missing/duplicate logging (G2: per-map). Completeness compares unique maps for unique-scenario sweeps, episode count when maps cycle.completed_episodequeue is separate from the my_log accumulator, so enabling this is additive and leaves the default metric path unchanged.3. Robust standalone-checkpoint eval
puffer eval --load-model-path <ckpt>reconstructs the checkpoint's full observation contract from its siblingconfig.yaml: network arch (policy/rnn), obs/action layout (token counts, target type,reward_conditioning, …), and obs normalization / spatial extent (max_position,agent_obs_max_dist, road distances, …). Without this the env feeds a mis-scaled/cropped observation and the policy drives offroad on identical maps.puffer evalaccepts--eval_simulation/--num_scenarios/--render/--render_obs/--num_carla_maps(override-only-when-passed);scripts/eval/*rewired topuffer eval.4. Observation render (G5)
eval.render_obs→ CPU-only interactive HTML per scenario viapufferlib.viz(scene + each agent's unpacked obs) + a galleryindex.html. No EGL.5. Consolidation
Removed the superseded standalone path —
eval_multi_scenarios/_renderand helpers (build_eval_overrides,load_eval_multi_scenarios_config,_export_metrics,_log_eval_metrics,verify_scenario_coverage[_gigaflow]), the two CLI modes, and the dead[eval]scalar config block. The pipeline is now the only eval system.6. Correctness fixes found via cluster smokes
MultiScenarioEvaluator._should_stopcounts completed episodes (1:1 with scenarios) when per-episode collection is on;num_eval_scenariosauto-syncs withnum_scenariosfor replay so the sweep can't hang or under/over-run; added a stall backstop.map_name(an absolute path was makingout_dir / stemescape the render dir, writing files next to the source bins).7. Replay viewer (viz.py)
Scenario Info panel is click-to-minimize and collapsed by default; the SDC (the controlled agent the obs were recorded for) is selected by default so the camera centers on it and its obs view opens on load.
8. Misc
weigths/→weights/rename.docs/evaluation.md(config schema, running inline/standalone/ad-hoc, outputs, built-in evaluators); README Eval section rewired;docs/*.mdun-ignored.Verification
On the cluster:
validation_replay/validation_gigaflowrun, terminate, and produce the CSV + coverage (e.g. 32/32 unique maps, complete). A real checkpoint (salade) evals correctly only once G3 reconstructs its obs normalization — offroad drops 0.64 → 0.002, return −5.55 → +8.36 — confirming the obs-contract merge. Both render backends produce correct output.🤖 Generated with Claude Code