eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs) by eugenevinitsky · Pull Request #437 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-21T21:33:10Z

Consolidates all evaluation onto the unified Evaluator/EvalManager pipeline: the validation evaluators now reproduce the standalone eval_multi_scenarios setup, the useful bits of that path are folded into the pipeline, the duplicate code is removed, and standalone-checkpoint eval is made robust. Plus replay-viewer UX and an evaluation doc.

Verified end-to-end on the cluster with real checkpoints (salade, tomate).

1. Validation evaluators reproduce the multi-scenario eval

[eval.validation_replay] / [eval.validation_gigaflow] (via a shared validation_defaults template) now carry the env + reward config that build_eval_overrides used — replay over the nuplan bins (control_sdc_only) and gigaflow over the carla maps, fixed eval reward weights, traffic_light_behavior=0, etc.

2. Per-episode CSV + coverage (opt-in)

eval.export_episode_csv — one row per finished episode → episode_metrics/<name>_epoch{E}_step{N}.csv, with map_name/scenario_id identity (G1: attached to the C CompletedEpisodeSummary).
eval.verify_coverage — coverage_expected/found/unique_maps/complete folded into metrics + missing/duplicate logging (G2: per-map). Completeness compares unique maps for unique-scenario sweeps, episode count when maps cycle.
The env's completed_episode queue is separate from the my_log accumulator, so enabling this is additive and leaves the default metric path unchanged.

3. Robust standalone-checkpoint eval

G3 — puffer eval --load-model-path <ckpt> reconstructs the checkpoint's full observation contract from its sibling config.yaml: network arch (policy/rnn), obs/action layout (token counts, target type, reward_conditioning, …), and obs normalization / spatial extent (max_position, agent_obs_max_dist, road distances, …). Without this the env feeds a mis-scaled/cropped observation and the policy drives offroad on identical maps.
G4 — puffer eval accepts --eval_simulation / --num_scenarios / --render / --render_obs / --num_carla_maps (override-only-when-passed); scripts/eval/* rewired to puffer eval.

4. Observation render (G5)

eval.render_obs → CPU-only interactive HTML per scenario via pufferlib.viz (scene + each agent's unpacked obs) + a gallery index.html. No EGL.

5. Consolidation

Removed the superseded standalone path — eval_multi_scenarios / _render and helpers (build_eval_overrides, load_eval_multi_scenarios_config, _export_metrics, _log_eval_metrics, verify_scenario_coverage[_gigaflow]), the two CLI modes, and the dead [eval] scalar config block. The pipeline is now the only eval system.

6. Correctness fixes found via cluster smokes

Stop condition: MultiScenarioEvaluator._should_stop counts completed episodes (1:1 with scenarios) when per-episode collection is on; num_eval_scenarios auto-syncs with num_scenarios for replay so the sweep can't hang or under/over-run; added a stall backstop.
HTML render path: basename map_name (an absolute path was making out_dir / stem escape the render dir, writing files next to the source bins).

7. Replay viewer (viz.py)

Scenario Info panel is click-to-minimize and collapsed by default; the SDC (the controlled agent the obs were recorded for) is selected by default so the camera centers on it and its obs view opens on load.

8. Misc

weigths/ → weights/ rename.
docs/evaluation.md (config schema, running inline/standalone/ad-hoc, outputs, built-in evaluators); README Eval section rewired; docs/*.md un-ignored.

Verification

On the cluster: validation_replay/validation_gigaflow run, terminate, and produce the CSV + coverage (e.g. 32/32 unique maps, complete). A real checkpoint (salade) evals correctly only once G3 reconstructs its obs normalization — offroad drops 0.64 → 0.002, return −5.55 → +8.36 — confirming the obs-contract merge. Both render backends produce correct output.

🤖 Generated with Claude Code

Port the two useful capabilities from vcha's standalone eval_multi_scenarios into the unified Evaluator/EvalManager pipeline as opt-in features, rather than carrying a second eval system: - Per-episode metrics CSV: when eval.export_episode_csv is set, the rollout collects completed_episode summaries (separate from the my_log aggregate stream that drives the weighted-mean metrics) and writes one row per finished episode to episode_metrics/<name>_epoch{E}_step{N}.csv. - Scenario coverage: when eval.verify_coverage is set, report expected vs. evaluated episode counts (folded into metrics as coverage_*), plus duplicate detection when the env tags episodes with a scenario identity. The manager turns on emit_completed_episodes for evaluators that opt in; this is purely additive and leaves the existing metric path and _should_stop untouched. Enabled on the multi_scenario evaluators (validation_replay, validation_gigaflow). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR ports two evaluation features (per-episode CSV export and scenario coverage verification) from a standalone evaluator flow into the unified EvalManager/Evaluator pipeline, keeping evaluation logic centralized while making the new behavior opt-in per evaluator.

Changes:

Enable env.emit_completed_episodes automatically for evaluators that set eval.export_episode_csv and/or eval.verify_coverage.
Extend the base evaluator rollout to collect completed_episode summaries separately, optionally write per-episode CSVs, and report coverage scalars into the evaluator metrics.
Update drive.ini to enable these features for the validation evaluators and repoint gigaflow validation env.map_dir to .../binaries/carla.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
pufferlib/ocean/benchmark/manager.py	Turns on `emit_completed_episodes` when CSV/coverage features are enabled for an evaluator.
pufferlib/ocean/benchmark/evaluators/base.py	Collects per-episode summaries, exports CSV, computes coverage metrics, and avoids polluting the default aggregated metric stream.
pufferlib/config/ocean/drive.ini	Enables episode CSV + coverage for validation evaluators; updates gigaflow validation `map_dir`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            "expected": expected,
+            "found": found,
+            "complete": found >= expected,
+            "duplicates": duplicates,
+        }


 render_views = ["sim_state", "bev"]
 env.simulation_mode = "gigaflow"
-env.map_dir = "pufferlib/resources/drive/binaries/carla_py123d"
+env.map_dir = "pufferlib/resources/drive/binaries/carla"


+    def _maybe_export_episodes(self, args, metrics) -> None:
+        """Write a per-episode metrics CSV and/or a scenario-coverage report.
+
+        Both are off by default and enabled per-evaluator via the
+        [eval.<name>] section:
+
+            eval.export_episode_csv = true
+            eval.verify_coverage    = true
+


Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rework [eval.validation_replay] / [eval.validation_gigaflow] so the unified Evaluator pipeline runs the same eval that the standalone eval_multi_scenarios path configured via build_eval_overrides: - New [eval.validation_defaults] template carries the shared clean-eval env + fixed eval reward weights (collision/offroad 3.0, goal 1.0, ...), eval_mode, termination_mode=0, reward_randomization off, target_type=static, traffic_light_behavior=0 (explicit value wins over the clean macro), num_agents=512, and num_scenarios=250. - validation_replay: replay over the real nuplan bins, num_maps=250, max_agents_per_env=64, scenario_length=200, control_sdc_only. - validation_gigaflow: gigaflow on the carla maps, num_maps=8, 40 agents/env, scenario_length=500. Both run every 25 epochs and emit the per-episode CSV + coverage report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Capture the scenario identity into CompletedEpisodeSummary at episode completion (before c_reset resamples the env slot) and emit it from my_completed_episode_to_dict, so per-episode consumers can attribute each summary to its map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

G2: coverage report now uses the per-episode map_name to report unique maps and duplicates, not just counts. G5: add a CPU-only observation render path (eval.render_obs) that writes one interactive HTML per scenario via pufferlib.viz, including each agent's unpacked observation, plus a gallery index. Takes precedence over the egl/html render backends when set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

G3: puffer eval adopts a checkpoint's policy/rnn architecture from its sibling config.yaml, so arbitrary checkpoints load without manual --policy.* overrides. G4: puffer eval accepts --eval_simulation / --num_scenarios / --render / --render_obs / --num_carla_maps and applies them to the chosen evaluator (override-only-when-passed). scripts/eval/* now drive `puffer eval`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add docs/evaluation.md (Evaluator/EvalManager pipeline: config schema, running inline/standalone/ad-hoc, outputs, built-in evaluators) and point the README Eval section at it. Stop ignoring docs/*.md (keep ignoring the sphinx _build output). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- coverage_complete compares unique maps to expected for unique-scenario sweeps (expected <= num_maps, e.g. replay) and falls back to episode count when maps cycle (gigaflow), so a 16-distinct-of-250 sweep no longer reports complete. - obs render filenames end in the numeric index so build_gallery_index matches them and writes the gallery. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The unified Evaluator/EvalManager pipeline now reproduces the standalone eval_multi_scenarios path (per-episode CSV, coverage, obs render, the build_eval_overrides config as [eval.*] sections, ad-hoc CLI, checkpoint arch merge), so drop the duplicate: - delete eval_multi_scenarios / eval_multi_scenarios_render and their helpers (build_eval_overrides, load_eval_multi_scenarios_config, _export_metrics, _log_eval_metrics, verify_scenario_coverage[_gigaflow]) and the two CLI modes. - drop the dead [eval] scalar config block (never wired into training). - set validation_replay env.num_eval_scenarios=250 so eval_mode sweeps all 250 distinct maps instead of the C default of 16. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The replay sweep hung: setting num_eval_scenarios so eval_mode sweeps all distinct maps means the env stops producing episodes after the sweep, but _should_stop counted my_log emissions (~one per batch), a target the sweep never reaches. Fixes: - MultiScenarioEvaluator._should_stop counts completed episodes (1:1 with scenarios) when per-episode collection is on; legacy emission count otherwise. - env_overrides auto-derives num_eval_scenarios from num_scenarios for replay, so they can't drift (incl. ad-hoc --num_scenarios overrides); gigaflow maps still cycle. Drops the hard-coded env.num_eval_scenarios from drive.ini. - base rollout adds a stall backstop: stop after 3x scenario_length steps with no new episode, so an exhausted/misconfigured sweep can't spin to timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Evaluating an arbitrary checkpoint also needs the env to pack observations the way the policy was trained to read them. Extend _merge_checkpoint_arch to pull the obs/action-layout env keys (max_*_observations, target_type, num_target_waypoints, action_type, dynamics_model, traffic_control_scope, reward_conditioning, trajectory_*) from the checkpoint's config.yaml, alongside the policy/rnn architecture. Eval-policy env config (sim mode, maps, rewards) still comes from the [eval.<name>] section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….yaml Evaluating a checkpoint also needs the env to normalize observations and clip the observed scene exactly as the policy was trained — otherwise positions are mis-scaled and the scene is cropped, so the policy drives offroad even on the same maps. Extend _ARCH_ENV_KEYS with the observation normalization scales (max_position, max_goal_position, max_veh_*, max_road_segment_*, max_traffic_control_distance), the observation distances (agent_obs_max_dist, road_obs_front/behind/side_dist), and the target waypoint spacing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The completed_episode summary's map_name is the full bin path. The html render built its output path as out_dir / f"{map_name}_...", and an absolute map_name makes pathlib discard out_dir — so the HTML files were written next to the source bins instead of the render dir (and never appeared in the results). Basename map_name like the obs render already does, and guard the info loop against non-dict entries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Click the panel header to collapse it to just the title (chevron flips), click again to expand. Keeps the replay view unobstructed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Start the panel minimized (title + chevron only); click to expand. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

On load, follow the controlled agent the observations were recorded for (first id in ALL_OBS) so the camera centers on the SDC and its obs view opens without a manual click. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Spell out that --eval_simulation gigaflow|replay selects the built-in validation_gigaflow / validation_replay evaluator (the by-name mode names its own), and that the override flags apply only when passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two confusing names cleaned up: - The two HTML renderers + the render_obs boolean are unified into one render_backend enum: egl (mp4) | triage_html (scene + per-episode metrics for triage, from the captured compact-replay bundle) | obs_html (interactive scene + the agent's NN observation). Drops the render_obs boolean that silently overrode render_backend, and renames the old "html" value to triage_html. CLI: --render-backend replaces --render_obs. - --num_carla_maps -> --num_maps: eval isn't CARLA-only (replay uses nuPlan bins), and it just sets env.num_maps anyway. Updates the dispatcher, drive.ini (validation_replay -> triage_html), the ad-hoc CLI, scripts/eval/*, and docs (with a Render backends table). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 21, 2026 21:33

Copilot started reviewing on behalf of eugenevinitsky May 21, 2026 21:33 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

eval: ruff format (collapse coverage print to one line)

944cb3d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eugenevinitsky force-pushed the ev/eval-csv-coverage branch from 925b7c8 to 944cb3d Compare May 21, 2026 21:48

eugenevinitsky changed the title ~~eval: port per-episode CSV + coverage into the Evaluator pipeline~~ eval: run the multi-scenario eval via the unified Evaluator pipeline May 21, 2026

Eugene Vinitsky and others added 15 commits May 21, 2026 17:15

eval: run validation evaluators every 250 epochs

78e9930

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

viz: make the Scenario Info panel click-to-minimize

f742534

Click the panel header to collapse it to just the title (chevron flips), click again to expand. Keeps the replay view unobstructed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

viz: collapse the Scenario Info panel by default

7aea57d

Start the panel minimized (title + chevron only); click to expand. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

viz: select the SDC by default in the replay viewer

4c1b65b

On load, follow the controlled agent the observations were recorded for (first id in ALL_OBS) so the camera centers on the SDC and its obs view opens without a manual click. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eval: ruff format wrap long lines in _render_pass_obs

2437950

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eugenevinitsky changed the title ~~eval: run the multi-scenario eval via the unified Evaluator pipeline~~ eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs) May 22, 2026

Eugene Vinitsky and others added 2 commits May 22, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437

eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437
eugenevinitsky wants to merge 20 commits into
vcha/updatefrom
ev/eval-csv-coverage

eugenevinitsky commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenevinitsky commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Validation evaluators reproduce the multi-scenario eval

2. Per-episode CSV + coverage (opt-in)

3. Robust standalone-checkpoint eval

4. Observation render (G5)

5. Consolidation

6. Correctness fixes found via cluster smokes

7. Replay viewer (viz.py)

8. Misc

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eugenevinitsky commented May 21, 2026 •

edited

Loading