Skip to content

eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437

Open
eugenevinitsky wants to merge 20 commits into
vcha/updatefrom
ev/eval-csv-coverage
Open

eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs)#437
eugenevinitsky wants to merge 20 commits into
vcha/updatefrom
ev/eval-csv-coverage

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 21, 2026

Consolidates all evaluation onto the unified Evaluator/EvalManager pipeline: the validation evaluators now reproduce the standalone eval_multi_scenarios setup, the useful bits of that path are folded into the pipeline, the duplicate code is removed, and standalone-checkpoint eval is made robust. Plus replay-viewer UX and an evaluation doc.

Verified end-to-end on the cluster with real checkpoints (salade, tomate).

1. Validation evaluators reproduce the multi-scenario eval

[eval.validation_replay] / [eval.validation_gigaflow] (via a shared validation_defaults template) now carry the env + reward config that build_eval_overrides used — replay over the nuplan bins (control_sdc_only) and gigaflow over the carla maps, fixed eval reward weights, traffic_light_behavior=0, etc.

2. Per-episode CSV + coverage (opt-in)

  • eval.export_episode_csv — one row per finished episode → episode_metrics/<name>_epoch{E}_step{N}.csv, with map_name/scenario_id identity (G1: attached to the C CompletedEpisodeSummary).
  • eval.verify_coveragecoverage_expected/found/unique_maps/complete folded into metrics + missing/duplicate logging (G2: per-map). Completeness compares unique maps for unique-scenario sweeps, episode count when maps cycle.
  • The env's completed_episode queue is separate from the my_log accumulator, so enabling this is additive and leaves the default metric path unchanged.

3. Robust standalone-checkpoint eval

  • G3puffer eval --load-model-path <ckpt> reconstructs the checkpoint's full observation contract from its sibling config.yaml: network arch (policy/rnn), obs/action layout (token counts, target type, reward_conditioning, …), and obs normalization / spatial extent (max_position, agent_obs_max_dist, road distances, …). Without this the env feeds a mis-scaled/cropped observation and the policy drives offroad on identical maps.
  • G4puffer eval accepts --eval_simulation / --num_scenarios / --render / --render_obs / --num_carla_maps (override-only-when-passed); scripts/eval/* rewired to puffer eval.

4. Observation render (G5)

eval.render_obs → CPU-only interactive HTML per scenario via pufferlib.viz (scene + each agent's unpacked obs) + a gallery index.html. No EGL.

5. Consolidation

Removed the superseded standalone path — eval_multi_scenarios / _render and helpers (build_eval_overrides, load_eval_multi_scenarios_config, _export_metrics, _log_eval_metrics, verify_scenario_coverage[_gigaflow]), the two CLI modes, and the dead [eval] scalar config block. The pipeline is now the only eval system.

6. Correctness fixes found via cluster smokes

  • Stop condition: MultiScenarioEvaluator._should_stop counts completed episodes (1:1 with scenarios) when per-episode collection is on; num_eval_scenarios auto-syncs with num_scenarios for replay so the sweep can't hang or under/over-run; added a stall backstop.
  • HTML render path: basename map_name (an absolute path was making out_dir / stem escape the render dir, writing files next to the source bins).

7. Replay viewer (viz.py)

Scenario Info panel is click-to-minimize and collapsed by default; the SDC (the controlled agent the obs were recorded for) is selected by default so the camera centers on it and its obs view opens on load.

8. Misc

  • weigths/weights/ rename.
  • docs/evaluation.md (config schema, running inline/standalone/ad-hoc, outputs, built-in evaluators); README Eval section rewired; docs/*.md un-ignored.

Verification

On the cluster: validation_replay/validation_gigaflow run, terminate, and produce the CSV + coverage (e.g. 32/32 unique maps, complete). A real checkpoint (salade) evals correctly only once G3 reconstructs its obs normalization — offroad drops 0.64 → 0.002, return −5.55 → +8.36 — confirming the obs-contract merge. Both render backends produce correct output.

🤖 Generated with Claude Code

Port the two useful capabilities from vcha's standalone eval_multi_scenarios
into the unified Evaluator/EvalManager pipeline as opt-in features, rather
than carrying a second eval system:

- Per-episode metrics CSV: when eval.export_episode_csv is set, the rollout
  collects completed_episode summaries (separate from the my_log aggregate
  stream that drives the weighted-mean metrics) and writes one row per
  finished episode to episode_metrics/<name>_epoch{E}_step{N}.csv.
- Scenario coverage: when eval.verify_coverage is set, report expected vs.
  evaluated episode counts (folded into metrics as coverage_*), plus
  duplicate detection when the env tags episodes with a scenario identity.

The manager turns on emit_completed_episodes for evaluators that opt in;
this is purely additive and leaves the existing metric path and _should_stop
untouched. Enabled on the multi_scenario evaluators (validation_replay,
validation_gigaflow).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 21:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports two evaluation features (per-episode CSV export and scenario coverage verification) from a standalone evaluator flow into the unified EvalManager/Evaluator pipeline, keeping evaluation logic centralized while making the new behavior opt-in per evaluator.

Changes:

  • Enable env.emit_completed_episodes automatically for evaluators that set eval.export_episode_csv and/or eval.verify_coverage.
  • Extend the base evaluator rollout to collect completed_episode summaries separately, optionally write per-episode CSVs, and report coverage scalars into the evaluator metrics.
  • Update drive.ini to enable these features for the validation evaluators and repoint gigaflow validation env.map_dir to .../binaries/carla.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pufferlib/ocean/benchmark/manager.py Turns on emit_completed_episodes when CSV/coverage features are enabled for an evaluator.
pufferlib/ocean/benchmark/evaluators/base.py Collects per-episode summaries, exports CSV, computes coverage metrics, and avoids polluting the default aggregated metric stream.
pufferlib/config/ocean/drive.ini Enables episode CSV + coverage for validation evaluators; updates gigaflow validation map_dir.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +318 to +322
"expected": expected,
"found": found,
"complete": found >= expected,
"duplicates": duplicates,
}
render_views = ["sim_state", "bev"]
env.simulation_mode = "gigaflow"
env.map_dir = "pufferlib/resources/drive/binaries/carla_py123d"
env.map_dir = "pufferlib/resources/drive/binaries/carla"
Comment on lines +237 to +245
def _maybe_export_episodes(self, args, metrics) -> None:
"""Write a per-episode metrics CSV and/or a scenario-coverage report.

Both are off by default and enabled per-evaluator via the
[eval.<name>] section:

eval.export_episode_csv = true
eval.verify_coverage = true

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky force-pushed the ev/eval-csv-coverage branch from 925b7c8 to 944cb3d Compare May 21, 2026 21:48
Rework [eval.validation_replay] / [eval.validation_gigaflow] so the unified
Evaluator pipeline runs the same eval that the standalone eval_multi_scenarios
path configured via build_eval_overrides:

- New [eval.validation_defaults] template carries the shared clean-eval env +
  fixed eval reward weights (collision/offroad 3.0, goal 1.0, ...), eval_mode,
  termination_mode=0, reward_randomization off, target_type=static,
  traffic_light_behavior=0 (explicit value wins over the clean macro),
  num_agents=512, and num_scenarios=250.
- validation_replay: replay over the real nuplan bins, num_maps=250,
  max_agents_per_env=64, scenario_length=200, control_sdc_only.
- validation_gigaflow: gigaflow on the carla maps, num_maps=8, 40 agents/env,
  scenario_length=500.

Both run every 25 epochs and emit the per-episode CSV + coverage report.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky changed the title eval: port per-episode CSV + coverage into the Evaluator pipeline eval: run the multi-scenario eval via the unified Evaluator pipeline May 21, 2026
Eugene Vinitsky and others added 15 commits May 21, 2026 17:15
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the scenario identity into CompletedEpisodeSummary at episode
completion (before c_reset resamples the env slot) and emit it from
my_completed_episode_to_dict, so per-episode consumers can attribute each
summary to its map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
G2: coverage report now uses the per-episode map_name to report unique maps
and duplicates, not just counts.
G5: add a CPU-only observation render path (eval.render_obs) that writes one
interactive HTML per scenario via pufferlib.viz, including each agent's
unpacked observation, plus a gallery index. Takes precedence over the
egl/html render backends when set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
G3: puffer eval adopts a checkpoint's policy/rnn architecture from its
sibling config.yaml, so arbitrary checkpoints load without manual
--policy.* overrides.
G4: puffer eval accepts --eval_simulation / --num_scenarios / --render /
--render_obs / --num_carla_maps and applies them to the chosen evaluator
(override-only-when-passed). scripts/eval/* now drive `puffer eval`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add docs/evaluation.md (Evaluator/EvalManager pipeline: config schema,
running inline/standalone/ad-hoc, outputs, built-in evaluators) and point
the README Eval section at it. Stop ignoring docs/*.md (keep ignoring the
sphinx _build output).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- coverage_complete compares unique maps to expected for unique-scenario
  sweeps (expected <= num_maps, e.g. replay) and falls back to episode count
  when maps cycle (gigaflow), so a 16-distinct-of-250 sweep no longer reports
  complete.
- obs render filenames end in the numeric index so build_gallery_index matches
  them and writes the gallery.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The unified Evaluator/EvalManager pipeline now reproduces the standalone
eval_multi_scenarios path (per-episode CSV, coverage, obs render, the
build_eval_overrides config as [eval.*] sections, ad-hoc CLI, checkpoint arch
merge), so drop the duplicate:

- delete eval_multi_scenarios / eval_multi_scenarios_render and their helpers
  (build_eval_overrides, load_eval_multi_scenarios_config, _export_metrics,
  _log_eval_metrics, verify_scenario_coverage[_gigaflow]) and the two CLI modes.
- drop the dead [eval] scalar config block (never wired into training).
- set validation_replay env.num_eval_scenarios=250 so eval_mode sweeps all 250
  distinct maps instead of the C default of 16.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The replay sweep hung: setting num_eval_scenarios so eval_mode sweeps all
distinct maps means the env stops producing episodes after the sweep, but
_should_stop counted my_log emissions (~one per batch), a target the sweep
never reaches. Fixes:

- MultiScenarioEvaluator._should_stop counts completed episodes (1:1 with
  scenarios) when per-episode collection is on; legacy emission count otherwise.
- env_overrides auto-derives num_eval_scenarios from num_scenarios for replay,
  so they can't drift (incl. ad-hoc --num_scenarios overrides); gigaflow maps
  still cycle. Drops the hard-coded env.num_eval_scenarios from drive.ini.
- base rollout adds a stall backstop: stop after 3x scenario_length steps with
  no new episode, so an exhausted/misconfigured sweep can't spin to timeout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Evaluating an arbitrary checkpoint also needs the env to pack observations the
way the policy was trained to read them. Extend _merge_checkpoint_arch to pull
the obs/action-layout env keys (max_*_observations, target_type,
num_target_waypoints, action_type, dynamics_model, traffic_control_scope,
reward_conditioning, trajectory_*) from the checkpoint's config.yaml, alongside
the policy/rnn architecture. Eval-policy env config (sim mode, maps, rewards)
still comes from the [eval.<name>] section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….yaml

Evaluating a checkpoint also needs the env to normalize observations and clip
the observed scene exactly as the policy was trained — otherwise positions are
mis-scaled and the scene is cropped, so the policy drives offroad even on the
same maps. Extend _ARCH_ENV_KEYS with the observation normalization scales
(max_position, max_goal_position, max_veh_*, max_road_segment_*,
max_traffic_control_distance), the observation distances (agent_obs_max_dist,
road_obs_front/behind/side_dist), and the target waypoint spacing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The completed_episode summary's map_name is the full bin path. The html
render built its output path as out_dir / f"{map_name}_...", and an absolute
map_name makes pathlib discard out_dir — so the HTML files were written next
to the source bins instead of the render dir (and never appeared in the
results). Basename map_name like the obs render already does, and guard the
info loop against non-dict entries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Click the panel header to collapse it to just the title (chevron flips),
click again to expand. Keeps the replay view unobstructed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Start the panel minimized (title + chevron only); click to expand.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On load, follow the controlled agent the observations were recorded for
(first id in ALL_OBS) so the camera centers on the SDC and its obs view
opens without a manual click.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky changed the title eval: run the multi-scenario eval via the unified Evaluator pipeline eval: consolidate onto the unified Evaluator pipeline (+ viewer/docs) May 22, 2026
Eugene Vinitsky and others added 2 commits May 22, 2026 01:40
Spell out that --eval_simulation gigaflow|replay selects the built-in
validation_gigaflow / validation_replay evaluator (the by-name mode names its
own), and that the override flags apply only when passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two confusing names cleaned up:

- The two HTML renderers + the render_obs boolean are unified into one
  render_backend enum: egl (mp4) | triage_html (scene + per-episode metrics
  for triage, from the captured compact-replay bundle) | obs_html (interactive
  scene + the agent's NN observation). Drops the render_obs boolean that
  silently overrode render_backend, and renames the old "html" value to
  triage_html. CLI: --render-backend replaces --render_obs.
- --num_carla_maps -> --num_maps: eval isn't CARLA-only (replay uses nuPlan
  bins), and it just sets env.num_maps anyway.

Updates the dispatcher, drive.ini (validation_replay -> triage_html), the
ad-hoc CLI, scripts/eval/*, and docs (with a Render backends table).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants