feat(wan22): add WAN 2.2 text-to-video adapter and dataset for MLPerf inference #293
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the WAN2.2 MLPerf text-to-video benchmark, including a new adapter, dataset loader, and associated Pydantic models. It also adds comprehensive documentation and an example configuration for running benchmarks on Lyris. The review feedback identifies several critical omissions and inconsistencies: the VideoPathRequest and Wan22Dataset are missing the latent_path field required for MLPerf reproducibility, and there is a mismatch between the adapter implementation and unit tests regarding the response_format and handling of VideoPayloadResponse. Additionally, the feedback suggests using None as a default for negative_prompt to allow server-side defaults and injecting the canonical MLPerf negative prompt into the dataset.
b0878b5 to
3230fec
Compare
…er, Wan22Dataset→VideoGenDataset
- Rename src/inference_endpoint/wan22/ → videogen/
- Rename tests/unit/wan22/ → tests/unit/videogen/
- Rename tests/integration/wan22/ → tests/integration/videogen/
- APIType.WAN22 → APIType.VIDEOGEN ("wan22" → "videogen")
- Wan22Adapter → VideoGenAdapter
- Wan22Dataset → VideoGenDataset
- Wan22Accumulator → VideoGenAccumulator
- Update all imports, maps, __all__, tests, docs, and example yaml
- Keep dataset_id="wan22_mlperf" and model_params.name="wan22" (MLPerf identifiers)
…bytes) encode_query: response_format defaults to "video_bytes" but can be overridden via query.data["response_format"] = "video_path" for Lustre-path mode. decode_response: dispatches on response shape — "video_bytes" key → VideoPayloadResponse, otherwise → VideoPathResponse.
arekay-nv
left a comment
There was a problem hiding this comment.
Made a quick pass - will make another later.
There was a problem hiding this comment.
Pull request overview
Adds WAN 2.2 text-to-video (trtllm-serve) support to the inference-endpoint client by introducing a new videogen module (adapter + wire types + dataset) and wiring it into the existing endpoint client + dataset loader factory.
Changes:
- Introduce
APIType.VIDEOGENwithVideoGenAdapter/VideoGenAccumulatorand Pydantic request/response wire models forPOST /v1/videos/generations. - Add
VideoGenDatasetregistered as predefined datasetwan22_mlperf, plus unit/integration tests and example offline benchmark config/scripts. - Update dataset factory + templates/docs to recognize the new workload.
Reviewed changes
Copilot reviewed 30 out of 32 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Bumps transformers and adds videogen extras to lock metadata. |
| pyproject.toml | Adds a videogen optional dependency group (currently empty). |
| src/inference_endpoint/videogen/types.py | Adds Pydantic wire models for video generation request/response. |
| src/inference_endpoint/videogen/adapter.py | Adds HTTP adapter + no-op accumulator for non-streaming video endpoint. |
| src/inference_endpoint/videogen/dataset.py | Adds predefined dataset wan22_mlperf (prompt text file loader). |
| src/inference_endpoint/videogen/init.py | Exposes videogen public API symbols. |
| src/inference_endpoint/core/types.py | Registers APIType.VIDEOGEN and default route /v1/videos/generations. |
| src/inference_endpoint/endpoint_client/config.py | Wires adapter/accumulator into ADAPTER_MAP/ACCUMULATOR_MAP. |
| src/inference_endpoint/dataset_manager/factory.py | Modifies predefined dataset loader invocation to pass path=. |
| src/inference_endpoint/dataset_manager/init.py | Imports VideoGenDataset so it registers into Dataset.PREDEFINED. |
| src/inference_endpoint/dataset_manager/dataset.py | Adds mypy ignore on datasets imports. |
| src/inference_endpoint/dataset_manager/predefined/shopify_product_catalogue/init.py | Adds mypy ignore on datasets import. |
| src/inference_endpoint/evaluation/livecodebench/generate.py | Adds mypy ignore on datasets import. |
| src/inference_endpoint/config/templates/online_template_full.yaml | Updates documented api_type options to include videogen. |
| src/inference_endpoint/config/templates/offline_template_full.yaml | Updates documented api_type options to include videogen. |
| src/inference_endpoint/config/templates/concurrency_template_full.yaml | Updates documented api_type options to include videogen. |
| tests/unit/videogen/test_types.py | Unit tests for videogen Pydantic wire models. |
| tests/unit/videogen/test_adapter.py | Unit tests for adapter encode/decode behavior and accumulator contract. |
| tests/unit/videogen/test_dataset.py | Unit tests for dataset loading + sample shaping. |
| tests/unit/videogen/test_factory.py | Unit tests asserting factory can create the videogen predefined dataset. |
| tests/unit/videogen/test_registration.py | Unit tests for enum + adapter/accumulator registration. |
| tests/unit/videogen/test_init.py | Unit tests for videogen module public exports. |
| tests/unit/videogen/init.py | Test package init (licensing header). |
| tests/integration/videogen/conftest.py | Adds aiohttp mock trtllm-serve fixtures for adapter integration tests. |
| tests/integration/videogen/test_adapter.py | Integration tests for encode→POST→decode round-trip and error cases. |
| tests/integration/videogen/init.py | Test package init (licensing header). |
| examples/09_Wan22_VideoGen_Example/offline_wan22.yaml | Example offline benchmark config targeting videogen endpoint. |
| examples/09_Wan22_VideoGen_Example/setup_and_test.sh | Example script to set up venv and run videogen tests. |
| examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl | Bundled 248-prompt dataset artifact for the example benchmark. |
| endpoints_changed.md | Design summary / documentation for WAN2.2 videogen integration. |
| AGENTS.md | Adds videogen module to repo architecture documentation. |
| .gitignore | Ignores .worktrees/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
After rebasing onto origin/main, these files needed updates from: - ruff lint/format (test imports, line wrapping) - prettier (markdown table formatting, YAML alignment) - regenerate-templates (api_type docstring now lists "videogen") - uv lock refresh (pyproject.toml now has "videogen" optional group) Also tightens an over-broad pytest.raises(Exception) in the videogen integration tests to (ValidationError, json.JSONDecodeError) for the 500-response case and json.JSONDecodeError for the malformed-JSON case (B017).
…t_path Address MR review feedback (Task 3 of wan22-trtllm-plan.md): types.py: - VideoPathRequest.negative_prompt: str = "" -> str | None = None - Add VideoPathRequest.latent_path: str | None = None field (per-request fixed latent tensor path for MLPerf reproducibility) adapter.py: - encode_query: read negative_prompt and latent_path with .get() (no default), and serialise with model_dump_json(exclude_none=True) so optional fields fall back to server-side defaults when absent. dataset.py: - Add _MLPERF_NEGATIVE_PROMPT module constant (canonical MLPerf string). - VideoGenDataset injects this negative prompt into every sample by default; pass negative_prompt=None to omit. Accepts latent_path as a per-dataset config so all samples share the same fixed latent. - load() conditionally includes negative_prompt and latent_path in each sample dict only when set, so adapter exclude_none does the right thing end-to-end. Tests: - Update test_types defaults (negative_prompt None, latent_path None) - Update test_dataset for the canonical negative-prompt default, add coverage for negative_prompt=None and latent_path propagation. - Add adapter tests for exclude_none behaviour and latent_path forwarding. Note for reviewer: the other two review comments (response_format hardcoded to video_path; decode_response only handling VideoPathResponse) are stale; both were addressed in 7a1b4d3 "fix: make response_format optional in VideoGenAdapter (default video_bytes)". Current adapter already defaults response_format to video_bytes and dispatches in decode_response on whether "video_bytes" is present in the response.
Aligns three previously-inconsistent statements about the request default:
- Adapter `encode_query` previously fell back to "video_bytes" when
query.data did not specify a response_format, but the Pydantic field
default on VideoPathRequest was "video_path" — the latter was dead
because the adapter always supplied a value.
- Dataset docstring claimed "always requests video_bytes"; types
docstring described a perf/accuracy split.
Pick the perf/accuracy split (Option A): default = video_path (perf),
opt-in to video_bytes via query.data["response_format"] (accuracy).
- adapter.py: flip `data.get("response_format", ...)` default to
"video_path"; rewrite class + encode_query docstrings to match.
- dataset.py: drop the "always requests video_bytes" line.
- test_adapter.py (unit + integration): split the old
test_encode_query_always_requests_video_bytes test into
default-is-video_path + accuracy-mode-override tests.
Also rewrite endpoints_changed.md:
- Replace the "always video_path" framing with the dual-mode reality.
- Document VideoPayloadResponse and the decode_response shape dispatch.
- Fix the payload-size claim (300 MB -> 3-5 MB; 300 MB was raw uncompressed).
- Drop stale "Pending" tasks 2 (latent_path -- already wired) and 3
(negative_prompt None -- already done).
- Update module name `wan22` -> `videogen` and `api_type` example.
Verified on aarch64 GB200: 58 unit+integration videogen tests pass.
…wan22 refs
Move datasets/wan22_prompts.jsonl into the example folder so the example
is self-contained and drop the absolute Lustre path baked into the setup
script.
setup_and_test.sh:
- Remove PROMPTS_TXT (hardcoded /lustre/share/... path) and the entire
prompts.txt -> JSONL conversion block. The JSONL is now bundled with
the example, so regeneration from a Lustre source is no longer needed.
- Retarget PROMPTS_JSONL to ${SCRIPT_DIR}/wan22_prompts.jsonl.
- Drop the now-orphaned PYTHON variable (only used by the conversion
heredoc).
- Fix stale post-rename references that were left over from
ddac990 (wan22 -> videogen): pip extras [wan22,test] -> [videogen,test]
and test paths tests/unit/wan22 / tests/integration/wan22 ->
tests/unit/videogen / tests/integration/videogen. Without these the
script failed on a fresh setup (no [wan22] extra) and collected zero
tests.
offline_wan22.yaml: dataset path -> examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl.
endpoints_changed.md: update bundled-dataset path reference.
…ader
VideoGenDataset duplicated functionality the generic JsonlLoader already
provides, was bugged on JSONL input (read each line as a raw text prompt
instead of parsing JSON), and wasn't actually invoked by the example
config: offline_wan22.yaml uses `name: wan22_prompts`, which doesn't
match its dataset_id (`wan22_mlperf`), so DataLoaderFactory already
routed to JsonlLoader. The class was dead code in the only path the
example exercises.
Bake the MLPerf canonical negative_prompt into every row of the bundled
JSONL so runtime injection is unnecessary, then delete the workload-
specific dataset class.
- examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl: replace
empty negative_prompt with canonical MLPerf string in all 248 rows.
- src/inference_endpoint/videogen/dataset.py: deleted.
- src/inference_endpoint/dataset_manager/__init__.py: drop the
side-effect VideoGenDataset import and __all__ entry.
- tests/unit/videogen/test_dataset.py, test_factory.py: deleted.
- src/inference_endpoint/videogen/types.py: update negative_prompt
field docstring to point at the bundled JSONL instead of
VideoGenDataset.
- examples/09_Wan22_VideoGen_Example/offline_wan22.yaml: drop
`format: jsonl` (the factory tries DatasetFormat("jsonl") and
crashes because enum values are `.jsonl`; auto-detection from the
path extension works), and update the comment block.
- endpoints_changed.md: replace the dataset.py / VideoGenDataset
section with a brief note about the bundled JSONL + JsonlLoader.
Verified on aarch64 GB200:
- pre-commit run --all-files: all hooks pass
- 42 videogen tests pass (down from 58: 14 dataset tests + 3 factory
tests removed; adapter/types/init/registration tests retained)
- end-to-end smoke: DataLoaderFactory creates a Dataset with 248
samples, each carrying prompt + canonical negative_prompt;
VideoGenAdapter.encode_query produces a valid request with
response_format=video_path.
The `metrics:` top-level block was rejected by `BenchmarkConfig` (extra="forbid", no `metrics` field in the schema), so loading the YAML via `BenchmarkConfig.from_yaml_file()` failed validation. The block had no effect on metrics collection — that's controlled by `settings.runtime` and the metrics aggregator service. Caught by an end-to-end functional smoke test that loads the YAML, runs encode → POST → decode against an inline mock trtllm-serve in both perf (video_path) and accuracy (video_bytes) modes, and bulk- encodes all 248 samples.
… simplify adapter, broaden tests Five small fixes from a pre-review pass: - factory.py: drop `path=dataset_path` kwarg on the predefined-dataset branch. It was added in 9ded0e6 to feed VideoGenDataset (since deleted in 6ce6bfa); none of the remaining predefined datasets' generate() signatures accept `path`, so any user passing both `name=<predefined>` and `path=<file>` would TypeError. Restores the pre-9ded0e6 behavior. Verified the videogen example still loads end-to-end (it routes through Dataset.load_from_file, not the predefined branch). - adapter.py encode_query: replace the 12-line data.get() boilerplate with `VideoPathRequest.model_validate({k: data[k] for k in known if k in data})`. Pydantic applies defaults, eliminating the drift risk between adapter.py and types.py. Extra keys in query.data (sample_id, sample_index, mode, ...) are now ignored cleanly. - integration mock: MockTrtllmServe._handle_sync now honors body["response_format"] and routes to a VideoPathResponse when the request asks for video_path. Previously the mock always returned video_bytes regardless of the request, so the integration tests never exercised the perf-mode decode branch end-to-end. - integration tests: add test_perf_mode_round_trip_returns_video_path asserting result.metadata == {video_path: ...} via real HTTP. Rename the renamed counterpart to test_accuracy_mode_round_trip_returns_video_bytes. Replace the misleadingly named test_missing_video_bytes_field_raises_validation_error with two targeted tests, one per decode dispatch branch. - endpoints_changed.md: drop two stale references — `HealthResponse` (never existed in types.py) and `APIType.WAN22` (the actual enum is APIType.VIDEOGEN). Trim the doc from 155 → 84 lines (~46% reduction) by collapsing the architecture diagram and removing repetition; factual content is unchanged. Verified on aarch64 GB200: - pre-commit run --all-files: all hooks pass - 43 videogen tests pass (one new perf-mode round-trip test added) - end-to-end smoke (load YAML → factory → JsonlLoader → adapter → mock server) passes both perf and accuracy modes; encode_query ignores extra keys cleanly
- AGENTS.md: drop stale HealthResponse, removed VideoGenDataset row; rephrase Key Components blurb to be model-agnostic. - Delete endpoints_changed.md per author note (kept internally). - schema.py / probe.py: api_type help now lists videogen alongside openai and sglang; regenerated full-template YAMLs accordingly. - offline_wan22.yaml: rewrite the comment block to match the bundled JSONL contents; drop misleading min_duration_ms warm-up annotation. - adapter.py: clarify that exclude_none falls back only when the query.data value is None, not unconditionally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace WAN 2.2 references in module docstrings, error messages, and comments with model-agnostic wording (the adapter can serve other video-generation models behind the same trtllm-serve route). - Trim wan22_prompts.jsonl to prompt + canonical MLPerf negative_prompt; drop unused mode/sample_id/sample_index columns the loader synthesises. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tray type-ignores - conftest.py: replace hardcoded /lustre/.../mock_video_001.mp4 with a tmp_path_factory-driven fixture; thread it through the integration test. - pyproject.toml: drop the empty videogen optional-dependencies extra; uv.lock regenerated to match. - Revert three unrelated # type: ignore[attr-defined] additions on the datasets imports (dataset.py, shopify, livecodebench) — out of scope. - setup_and_test.sh: replace the test-runner-only script with a brief end-to-end runbook (HF download, server launch hint, benchmark). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…use it - Extract route registration in EchoServer into _register_routes(app) so subclasses can swap the OpenAI-shaped routes for a different wire contract while reusing the background-thread aiohttp lifecycle. - Convert MockTrtllmServe and MockTrtllmServeError into EchoServer subclasses; drops ~80 lines of duplicated start/stop/port plumbing from tests/integration/videogen/conftest.py. Addresses arekay-nv's review comment on conftest.py asking whether the existing echo server can be reused with a video-gen route. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex run failed: workspace-managed policies error during cloud-requirements load).
Depth: thorough (PR is +1244/-24 across 25 files).
Found 12 issues — 5 medium, 7 low. No critical/high. See inline comments for details and the summary table in the follow-up comment.
Review Council — Multi-AI Code ReviewReviewed by: Claude | Depth: thorough Codex review failed at the cloud-requirements / workspace-managed-policies load step; falling back to Claude-only. Found 12 issues across 6 files. 🟡 Should Fix (medium)Real issues that trigger under specific conditions or design flaws that will compound.
🔵 Consider (low)Valid improvements that could be follow-ups.
Note on duplicates: Two candidate issues were dropped during dedupe — one on |
arekay-nv
left a comment
There was a problem hiding this comment.
Looks good overall, minor issues as identified by review-council.
Minor issues can be addressed in followup.
…ming Addresses arekay-nv [Claude] review batch on PR mlcommons#293. - adapter.py: * dataset_transforms now ships a ColumnFilter (required: prompt, optional: VideoPathRequest fields) — satisfies the abstract-base contract and turns typo'd column names into hard errors at dataset load time instead of silent server-side fallbacks. * decode_response dispatches on `isinstance(raw.get('video_bytes'), str)` rather than key presence, so a server reply with `video_bytes: null` falls through to the video_path branch instead of crashing VideoPayloadResponse validation. * encode_query rejects `stream=True` upfront with a clear ValueError; VideoGenAccumulator.get_final_output now raises instead of returning an empty QueryResult, since the worker SSE path swallows NotImplementedError from decode_sse_message and would otherwise silently report zero-output queries as successful. * Class docstring rewritten — adapter does not auto-derive response_format from BenchmarkConfig.benchmark_mode; callers must inject `response_format=video_bytes` via the dataset to opt in. - types.py: VideoPathResponse and VideoPayloadResponse now use `extra='forbid'` so server shape drift (e.g. both video_path and video_bytes populated, or unknown fields) fails loud at deserialise time instead of silently picking the wrong branch. - offline_wan22.yaml: max_new_tokens 0 → 1 with a comment explaining the field is ignored by VideoGenAdapter; copying the YAML and switching api_type to openai/sglang for debugging no longer yields a 400 from a zero-completion-token request. - tests/unit/videogen/test_adapter.py: update locked-in assertions for the new contract (ColumnFilter present, accumulator raises) and add coverage for the stream=True rejection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- decode_response no longer wraps video_id in TextModelOutput. video_id is a server-side handle, not generated text — packing it into response_output meant downstream consumers (e.g. probe.py:195 printing get_response_output_string()) surfaced it as the response body. Both decode branches now set response_output=None and place video_id alongside video_path/video_bytes in metadata. - Update unit + integration test assertions for the new shape. - Add the missing unit-level coverage for the perf-mode (video_path) decode branch — previously only exercised by the integration suite. - Mock trtllm-serve now derives its mock video_id from sha1(prompt) instead of Python's salted hash(), so ids are reproducible across test runs and CI workers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The probe path assumes second-scale latencies (probe_timeout=60s) and text-prompt/text-response semantics. Video generation requests on WAN2.2-T2V-A14B take minutes per request and emit no chat tokens, so the probe always fails with a misleading 'Probe failed: only 0/N requests successful' after the timeout. Reject `--api-type videogen` upfront in _probe_async with a clear InputValidationError pointing users at `benchmark from-config` (or a dedicated health check, if one is later added). Addresses arekay-nv [Claude] medium-severity comment on probe.py:54. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Fix/Modified based on the review. |
Summary
wan22module withWan22Adapter,Wan22Accumulator,Wan22Dataset, and Pydantic wire types for the trtllm-servePOST /v1/videos/generationsendpoint.Wan22Adapterusesresponse_format=video_path: the server saves the encoded video to shared storage (Lustre) and returns only the file path, avoiding 3–5 MB of base64 video bytes per request overHTTP and ZMQ transport.
Wan22Datasetloads MLPerf WAN2.2 prompt text files (one prompt per line); dataset IDwan22_mlperfis registered withDataLoaderFactoryfor--datasetCLI use.APIType.WAN22and wiresWan22Adapter/Wan22AccumulatorintoHTTPClientConfig.with_updates()to reset adapter and accumulator whenapi_typechanges.Test plan
pytest -m unit tests/unit/wan22/— adapter, dataset, factory, types, init, registration unit testspytest -m integration tests/integration/wan22/— adapter round-trip with mock serverpre-commit run --all-filespasses cleanWhat does this PR do?
Type of change
Related issues
Testing
Checklist