fix(llm_eval): migrate lm_eval_hf.py to lm-eval >= 0.4.10 HarnessCLI#1416
fix(llm_eval): migrate lm_eval_hf.py to lm-eval >= 0.4.10 HarnessCLI#1416kevalmorabia97 wants to merge 4 commits intomainfrom
Conversation
lm-eval 0.4.10 replaced lm_eval.__main__.{setup_parser, parse_eval_args}
with a HarnessCLI-based interface in lm_eval._cli, breaking the script's
import. Drive HarnessCLI directly: extend the run subparser with the
ModelOpt args, then move them out of the namespace into args.model_args
so EvaluatorConfig.from_cli does not reject them. Bump pinned lm-eval
versions in examples/llm_eval and examples/puzzletron requirements,
and add an end-to-end test that runs lm_eval_hf.py against a tiny qwen3.
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRelax lm-eval pins, enforce lm_eval >= 0.4.10 at runtime, migrate lm_eval_hf.py to HarnessCLI with ModelOpt argparse registration and injection, propagate trust_remote_code into model args, and update tests to use dynamically created Qwen3 model fixtures. Changeslm-eval Upgrade and CLI Refactoring
Sequence Diagram(s)sequenceDiagram
participant User
participant Script
participant VersionCheck
participant HarnessCLI
participant Parser
participant Injector
participant Executor
User->>Script: invoke lm_eval_hf.py
Script->>VersionCheck: verify lm_eval >= 0.4.10
VersionCheck-->>Script: pass or raise ImportError
Script->>HarnessCLI: construct CLI wrapper, locate 'run' subparser
HarnessCLI->>Parser: obtain parser for run
Parser->>Injector: parse args (includes ModelOpt flags)
Injector->>Parser: move ModelOpt flags into args.model_args
Injector->>Executor: call cli.execute(args)
Executor-->>User: evaluation completes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/claude review |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/llm_eval/lm_eval_hf.py (1)
50-50: 💤 Low valueConsider whether CLI extension is necessary, or refactor to use lm-eval's public Python API instead.
The code imports from
lm_eval._cli(line 50) and accessescli._subparsers.choices["run"](line 267), both private/internal APIs marked by underscore prefixes. While the version check (line 47) provides some protection against major breaking changes, these internal APIs may still shift in minor releases without notice.A preferable approach would be to use lm-eval's documented Python API (
evaluate()/simple_evaluate()) directly in a script, which would avoid reliance on unstable CLI internals. If CLI extension is essential, verify with lm-eval maintainers whether_climodule stability is guaranteed for 0.4.10+.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_eval/lm_eval_hf.py` at line 50, The code uses the internal CLI module and private attributes (HarnessCLI and cli._subparsers.choices["run"]), which are unstable; replace this with lm-eval's public Python API by calling the documented evaluate() or simple_evaluate() functions directly (or, if CLI extension is truly required, confirm stability with maintainers and encapsulate CLI usage behind a clear wrapper). Locate uses of HarnessCLI and any direct access to cli._subparsers.choices["run"] and refactor to build and pass the same config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic in a single adapter that isolates private API usage and adds version checks and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Line 50: The code uses the internal CLI module and private attributes
(HarnessCLI and cli._subparsers.choices["run"]), which are unstable; replace
this with lm-eval's public Python API by calling the documented evaluate() or
simple_evaluate() functions directly (or, if CLI extension is truly required,
confirm stability with maintainers and encapsulate CLI usage behind a clear
wrapper). Locate uses of HarnessCLI and any direct access to
cli._subparsers.choices["run"] and refactor to build and pass the same
config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic
in a single adapter that isolates private API usage and adds version checks and
error handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8e41d01f-ae4a-4286-8fcc-e9b6d62237e0
📒 Files selected for processing (4)
examples/llm_eval/lm_eval_hf.pyexamples/llm_eval/requirements.txtexamples/puzzletron/requirements.txttests/examples/llm_eval/test_llm_eval.py
💤 Files with no reviewable changes (1)
- examples/puzzletron/requirements.txt
There was a problem hiding this comment.
Claude Code Review Summary
CRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3
Key Findings
-
[IMPORTANT Compatibility] —
cli._subparsers.choices["run"]accesses a private attribute ofHarnessCLI. Combined with the unbounded>=0.4.10floor, a future lm-eval release could break this without warning. Recommend adding a guard with a helpful error message. -
[IMPORTANT Compatibility] — The requirements pin uses
>=0.4.10but the stated motivation (vLLM 0.20+ support) requires>=0.4.11per the PR description. The mismatch could confuse future maintainers. -
[SUGGESTION] —
_inject_modelopt_args_into_model_argspassesNonevalues for all unset ModelOpt args intomodel_args. Not a bug (downstream handles it), but filtering would be cleaner.
What Looks Good
- The migration from deprecated
parse_eval_args/cli_evaluatetoHarnessCLIis structurally correct. - The separation into
_MODELOPT_ARG_KEYS,_add_modelopt_args(), and_inject_modelopt_args_into_model_args()is well-factored and testable. - The version gate at module load (
raise ImportError) gives a clear error instead of a cryptic traceback. - The test correctly exercises the full
__main__path via subprocess with a tiny model. - Backward compatibility for existing CLI invocations is preserved (HarnessCLI auto-inserts the
runsubcommand for legacy-style argv).
Risk Assessment
Low-Medium. No algorithm or mode/state correctness concerns — this is an example script (not library code). The primary risk is fragility against future lm-eval releases due to the private API access. The test coverage added mitigates regression risk for the current version.
🤖 Generated with Claude Code
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1416 +/- ##
==========================================
- Coverage 77.40% 77.38% -0.02%
==========================================
Files 476 476
Lines 51319 51319
==========================================
- Hits 39721 39715 -6
- Misses 11598 11604 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Annotate trust_remote_code propagation as user-driven (satisfies CodeRabbit Security Anti-Pattern check on hardcoded trust_remote_code=True). - Guard the private cli._subparsers access with a clear RuntimeError so a future lm-eval refactor surfaces a useful error instead of AttributeError. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
/claude review |
There was a problem hiding this comment.
Claude Review Summary
CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 3
Most impactful finding
[IMPORTANT Compatibility] — _inject_modelopt_args_into_model_args unconditionally injects all _MODELOPT_ARG_KEYS (including None-valued ones) into model_args. If a user invokes this script with --model set to a non-HF model type (e.g., vllm), those extraneous None keys could leak into that model's constructor and cause a TypeError. Filtering out None values would make this more robust.
Overall assessment
Clean, well-structured migration from the legacy setup_parser/parse_eval_args/cli_evaluate API to lm-eval 0.4.10+'s HarnessCLI. The private API access (cli._subparsers.choices["run"]) is properly guarded with a clear error message. The data flow through _inject_modelopt_args_into_model_args → create_from_arg_obj is traced and correct for the HF model path. Test coverage exercises the happy path (no quant) and the FP8 quantization path.
Risk level: Low. This is an example script (not core library), the breaking change (lm-eval < 0.4.10 no longer supported) is clearly documented, and the fallback error messages are informative.
LGTM — approving with minor suggestions above.
…test The default tiny qwen3 max_position_embeddings is 32, shorter than typical MMLU prompts (39-46 tokens). lm-eval's HFLM path tolerates this by truncating, but the TRT-LLM serve path used in test_qwen3_eval_fp8 rejects oversized prompts with `default_max_tokens (-14) must be greater than 0`. Bump to 2048 to give TRT-LLM headroom for MMLU/hellaswag/gsm8k/humaneval prompts. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
What does this PR do?
Type of change: Bug fix
Related: NVbug 6153721. To make
lm_eval_hf.pyand thevllm_causallmsadapter compatible with vLLM 0.20+, lm_eval must be upgraded to 0.4.11.examples/llm_eval/lm_eval_hf.pyfailed to import on lm-eval >= 0.4.10:lm-eval 0.4.10 replaced
lm_eval.__main__.{setup_parser, parse_eval_args, cli_evaluate}with aHarnessCLI-based interface inlm_eval._cli. This PR drops the legacy code path and drivesHarnessCLIdirectly:--quant_cfg,--auto_quantize_*,--calib_*,--compress,--sparse_cfg) to the newrunsubparser.args.model_args(now a dict, courtesy ofMergeDictAction), soEvaluatorConfig.from_clidoesn't reject them as unknown kwargs and so they reach ourHFLM.create_from_arg_objoverride.packaging.version.Versionand bumpexamples/llm_eval/requirements.txtandexamples/puzzletron/requirements.txtaccordingly.Usage
Testing
Added
tests/examples/llm_eval/test_llm_eval.py::test_lm_eval_hf— an end-to-end test that builds a tiny qwen3 (no GPU required) and runslm_eval_hf.pyagainst MMLU with--limit 0.1. Verifies both the new HarnessCLI integration and theHFLM.create_from_arg_objoverride actually execute. Existing FP8 PTQ test (test_qwen3_eval_fp8) was retargeted to the same tiny qwen3 helper so it no longer pulls down the 1.1B TinyLlama checkpoint.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).runsubcommand for legacy-style argv.CONTRIBUTING.md: N/AAdditional Information
NVbug 6153721 — vLLM 0.20+ compatibility requires lm_eval 0.4.11. This PR pins to >= 0.4.10 to fix the import error; bumping the floor to 0.4.11 (or letting users opt in via vLLM extras) unblocks vLLM 0.20+ usage.
Summary by CodeRabbit
Chores
Tests