fix(llm_eval): migrate lm_eval_hf.py to lm-eval >= 0.4.10 HarnessCLI by kevalmorabia97 · Pull Request #1416 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-05-08T12:57:23Z

What does this PR do?

Type of change: Bug fix

Related: NVbug 6153721. To make lm_eval_hf.py and the vllm_causallms adapter compatible with vLLM 0.20+, lm_eval must be upgraded to 0.4.11.

examples/llm_eval/lm_eval_hf.py failed to import on lm-eval >= 0.4.10:

ImportError: cannot import name 'parse_eval_args' from 'lm_eval.__main__'

lm-eval 0.4.10 replaced lm_eval.__main__.{setup_parser, parse_eval_args, cli_evaluate} with a HarnessCLI-based interface in lm_eval._cli. This PR drops the legacy code path and drives HarnessCLI directly:

Attach the ModelOpt arguments (--quant_cfg, --auto_quantize_*, --calib_*, --compress, --sparse_cfg) to the new run subparser.
After parsing, move those keys out of the argparse namespace and into args.model_args (now a dict, courtesy of MergeDictAction), so EvaluatorConfig.from_cli doesn't reject them as unknown kwargs and so they reach our HFLM.create_from_arg_obj override.
Hard-require lm-eval >= 0.4.10 via packaging.version.Version and bump examples/llm_eval/requirements.txt and examples/puzzletron/requirements.txt accordingly.

Usage

# Same CLI as before — HarnessCLI auto-inserts the `run` subcommand for legacy-style invocations.
python examples/llm_eval/lm_eval_hf.py \
    --model hf \
    --model_args pretrained=<HF model> \
    --tasks hellaswag \
    --quant_cfg FP8_DEFAULT_CFG \
    --batch_size 4

Testing

Added tests/examples/llm_eval/test_llm_eval.py::test_lm_eval_hf — an end-to-end test that builds a tiny qwen3 (no GPU required) and runs lm_eval_hf.py against MMLU with --limit 0.1. Verifies both the new HarnessCLI integration and the HFLM.create_from_arg_obj override actually execute. Existing FP8 PTQ test (test_qwen3_eval_fp8) was retargeted to the same tiny qwen3 helper so it no longer pulls down the 1.1B TinyLlama checkpoint.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ❌ — drops support for lm-eval < 0.4.10. Requirements pins are bumped in the same PR; existing CLI invocations continue to work because HarnessCLI auto-inserts the run subcommand for legacy-style argv.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: ❌

Additional Information

NVbug 6153721 — vLLM 0.20+ compatibility requires lm_eval 0.4.11. This PR pins to >= 0.4.10 to fix the import error; bumping the floor to 0.4.11 (or letting users opt in via vLLM extras) unblocks vLLM 0.20+ usage.

Summary by CodeRabbit

Chores
- Require lm_eval >= 0.4.10 for evaluation examples and remove an lm-eval pin from a puzzletron example.
- Improve LM-eval CLI integration so model-related CLI options (including trust_remote_code) are recognized and forwarded correctly.
Tests
- Added an end-to-end smoke test for the LM evaluation example.
- Updated evaluation tests to use the new tiny model setup for more robust validation.

lm-eval 0.4.10 replaced lm_eval.__main__.{setup_parser, parse_eval_args} with a HarnessCLI-based interface in lm_eval._cli, breaking the script's import. Drive HarnessCLI directly: extend the run subparser with the ModelOpt args, then move them out of the namespace into args.model_args so EvaluatorConfig.from_cli does not reject them. Bump pinned lm-eval versions in examples/llm_eval and examples/puzzletron requirements, and add an end-to-end test that runs lm_eval_hf.py against a tiny qwen3. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-05-08T13:00:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e88d50c5-7ef3-4018-bc21-d9d660c4a99f

📥 Commits

Reviewing files that changed from the base of the PR and between 8d83ac3 and 4af002f.

📒 Files selected for processing (1)

tests/examples/llm_eval/test_llm_eval.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/examples/llm_eval/test_llm_eval.py

📝 Walkthrough

Walkthrough

Relax lm-eval pins, enforce lm_eval >= 0.4.10 at runtime, migrate lm_eval_hf.py to HarnessCLI with ModelOpt argparse registration and injection, propagate trust_remote_code into model args, and update tests to use dynamically created Qwen3 model fixtures.

Changes

lm-eval Upgrade and CLI Refactoring

Layer / File(s)	Summary
Dependency Version Constraints `examples/llm_eval/requirements.txt`, `examples/puzzletron/requirements.txt`	`lm_eval[api,ifeval]` relaxed from `==0.4.8` to `>=0.4.10`; removed `lm-eval==0.4.8` from puzzletron.
Version Enforcement and Imports `examples/llm_eval/lm_eval_hf.py`	Hard-enforces `lm_eval >= 0.4.10` using `packaging.version.Version` (raises `ImportError` if unmet); switch imports to `HarnessCLI` and `setup_logging`.
ModelOpt Argument Refactoring `examples/llm_eval/lm_eval_hf.py`	Replace `setup_parser_with_modelopt_args()` with `_add_modelopt_args(parser)` and add `_MODELOPT_ARG_KEYS`.
Argument Injection & main `examples/llm_eval/lm_eval_hf.py`	Add `_inject_modelopt_args_into_model_args(args)` to move ModelOpt flags into `args.model_args` (propagate `trust_remote_code`) and rewrite `__main__` to use `HarnessCLI().execute(args)`.
Tests and Fixtures `tests/examples/llm_eval/test_llm_eval.py`	Update imports to shared helpers and Qwen3 fixture creator; add `test_lm_eval_hf(tmp_path)` smoke test; adapt FP8 test to use generated Qwen3 dirs.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Script
  participant VersionCheck
  participant HarnessCLI
  participant Parser
  participant Injector
  participant Executor

  User->>Script: invoke lm_eval_hf.py
  Script->>VersionCheck: verify lm_eval >= 0.4.10
  VersionCheck-->>Script: pass or raise ImportError
  Script->>HarnessCLI: construct CLI wrapper, locate 'run' subparser
  HarnessCLI->>Parser: obtain parser for run
  Parser->>Injector: parse args (includes ModelOpt flags)
  Injector->>Parser: move ModelOpt flags into args.model_args
  Injector->>Executor: call cli.execute(args)
  Executor-->>User: evaluation completes

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: migrating lm_eval_hf.py to use the new lm-eval >= 0.4.10 HarnessCLI interface, which is the core objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. No unsafe torch.load, numpy.load, eval/exec, or nosec comments. trust_remote_code conditional with justification. No new non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/lm-eval-0.4.10-cli

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kevalmorabia97 · 2026-05-08T13:02:36Z

/claude review

coderabbitai

🧹 Nitpick comments (1)

examples/llm_eval/lm_eval_hf.py (1)
50-50: 💤 Low value

Consider whether CLI extension is necessary, or refactor to use lm-eval's public Python API instead.

The code imports from lm_eval._cli (line 50) and accesses cli._subparsers.choices["run"] (line 267), both private/internal APIs marked by underscore prefixes. While the version check (line 47) provides some protection against major breaking changes, these internal APIs may still shift in minor releases without notice.

A preferable approach would be to use lm-eval's documented Python API (evaluate() / simple_evaluate()) directly in a script, which would avoid reliance on unstable CLI internals. If CLI extension is essential, verify with lm-eval maintainers whether _cli module stability is guaranteed for 0.4.10+.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/lm_eval_hf.py` at line 50, The code uses the internal CLI
module and private attributes (HarnessCLI and cli._subparsers.choices["run"]),
which are unstable; replace this with lm-eval's public Python API by calling the
documented evaluate() or simple_evaluate() functions directly (or, if CLI
extension is truly required, confirm stability with maintainers and encapsulate
CLI usage behind a clear wrapper). Locate uses of HarnessCLI and any direct
access to cli._subparsers.choices["run"] and refactor to build and pass the same
config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic
in a single adapter that isolates private API usage and adds version checks and
error handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Line 50: The code uses the internal CLI module and private attributes
(HarnessCLI and cli._subparsers.choices["run"]), which are unstable; replace
this with lm-eval's public Python API by calling the documented evaluate() or
simple_evaluate() functions directly (or, if CLI extension is truly required,
confirm stability with maintainers and encapsulate CLI usage behind a clear
wrapper). Locate uses of HarnessCLI and any direct access to
cli._subparsers.choices["run"] and refactor to build and pass the same
config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic
in a single adapter that isolates private API usage and adds version checks and
error handling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e41d01f-ae4a-4286-8fcc-e9b6d62237e0

📥 Commits

Reviewing files that changed from the base of the PR and between e2d29c8 and 4727635.

📒 Files selected for processing (4)

examples/llm_eval/lm_eval_hf.py
examples/llm_eval/requirements.txt
examples/puzzletron/requirements.txt
tests/examples/llm_eval/test_llm_eval.py

💤 Files with no reviewable changes (1)

examples/puzzletron/requirements.txt

claude

Claude Code Review Summary

CRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3

Key Findings

[IMPORTANT Compatibility] — cli._subparsers.choices["run"] accesses a private attribute of HarnessCLI. Combined with the unbounded >=0.4.10 floor, a future lm-eval release could break this without warning. Recommend adding a guard with a helpful error message.
[IMPORTANT Compatibility] — The requirements pin uses >=0.4.10 but the stated motivation (vLLM 0.20+ support) requires >=0.4.11 per the PR description. The mismatch could confuse future maintainers.
[SUGGESTION] — _inject_modelopt_args_into_model_args passes None values for all unset ModelOpt args into model_args. Not a bug (downstream handles it), but filtering would be cleaner.

What Looks Good

The migration from deprecated parse_eval_args/cli_evaluate to HarnessCLI is structurally correct.
The separation into _MODELOPT_ARG_KEYS, _add_modelopt_args(), and _inject_modelopt_args_into_model_args() is well-factored and testable.
The version gate at module load (raise ImportError) gives a clear error instead of a cryptic traceback.
The test correctly exercises the full __main__ path via subprocess with a tiny model.
Backward compatibility for existing CLI invocations is preserved (HarnessCLI auto-inserts the run subcommand for legacy-style argv).

Risk Assessment

Low-Medium. No algorithm or mode/state correctness concerns — this is an example script (not library code). The primary risk is fragility against future lm-eval releases due to the private API access. The test coverage added mitigates regression risk for the current version.

🤖 Generated with Claude Code

codecov · 2026-05-08T13:10:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (e2d29c8) to head (4af002f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1416      +/-   ##
==========================================
- Coverage   77.40%   77.38%   -0.02%     
==========================================
  Files         476      476              
  Lines       51319    51319              
==========================================
- Hits        39721    39715       -6     
- Misses      11598    11604       +6

Flag	Coverage Δ
examples	`41.81% <ø> (-0.16%)`	⬇️
unit	`52.55% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Annotate trust_remote_code propagation as user-driven (satisfies CodeRabbit Security Anti-Pattern check on hardcoded trust_remote_code=True). - Guard the private cli._subparsers access with a clear RuntimeError so a future lm-eval refactor surfaces a useful error instead of AttributeError. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 · 2026-05-08T13:17:05Z

/claude review

claude

Claude Review Summary

CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 3

Most impactful finding

[IMPORTANT Compatibility] — _inject_modelopt_args_into_model_args unconditionally injects all _MODELOPT_ARG_KEYS (including None-valued ones) into model_args. If a user invokes this script with --model set to a non-HF model type (e.g., vllm), those extraneous None keys could leak into that model's constructor and cause a TypeError. Filtering out None values would make this more robust.

Overall assessment

Clean, well-structured migration from the legacy setup_parser/parse_eval_args/cli_evaluate API to lm-eval 0.4.10+'s HarnessCLI. The private API access (cli._subparsers.choices["run"]) is properly guarded with a clear error message. The data flow through _inject_modelopt_args_into_model_args → create_from_arg_obj is traced and correct for the HF model path. Test coverage exercises the happy path (no quant) and the FP8 quantization path.

Risk level: Low. This is an example script (not core library), the breaking change (lm-eval < 0.4.10 no longer supported) is clearly documented, and the fallback error messages are informative.

LGTM — approving with minor suggestions above.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…test The default tiny qwen3 max_position_embeddings is 32, shorter than typical MMLU prompts (39-46 tokens). lm-eval's HFLM path tolerates this by truncating, but the TRT-LLM serve path used in test_qwen3_eval_fp8 rejects oversized prompts with `default_max_tokens (-14) must be greater than 0`. Bump to 2048 to give TRT-LLM headroom for MMLU/hellaswag/gsm8k/humaneval prompts. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested review from a team as code owners May 8, 2026 12:57

kevalmorabia97 requested review from ChenhanYu, cjluo-nv, meenchen, shengliangxu and sugunav14 May 8, 2026 12:57

coderabbitai Bot reviewed May 8, 2026

View reviewed changes