Skip to content

fix(llm_eval): migrate lm_eval_hf.py to lm-eval >= 0.4.10 HarnessCLI#1416

Open
kevalmorabia97 wants to merge 4 commits intomainfrom
fix/lm-eval-0.4.10-cli
Open

fix(llm_eval): migrate lm_eval_hf.py to lm-eval >= 0.4.10 HarnessCLI#1416
kevalmorabia97 wants to merge 4 commits intomainfrom
fix/lm-eval-0.4.10-cli

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented May 8, 2026

What does this PR do?

Type of change: Bug fix

Related: NVbug 6153721. To make lm_eval_hf.py and the vllm_causallms adapter compatible with vLLM 0.20+, lm_eval must be upgraded to 0.4.11.

examples/llm_eval/lm_eval_hf.py failed to import on lm-eval >= 0.4.10:

ImportError: cannot import name 'parse_eval_args' from 'lm_eval.__main__'

lm-eval 0.4.10 replaced lm_eval.__main__.{setup_parser, parse_eval_args, cli_evaluate} with a HarnessCLI-based interface in lm_eval._cli. This PR drops the legacy code path and drives HarnessCLI directly:

  • Attach the ModelOpt arguments (--quant_cfg, --auto_quantize_*, --calib_*, --compress, --sparse_cfg) to the new run subparser.
  • After parsing, move those keys out of the argparse namespace and into args.model_args (now a dict, courtesy of MergeDictAction), so EvaluatorConfig.from_cli doesn't reject them as unknown kwargs and so they reach our HFLM.create_from_arg_obj override.
  • Hard-require lm-eval >= 0.4.10 via packaging.version.Version and bump examples/llm_eval/requirements.txt and examples/puzzletron/requirements.txt accordingly.

Usage

# Same CLI as before — HarnessCLI auto-inserts the `run` subcommand for legacy-style invocations.
python examples/llm_eval/lm_eval_hf.py \
    --model hf \
    --model_args pretrained=<HF model> \
    --tasks hellaswag \
    --quant_cfg FP8_DEFAULT_CFG \
    --batch_size 4

Testing

Added tests/examples/llm_eval/test_llm_eval.py::test_lm_eval_hf — an end-to-end test that builds a tiny qwen3 (no GPU required) and runs lm_eval_hf.py against MMLU with --limit 0.1. Verifies both the new HarnessCLI integration and the HFLM.create_from_arg_obj override actually execute. Existing FP8 PTQ test (test_qwen3_eval_fp8) was retargeted to the same tiny qwen3 helper so it no longer pulls down the 1.1B TinyLlama checkpoint.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ❌ — drops support for lm-eval < 0.4.10. Requirements pins are bumped in the same PR; existing CLI invocations continue to work because HarnessCLI auto-inserts the run subcommand for legacy-style argv.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: ❌

Additional Information

NVbug 6153721 — vLLM 0.20+ compatibility requires lm_eval 0.4.11. This PR pins to >= 0.4.10 to fix the import error; bumping the floor to 0.4.11 (or letting users opt in via vLLM extras) unblocks vLLM 0.20+ usage.

Summary by CodeRabbit

  • Chores

    • Require lm_eval >= 0.4.10 for evaluation examples and remove an lm-eval pin from a puzzletron example.
    • Improve LM-eval CLI integration so model-related CLI options (including trust_remote_code) are recognized and forwarded correctly.
  • Tests

    • Added an end-to-end smoke test for the LM evaluation example.
    • Updated evaluation tests to use the new tiny model setup for more robust validation.

lm-eval 0.4.10 replaced lm_eval.__main__.{setup_parser, parse_eval_args}
with a HarnessCLI-based interface in lm_eval._cli, breaking the script's
import. Drive HarnessCLI directly: extend the run subparser with the
ModelOpt args, then move them out of the namespace into args.model_args
so EvaluatorConfig.from_cli does not reject them. Bump pinned lm-eval
versions in examples/llm_eval and examples/puzzletron requirements,
and add an end-to-end test that runs lm_eval_hf.py against a tiny qwen3.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e88d50c5-7ef3-4018-bc21-d9d660c4a99f

📥 Commits

Reviewing files that changed from the base of the PR and between 8d83ac3 and 4af002f.

📒 Files selected for processing (1)
  • tests/examples/llm_eval/test_llm_eval.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/examples/llm_eval/test_llm_eval.py

📝 Walkthrough

Walkthrough

Relax lm-eval pins, enforce lm_eval >= 0.4.10 at runtime, migrate lm_eval_hf.py to HarnessCLI with ModelOpt argparse registration and injection, propagate trust_remote_code into model args, and update tests to use dynamically created Qwen3 model fixtures.

Changes

lm-eval Upgrade and CLI Refactoring

Layer / File(s) Summary
Dependency Version Constraints
examples/llm_eval/requirements.txt, examples/puzzletron/requirements.txt
lm_eval[api,ifeval] relaxed from ==0.4.8 to >=0.4.10; removed lm-eval==0.4.8 from puzzletron.
Version Enforcement and Imports
examples/llm_eval/lm_eval_hf.py
Hard-enforces lm_eval >= 0.4.10 using packaging.version.Version (raises ImportError if unmet); switch imports to HarnessCLI and setup_logging.
ModelOpt Argument Refactoring
examples/llm_eval/lm_eval_hf.py
Replace setup_parser_with_modelopt_args() with _add_modelopt_args(parser) and add _MODELOPT_ARG_KEYS.
Argument Injection & main
examples/llm_eval/lm_eval_hf.py
Add _inject_modelopt_args_into_model_args(args) to move ModelOpt flags into args.model_args (propagate trust_remote_code) and rewrite __main__ to use HarnessCLI().execute(args).
Tests and Fixtures
tests/examples/llm_eval/test_llm_eval.py
Update imports to shared helpers and Qwen3 fixture creator; add test_lm_eval_hf(tmp_path) smoke test; adapt FP8 test to use generated Qwen3 dirs.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Script
  participant VersionCheck
  participant HarnessCLI
  participant Parser
  participant Injector
  participant Executor

  User->>Script: invoke lm_eval_hf.py
  Script->>VersionCheck: verify lm_eval >= 0.4.10
  VersionCheck-->>Script: pass or raise ImportError
  Script->>HarnessCLI: construct CLI wrapper, locate 'run' subparser
  HarnessCLI->>Parser: obtain parser for run
  Parser->>Injector: parse args (includes ModelOpt flags)
  Injector->>Parser: move ModelOpt flags into args.model_args
  Injector->>Executor: call cli.execute(args)
  Executor-->>User: evaluation completes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: migrating lm_eval_hf.py to use the new lm-eval >= 0.4.10 HarnessCLI interface, which is the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns found. No unsafe torch.load, numpy.load, eval/exec, or nosec comments. trust_remote_code conditional with justification. No new non-permissive dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/lm-eval-0.4.10-cli

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/llm_eval/lm_eval_hf.py (1)

50-50: 💤 Low value

Consider whether CLI extension is necessary, or refactor to use lm-eval's public Python API instead.

The code imports from lm_eval._cli (line 50) and accesses cli._subparsers.choices["run"] (line 267), both private/internal APIs marked by underscore prefixes. While the version check (line 47) provides some protection against major breaking changes, these internal APIs may still shift in minor releases without notice.

A preferable approach would be to use lm-eval's documented Python API (evaluate() / simple_evaluate()) directly in a script, which would avoid reliance on unstable CLI internals. If CLI extension is essential, verify with lm-eval maintainers whether _cli module stability is guaranteed for 0.4.10+.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/lm_eval_hf.py` at line 50, The code uses the internal CLI
module and private attributes (HarnessCLI and cli._subparsers.choices["run"]),
which are unstable; replace this with lm-eval's public Python API by calling the
documented evaluate() or simple_evaluate() functions directly (or, if CLI
extension is truly required, confirm stability with maintainers and encapsulate
CLI usage behind a clear wrapper). Locate uses of HarnessCLI and any direct
access to cli._subparsers.choices["run"] and refactor to build and pass the same
config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic
in a single adapter that isolates private API usage and adds version checks and
error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Line 50: The code uses the internal CLI module and private attributes
(HarnessCLI and cli._subparsers.choices["run"]), which are unstable; replace
this with lm-eval's public Python API by calling the documented evaluate() or
simple_evaluate() functions directly (or, if CLI extension is truly required,
confirm stability with maintainers and encapsulate CLI usage behind a clear
wrapper). Locate uses of HarnessCLI and any direct access to
cli._subparsers.choices["run"] and refactor to build and pass the same
config/arguments into evaluate()/simple_evaluate(), or else wrap the CLI logic
in a single adapter that isolates private API usage and adds version checks and
error handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e41d01f-ae4a-4286-8fcc-e9b6d62237e0

📥 Commits

Reviewing files that changed from the base of the PR and between e2d29c8 and 4727635.

📒 Files selected for processing (4)
  • examples/llm_eval/lm_eval_hf.py
  • examples/llm_eval/requirements.txt
  • examples/puzzletron/requirements.txt
  • tests/examples/llm_eval/test_llm_eval.py
💤 Files with no reviewable changes (1)
  • examples/puzzletron/requirements.txt

Comment thread examples/llm_eval/lm_eval_hf.py Outdated
Comment thread examples/llm_eval/lm_eval_hf.py
Comment thread examples/llm_eval/lm_eval_hf.py
Comment thread examples/llm_eval/requirements.txt
Comment thread examples/puzzletron/requirements.txt
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review Summary

CRITICAL: 0 | IMPORTANT: 2 | SUGGESTION: 3

Key Findings

  1. [IMPORTANT Compatibility]cli._subparsers.choices["run"] accesses a private attribute of HarnessCLI. Combined with the unbounded >=0.4.10 floor, a future lm-eval release could break this without warning. Recommend adding a guard with a helpful error message.

  2. [IMPORTANT Compatibility] — The requirements pin uses >=0.4.10 but the stated motivation (vLLM 0.20+ support) requires >=0.4.11 per the PR description. The mismatch could confuse future maintainers.

  3. [SUGGESTION]_inject_modelopt_args_into_model_args passes None values for all unset ModelOpt args into model_args. Not a bug (downstream handles it), but filtering would be cleaner.

What Looks Good

  • The migration from deprecated parse_eval_args/cli_evaluate to HarnessCLI is structurally correct.
  • The separation into _MODELOPT_ARG_KEYS, _add_modelopt_args(), and _inject_modelopt_args_into_model_args() is well-factored and testable.
  • The version gate at module load (raise ImportError) gives a clear error instead of a cryptic traceback.
  • The test correctly exercises the full __main__ path via subprocess with a tiny model.
  • Backward compatibility for existing CLI invocations is preserved (HarnessCLI auto-inserts the run subcommand for legacy-style argv).

Risk Assessment

Low-Medium. No algorithm or mode/state correctness concerns — this is an example script (not library code). The primary risk is fragility against future lm-eval releases due to the private API access. The test coverage added mitigates regression risk for the current version.

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (e2d29c8) to head (4af002f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1416      +/-   ##
==========================================
- Coverage   77.40%   77.38%   -0.02%     
==========================================
  Files         476      476              
  Lines       51319    51319              
==========================================
- Hits        39721    39715       -6     
- Misses      11598    11604       +6     
Flag Coverage Δ
examples 41.81% <ø> (-0.16%) ⬇️
unit 52.55% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Annotate trust_remote_code propagation as user-driven (satisfies CodeRabbit
  Security Anti-Pattern check on hardcoded trust_remote_code=True).
- Guard the private cli._subparsers access with a clear RuntimeError so a
  future lm-eval refactor surfaces a useful error instead of AttributeError.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

/claude review

Comment thread examples/llm_eval/lm_eval_hf.py
Comment thread examples/llm_eval/lm_eval_hf.py
Comment thread examples/llm_eval/lm_eval_hf.py
Comment thread tests/examples/llm_eval/test_llm_eval.py
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Review Summary

CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 3

Most impactful finding

[IMPORTANT Compatibility]_inject_modelopt_args_into_model_args unconditionally injects all _MODELOPT_ARG_KEYS (including None-valued ones) into model_args. If a user invokes this script with --model set to a non-HF model type (e.g., vllm), those extraneous None keys could leak into that model's constructor and cause a TypeError. Filtering out None values would make this more robust.

Overall assessment

Clean, well-structured migration from the legacy setup_parser/parse_eval_args/cli_evaluate API to lm-eval 0.4.10+'s HarnessCLI. The private API access (cli._subparsers.choices["run"]) is properly guarded with a clear error message. The data flow through _inject_modelopt_args_into_model_argscreate_from_arg_obj is traced and correct for the HF model path. Test coverage exercises the happy path (no quant) and the FP8 quantization path.

Risk level: Low. This is an example script (not core library), the breaking change (lm-eval < 0.4.10 no longer supported) is clearly documented, and the fallback error messages are informative.

LGTM — approving with minor suggestions above.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…test

The default tiny qwen3 max_position_embeddings is 32, shorter than typical
MMLU prompts (39-46 tokens). lm-eval's HFLM path tolerates this by truncating,
but the TRT-LLM serve path used in test_qwen3_eval_fp8 rejects oversized
prompts with `default_max_tokens (-14) must be greater than 0`. Bump to 2048
to give TRT-LLM headroom for MMLU/hellaswag/gsm8k/humaneval prompts.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant