[https://nvbugs/6069543][fix] Lower accuracy threshold for H20 qwen3.5 test by rosenrodt · Pull Request #13895 · NVIDIA/TensorRT-LLM

rosenrodt · 2026-05-08T09:03:37Z

Summary by CodeRabbit

Tests
- Added GPU-specific (H20) accuracy reference configurations for model evaluation benchmarks.
- Updated evaluation test logic to compute and apply hardware-specific accuracy thresholds, enabling more granular validation across different GPU types.

Description

For some reason H20 has small and sometimes fluctuating accuracy gap relative to H100/H200 BF16 MoE config, resulting in occasional failures. To keep track if it regresses further, we lower the accuracy threshold.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-05-08T09:06:31Z

📝 Walkthrough

Walkthrough

This PR adds H20 GPU-specific accuracy thresholds for the Qwen 3.5 35B model by introducing a reference accuracy specification and conditional test logic. The changes enable the test suite to apply different acceptance criteria when evaluating the model on H20 hardware.

Changes

H20 GPU Accuracy Handling

Layer / File(s)	Summary
Reference Accuracy Configuration `tests/integration/defs/accuracy/references/gsm8k.yaml`	Adds `extra_acc_spec: h20` with `accuracy: 83.9` under the Qwen3.5-35B-A3B reference entry.
H20-Conditional Evaluation Logic `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	`TestQwen3_5_35B_A3B.test_bf16` conditionally sets `extra_acc_spec` to `"h20"` when the current GPU is H20, and passes it to the GSM8K evaluation call to apply adjusted thresholds.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: lowering the accuracy threshold for H20 qwen3.5 test, with proper NVBugs reference and [fix] type.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides a clear explanation of the issue (H20 accuracy fluctuation) and the solution (lowering accuracy thresholds), meeting the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/integration/defs/accuracy/references/gsm8k.yaml (1)
205-206: ⚡ Quick win

Document how the H20 threshold was derived.

Please add a short inline note next to accuracy: 83.9 (for example: calibration date + run window/build IDs). Without provenance, future threshold changes are hard to audit and can mask drift.
Also, QA list updates look unnecessary here (no new/renamed integration test definition, so no change needed in tests/integration/test_lists/qa/llm_function_core.txt).

As per coding guidelines, "Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/accuracy/references/gsm8k.yaml` around lines 205 -
206, The YAML entry uses extra_acc_spec: h20 with accuracy: 83.9 but lacks
provenance; update the line with a short inline note after accuracy: 83.9 (e.g.,
" # derived: calibration YYYY-MM-DD; run window: [start:end]; build IDs:
<build1>,<build2>") explaining how the H20 threshold was computed and which
calibration/run/build produced it, and keep the extra_acc_spec key unchanged;
also revert any edits to the QA list (llm_function_core.txt) — coverage is
sufficient for this change and no QA list update is needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/integration/defs/accuracy/references/gsm8k.yaml`:
- Around line 205-206: The YAML entry uses extra_acc_spec: h20 with accuracy:
83.9 but lacks provenance; update the line with a short inline note after
accuracy: 83.9 (e.g., " # derived: calibration YYYY-MM-DD; run window:
[start:end]; build IDs: <build1>,<build2>") explaining how the H20 threshold was
computed and which calibration/run/build produced it, and keep the
extra_acc_spec key unchanged; also revert any edits to the QA list
(llm_function_core.txt) — coverage is sufficient for this change and no QA list
update is needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28f722bb-c912-4250-b384-31ccac392441

📥 Commits

Reviewing files that changed from the base of the PR and between 7d37c74 and f2bd0fc.

📒 Files selected for processing (2)

tests/integration/defs/accuracy/references/gsm8k.yaml
tests/integration/defs/accuracy/test_llm_api_pytorch.py

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

StanleySun639 · 2026-05-08T09:31:50Z

/bot run

tensorrt-cicd · 2026-05-08T09:37:51Z

PR_Github #47373 [ run ] triggered by Bot. Commit: db77af5 Link to invocation

tensorrt-cicd · 2026-05-08T14:44:50Z

PR_Github #47373 [ run ] completed with state SUCCESS. Commit: db77af5
/LLM/main/L0_MergeRequest_PR pipeline #37304 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

rosenrodt requested a review from a team as a code owner May 8, 2026 09:03

github-actions Bot assigned rosenrodt May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

lower accuracy threshold for h20 qwen3.5 test

db77af5

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt force-pushed the bug/6069543 branch from f2bd0fc to db77af5 Compare May 8, 2026 09:09

StanleySun639 approved these changes May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6069543][fix] Lower accuracy threshold for H20 qwen3.5 test#13895

[https://nvbugs/6069543][fix] Lower accuracy threshold for H20 qwen3.5 test#13895
rosenrodt wants to merge 1 commit intoNVIDIA:mainfrom
rosenrodt:bug/6069543

rosenrodt commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

StanleySun639 commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rosenrodt commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

StanleySun639 commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rosenrodt commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading