[None][docs] Update supported models matrix with AD-onboarded architectures#12340
[None][docs] Update supported models matrix with AD-onboarded architectures#12340bmarimuthu-nv wants to merge 55 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
* build_and_run_ad.py from registry Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * csv generator file Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Working models: ibm-granite/granite-4.0-micro, ibm-granite/granite-4.0-tiny-preview, ibm-granite/granite-4.0-h-small Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* hunyuan model onboarding Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> * support hunyuan instruct model Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com> * update world_size for hunyuan models Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com> --------- Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
* [None][feat] Add AutoDeploy custom model for Qwen3 (Qwen/Qwen3-0.6B-FP8) Add prefill-only Qwen3 custom model implementation for AutoDeploy export with GQA attention, per-head Q/K normalization, and FP8 dtype handling. - Custom model: modeling_qwen3.py using torch_attention and torch_rope ops - Hierarchical equivalence tests against HF reference (MLP, attention, decoder layer, full model, export with dynamic shapes) - AD config with fuse_finegrained_fp8_linear disabled (CUDA 13.1 NVRTC workaround) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address PR feedback: use torch_rmsnorm op, simplify yaml - Replace custom RMSNorm implementation with torch.ops.auto_deploy.torch_rmsnorm - Simplify AD config yaml to only override fuse_finegrained_fp8_linear Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Add SM check to fuse_finegrained_fp8_linear, remove FP8 config override - Skip fuse_finegrained_fp8_linear on SM < 100 (Hopper and below) since fp8_block_scaling_gemm requires Blackwell (SM 100+) - Remove the no-longer-needed qwen3_0_6b_fp8_no_fuse_fp8.yaml config override Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Move position slicing to RoPE, assert position_ids, update models.yaml - Move cos/sin position_ids slicing from attention into RoPE forward (avoids redundant slicing per layer) - Assert position_ids is not None in both Model and ForCausalLM forward (no fallback — AD always provides position_ids) - Update Qwen3-14B and Qwen3-32B to world_size_2 in models.yaml Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AutoDeploy custom model for nvidia/Llama-3_3-Nemotron-Super-49B-v1 Onboard the DeciLM (Nemotron-NAS) architecture as a custom AutoDeploy model. This is a heterogeneous Llama-like model where each layer can have GQA attention + varying-width SwiGLU MLP, or FFN-only (attention skipped via no_op). - Custom prefill-only model using torch_attention AD op - Bundled DeciLMConfig (model_type "nemotron-nas" not in transformers) - Llama3-style RoPE with frequency scaling - Hierarchical equivalence tests against HF reference - AD config YAML for TP=8 deployment Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Address review: use canonical AD IR ops, remove bundled config - Remove bundled DeciLMConfig (use HF config via trust_remote_code=True) - Replace manual RMSNorm with torch.ops.auto_deploy.torch_rmsnorm - Replace manual RoPE with torch.ops.auto_deploy.torch_rope_with_explicit_cos_sin - Move position_ids indexing into RoPE forward (avoid repeated indexing) - Make position_ids required (no fallback) - Remove custom config YAML, update models.yaml entry (dashboard_default + world_size_8) - Remove simple_shard_only from models.yaml entry Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Add custom config for Nemotron-Super-49B with memory-safe defaults The default max_batch_size=128 from dashboard_default.yaml causes OOM during KV cache allocation on 8x H100 80GB (~7GB free per GPU after weight loading). Add model-specific config with reduced batch/token/seq limits and register it in the model registry pipeline. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* update world_size Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> * further update the model yaml Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> --------- Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
* build_and_run_ad.py config clean up Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * max_num_tokens clean up Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Add ByteDance-Seed/Seed-Coder-8B-Base to the AutoDeploy model registry. All Seed-Coder models use standard LlamaForCausalLM architecture (GQA, 8B params) and require no custom model code. Seed-Coder-8B-Instruct and 8B-Reasoning were already in the registry. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add prefill-only custom model for the DeepSeek-V2 family (model_type="deepseek_v2") covering: - DeepSeek-Coder-V2-Instruct (236B MoE, group_limited_greedy routing) - DeepSeek-Coder-V2-Lite-Instruct (16B MoE, greedy routing, no Q LoRA) - DeepSeek-V2.5 (236B MoE, group_limited_greedy routing) Uses AD canonical ops: torch_mla, torch_moe, torch_rmsnorm, torch_rope_with_explicit_cos_sin. Includes RoPE weight de-interleaving hooks for both q_b_proj and q_proj (when q_lora_rank is None). Multi-GPU support via EP (expert parallelism) with enable_attention_dp=true, which replicates MLA attention across GPUs while distributing MoE experts. This avoids MLA TP sharding complexity for the V2-Lite variant (q_lora_rank=None). Comprehensive test suite with 43 tests covering block-level, layer-level, and full-model numerical equivalence against HF, plus torch.export compatibility. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add Llama 3 family AD custom model and registry entries Add explicit AutoDeploy custom modeling code for the Llama 3 family (Llama 3, 3.1, 3.2, 3.3) using AD canonical ops for export compatibility. All text-only Llama 3 variants share the same architecture (LlamaConfig). - Custom model: modeling_llama3.py with torch_rmsnorm, torch_attention, torch_rope_with_explicit_cos_sin canonical ops - Supports all RoPE variants (default, llama3) via ROPE_INIT_FUNCTIONS - Hierarchical tests: RMSNorm, MLP, Attention, Decoder Layer, Full Model, Export with dynamic shapes - Registry: added Meta-Llama-3-8B-Instruct, Llama-3.1-70B-Instruct, Meta-Llama-3-70B-Instruct Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Move RoPE position_ids slicing to model level Move position_ids slicing back into Llama3RotaryEmbedding.forward() so it happens once per forward pass, not redundantly in every attention layer. Also update ad-model-onboard skill to reflect this convention. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AutoDeploy custom model for Qwen3.5 dense (Qwen/Qwen3.5-0.8B) Add prefill-only custom model implementation for the Qwen3.5 dense architecture (hybrid GatedDeltaNet linear attention + full attention, mRoPE, gated attention output, tied embeddings). Includes multimodal wrapper (vision + language) matching the HF checkpoint hierarchy. Files: - modeling_qwen3_5.py: Full model with AD custom ops - test_qwen3_5_modeling.py: Hierarchical equivalence tests - __init__.py: Registration Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address PR feedback on Qwen3.5 dense model - Use torch_rmsnorm and torch_rmsnorm_gated AD custom ops instead of plain PyTorch for RMSNorm and GatedRMSNorm - Add comments explaining why config classes are bundled (qwen3_5 not in installed transformers, requires >= 4.58) - Add comment explaining why no AD rope op is used for mRoPE (existing torch_rope_* ops don't support interleaved mRoPE with 3D position_ids) - Remove unnecessary _split_fused_projections load hook from GatedDeltaNet (Qwen3.5 checkpoint already uses split projections; HF Qwen3Next fused format conversion is test-only) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Accept runtime position_ids in Qwen3.5 forward path The multimodal wrapper was ignoring runtime-provided position_ids and always computing its own via torch.arange, which breaks during decode when the KV cache has accumulated tokens. Now: - Text-only: uses runtime position_ids directly (expanded 2D->3D for mRoPE) - Multimodal: computes spatial (T,H,W) positions via get_rope_index - Fallback: sequential positions when no position_ids provided This fixes garbled output during generation, especially for the 27B model with tensor parallelism. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Add Qwen3.5 dense model registry config with manual TP sharding Add qwen3.5_dense.yaml with manual TP sharding plan matching the MoE model's pattern (delta for GDN, colwise/rowwise for attention and MLP). Update models.yaml to use this config for both 0.8B and 27B models. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Add GQA support to GatedDeltaNet ops and remove repeat_interleave from model Move the repeat_interleave for GQA (num_v_heads > num_k_heads) from the model code into the GDN ops themselves, so that: - The model passes Q/K with their native head counts to the op - The uncached op, torch cached backend, and FLA cached backend all handle the GQA expansion internally via repeat_interleave - Cache initializers use V's head count (not K's) for the recurrent state shape This allows proper TP sharding of the GDN layer since the sharding transform sees the original head counts rather than the expanded ones. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address PR feedback on Qwen3.5 dense model - Minimize qwen3.5_dense.yaml to only overrides from defaults - Put dashboard_default.yaml first in yaml_extra - Remove repeat_interleave from model code (GDN op handles GQA) - Revert FLA backend GQA handling (FLA kernel handles it internally) - Remove position_ids fallback in Qwen3_5Model.forward — raise error instead, with TODO comment for mRoPE position delta cache transform - Remove position_ids fallback in Qwen3_5TextModel.forward Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Fix FLA cached GDN head count for GQA (g/beta use num_v_heads) The FLA cached backend was using Q's head count (num_k_heads) to reshape g, beta, and V tensors, but these have num_v_heads dimensions. Use separate head counts for Q/K vs V/g/beta to correctly handle GQA where num_v_heads > num_k_heads. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Skip delta sharding for GDN layer to isolate TP issue Disable GDN (delta) sharding in the manual TP plan to test if the garbled output on 27B TP=4 is caused by the delta sharding transform. Only attention and MLP layers are sharded. Also add assertion for unsupported vision path per reviewer feedback. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Fix tie_word_embeddings default and re-enable delta sharding Root cause of 27B garbled output: Qwen3_5Config defaulted tie_word_embeddings=True but the 27B checkpoint has tie_word_embeddings=False with a separate lm_head.weight. This caused embed_tokens to be overwritten with lm_head weights during loading. - Change tie_word_embeddings default to False in both Qwen3_5TextConfig and Qwen3_5Config (the 0.8B checkpoint explicitly sets True, the 27B sets False — letting the checkpoint value take precedence) - Re-enable delta sharding in qwen3.5_dense.yaml Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* codex updates Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * update reviewer Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * updates Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> --------- Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Adds a prefill-only custom model for Starcoder2 (bigcode/starcoder2-7b and variants) compatible with AutoDeploy torch.export pipeline. Note: starcoder2-7b struggles with non-code questions, but this seems to be an artifact of the model. Key implementation details: - GQA (36 Q / 4 KV heads in 7B) via torch_attention canonical op - Standard RoPE via torch_rope_with_explicit_cos_sin with _ad_ buffers - 4096-token sliding window passed directly to torch_attention - LayerNorm normalization (not RMSNorm) - Vanilla GELU MLP (no gating) - Config imported from transformers (Starcoder2Config) Tested with hierarchical equivalence tests (14/14 passing) covering block, layer, full-model, and export levels on both CPU and CUDA. E2e run on bigcode/starcoder2-7b (2 GPUs) produces coherent code completions. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Add AutoDeploy custom model for Skywork-R1V2
Adds a prefill-only AutoDeploy custom model for Skywork/Skywork-R1V2-38B,
an InternVL-based VLM with a Qwen2-38B LLM backbone. Only the text path
is exported; the vision tower remains in eager mode.
Key implementation details:
- Bundles SkyworkChatConfig (internvl_chat model_type, not in standard
transformers) with tie_word_embeddings=False to avoid post_init tying
lm_head to embed_tokens
- GQA attention with bias on Q/K/V projections (Qwen2 style)
- Uses AD canonical ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin,
torch_attention (bsnd layout, GQA-native)
- RoPE buffers use _ad_ prefix and return full cached tables; attention
slices by position_ids downstream (D2/D3 convention)
- Vision weights (vision_model.*, mlp1.*) silently skipped via strict=False
Tests: 14/14 pass (RMSNorm, MLP, Attention, DecoderLayer, FullModel
equivalence + torch.export with dynamic shapes)
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Address PR review: use HF config directly for Skywork-R1V2
- Replace bundled SkyworkChatConfig with a lazy loader that loads the real
config from the HF checkpoint cache via trust_remote_code=True; this avoids
maintaining a copy that could drift from the upstream config
- Remove AutoConfig.register (HF auto_map handles config registration; our
bundled config also had the wrong model_type 'internvl_chat' instead of
'skywork_chat')
- Remove arbitrary defaults from SkyworkR1V2RotaryEmbedding.__init__; values
always come from config.max_position_embeddings and config.rope_theta
- Restore original import order in custom/__init__.py (revert linter reorder)
- Update tests: add architectures=["Qwen2ForCausalLM"] to the small Qwen2
config (required by HF SkyworkChatConfig.__init__), forward tie_word_embeddings
from llm_config to the outer SkyworkChatConfig to prevent post_init() from
incorrectly tying lm_head to embed_tokens, add pytest.skip if HF config not
in local cache, fix model_type assertion to "skywork_chat"
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][test] Improve Skywork-R1V2 test: explain reference impls and strengthen state_dict check
- Add block comment explaining why HF Qwen2 reference implementations are
inline (SDPA/flash backend non-determinism in HF Qwen2Attention, private
module path stability)
- Strengthen test_skywork_r1v2_state_dict_keys: assert all keys start with
'language_model.' instead of two specific deny-list prefix checks
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Remove SkyworkChatConfig dependency from modeling and tests
SkyworkChatConfig uses trust_remote_code and requires the HF checkpoint to
be present in the local cache. Neither the modeling file nor the test file
need to import or instantiate this class:
Modeling (modeling_skywork_r1v2.py):
- Remove _load_skywork_chat_config_cls() and module-level SkyworkChatConfig var
- Set config_class = None (not needed by AD's _from_config load path)
- SkyworkR1V2ForCausalLM.__init__ uses getattr(config, 'llm_config', config):
at runtime AD passes a SkyworkChatConfig (returns config.llm_config);
in tests a Qwen2Config can be passed directly (fallback returns config itself)
- Add comment to registration explaining why the string key is required
Tests (test_skywork_r1v2_modeling.py):
- Remove SkyworkChatConfig import and module-level skip guard
- Remove _create_small_chat_config() helper
- All tests pass Qwen2Config directly to SkyworkR1V2ForCausalLM
- Remove test_skywork_r1v2_config_parsing (was testing SkyworkChatConfig
internals, not the AD custom model behavior)
- Tests now run unconditionally without the HF checkpoint in local cache
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Address PR review: use HF config directly for Skywork-R1V2
Modeling (modeling_skywork_r1v2.py):
- Remove lift_to_meta reference from SkyworkR1V2RotaryEmbedding docstring
Tests (test_skywork_r1v2_modeling.py):
- Load SkyworkChatConfig via AutoConfig.from_pretrained (trust_remote_code=True,
local_files_only=True) — same path AutoDeploy's factory takes at runtime
- Skip all tests if the checkpoint is not in the local HF cache
- Restore _create_small_chat_config() and test_skywork_r1v2_config_parsing
- Full model, export, and structural tests use SkyworkChatConfig to exercise
the same code path as production; block/layer tests use Qwen2Config directly
- architectures=["Qwen2ForCausalLM"] required in llm_config dict because
SkyworkChatConfig.__init__ indexes llm_config.get('architectures')[0]
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* some changes
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Extract HF config defaults as named constants in Skywork-R1V2 model
Replace inline fallback literals in getattr calls with named constants
sourced from the HuggingFace remote-code config files, and fix the
ps_version default (v1, not v2).
_HF_DEFAULT_SELECT_LAYER from configuration_skywork_chat.py
_HF_DEFAULT_DOWNSAMPLE_RATIO from configuration_skywork_chat.py
_HF_DEFAULT_PS_VERSION from configuration_skywork_chat.py (was "v2", now "v1")
_HF_DEFAULT_NORM_TYPE from configuration_skywork_vit.py
_HF_DEFAULT_INITIALIZER_FACTOR from configuration_skywork_vit.py
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][chore] Revert SKILL.md change from Skywork-R1V2 onboarding
The ad-model-onboard skill update is tracked separately.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][test] Use HF Qwen2 classes directly in Skywork-R1V2 tests
Replace fully-inline reference implementations with imports from
transformers where possible:
- Qwen2RMSNorm (replaces _HFQwen2RMSNorm)
- Qwen2MLP (replaces _HFQwen2MLP)
- Qwen2ForCausalLM (replaces _HFQwen2ForCausalLM; converter simplified
to a single dict comprehension)
- Qwen2RotaryEmbedding used for attention/layer tests where the
inline rotary is still needed
Keep inline _InlineQwen2Attention / _InlineQwen2DecoderLayer /
_InlineQwen2RotaryEmbedding for the standalone attention and decoder-layer
equivalence tests: SkyworkR1V2Attention uses AD custom ops
(torch_rope_with_explicit_cos_sin + torch_attention in bsnd layout) whose
numerical output differs from HF's standard SDPA even with identical
weights, so the inline reference matching the AD convention is necessary
for a tight tolerance check.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][test] Replace inline Qwen2 impls with HF classes in Skywork-R1V2 tests
Use Qwen2Attention, Qwen2DecoderLayer, and Qwen2RotaryEmbedding directly from
transformers instead of hand-rolled inline reference classes. Set
_attn_implementation="eager" to force manual matmul+softmax (same numerical
path as all other AD model tests) and pass an explicit additive causal mask so
the HF eager path matches AD's is_causal=True convention.
Removes ~100 lines of redundant inline code and aligns the test structure with
the established Llama3/Qwen3 pattern.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][style] Remove leading underscore from vision classes in Skywork-R1V2
Rename _VisionRMSNorm, _VisionEmbeddings, _VisionAttention, _VisionMLP,
_VisionEncoderLayer, _VisionEncoder, _VisionModel to public names per
PR review feedback.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][refactor] Rename SkyworkR1V2ForCausalLM to ForConditionalGeneration, fold PreTrainedModel base
- Rename SkyworkR1V2ForCausalLM -> SkyworkR1V2ForConditionalGeneration to
reflect that the model includes a vision tower (follows AD VLM convention
used by Qwen3_5ForConditionalGeneration, KimiK25ForConditionalGeneration)
- Remove intermediate SkyworkR1V2PreTrainedModel base class; fold its
class attributes and _init_weights into SkyworkR1V2ForConditionalGeneration
directly (only one derived class, so no sharing benefit)
- Update registration and all test references accordingly
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][chore] Remove stale Qwen2Config unit-test-path comments in Skywork-R1V2
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
---------
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Add AutoDeploy custom model implementation covering both the cohere (aya-expanse) and cohere2 (command-a) model families. Key architectural features handled: - Interleaved RoPE via torch_rope_with_qk_interleaving canonical op - Parallel attention + MLP pattern (single LayerNorm, both branches read from same normed input) - LayerNorm (not RMSNorm) with mean subtraction - Logit scaling on output logits - Cohere2 sliding window attention pattern with conditional RoPE (RoPE only on sliding attention layers, not full attention layers) Models covered in model_registry (already existed): - CohereForAI/aya-expanse-8b (world_size=2) - CohereForAI/aya-expanse-32b (world_size=4) - CohereLabs/c4ai-command-a-03-2025 (world_size=8) - CohereLabs/command-a-reasoning-08-2025 (world_size=8) - CohereLabs/command-a-translate-08-2025 (world_size=8) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* hunyuan moe model support Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com> * further update the hunyuan model Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> --------- Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Add AutoDeploy custom model implementation for the IBM Granite dense model family (granite-3.0/3.1/3.3 + guardian variants). Granite is architecturally similar to Llama but has four extra scaling factors: embedding_multiplier, residual_multiplier, attention_multiplier, and logits_scaling. The custom model uses AD canonical ops (torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention) for export. Also fixes a bug in the trtllm attention backend where the q_scaling parameter was hardcoded to 1.0, ignoring the model's custom attention scale. This caused wrong results for any model whose attention scale differs from the default 1/sqrt(head_dim) — e.g. Granite uses attention_multiplier = 1/head_dim. Files added: - tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite.py - tests/unittest/auto_deploy/singlegpu/models/test_granite_modeling.py Files modified: - tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py - tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the SmolLM3 family (HuggingFaceTB/SmolLM3-3B, SmolLM3-3B-Base). SmolLM3 is a Llama-like dense model with GQA and its distinguishing feature: NoPE (No Position Embedding) layers where every 4th layer skips RoPE entirely. Uses AD canonical ops (torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention) and handles per-layer RoPE toggling via config.no_rope_layers. Includes hierarchical equivalence tests covering both RoPE and NoPE layer types, plus torch.export with dynamic shapes. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add AutoDeploy custom model implementation for the Gemma 2 family (google/gemma-2-2b-it, gemma-2-9b-it, gemma-2-27b-it). Key Gemma 2 architecture features handled: - RMSNorm with (1 + weight) scaling (zero-initialized weights) - 4 layer norms per decoder layer (pre/post attention, pre/post feedforward) - Attention logit softcapping via torch_attention's logit_cap parameter - Final logit softcapping on lm_head output - Alternating sliding window / full attention layers - Custom attention scaling via query_pre_attn_scalar - Embedding normalization by sqrt(hidden_size) - GQA with explicit head_dim (can differ from hidden_size / num_heads) - gelu_pytorch_tanh activation - Tied word embeddings (lm_head shares weights with embed_tokens) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for Qwen3 MoE family Replace the qwen3.py export patch with a proper custom model (modeling_qwen3_moe.py) for the qwen3_moe architecture family. Models covered: Qwen3-30B-A3B, Qwen3-30B-A3B-Instruct-2507, Qwen3-235B-A22B, Qwen3-235B-A22B-Instruct-2507, Qwen3-Coder-30B-A3B-Instruct. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][test] Fix Qwen3 MoE attention equivalence test determinism Pass an explicit additive causal mask to HF attention instead of attention_mask=None. With None, HF eager dispatches to F.scaled_dot_product_attention(is_causal=True) which can choose non-deterministic CUDA backends across runs. An explicit float -inf upper-triangular mask forces the deterministic additive-mask path that matches torch_attention's manual causal masking exactly. Follows the same pattern established in test_skywork_r1v2_modeling.py. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Add AD custom model for Qwen2 family
Add a lean prefill-only custom model implementation for the Qwen2 architecture
family, covering Qwen2.5, QwQ, DeepSeek-R1-Distill-Qwen, and related models.
The Qwen2 architecture is similar to Llama3 (GQA + SwiGLU MLP + RMSNorm + RoPE)
with the key difference that Q/K/V projections have bias=True while O and MLP
projections have bias=False.
Uses AD canonical ops:
- torch_rmsnorm for normalization
- torch_rope_with_explicit_cos_sin for rotary embeddings
- torch_attention for grouped-query attention (no repeat_kv needed)
Files:
- tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen2.py
- tests/unittest/auto_deploy/singlegpu/models/test_qwen2_modeling.py
Registered for Qwen2Config covering all qwen2-based models in the registry:
Qwen2.5-{0.5B,1.5B,3B,7B,14B,72B}, Qwen2.5-Coder, Qwen2.5-Math, QwQ-32B,
DeepSeek-R1-Distill-Qwen-{1.5B,7B,14B,32B}, Skywork-SWE-32B,
OpenReasoning-Nemotron-32B, r1-1776-distill-qwen-32b.
Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][fix] Fix causal mask in Qwen2 standalone attention/decoder-layer tests
HF eager_attention_forward skips masking when attention_mask=None, making
the HF reference run non-causal (full) attention. The custom AD attention
uses is_causal=True. This mismatch caused RMSE ratio ~0.93 in the
standalone attention and decoder-layer equivalence tests.
Fix: build an additive causal mask and pass it to the HF attention call
in test_qwen2_attention_equivalence and test_qwen2_decoder_layer_equivalence.
The full-model tests are unaffected since Qwen2Model.forward builds
the causal mask internally for both HF and custom paths.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][fix] Use _get_hf_rotary_class() helper instead of redundant module-level import
Remove the module-level try/except import of HFQwen2RotaryEmbedding and use
the HFRotary variable from _get_hf_rotary_class() already called in the test.
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* ckpt Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * [None][model] onboard Gemma 3 27B IT to AutoDeploy Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * [None][model] address Gemma 3 review feedback Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * [None][model] fix Gemma 3 sliding-window runtime handling Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * [None][model] clarify Gemma sliding window backend contract Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> * [None][model] drop Gemma reduced-layer registry configs Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com> --------- Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
* [None][chore] Update AD onboard skill and reviewer agent - SKILL.md: Add GPU memory sanity check (Phase 0 Step 0) before onboarding - SKILL.md: Clarify HF reference strategy — import from HF cache via importlib instead of copying class definitions into test files - SKILL.md: Tighten Phase 9/10/11 to consolidate --use-registry guidance and remove duplicate mandatory-warning blocks; simplify summary report format - SKILL.md: Phase 4 note that AD factory already calls AutoConfig.from_pretrained - ad-onboard-reviewer.md: Add BB section for vision/multi-modal support checks - ad-onboard-reviewer.md: Clarify B2 custom config justification criteria - ad-onboard-reviewer.md: F4 — no standalone HF-like class definitions in tests; must import from actual HF source Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Address PR review feedback on skill/agent updates - SKILL.md line 110: Relax "always use HF classes" to "use if they exist" - SKILL.md line 112: Restore fallback to copying from HF source when HF cache unavailable; keep strong preference for HF cache import first - SKILL.md Phase 9: Restore⚠️ MANDATORY --use-registry block - SKILL.md Phase 9 failure steps: Restore bold registry config yaml, items 4+5 - SKILL.md Phase 10: Restore⚠️ MANDATORY block and item 9 (raw prompts) - SKILL.md Phase 11: Restore MUST list with raw prompts requirement - SKILL.md Key Gotchas: Restore bold MUST/NEVER RoPE cos/sin guidance - reviewer F4: Relax to flag standalone class defs only when viable HF alt exists Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Address second round of PR review feedback - reviewer BB3 → BB2 (nit rename) - SKILL.md line 205: remove "first sentence" about --use-registry flag - SKILL.md line 112: drop overly specific importlib/sys.path sentence Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Require pytest results to be included in PR Phase 11: agent must run the pytest command on the latest commit and include results in the PR, not just provide the command. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Add E2E sanity check note for custom config class in Phase 4 Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Reviewer B2: replace Eagle example with E2E heuristic for custom config Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Reviewer B2: remove E2E heuristic, keep AutoConfig.from_pretrained check Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Revert HF cache approach; restore copy-into-test-file for missing HF modules HF cache paths are not dependable in CI. When HF modules are not in the installed transformers, copy minimal faithful class definitions from the HF source into the test file instead of loading from HF cache. Update reviewer F4 accordingly. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Extend HF reference strategy to cover trust_remote_code config classes Clarify that the "copy into test file" pattern applies to config classes too: when a model's config uses trust_remote_code (not in transformers), copy a minimal faithful version into the test file rather than loading from the HF cache. The modeling file itself should not import the config. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][chore] Address PR review comments on skill config/test guidance - Remove overly specific examples from config import note (line 96) - Simplify trust_remote_code config guidance to avoid multi-modal specifics (line 112) Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> --------- Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…242) * [None][fix] Remove HF cache dependency from Skywork-R1V2 unit tests Replace SkyworkChatConfig (loaded via AutoConfig with trust_remote_code, which requires the checkpoint in the local HF cache) with a plain Qwen2Config passed directly to SkyworkR1V2ForConditionalGeneration. The model's __init__ already has a fallback: llm_config = getattr(config, "llm_config", config) so passing Qwen2Config directly works without any wrapper config. The vision tower is simply not instantiated (no vision_config attr). This makes all tests runnable in CI without requiring the full 38B checkpoint in the local HF cache. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> * [None][fix] Remove HF cache dependency from Skywork-R1V2 unit tests Replace AutoConfig.from_pretrained(..., local_files_only=True) with minimal faithful copies of SkyworkChatConfig and SkyworkVisionConfig defined in the test file (same pattern used for HF modeling classes not in transformers). This removes the module-level pytest.skip that silently skipped all Skywork tests in CI when the 38B checkpoint was absent. Tests now run without any HF checkpoint, while still exercising the real config- wrapping behavior (nested llm_config, vision_config, vision weight keys). Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> --------- Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
* [None][feat] Add AD custom model for Seed-OSS family Add prefill-only custom model implementation for ByteDance-Seed/Seed-OSS-36B-Instruct using AutoDeploy canonical ops (torch_attention, torch_rope_with_explicit_cos_sin, torch_rmsnorm). Seed-OSS is a dense Llama-style model with GQA (80 Q / 8 KV heads), SwiGLU MLP, and attention_bias=True on Q/K/V projections. Includes hierarchical equivalence tests (MLP, Attention, Decoder Layer, Full Model, Export) comparing against HF reference implementation. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Add causal mask to HF reference in attention/decoder layer tests The HF eager attention does NOT apply causal masking when attention_mask=None, while our custom model always uses is_causal=True. Provide explicit causal mask to HF reference to ensure equivalent comparison. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add AutoDeploy custom model for the Mistral family (model_type="mistral"), covering Mistral-7B, Codestral-22B, Mistral-Small-24B, Mistral-Large, and NeMo-Minitron models. Uses AD canonical ops for RMSNorm, RoPE, and GQA attention with sliding window support. - New modeling_mistral.py with prefill-only export-compatible implementation - Hierarchical equivalence tests (block, layer, full model, export) - Registers against MistralConfig from transformers Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for Qwen3-Next family Replace patch-based approach with a self-contained custom model for the Qwen3-Next (qwen3_next) architecture. The custom model uses AD canonical ops (torch_attention, torch_moe, torch_gated_delta_rule, torch_causal_conv1d, torch_l2norm, torch_rmsnorm, torch_rmsnorm_gated, torch_rope_with_explicit_cos_sin) for prefill-only export via torch.export. Key architecture features handled: - Hybrid linear attention (GatedDeltaNet) + full attention layers - MoE with softmax+topk routing and shared expert (sigmoid gate) - Partial RoPE (25% of head_dim) via canonical rope op - Gated attention output (q_proj 2x -> query + gate) - (1+w) RMSNorm parameterization with load-time offset hook - Fused GDN projections (in_proj_qkvz, in_proj_ba) matching checkpoint Files: - New: modeling_qwen3_next.py (custom model) - New: test_qwen3_next_modeling.py (hierarchical equivalence tests) - Removed: patches/qwen3_next.py (replaced by custom model) - Removed: test_qwen3_next_patches.py, test_qwen3_next_gdn_patches.py - Updated: models.yaml registry entry, custom/__init__.py Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Move position_ids slicing to model level Move cos/sin slicing by position_ids from per-layer attention forward to Qwen3NextModel.forward, so it's done once per forward pass instead of redundantly in every attention layer. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for EXAONE 3.5 family Add prefill-only custom model implementation for the EXAONE 3.5 family (LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct, EXAONE-3.5-32B-Instruct) using AutoDeploy canonical ops (torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention). EXAONE is a dense GQA transformer with non-standard naming conventions (wte, h, attn.attention, c_fc_0/c_fc_1/c_proj). Includes bundled ExaoneConfig (not in installed transformers), hierarchical equivalence tests, model-specific registry config with trust_remote_code:false and dtype:bfloat16, and export compatibility verification. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Move position_ids slicing to RotaryEmbedding (once) per reviewer feedback Slice cos/sin by position_ids once in RotaryEmbedding.forward() instead of per-layer in attention. Remove position_ids from attention/block/layer forward signatures. This matches the Llama3 AD model pattern. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for InternLM3 family Add a lean prefill-only custom model implementation for the InternLM3 architecture (GQA + SwiGLU MLP + RMSNorm + dynamic NTK-scaled RoPE) using AutoDeploy canonical ops (torch_attention, torch_rmsnorm, torch_rope_with_explicit_cos_sin). Includes hierarchical equivalence tests (block, layer, full model, export) and bundles a minimal InternLM3Config since the model is not natively in transformers. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Address review: remove bundled config, document inline refs Remove the bundled InternLM3Config from the modeling file. The AD pipeline loads the config from the HF checkpoint via trust_remote_code=True (same pattern as DeciLM). The test file now loads InternLM3Config dynamically from the HF cache. Inline HF reference classes are kept because the HF modeling_internlm3.py cannot be imported on the installed transformers version (requires LossKwargs from transformers >=4.48). Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM-4 MoE family (glm4_moe) Add AutoDeploy custom model implementation for the GLM-4 MoE architecture (model_type: glm4_moe), covering zai-org/GLM-4.6 and zai-org/GLM-4.7. Key architectural features: - GQA attention (96 Q heads, 8 KV heads, head_dim=128) - Partial rotary embeddings (partial_rotary_factor=0.5) - Per-head QK normalization (RMSNorm) - MoE with sigmoid gating, group top-k routing (160 experts, top-8) - First 3 layers dense, rest MoE with shared experts Uses AD canonical ops: torch_attention, torch_rope_with_explicit_cos_sin, torch_moe, torch_rmsnorm. Includes hierarchical unit tests (19 tests): block, layer, full model equivalence against HF reference, plus torch.export test. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Add glm4_moe registry config with reduced layers and expert export limit Add glm4_moe.yaml config for GLM-4 MoE family models (GLM-4.6, GLM-4.7): - num_hidden_layers: 5 (reduce from 92 for CI/testing) - num_moe_experts_for_export: 2 (reduce from 160 for export) Update registry entries for both GLM-4.6 and GLM-4.7 to use the new config. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Fix device mismatch in GLM4 MoE gate buffer for export Fix e_score_correction_bias buffer to not hardcode dtype, and cast to match scores tensor device/dtype in forward. This prevents a device mismatch (cpu vs meta) during the MoE expert reduction pre-trace in export_to_gm. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][feat] Use noaux_tc_op for GLM4 MoE gate and split registry configs Address reviewer feedback: - Use torch.ops.trtllm.noaux_tc_op for fused sigmoid + bias + group top-k routing instead of vanilla PyTorch - Move num_hidden_layers out of glm4_moe.yaml, use num_hidden_layers_5.yaml separately in models.yaml entries - Ensure e_score_correction_bias is passed as float32 to noaux_tc_op Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GLM MoE DSA family (GLM-5) Add prefill-only AutoDeploy custom model for the glm_moe_dsa architecture (zai-org/GLM-5, zai-org/GLM-5-FP8). The model uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) with noaux_tc-style sigmoid routing, similar to DeepSeek-V3. Key implementation details: - Bundled GlmMoeDsaConfig (not yet in transformers) - Uses canonical AD ops: torch_rmsnorm, torch_mla, torch_moe, torch_rope_with_explicit_cos_sin - Vanilla PyTorch noaux_tc router (sigmoid + group topk + normalize) - Shared rotary embedding at model level with _ad_ buffer prefix - RoPE weight de-interleaving via mla_rope_utils load hook - TokenizersBackend alias for GLM-5-FP8 tokenizer compatibility - DSA indexer and MTP layers skipped (not needed for prefill) Includes hierarchical equivalence tests (MLP, MoE, Attention, Decoder Layer, Full Model, Export) against standalone HF-faithful reference implementations. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * Address PR review feedback - Add num_hidden_layers_5.yaml to GLM-5 registry entries for dashboard runs - Switch MoE gate to torch.ops.trtllm.noaux_tc_op (fused routing kernel) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Add a lean, prefill-only custom model for the Llama 4 family (Scout-17B-16E, Maverick-17B-128E) using AutoDeploy canonical ops. Replaces the previous MoE/vision patches with a self-contained implementation. Key features: - GQA with complex-frequency RoPE (torch_rope_with_complex_freqs) - NoPE layers with attention temperature tuning - L2 QK normalization on RoPE layers (mean-based, plain PyTorch) - MoE with stacked expert weights (bmm) matching HF checkpoint format; AD MatchBmmMoePattern transform handles conversion at deployment - Multimodal wrapper (ForConditionalGeneration) for weight compat - Fix multimodal processor chat template for text-only prompts Includes hierarchical unit tests (block, layer, full model, export) covering RoPE/NoPE layers, MoE/dense layers, and dynamic shapes. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for GPT-OSS family Add AutoDeploy custom model implementation for the GPT-OSS MoE model family (openai/gpt-oss-20b, openai/gpt-oss-120b), replacing the previous gptoss-mxfp4 export patch with a full custom model. Key architectural features: - GQA attention with learnable per-head attention sinks - Custom clamped SwiGLU activation with per-expert biases - YaRN RoPE with precomputed cos/sin cache - MXFP4 checkpoint dequantization via load_state_dict pre-hook - Alternating sliding-window and full-attention layers Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Address reviewer feedback for GPT-OSS custom model - Restore torch_moe_router canonical op (fix fake kernel to use hidden_states.device instead of hardcoded "meta") - Add gpt_oss.yaml config with attn_backend: torch (sinks only supported by torch_attention, not flashinfer/trtllm backends) - Fix sinks TP slicing in torch_backend_attention.py for multi-GPU - Add CUDA graph capture guard for empty batch_sizes list - Add eager fallback in CapturedGraph.forward() when no graphs captured Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use torch-simple backend and attention DP per review - Use compile_backend: torch-simple instead of disabling CUDA graphs - Use enable_attention_dp: true for attention replication (avoids sinks TP sharding issue without modifying torch_backend_attention) - Revert torch_cudagraph.py and torch_backend_attention.py changes - Add TODO note about sinks TP sharding limitation Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Use simple shard + BMM and fix chat template for GPT-OSS - Use simple_shard_only + bmm sharding per reviewer feedback (uses all_gather for functional multi-GPU support) - Guard multimodal content-to-list conversion in llm.py with hasattr(processor, "image_processor") to fix TypeError in text-only model chat templates (e.g., GPT-OSS) Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* updated agent skill with regards to PR Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * instruct agents to poll PRs Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Onboard the OpenELM architecture (apple/OpenELM-270M/1_1B/3B-Instruct) as a custom AutoDeploy model. This is a heterogeneous transformer with: - Per-layer varying query/KV head counts (GQA) - Per-layer varying FFN intermediate sizes - Fused QKV projection with Q/K normalization - Shared input/output embeddings (no separate lm_head) - GLU-style FFN (proj_1 = fused gate+up, proj_2 = down) Uses canonical AD IR ops: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention. Config loaded from checkpoint via trust_remote_code=True. Updated openelm.yaml with attn_backend=flashinfer (trtllm backend produces degenerate output for OpenELM). Works with torch-cudagraph, default batch settings from dashboard_default.yaml. All 3 variants produce coherent generation via build_and_run_ad.py. Signed-off-by: Lucas Liebenwein <lliebenwein@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
… architectures Add 15 new architecture entries to the model support matrix for models onboarded via the AutoDeploy backend, and expand existing entries to reflect broader model family coverage from the AD sprint. Signed-off-by: Bala Marimuthu <bmarimuthu@nvidia.com> Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
… architectures Add 15 new architecture entries and expand 8 existing entries in the model support matrix to reflect models onboarded via AutoDeploy in this branch. Each new entry traces to a specific commit in the branch. New architectures (all AD-supported via [^7]): - DeepseekV2ForCausalLM, ExaoneForCausalLM, Gemma2ForCausalLM, GemmaForCausalLM, GlmMoeDsaForCausalLM, GraniteMoeHybridForCausalLM, HunYuanDenseV1ForCausalLM, HunYuanMoEV1ForCausalLM, InternLM2ForCausalLM, Olmo2ForCausalLM, OpenELMForCausalLM, Phi4FlashForCausalLM, Phi4VisionRForConditionalGeneration, SeedOssForCausalLM, Starcoder2ForCausalLM Expanded existing architectures: - Cohere2, LlamaForCausalLM, MistralForCausalLM, Phi3ForCausalLM, Qwen2ForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, MiniMaxM2ForCausalLM Signed-off-by: Bala Marimuthu <bmarimuthu@nvidia.com> Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive AutoDeploy infrastructure expansion, including agent definitions and configurations for model onboarding workflows, new AutoDeploy model implementations for 30+ architectures, model registry configurations, custom operator enhancements for Mamba/SSM and attention backends, and runtime framework updates for registry-based model configuration management. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
1-6:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
This Python file is missing the required NVIDIA copyright header. Per coding guidelines, all TensorRT-LLM source files should contain an Apache 2.0 license header.
🛡️ Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from importlib.resources import filesAs per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/llm_args.py` around lines 1 - 6, Add the required NVIDIA Apache-2.0 copyright header at the top of this source file (tensorrt_llm._torch.auto_deploy.llm_args) including "Copyright (c) YEAR, NVIDIA CORPORATION" with the year set to the latest meaningful modification, followed by the standard Apache License, Version 2.0 boilerplate; place it before any imports so the header covers the entire file..claude/agents/ad-onboard-reviewer.md (2)
118-121:⚠️ Potential issue | 🟡 MinorThe sample PASS output still teaches the old config policy.
Lines 120-121 show a locally-defined config and
AutoConfig.register()as an unconditional PASS, which contradicts the stricter B2/B3 rules above. Please update the example so it only blesses that pattern when the config is genuinely unavailable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/ad-onboard-reviewer.md around lines 118 - 121, Update the example PASS output so it does not unconditionally endorse a locally-defined config and AutoConfig.register(); specifically change the lines referencing modeling_foo.py, FooConfig, and AutoConfig.register("foo", FooConfig, exist_ok=True) so the sample only shows this pattern as PASS when the global config is genuinely unavailable (e.g., indicate a conditional note or change to FAIL otherwise). Locate the example block that lists "modeling_foo.py:15 — FooConfig defined in file" and "AutoConfig.register(...)" and modify the text to reflect the stricter policy: either mark the register line as conditional/pass-when-unavailable or mark it as failing guidance unless an external config cannot be found.
72-74:⚠️ Potential issue | 🟠 MajorAlign D2/D3 with the new slice-once RoPE guidance.
Lines 73-74 still require RoPE to return full tables and be sliced inside attention, but
.codex/skills/ad-model-onboard/SKILL.mdLines 310-311 now require the opposite. Leaving both docs unchanged will make the reviewer disagree with the skill on otherwise-correct model implementations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/ad-onboard-reviewer.md around lines 72 - 74, Update D2 and D3 to match the new "slice-once" RoPE policy: change D2 to state that RoPE.forward (the RoPE class' forward method) must return tensors already sliced to the requested sequence length (not the full cached table), and change D3 to remove/negate any requirement that downstream attention (e.g., the attention forward implementation that indexes cos/sin via position_ids or cos[position_ids]) perform additional slicing — attention should consume the already-sliced RoPE outputs. Reference the RoPE.forward symbol and the attention forward/position_ids usage when making these edits.
🟡 Minor comments (12)
tensorrt_llm/_torch/auto_deploy/llm.py-49-59 (1)
49-59:⚠️ Potential issue | 🟡 MinorIn-place mutation of caller's
messagesmay cause side effects.The code modifies
msg["content"]in-place for each message. If the caller reuses theinputs["messages"]list after this call, they may unexpectedly see the transformed format instead of their original plain strings.Consider creating a copy of the messages list before modification to avoid mutating the caller's data.
🛡️ Proposed fix to avoid in-place mutation
messages = inputs["messages"] is_multimodal = hasattr(self.processor, "image_processor") if is_multimodal: + # Create a shallow copy to avoid mutating the caller's data + messages = [dict(msg) for msg in messages] for msg in messages: if isinstance(msg.get("content"), str): msg["content"] = [{"type": "text", "text": msg["content"]}]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/llm.py` around lines 49 - 59, The current loop mutates the caller's inputs["messages"] in-place when is_multimodal (hasattr(self.processor, "image_processor")), which can leak formatted message objects back to the caller; instead, make a local copy of the messages list and each message dict before transforming content (e.g., create a new list like local_messages = [msg.copy() for msg in inputs["messages"]] and work on local_messages), update only the copied msg["content"] to the [{"type":"text","text":...}] form for string contents, and then use local_messages for downstream processing so inputs["messages"] remains unchanged; refer to inputs["messages"], is_multimodal, self.processor, and msg["content"] to locate where to apply this change.examples/auto_deploy/model_registry/configs/granite_4.0_micro.yaml-2-2 (1)
2-2:⚠️ Potential issue | 🟡 MinorTypo in comment: "attn ackend" → "attn backend".
📝 Proposed fix
-# Note: Uses flashinfer attention backend (trtllm attn ackend produces garbled output) +# Note: Uses flashinfer attention backend (trtllm attn backend produces garbled output)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/configs/granite_4.0_micro.yaml` at line 2, There's a typo in the comment string "# Note: Uses flashinfer attention backend (trtllm attn ackend produces garbled output)" — change "attn ackend" to "attn backend" so the comment reads "...(trtllm attn backend produces garbled output)"; update the comment text in the YAML (look for the exact comment containing "flashinfer attention backend" / "trtllm attn ackend") to correct the spelling..claude/agents/ad-debug-agent.md-53-53 (1)
53-53:⚠️ Potential issue | 🟡 MinorMinor grammar: "user given" should be hyphenated.
Per static analysis hint, "user given" should be "user-given" when used as a compound adjective.
📝 Proposed fix
-Run the AD flow with the user given model-id using the below command. +Run the AD flow with the user-given model-id using the below command.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/ad-debug-agent.md at line 53, Update the sentence "Run the AD flow with the user given model-id using the below command." to use the compound adjective form by replacing "user given" with "user-given" so it reads "Run the AD flow with the user-given model-id using the below command."; ensure only that hyphenation change is applied and no other wording is altered..claude/agents/ad-debug-agent.md-72-82 (1)
72-82:⚠️ Potential issue | 🟡 MinorDuplicate instruction and typo.
- Line 72 has a typo: "use you your own tools" should be "use your own tools"
- Lines 72 and 82 are duplicates — consider removing one
📝 Proposed fix
-Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write +Remember to use your own tools - Read, Grep, Glob, Bash, Edit, WriteAnd remove line 82 as it duplicates line 72.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/agents/ad-debug-agent.md around lines 72 - 82, Fix the duplicated instruction and typo: change the phrase "Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write" to "Remember to use your own tools - Read, Grep, Glob, Bash, Edit, Write" and remove the duplicate occurrence of that exact line (the second instance) so it appears only once in the document; search for the exact sentence to locate both instances to update and delete the duplicate..codex/agents/ad_onboard_reviewer.toml-63-63 (1)
63-63:⚠️ Potential issue | 🟡 MinorChecklist item G5 appears to be missing.
The checklist jumps from G4 to G6 in the "Test File Hierarchical Levels" section. If G5 was intentionally removed, consider renumbering G6 to G5 for consistency. If it was accidentally omitted, please add the missing item.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.codex/agents/ad_onboard_reviewer.toml at line 63, The checklist numbering in the "Test File Hierarchical Levels" section skips G5 (it goes from G4 to G6); either restore the missing G5 entry or renumber G6 to G5 for consistency. Locate the checklist lines referencing G4 and G6 in .codex/agents/ad_onboard_reviewer.toml and either insert the intended G5 checklist item between G4 and G6 or change the label "G6: Export test runs a second forward with a different shape to validate dynamic dims." to "G5: ..." so the sequence is contiguous.examples/auto_deploy/model_registry/generate_csv.py-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
New Python files require the NVIDIA Apache 2.0 copyright header. Please add the standard header at the top of the file.
📝 Add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Generate a CSV from models.yaml with HF model id, link, and build_and_run_ad.py command.""" import csvAs per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/generate_csv.py` around lines 1 - 6, Add the standard NVIDIA Apache-2.0 copyright header (including the year of latest meaningful modification) to the top of generate_csv.py above the module docstring and imports; ensure the header text matches other TensorRT-LLM source files (Apache 2.0 license boilerplate and NVIDIA copyright line) so the file complies with project guidelines.tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py-1-2 (1)
1-2:⚠️ Potential issue | 🟡 MinorRestore the standard NVIDIA Apache-2.0 header.
This modified Python source file now starts directly with imports. Please add the required license header back at the top. As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` around lines 1 - 2, This file (__init__.py) is missing the required NVIDIA Apache-2.0 license header; add the standard NVIDIA Apache-2.0 copyright header block at the very top of tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py (above the existing imports of CohereForCausalLM and DeciLMForCausalLM), updating the copyright year as needed to reflect modification..codex/skills/ad-model-onboard/SKILL.md-156-160 (1)
156-160:⚠️ Potential issue | 🟡 MinorFix the
world_size_N.yamlsize buckets.Lines 158-160 leave 15B-20B models uncovered and make 80B ambiguous between the 4-GPU and 8-GPU buckets. Please make the ranges exhaustive and non-overlapping.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.codex/skills/ad-model-onboard/SKILL.md around lines 156 - 160, Update the world_size_N.yaml size buckets to be exhaustive and non-overlapping: define world_size_1.yaml for models <2B, world_size_2.yaml for models from 2B up to and including 15B, world_size_4.yaml for models >15B and <80B, and world_size_8.yaml for models >=80B; modify the explanatory bullets in SKILL.md (the section describing the yaml_extra selection and the "Pick world_size_N.yaml based on model size" text) accordingly so the ranges are explicit and unambiguous while keeping the instruction to always include dashboard_default.yaml first and to append any model-specific YAML after the world_size entry..codex/skills/ad-model-onboard/SKILL.md-245-247 (1)
245-247:⚠️ Potential issue | 🟡 MinorAdd a language to this fenced command block.
This currently trips markdownlint MD040. Use
bashlike the surrounding examples.Suggested fix
-``` +```bash python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.codex/skills/ad-model-onboard/SKILL.md around lines 245 - 247, The fenced
code block containing the commandpython examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registryis
missing a language tag and triggers markdownlint MD040; update that
triple-backtick fence to include "bash" (i.e., changetobash) so the
block matches surrounding examples and satisfies the linter.</details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.py-1-1 (1)</summary><blockquote> `1-1`: _⚠️ Potential issue_ | _🟡 Minor_ **Update copyright year to 2026.** The copyright year shows 2025, but since this is a new file created in March 2026, it should be updated to 2026 to match other new files in this PR. <details> <summary>📝 Proposed fix</summary> ```diff -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.py` at line 1, Update the SPDX copyright header year from 2025 to 2026 at the top of the file (the file-level header comment in modeling_deepseek_v2.py) so the header reads 2026, matching the rest of the PR. ``` </details> </blockquote></details> <details> <summary>.claude/skills/ad-model-onboard/SKILL.md-17-19 (1)</summary><blockquote> `17-19`: _⚠️ Potential issue_ | _🟡 Minor_ **Explicitly sum the `nvidia-smi` output or reword this step.** `nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits` returns one value per GPU, not a single total. As written, Step 0 can be read as comparing the model size against one line of output instead of the aggregate VRAM budget it mentions. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/ad-model-onboard/SKILL.md around lines 17 - 19, The step currently suggests running `nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits` but doesn't clarify that it returns per-GPU values; update the SKILL.md step to either instruct users to sum those per-GPU values (e.g., "sum the returned lines to get total VRAM across all GPUs") or reword the step to explicitly say "run the command and sum the per-GPU outputs to compute total system VRAM" so the model size is compared to the aggregate VRAM budget rather than a single GPU value. ``` </details> </blockquote></details> <details> <summary>.claude/skills/ad-model-onboard/SKILL.md-158-160 (1)</summary><blockquote> `158-160`: _⚠️ Potential issue_ | _🟡 Minor_ **The size buckets skip 16B-19B models.** The current guidance covers `<2B`, `2-15B`, `20-80B`, and `80B+`, so 16B/17B/18B/19B family members have no prescribed `world_size_N.yaml`. Please close that gap so the onboarding rule stays deterministic. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.claude/skills/ad-model-onboard/SKILL.md around lines 158 - 160, Update the size-bucket guidance in SKILL.md so 16–19B models are covered: keep the rule to "Always include dashboard_default.yaml first", then change the world_size mapping to explicitly include "world_size_1.yaml for <2B, world_size_2.yaml for 2–15B, world_size_4.yaml for 16–80B, world_size_8.yaml for 80B+" (or alternatively add a separate "16–19B -> world_size_4.yaml" bullet if you prefer minimal change); ensure any mention of "world_size" or "world_size_N.yaml" in the same section is updated to reflect the new bucket so onboarding guidance is deterministic. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (13)</summary><blockquote> <details> <summary>tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gpt_oss.py (1)</summary><blockquote> `467-493`: **Consider adding error handling for dequantization failures.** The `convert_moe_packed_tensors` call could fail (e.g., invalid tensor shapes, dtype mismatches) and would raise an unhandled exception during model loading. Consider wrapping the conversion in a try-except to provide a clearer error message. <details> <summary>♻️ Proposed improvement for error handling</summary> ```diff for base, blocks_key, scales_key in keys_to_process: - blocks = state_dict.pop(blocks_key).cpu() - scales = state_dict.pop(scales_key).cpu() - dequantized = convert_moe_packed_tensors(blocks, scales, dtype=torch.bfloat16) - state_dict[base] = dequantized + try: + blocks = state_dict.pop(blocks_key).cpu() + scales = state_dict.pop(scales_key).cpu() + dequantized = convert_moe_packed_tensors(blocks, scales, dtype=torch.bfloat16) + state_dict[base] = dequantized + except Exception as e: + raise RuntimeError( + f"Failed to dequantize MXFP4 weights for {base}: {e}" + ) from e ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gpt_oss.py` around lines 467 - 493, Wrap the convert_moe_packed_tensors call inside _mxfp4_dequant_load_hook in a try/except so dequantization failures produce a clear contextual error instead of crashing silently; catch exceptions around converting each (blocks_key, scales_key) pair, on error restore any popped entries back into state_dict (blocks_key and scales_key) and either raise a new RuntimeError with details including base/blocks_key/scales_key and the original exception message or log the error before re-raising, so callers can identify which expert tensor failed and why. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py (1)</summary><blockquote> `179-188`: **Minor: Unused unpacked variables.** The variables `T` and `D` from the shape unpacking are never used. Consider prefixing with underscore for clarity. <details> <summary>✏️ Suggested fix</summary> ```diff def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Return (topk_indices, topk_weights), shapes [T, K] each.""" - T, D = hidden_states.shape + _T, _D = hidden_states.shape # Cast both input and weight to float32 for gate computation logits = F.linear(hidden_states.float(), self.wg.weight.float()) ``` Or simply remove the unpacking if not needed for documentation: ```diff - T, D = hidden_states.shape ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` around lines 179 - 188, In the forward method of class/model handling routing (function forward), remove the unused local variables T and D or rename them to _T and _D (or simply use a single underscore _) when unpacking hidden_states.shape to avoid unused-variable warnings; update the line "T, D = hidden_states.shape" accordingly and ensure no other code relies on T or D. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py (2)</summary><blockquote> `179-185`: **Verify the input shape assumption for expert computation.** The reshape on line 180 assumes `hidden_states` has shape `[num_experts * tokens, hidden_size]` and reshapes to `[num_experts, tokens, hidden_size]`. This shape is set by the caller (`Llama4MoE.forward`), but the assumption isn't immediately obvious and could break if the MoE layer is used differently. Consider adding a shape assertion or comment documenting the expected input contract. <details> <summary>📝 Suggested documentation</summary> ```diff def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + # Input shape: [num_experts * num_tokens, hidden_size] + # Reshaped to: [num_experts, num_tokens, hidden_size] for parallel expert computation hidden_states = hidden_states.view(self.gate_up_proj.shape[0], -1, self.hidden_size) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py` around lines 179 - 185, The forward method in modeling_llama4.py assumes hidden_states is shaped as [num_experts * tokens, hidden_size] and reshapes it using self.gate_up_proj.shape[0] to [num_experts, tokens, hidden_size]; add an explicit check and/or clear inline comment documenting this contract so misuse is caught early. In the forward function (method name: forward) reference hidden_states, self.gate_up_proj, self.down_proj and the caller Llama4MoE.forward in the comment and add an assertion that hidden_states.dim() == 2 and hidden_states.size(1) == self.hidden_size and that hidden_states.size(0) is divisible by self.gate_up_proj.shape[0] (or otherwise matches expected num_experts * tokens) before the view operation. ``` </details> --- `614-623`: **Consider guarding the HF vision model import for robustness.** The import of `Llama4VisionModel` from transformers at line 617 only occurs inside `__init__`, unlike other Llama4 imports at module level. While the project pins `transformers==4.57.1` (which includes Llama4VisionModel support since 4.51.0), adding error handling would make the code more defensive against unexpected version mismatches or environments where the model is unavailable. <details> <summary>🛡️ Proposed improvement</summary> ```diff def __init__(self, config: Llama4Config, **kwargs): super().__init__(config) # Import HF's vision model for weight loading compatibility - from transformers.models.llama4.modeling_llama4 import ( - Llama4VisionModel as HFLlama4VisionModel, - ) + try: + from transformers.models.llama4.modeling_llama4 import ( + Llama4VisionModel as HFLlama4VisionModel, + ) + except ImportError as e: + raise ImportError( + "Llama4VisionModel requires transformers >= 4.51.0 with Llama 4 support" + ) from e self.vision_model = HFLlama4VisionModel(config.vision_config) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py` around lines 614 - 623, The inline import of Llama4VisionModel inside __init__ (used to set self.vision_model) is not guarded and can raise an unclear ImportError on mismatched transformers versions; wrap the import of Llama4VisionModel in a try/except ImportError inside __init__ (or perform a module-level guarded import) and on failure raise a clear RuntimeError that explains the missing transformer class and required version, or provide a sensible fallback (e.g., set self.vision_model = None) so downstream code can handle absence gracefully; reference the __init__, Llama4VisionModel, and self.vision_model symbols when applying the change. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.py (1)</summary><blockquote> `1-14`: **Copyright header format differs from other files.** This file uses `Copyright (c) 2026` while other new files in this PR use `Copyright (c) 2022-2026`. Consider aligning with the established pattern for consistency. <details> <summary>📝 Suggested header alignment</summary> ```diff -# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.py` around lines 1 - 14, Update the copyright header in modeling_phi4_visionr.py to match the project's standard range by changing "Copyright (c) 2026" to "Copyright (c) 2022-2026" at the top of the file; keep the rest of the Apache License boilerplate unchanged so the header format is consistent with other new files in the PR. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py (1)</summary><blockquote> `196-203`: **Clarify variable naming: `num_k_heads` is derived from Q's shape, not K's.** The variable `num_k_heads` at line 197 is assigned from `q.shape[2]` (Q's head count), not K's shape. This naming is misleading. The condition at line 199 (`num_v_heads > num_k_heads`) is intentional and documented in the FLA gated delta rule: Q/K have fewer heads than V. However, the variable name obscures this intent. Consider renaming to `num_qk_heads` to clarify that Q and K share the same head count, which differs from V. <details> <summary>📝 Suggested clarification</summary> ```diff # Handle GQA: expand Q/K heads to match V heads if needed - num_k_heads = q.shape[2] + num_qk_heads = q.shape[2] # Q and K share the same head count num_v_heads = v.shape[2] - if num_v_heads > num_k_heads: - n_rep = num_v_heads // num_k_heads + if num_v_heads > num_qk_heads: + n_rep = num_v_heads // num_qk_heads q = q.repeat_interleave(n_rep, dim=2) k = k.repeat_interleave(n_rep, dim=2) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py` around lines 196 - 203, Rename the misleading variable num_k_heads (assigned from q.shape[2]) to num_qk_heads to make it clear Q and K share the same head count; update the condition (num_v_heads > num_qk_heads) and the subsequent repeat_interleave calls on q and k to use num_qk_heads, and adjust the comment above the block to mention "Q/K heads" explicitly; locate this change in torch_backend_gated_delta.py around the block that references q, k, v and num_v_heads. ``` </details> </blockquote></details> <details> <summary>examples/auto_deploy/model_registry/configs/hunyuan_mt_7b.yaml (1)</summary><blockquote> `1-4`: **Consider adding at least an empty YAML mapping or a placeholder key.** This file contains only comments with no YAML keys. While YAML parsers typically handle comment-only files gracefully (returning `null` or empty), some registry implementations may expect at least an empty document. If `dashboard_default` truly provides all necessary values, consider adding a brief placeholder like `{}` or `# (uses dashboard_default)` to make the intent explicit in the parsed structure. <details> <summary>💡 Optional: Add empty mapping for clarity</summary> ```diff # Configuration for Tencent Hunyuan-MT-7B (Dense) # Standard dense GQA model with QK norm and Dynamic NTK-Alpha RoPE. # Uses a custom AutoDeploy model implementation (HunYuanDenseForCausalLM) # registered locally; no additional overrides needed beyond dashboard_default. + +# Empty config - inherits all settings from dashboard_default +{} ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/configs/hunyuan_mt_7b.yaml` around lines 1 - 4, Add an explicit empty YAML document or placeholder to make the intent explicit and ensure parsers/registries don't return null; update hunyuan_mt_7b.yaml to include either an empty mapping ("{}") or a comment placeholder like "# uses dashboard_default" so the file parses to a non-null document while still relying on dashboard_default for values referenced in the config. ``` </details> </blockquote></details> <details> <summary>.codex/agents/ad_debug_agent.toml (1)</summary><blockquote> `14-18`: **CLI flag inconsistency with the Markdown agent documentation.** This TOML uses `--args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE>` (line 17), but `.claude/agents/ad-debug-agent.md` uses `--use-registry` (line 60). The PR's registry-driven workflow suggests `--use-registry` is the preferred approach since it auto-resolves configs from `models.yaml`. Consider aligning this TOML with the registry-based workflow: <details> <summary>♻️ Proposed fix to align with registry workflow</summary> ```diff Typical run command: -AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \ +CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> \ + python examples/auto_deploy/build_and_run_ad.py \ --model <MODEL_HF_ID> \ - --args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE> \ + --use-registry \ 2>&1 | tee <LOG_FILE> ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.codex/agents/ad_debug_agent.toml around lines 14 - 18, Update the CLI example in the TOML to use the registry-driven flag to match the Markdown docs: replace the usage of the --args.yaml-extra flag in the Typical run command with the --use-registry flag so the example follows the registry workflow (i.e., run examples/auto_deploy/build_and_run_ad.py with --use-registry to auto-resolve configs from models.yaml), ensuring the flags --args.yaml-extra and --use-registry are not both shown as the canonical example. ``` </details> </blockquote></details> <details> <summary>examples/auto_deploy/model_registry/generate_csv.py (2)</summary><blockquote> `21-23`: **Specify explicit encoding for file operations.** For cross-platform compatibility, explicitly specify `encoding="utf-8"` when opening text files. <details> <summary>📝 Add explicit encoding</summary> ```diff - with open(MODELS_YAML) as f: + with open(MODELS_YAML, encoding="utf-8") as f: data = yaml.safe_load(f) ``` And for the output file: ```diff - with open(OUTPUT_CSV, "w", newline="") as f: + with open(OUTPUT_CSV, "w", newline="", encoding="utf-8") as f: ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/generate_csv.py` around lines 21 - 23, The file operations in main() currently open MODELS_YAML (and the CSV output) without an explicit encoding; update the open(...) calls to include encoding="utf-8" when reading MODELS_YAML in main() and when writing the CSV output file (wherever the write/open for the CSV is performed) to ensure cross-platform UTF-8 handling; locate the open calls in main() and the CSV-writing function/section and add encoding="utf-8". ``` </details> --- `9-9`: **Unused constant `REPO_ROOT`.** `REPO_ROOT` is defined but never referenced in the script. Consider removing it if it's not needed. <details> <summary>🧹 Remove unused constant</summary> ```diff -# Paths relative to TensorRT-LLM repo root -REPO_ROOT = Path(__file__).resolve().parents[3] MODELS_YAML = Path(__file__).resolve().parent / "models.yaml" ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/generate_csv.py` at line 9, The constant REPO_ROOT is declared but not used in generate_csv.py; remove the unused symbol REPO_ROOT (the Path(__file__).resolve().parents[3] assignment) to eliminate dead code, or if it was intended to be used for resolving paths, replace its usage by referencing REPO_ROOT in the relevant functions (e.g., in CSV generation or file reads) instead of hardcoding paths; target the REPO_ROOT declaration to either delete or wire into path resolution logic. ``` </details> </blockquote></details> <details> <summary>.codex/agents/ad_run_agent.toml (1)</summary><blockquote> `18-22`: **Inconsistent with `.claude/agents/ad-run-agent.md`: consider using `--use-registry`.** This workflow uses `--args.yaml-extra` directly, while the corresponding `.claude/agents/ad-run-agent.md` uses `--use-registry` for automatic config resolution from the registry. Consider aligning on `--use-registry` for consistency, unless there's a specific reason to require explicit YAML paths here. <details> <summary>📝 Update to use --use-registry</summary> ```diff 1. Run: -AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \ +CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \ --model <MODEL_HF_ID> \ - --args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE> \ + --use-registry \ 2>&1 | tee <LOG_FILE> ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In @.codex/agents/ad_run_agent.toml around lines 18 - 22, The command in ad_run_agent.toml uses the explicit flag --args.yaml-extra to pass a config file to examples/auto_deploy/build_and_run_ad.py, but the documentation (.claude/agents/ad-run-agent.md) expects the --use-registry flow; update the Run snippet to use --use-registry instead of --args.yaml-extra (or add an explanatory comment if explicit YAML is required) so the CLI invocation in this file matches the automatic registry-based config resolution described in the docs; reference build_and_run_ad.py and the flags --args.yaml-extra and --use-registry when making the change. ``` </details> </blockquote></details> <details> <summary>.claude/agents/ad-run-agent.md (1)</summary><blockquote> `86-88`: **Add language identifier to fenced code block.** The code block showing the log path format should have a language identifier (or use empty identifier for plain text). <details> <summary>📝 Add language identifier</summary> ```diff -``` +```text $PWD/ad-test-workspace/ad_run_logs/<MODEL_SHORT_NAME>_<YYYYMMDD_HHMMSS>_<STATUS>.log ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In @.claude/agents/ad-run-agent.md around lines 86 - 88, The fenced code block
that displays the log path
"$PWD/ad-test-workspace/ad_run_logs/<MODEL_SHORT_NAME><YYYYMMDD_HHMMSS>.log"
lacks a language identifier; update that Markdown block to include a language
tag such as "text" (i.e., changetotext) so the snippet is treated as
plain text and renders consistently.</details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py (1)</summary><blockquote> `160-166`: **Add validation for GQA head count divisibility.** The GQA expansion assumes `num_v_heads` is evenly divisible by `num_k_heads`. If this invariant is violated, the resulting tensor shape will be incorrect without any clear error message. <details> <summary>🛡️ Add divisibility assertion</summary> ```diff # Handle GQA: expand Q/K heads to match V heads if needed num_k_heads = q.shape[2] num_v_heads = v.shape[2] if num_v_heads > num_k_heads: + assert num_v_heads % num_k_heads == 0, ( + f"GQA requires num_v_heads ({num_v_heads}) to be divisible by " + f"num_k_heads ({num_k_heads})" + ) n_rep = num_v_heads // num_k_heads q = q.repeat_interleave(n_rep, dim=2) k = k.repeat_interleave(n_rep, dim=2) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py` around lines 160 - 166, The code assumes num_v_heads is divisible by num_k_heads before using n_rep = num_v_heads // num_k_heads and repeat_interleave on q and k; add a validation check right after computing num_k_heads and num_v_heads that raises a clear ValueError (or AssertionError) when num_v_heads % num_k_heads != 0 (include the actual values in the message), and only perform n_rep = num_v_heads // num_k_heads and q = q.repeat_interleave(n_rep, dim=2); k = k.repeat_interleave(n_rep, dim=2) when the divisibility check passes to avoid silent shape errors. ``` </details> </blockquote></details> </blockquote></details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro **Run ID**: `3859f7f0-d841-4b1e-abe2-67386567f9fd` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 2fb5e805335d2cd0a9e580797a9d39b85cfbfc58 and 89a5aeaafc7b1a34fa0e38f9975537f1bb79d41b. </details> <details> <summary>📒 Files selected for processing (138)</summary> * `.claude/agents/ad-debug-agent.md` * `.claude/agents/ad-onboard-reviewer.md` * `.claude/agents/ad-run-agent.md` * `.claude/skills/ad-model-onboard/SKILL.md` * `.codex/AGENTS.md` * `.codex/agents/ad_debug_agent.toml` * `.codex/agents/ad_onboard_reviewer.toml` * `.codex/agents/ad_run_agent.toml` * `.codex/agents/onboard_update_reviewer.toml` * `.codex/config.toml` * `.codex/skills/ad-model-onboard/SKILL.md` * `AGENTS.md` * `docs/source/models/supported-models.md` * `examples/auto_deploy/build_and_run_ad.py` * `examples/auto_deploy/model_registry/configs/dashboard_default.yaml` * `examples/auto_deploy/model_registry/configs/deepseek_v2_ep.yaml` * `examples/auto_deploy/model_registry/configs/exaone.yaml` * `examples/auto_deploy/model_registry/configs/glm4_moe.yaml` * `examples/auto_deploy/model_registry/configs/glm_5.yaml` * `examples/auto_deploy/model_registry/configs/gpt_oss.yaml` * `examples/auto_deploy/model_registry/configs/granite_4.0_h_small.yaml` * `examples/auto_deploy/model_registry/configs/granite_4.0_micro.yaml` * `examples/auto_deploy/model_registry/configs/granite_4.0_tiny_preview.yaml` * `examples/auto_deploy/model_registry/configs/hunyuan_mt_7b.yaml` * `examples/auto_deploy/model_registry/configs/minimax_m2.yaml` * `examples/auto_deploy/model_registry/configs/nano_v3.yaml` * `examples/auto_deploy/model_registry/configs/nemotron_flash.yaml` * `examples/auto_deploy/model_registry/configs/nemotron_super_49b.yaml` * `examples/auto_deploy/model_registry/configs/openelm.yaml` * `examples/auto_deploy/model_registry/configs/phi4-multimodal-instruct.yaml` * `examples/auto_deploy/model_registry/configs/phi4-reasoning-vision-15b.yaml` * `examples/auto_deploy/model_registry/configs/qwen3.5_dense.yaml` * `examples/auto_deploy/model_registry/configs/super_v3.yaml` * `examples/auto_deploy/model_registry/generate_csv.py` * `examples/auto_deploy/model_registry/models.yaml` * `tensorrt_llm/_torch/auto_deploy/config/default.yaml` * `tensorrt_llm/_torch/auto_deploy/custom_ops/README.md` * `tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/linear/torch_router.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py` * `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py` * `tensorrt_llm/_torch/auto_deploy/llm.py` * `tensorrt_llm/_torch/auto_deploy/llm_args.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma2.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gpt_oss.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite_moe_hybrid.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_dense.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_olmo3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_openelm.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen2.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_moe.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_next.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py` * `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/gptoss-mxfp4.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/minimax_m2.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/mistral3.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/qwen3.py` * `tensorrt_llm/_torch/auto_deploy/models/patches/qwen3_next.py` * `tensorrt_llm/_torch/auto_deploy/shim/interface.py` * `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_quant.py` * `tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py` * `tensorrt_llm/_torch/auto_deploy/utils/benchmark.py` * `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_gemma3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/custom_ops/test_resource_handlers.py` * `tests/unittest/auto_deploy/singlegpu/models/test_cohere_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_decilm_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_deepseek_v2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_exaone_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_gemma2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_gemma_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_glm4_moe_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_glm_moe_dsa_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_gpt_oss_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_granite_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_granite_moe_hybrid_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_hunyuan_dense_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_hunyuan_moe_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_internlm3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_llama3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_llama4_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_minimax_m2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_mistral3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_mistral3_patches.py` * `tests/unittest/auto_deploy/singlegpu/models/test_mistral_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_olmo3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_openelm_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_phi4_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_phi4_visionr_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_phi4flash_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_phi4mm_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_5_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_moe_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_next_gdn_patches.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_next_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_qwen3_next_patches.py` * `tests/unittest/auto_deploy/singlegpu/models/test_seed_oss_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_skywork_r1v2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_smollm3_modeling.py` * `tests/unittest/auto_deploy/singlegpu/models/test_starcoder2_modeling.py` * `tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py` * `tests/unittest/auto_deploy/singlegpu/shim/test_engine.py` * `tests/unittest/auto_deploy/singlegpu/transformations/library/test_gated_delta_rule_cache.py` * `tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py` * `tests/unittest/auto_deploy/singlegpu/transformations/library/test_torch_gated_delta_rule_cache.py` * `tests/unittest/auto_deploy/singlegpu/utils/test_benchmark_mlp.py` </details> <details> <summary>💤 Files with no reviewable changes (1)</summary> * examples/auto_deploy/model_registry/configs/dashboard_default.yaml </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| batch_size, seq_len = inputs_embeds.shape[:2] | ||
| if position_ids is None: | ||
| position_ids = ( | ||
| torch.arange(seq_len, device=inputs_embeds.device) | ||
| .unsqueeze(0) | ||
| .expand(batch_size, -1) | ||
| ) |
There was a problem hiding this comment.
Do not synthesize position_ids in the export path.
AutoDeploy prefills need explicit positions. Defaulting to 0..S-1 hides a missing input and gives wrong RoPE for non-zero-offset or packed sequences.
Suggested fix
- batch_size, seq_len = inputs_embeds.shape[:2]
- if position_ids is None:
- position_ids = (
- torch.arange(seq_len, device=inputs_embeds.device)
- .unsqueeze(0)
- .expand(batch_size, -1)
- )
+ if position_ids is None:
+ raise ValueError("position_ids must be provided for AD export")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3.py` around
lines 414 - 420, The current code synthesizes position_ids from inputs_embeds
(using torch.arange and expand), which must not happen in the export/AutoDeploy
path; update the block that creates position_ids so that if position_ids is None
and we are in an export/tracing/scripting context (torch.jit.is_tracing() or
torch.jit.is_scripting()), we raise a clear ValueError stating position_ids must
be provided for export instead of synthesizing them; keep the existing fallback
synthesis only for normal runtime (non-export) paths so references to
inputs_embeds, position_ids, and the surrounding forward (or the method
containing this code) remain correct.
| def forward( | ||
| self, | ||
| input_ids: Optional[torch.LongTensor] = None, | ||
| position_ids: Optional[torch.LongTensor] = None, | ||
| inputs_embeds: Optional[torch.FloatTensor] = None, | ||
| **kwargs, | ||
| ) -> Gemma3CausalLMOutput: | ||
| language_model_kwargs = {} | ||
| if input_ids is not None: | ||
| language_model_kwargs["input_ids"] = input_ids | ||
| if position_ids is not None: | ||
| language_model_kwargs["position_ids"] = position_ids | ||
| if inputs_embeds is not None: | ||
| language_model_kwargs["inputs_embeds"] = inputs_embeds | ||
|
|
||
| language_model_signature = inspect.signature(self.language_model.forward) | ||
| accepts_var_kwargs = any( | ||
| parameter.kind == inspect.Parameter.VAR_KEYWORD | ||
| for parameter in language_model_signature.parameters.values() | ||
| ) | ||
| if not accepts_var_kwargs: | ||
| allowed_extra_kwargs = set(language_model_signature.parameters) - set( | ||
| language_model_kwargs | ||
| ) | ||
| language_model_kwargs.update( | ||
| {key: value for key, value in kwargs.items() if key in allowed_extra_kwargs} | ||
| ) | ||
|
|
||
| return self.language_model(**language_model_kwargs) |
There was a problem hiding this comment.
This wrapper is registered as image-text, but its forward path is still text-only.
Gemma3ForConditionalGeneration.forward never uses vision_tower or multi_modal_projector, and the current kwarg filtering drops multimodal inputs before the call into Gemma3ForCausalLM. Because the class is also registered in AutoModelForImageTextToTextFactory, multimodal runs can silently ignore the image input instead of failing fast. That is especially risky because the registry still marks google/gemma-3-27b-it with multimodal.yaml.
Suggested fix
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
**kwargs,
) -> Gemma3CausalLMOutput:
+ if "pixel_values" in kwargs:
+ raise NotImplementedError(
+ "Gemma3ForConditionalGeneration export path is text-only; image inputs are not supported."
+ )
language_model_kwargs = {}If image-text support is not intended yet, please also drop the AutoModelForImageTextToTextFactory registration until the vision path is wired.
Also applies to: 567-569
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3.py` around
lines 532 - 560, The forward wrapper (Gemma3ForConditionalGeneration.forward) is
registered as image-text but currently only forwards text args to
self.language_model and filters out multimodal kwargs; either wire the vision
path or fail fast: detect multimodal inputs (e.g., pixel_values / vision_tower /
multi_modal_projector related kwargs) before the existing kwargs filtering in
forward and if self.vision_tower and self.multi_modal_projector are available,
run the vision_tower and multi_modal_projector to produce inputs_embeds and
include them in language_model_kwargs, otherwise raise a clear ValueError
stating multimodal inputs are unsupported (or remove the
AutoModelForImageTextToTextFactory registration) so image inputs are not
silently dropped. Ensure checks reference the symbols language_model.forward,
vision_tower, multi_modal_projector, and language_model_kwargs so the fix is
easy to locate.
| @@ -0,0 +1,675 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Add the standard SPDX/Apache header to this new file.
This new TensorRT-LLM source file is missing the required Apache 2.0 header block, and the year should reflect the 2026 addition.
Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` at
line 1, This file
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py is missing
the required NVIDIA Apache-2.0 header; add the standard SPDX/Apache header block
at the top of the file with the NVIDIA copyright line and the year 2026, using
the exact license text and SPDX identifier (Apache-2.0) required by project
guidelines so the header precedes the module code in modeling_glm_moe_dsa.py.
| if inputs_embeds is None: | ||
| inputs_embeds = self.embed_tokens(input_ids) | ||
|
|
||
| assert position_ids is not None, "position_ids is required" |
There was a problem hiding this comment.
Replace these asserts with explicit input validation.
Optimized Python removes assert, so position_ids becomes optional at runtime and the model fails later with a less actionable error.
Suggested fix
- assert position_ids is not None, "position_ids is required"
+ if position_ids is None:
+ raise ValueError("position_ids is required")
- assert position_ids is not None, "position_ids is required"
+ if position_ids is None:
+ raise ValueError("position_ids is required")Also applies to: 659-659
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` at
line 597, Replace the runtime-removed asserts for input validation: instead of
using "assert position_ids is not None, 'position_ids is required'" (and the
similar check at the other occurrence), add explicit validation that checks if
position_ids is None and raise a clear exception (e.g., ValueError or TypeError)
with the message "position_ids is required". Update the checks in
modeling_glm_moe_dsa.py where position_ids is validated so the function/method
fails fast with a clear error at runtime rather than relying on assert.
| def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: | ||
| bsz, seq_len, hidden_dim = hidden_states.shape | ||
| hidden_states_flat = hidden_states.view(-1, hidden_dim) | ||
|
|
||
| # Router GEMM in float32 | ||
| router_logits = F.linear(hidden_states_flat.float(), self.weight.float()) | ||
|
|
||
| # Fused routing: sigmoid + bias + group top-k + normalize + scale | ||
| topk_weights, topk_indices = torch.ops.trtllm.noaux_tc_op( | ||
| router_logits, | ||
| self.e_score_correction_bias.float(), | ||
| self.n_group, | ||
| self.topk_group, | ||
| self.top_k, | ||
| self.routed_scaling_factor, | ||
| ) | ||
|
|
||
| return topk_indices, topk_weights |
There was a problem hiding this comment.
Remove the direct torch.ops.trtllm dependency from the router.
Line 155 calls torch.ops.trtllm.noaux_tc_op, but the onboarding contract updated in this PR only allows torch.ops.auto_deploy.torch_* or plain PyTorch in model code. This will also break the required CPU unit-test path. If helpful, I can sketch the pure-PyTorch sigmoid/bias/group-topk rewrite.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 148-148: Unpacked variable bsz is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 148-148: Unpacked variable seq_len is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` around
lines 147 - 164, The forward method currently calls torch.ops.trtllm.noaux_tc_op
(in modeling_glm4_moe.forward) which violates the new onboarding contract and
breaks CPU unit tests; replace that call with either a pure-PyTorch
implementation of the fused operations (apply bias, sigmoid, perform group-wise
top-k selection, normalize and scale using the existing tensors router_logits,
self.e_score_correction_bias, self.n_group, self.topk_group, self.top_k,
self.routed_scaling_factor) or call the approved operator namespace
(torch.ops.auto_deploy.*) if an equivalent op exists; ensure the replacement
reproduces the original outputs topk_indices and topk_weights and runs on CPU.
| def forward(self, hidden_states: torch.Tensor, position_ids: torch.LongTensor) -> torch.Tensor: | ||
| batch_size, seq_len, _ = hidden_states.shape | ||
| qkv = self.qkv_proj(hidden_states) | ||
| query_pos = self.num_heads * self.head_dim | ||
| kv_pos = self.num_key_value_heads * self.head_dim | ||
| query_states = qkv[..., :query_pos] | ||
| key_states = qkv[..., query_pos : query_pos + kv_pos] | ||
| value_states = qkv[..., query_pos + kv_pos :] | ||
|
|
||
| query_states = query_states.view(batch_size, seq_len, self.num_heads, self.head_dim) | ||
| key_states = key_states.view(batch_size, seq_len, self.num_key_value_heads, self.head_dim) | ||
| value_states = value_states.view( | ||
| batch_size, seq_len, self.num_key_value_heads, self.head_dim | ||
| ) | ||
|
|
||
| position_embeddings = self.rotary_emb(value_states) | ||
| if self.rotary_emb.rope_scaling is None: | ||
| cos, sin = position_embeddings | ||
| else: | ||
| short_cos, short_sin, long_cos, long_sin = position_embeddings | ||
| if position_ids.is_meta: | ||
| cos, sin = short_cos, short_sin | ||
| else: | ||
| seq_len = int(position_ids.max().item()) + 1 | ||
| if seq_len <= self.rotary_emb.original_max_position_embeddings: | ||
| cos, sin = short_cos, short_sin | ||
| else: | ||
| cos, sin = long_cos, long_sin | ||
| cos = cos[position_ids] | ||
| sin = sin[position_ids] | ||
| query_rot = query_states[..., : self.rotary_ndims] | ||
| query_pass = query_states[..., self.rotary_ndims :] | ||
| key_rot = key_states[..., : self.rotary_ndims] | ||
| key_pass = key_states[..., self.rotary_ndims :] | ||
| query_rot, key_rot = torch.ops.auto_deploy.torch_rope_with_explicit_cos_sin( | ||
| query_rot, | ||
| key_rot, | ||
| cos, | ||
| sin, | ||
| 2, | ||
| ) | ||
| query_states = torch.cat((query_rot, query_pass), dim=-1) | ||
| key_states = torch.cat((key_rot, key_pass), dim=-1) | ||
| attn_output = torch.ops.auto_deploy.torch_attention( | ||
| query_states, | ||
| key_states, | ||
| value_states, | ||
| is_causal=True, | ||
| scale=self.scaling, | ||
| layout="bsnd", | ||
| ) | ||
| attn_output = attn_output.reshape(batch_size, seq_len, self.num_heads * self.head_dim) | ||
| return self.o_proj(attn_output) |
There was a problem hiding this comment.
Keep query length separate from max position.
Line 280 overwrites seq_len with position_ids.max() + 1, but Line 308 still uses seq_len for the final reshape. If a caller passes offset or padded position_ids, that reshape size no longer matches the actual query length.
🐛 Proposed fix
def forward(self, hidden_states: torch.Tensor, position_ids: torch.LongTensor) -> torch.Tensor:
- batch_size, seq_len, _ = hidden_states.shape
+ batch_size, query_len, _ = hidden_states.shape
qkv = self.qkv_proj(hidden_states)
query_pos = self.num_heads * self.head_dim
kv_pos = self.num_key_value_heads * self.head_dim
query_states = qkv[..., :query_pos]
key_states = qkv[..., query_pos : query_pos + kv_pos]
value_states = qkv[..., query_pos + kv_pos :]
- query_states = query_states.view(batch_size, seq_len, self.num_heads, self.head_dim)
- key_states = key_states.view(batch_size, seq_len, self.num_key_value_heads, self.head_dim)
+ query_states = query_states.view(batch_size, query_len, self.num_heads, self.head_dim)
+ key_states = key_states.view(batch_size, query_len, self.num_key_value_heads, self.head_dim)
value_states = value_states.view(
- batch_size, seq_len, self.num_key_value_heads, self.head_dim
+ batch_size, query_len, self.num_key_value_heads, self.head_dim
)
position_embeddings = self.rotary_emb(value_states)
if self.rotary_emb.rope_scaling is None:
cos, sin = position_embeddings
else:
short_cos, short_sin, long_cos, long_sin = position_embeddings
if position_ids.is_meta:
cos, sin = short_cos, short_sin
else:
- seq_len = int(position_ids.max().item()) + 1
- if seq_len <= self.rotary_emb.original_max_position_embeddings:
+ max_position = int(position_ids.max().item()) + 1
+ if max_position <= self.rotary_emb.original_max_position_embeddings:
cos, sin = short_cos, short_sin
else:
cos, sin = long_cos, long_sin
cos = cos[position_ids]
sin = sin[position_ids]
@@
attn_output = torch.ops.auto_deploy.torch_attention(
query_states,
key_states,
value_states,
is_causal=True,
scale=self.scaling,
layout="bsnd",
)
- attn_output = attn_output.reshape(batch_size, seq_len, self.num_heads * self.head_dim)
+ attn_output = attn_output.reshape(batch_size, query_len, self.num_heads * self.head_dim)
return self.o_proj(attn_output)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py` around
lines 257 - 309, The method forward overwrites the original query length stored
in seq_len when computing seq_len = int(position_ids.max().item()) + 1, which
breaks the final reshape of attn_output; preserve the original batch_size and
seq_len for tensor shapes by using a separate variable (e.g. pos_seq_len or
rope_seq_len) when inspecting position_ids and choosing cos/sin in the rotary
embedding logic (inside forward, around the rotary_emb/position_ids handling)
instead of reassigning seq_len so attn_output.reshape(batch_size, seq_len, ...)
still uses the true query length.
| if position_ids is None: | ||
| seq_len = inputs_embeds.shape[1] | ||
| position_ids = torch.arange(seq_len, device=inputs_embeds.device).unsqueeze(0) |
There was a problem hiding this comment.
Require position_ids instead of synthesizing them.
Line 726 makes position_ids optional again and generates a single [1, S] row for the whole batch. That changes the exported interface and can hide upstream caller bugs that should have been caught at the boundary.
🛡️ Proposed change
- if position_ids is None:
- seq_len = inputs_embeds.shape[1]
- position_ids = torch.arange(seq_len, device=inputs_embeds.device).unsqueeze(0)
+ if position_ids is None:
+ raise ValueError("position_ids must be provided for AD export")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py` around
lines 726 - 728, The code currently synthesizes position_ids when None (using
inputs_embeds.shape[1]) which hides caller errors and changes the exported
interface; instead, make position_ids mandatory by removing the fallback
generation and adding an explicit check that raises a clear error if
position_ids is None (e.g., in the same function/method where position_ids is
used, reference the position_ids parameter and inputs_embeds to validate), so
callers must supply a [batch_size, seq_len] tensor; ensure any
docstring/signature for the forward/inference method (the function using
position_ids in modeling_phi4mm.py) reflects this required parameter and the
raised error message points to the missing position_ids input.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
|
|
||
| """Qwen3.5 dense model for auto_deploy (text + vision). | ||
|
|
||
| Reference HF modeling file (not yet in a released transformers version): | ||
| transformers/models/qwen3_next/modeling_qwen3_next.py | ||
| (Qwen3.5 dense is architecturally a Qwen3Next without MoE) | ||
|
|
||
| This implementation differs from the HuggingFace original in the following ways: | ||
| * External kernel dependencies (flash-linear-attention, causal_conv1d) are replaced with | ||
| autodeploy custom ops. | ||
| * Cache-related code paths have been removed (prefill-only). | ||
| * Training-related code paths have been removed. | ||
| * Unnecessary output fields have been removed. | ||
| * The GatedDeltaNet forward uses autodeploy custom ops: | ||
| torch_causal_conv1d, torch_l2norm, torch_gated_delta_rule. | ||
| * mRoPE cos/sin can be computed outside the export boundary and | ||
| passed in as ``position_embeddings`` for multimodal inputs. | ||
|
|
||
| This allows us to have a "pytorch" native reference implementation decoupled from bugs and | ||
| dependency issues in the source, while remaining weight-compatible with HF checkpoints. | ||
| """ |
There was a problem hiding this comment.
Add the standard 2026 Apache header to this new file.
Line 1 still says 2025, and the file is missing the Apache 2.0 header block used in the other new source files under this directory. That leaves the new file out of compliance with the repo's required header format.
As per coding guidelines **/*.{cpp,h,cu,cuh,hpp,py}: All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification. The header should be an Apache 2.0 license block as specified.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py` around
lines 1 - 22, Update the file header in modeling_qwen3_5.py to include the
standard 2026 NVIDIA Apache-2.0 license block used across the project (replace
the current "2025" year and the missing header); insert the full Apache 2.0
header/license comment at the top of the file matching other TensorRT-LLM source
files and ensure the copyright line references 2026 and NVIDIA CORPORATION,
leaving the rest of the module (e.g., the Qwen3.5 docstring and functions)
unchanged.
| if has_vision: | ||
| # Multimodal: compute mRoPE positions with spatial (T, H, W) layout. | ||
| # NOTE: This path needs the mRoPE position delta cache transform to work | ||
| # correctly with the AD runtime during decode steps. | ||
| assert False, ( | ||
| "Vision path not yet supported in AD runtime. Requires mRoPE position " | ||
| "delta cache transform. See: " | ||
| "https://github.com/nv-auto-deploy/TensorRT-LLM/pull/189#discussion_r2915084063" | ||
| ) | ||
| position_ids, _ = self.get_rope_index( | ||
| input_ids, | ||
| image_grid_thw=image_grid_thw, | ||
| video_grid_thw=video_grid_thw, | ||
| attention_mask=attention_mask, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does Python remove assertstatements when running with the-Oflag, and should unsupported code paths raiseNotImplementedErrororRuntimeError instead?
💡 Result:
Yes. Running Python with -O (“optimize”) removes assert statements from the generated bytecode (and also removes code guarded by if __debug__: by setting __debug__ to False).[1] So assert must not be relied on for required validation or control flow.
For “unsupported” code paths, pick the exception based on why it’s unsupported:
- Use
NotImplementedErrorwhen the object is intended to support the operation, but the implementation is intentionally missing (e.g., abstract base-class method, stub during development).[2] - Use
RuntimeErrorfor an unexpected internal situation that “doesn’t fall into any other category”.[2] For “should be unreachable” paths where you still want the failure in optimized runs, raise an exception explicitly (e.g.,raise RuntimeError("unreachable")), rather thanassert False.
Sources: [1] Python “Setup and Usage” (-O removes asserts / __debug__-guarded code). [2] Python docs: built-in exceptions (NotImplementedError, RuntimeError).
🏁 Script executed:
# Check if file exists and get line count
wc -l tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 133
🏁 Script executed:
# Read the specific lines mentioned in the review (1391-1405)
sed -n '1391,1405p' tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 823
🏁 Script executed:
# Search for other instances of assert False in this file
grep -n "assert False" tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 95
🏁 Script executed:
# Get broader context around line 1395
sed -n '1370,1420p' tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2967
🏁 Script executed:
# Check CODING_GUIDELINES.md for exception handling guidance
find . -name "CODING_GUIDELINES.md" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 85
🏁 Script executed:
# Check CODING_GUIDELINES.md for exception handling recommendations
cat CODING_GUIDELINES.md | grep -A 20 -B 2 "exception\|Exception\|assert\|Assert"Repository: NVIDIA/TensorRT-LLM
Length of output: 3787
Replace assert False with raise NotImplementedError() for unsupported vision path.
Line 1395 uses assert False, which Python removes entirely when run with the -O flag (optimize mode). In optimized runs, this guard disappears and execution continues into a path explicitly marked unsupported. Use raise NotImplementedError() instead to ensure the guard persists across all execution modes.
Proposed change
if has_vision:
# Multimodal: compute mRoPE positions with spatial (T, H, W) layout.
# NOTE: This path needs the mRoPE position delta cache transform to work
# correctly with the AD runtime during decode steps.
- assert False, (
+ raise NotImplementedError(
"Vision path not yet supported in AD runtime. Requires mRoPE position "
"delta cache transform. See: "
"https://github.com/nv-auto-deploy/TensorRT-LLM/pull/189#discussion_r2915084063"
)
- position_ids, _ = self.get_rope_index(
- input_ids,
- image_grid_thw=image_grid_thw,
- video_grid_thw=video_grid_thw,
- attention_mask=attention_mask,
- )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if has_vision: | |
| # Multimodal: compute mRoPE positions with spatial (T, H, W) layout. | |
| # NOTE: This path needs the mRoPE position delta cache transform to work | |
| # correctly with the AD runtime during decode steps. | |
| assert False, ( | |
| "Vision path not yet supported in AD runtime. Requires mRoPE position " | |
| "delta cache transform. See: " | |
| "https://github.com/nv-auto-deploy/TensorRT-LLM/pull/189#discussion_r2915084063" | |
| ) | |
| position_ids, _ = self.get_rope_index( | |
| input_ids, | |
| image_grid_thw=image_grid_thw, | |
| video_grid_thw=video_grid_thw, | |
| attention_mask=attention_mask, | |
| ) | |
| if has_vision: | |
| # Multimodal: compute mRoPE positions with spatial (T, H, W) layout. | |
| # NOTE: This path needs the mRoPE position delta cache transform to work | |
| # correctly with the AD runtime during decode steps. | |
| raise NotImplementedError( | |
| "Vision path not yet supported in AD runtime. Requires mRoPE position " | |
| "delta cache transform. See: " | |
| "https://github.com/nv-auto-deploy/TensorRT-LLM/pull/189#discussion_r2915084063" | |
| ) |
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 1395-1395: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py` around
lines 1391 - 1405, The guard in the vision branch (when has_vision is true in
the block that calls self.get_rope_index) uses assert False which is stripped
under Python -O; replace that assert with raise NotImplementedError(...) so the
unsupported vision path always raises at runtime; update the exception text to
keep the same explanatory message referencing the mRoPE position delta cache
transform and the PR link to aid debugging (keep reference to
self.get_rope_index and the vision/multimodal comment).
| AutoConfig.register("qwen3_5", Qwen3_5Config, exist_ok=True) | ||
| AutoConfig.register("qwen3_5_text", Qwen3_5TextConfig, exist_ok=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does transformers.AutoConfig.registeraccept anexist_ok keyword in the Transformers version used by this repo, and if not, what backward-compatible registration pattern should be used?
💡 Result:
No—if this repo is on an older Transformers where AutoConfig.register is defined as register(model_type, config) (e.g., Transformers v4.12.0), it does not accept an exist_ok keyword. In v4.12.0 the method signature has no exist_ok parameter. [1] By contrast, in newer Transformers (e.g., v4.42.4) the documented signature is register(model_type, config, exist_ok=False). [2]
A backward-compatible pattern is to feature-detect (or catch TypeError) and fall back:
from transformers import AutoConfig
import inspect
def register_autoconfig(model_type: str, config_cls):
sig = inspect.signature(AutoConfig.register)
if "exist_ok" in sig.parameters:
AutoConfig.register(model_type, config_cls, exist_ok=True)
return
# Old Transformers: no exist_ok
try:
AutoConfig.register(model_type, config_cls)
except ValueError:
# already registered; ignore only if it's the same class
from transformers.models.auto.configuration_auto import CONFIG_MAPPING
if CONFIG_MAPPING[model_type] is not config_cls:
raiseSources:
[1] Transformers v4.12.0 AutoConfig.register(model_type, config) implementation (no exist_ok). [1]
[2] Transformers v4.42.4 docs showing AutoConfig.register(model_type, config, exist_ok=False). [2]
🏁 Script executed:
# Check the target file at the specified lines
cat -n tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py | sed -n '1475,1485p'Repository: NVIDIA/TensorRT-LLM
Length of output: 651
🏁 Script executed:
# Find other custom-model files to verify the existing pattern
find tensorrt_llm -name "modeling_*.py" -path "*/custom/*" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 2891
🏁 Script executed:
# Check Transformers version requirement
fd -t f -e txt -e toml -e py -e cfg . | xargs grep -l "transformers" | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 594
🏁 Script executed:
# Check a few other custom-model files for the error-handling pattern
head -50 tensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.py | tail -30Repository: NVIDIA/TensorRT-LLM
Length of output: 1332
🏁 Script executed:
# Look at the end of a custom model file to see registration pattern
tail -20 tensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 810
🏁 Script executed:
# Check another one
tail -20 tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 763
🏁 Script executed:
# Find Transformers version requirement
find . -maxdepth 2 -name "pyproject.toml" -o -name "setup.py" -o -name "requirements*.txt" | xargs grep -i transformers 2>/dev/null | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 223
🏁 Script executed:
# Search for all uses of AutoConfig.register in custom modeling files
rg "AutoConfig\.register" tensorrt_llm/_torch/auto_deploy/models/custom/ -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 8444
Guard AutoConfig.register(..., exist_ok=True) calls with try-except for backward compatibility with older Transformers.
Lines 1480-1481 will fail on Transformers versions that do not support the exist_ok parameter. Several other custom-model files already use the try-except fallback pattern (e.g., modeling_kimi_k2.py, modeling_exaone.py, modeling_phi4flash.py). Apply the same pattern here:
Proposed change
-AutoConfig.register("qwen3_5", Qwen3_5Config, exist_ok=True)
-AutoConfig.register("qwen3_5_text", Qwen3_5TextConfig, exist_ok=True)
+for model_type, config_cls in (
+ ("qwen3_5", Qwen3_5Config),
+ ("qwen3_5_text", Qwen3_5TextConfig),
+):
+ try:
+ AutoConfig.register(model_type, config_cls, exist_ok=True)
+ except TypeError:
+ try:
+ AutoConfig.register(model_type, config_cls)
+ except ValueError:
+ pass📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| AutoConfig.register("qwen3_5", Qwen3_5Config, exist_ok=True) | |
| AutoConfig.register("qwen3_5_text", Qwen3_5TextConfig, exist_ok=True) | |
| for model_type, config_cls in ( | |
| ("qwen3_5", Qwen3_5Config), | |
| ("qwen3_5_text", Qwen3_5TextConfig), | |
| ): | |
| try: | |
| AutoConfig.register(model_type, config_cls, exist_ok=True) | |
| except TypeError: | |
| try: | |
| AutoConfig.register(model_type, config_cls) | |
| except ValueError: | |
| pass |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py` around
lines 1480 - 1481, The AutoConfig.register calls for Qwen3_5Config and
Qwen3_5TextConfig use the exist_ok=True parameter which breaks on older
Transformers; wrap both registrations in a try-except that first calls
AutoConfig.register("qwen3_5", Qwen3_5Config, exist_ok=True) and
AutoConfig.register("qwen3_5_text", Qwen3_5TextConfig, exist_ok=True) inside the
try, and in the except TypeError (or generic Exception) fall back to calling
AutoConfig.register("qwen3_5", Qwen3_5Config) and
AutoConfig.register("qwen3_5_text", Qwen3_5TextConfig) so the code remains
compatible with versions that lack the exist_ok parameter.
There was a problem hiding this comment.
Please rebase, the nano config already exists in this location, just update the relevant config (sharding_source: ['manual'] was removed here)
There was a problem hiding this comment.
Same as nanov3, the file exists, please rebase and add any relevant changes
Summary
[^7]footnote for all AutoDeploy-supported architecturesNew architectures
DeepseekV2ForCausalLM,ExaoneForCausalLM,Gemma2ForCausalLM,GemmaForCausalLM,GlmMoeDsaForCausalLM,GraniteMoeHybridForCausalLM,HunYuanDenseV1ForCausalLM,HunYuanMoEV1ForCausalLM,InternLM2ForCausalLM,Olmo2ForCausalLM,OpenELMForCausalLM,Phi4FlashForCausalLM,Phi4VisionRForConditionalGeneration,SeedOssForCausalLM,Starcoder2ForCausalLMTest plan
__init__.py🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Infrastructure