[Bug] TRTLLMGenFusedMoE + CUTLASS MoE both fail on SM120 (RTX PRO 6000) with FP4 Qwen3-Next MoE in v1.3.0rc4

## System Info

- **TensorRT-LLM version**: 1.3.0rc4 (Docker: `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc4`)
- **GPU**: NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, 96GB GDDR7) x4
- **Driver**: 570.211.01
- **OS**: Ubuntu 24.04 (GCP `g4-standard-192`)
- **Model**: Qwen3-Next 80B MoE (~3B active), NVFp4 quantized 
## Problem

Neither the `TRTLLM` nor `CUTLASS` MoE backend works on SM120 GPUs with FP4 MoE models. This model runs fine on B200 (SM100).

### Error 1: `moe_config.backend: TRTLLM`

```
NotImplementedError: TRTLLMGenFusedMoE does not support SM120 and above.
[TRT-LLM] [E] Failed to initialize executor on rank 0: TRTLLMGenFusedMoE does not support SM120 and above.
```

### Error 2: `moe_config.backend: CUTLASS`

```
[TRT-LLM] [W] [Autotuner] Failed when profiling runner=MoERunner, tactic=6
Error: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm.
Error: Error Internal (cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39)

Total failed profiling tactics: 16 for custom_op=trtllm::fused_moe::gemm2
```

The CUTLASS backend loads all 1412 weights (~46s), runs the autotuner, but then the executor worker crashes with `RuntimeError: Executor worker returned error`.

## Serving Config

```yaml
tensor_parallel_size: 1
moe_expert_parallel_size: 1
max_batch_size: 4
max_num_tokens: 32768
max_seq_len: 32768
trust_remote_code: true
moe_config:
  backend: CUTLASS  # also tried TRTLLM
kv_cache_config:
  free_gpu_memory_fraction: 0.85
```

## Context

- PR #5823 ("fix moe regression for sm120") was merged in July 2025 and should be in 1.3.0rc4, but the issue persists.
- Issue #7484 reports the same `TRTLLMGenFusedMoE` error on SM120 and remains open with no resolution.
- The v0.20.0 release notes mention "Fix MOE FP4 on SM120" and "Add RTX Pro 6000 support on single GPU", but that fix doesn't seem to carry through to 1.3.0rc4 for this model architecture.

## Questions

1. Is SM120 FP4 MoE expected to work in 1.3.0rc4, or is this a known regression?
2. If not yet supported, which upcoming release will include SM120 FP4 MoE support?
3. Are there any workarounds (e.g., specific quantization format, env flags, or config changes)?

## Steps to Reproduce

```bash
# On a g4-standard-192 (4x RTX PRO 6000 SM120)
docker run --gpus '"device=0"' --shm-size=16g \
  -v /path/to/nvfp4_checkpoint:/workspace/model:ro \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc4 \
  trtllm-serve /workspace/model --host 0.0.0.0 --port 8000 \
  --extra_llm_api_options serving.yaml
```

With the serving config above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] TRTLLMGenFusedMoE + CUTLASS MoE both fail on SM120 (RTX PRO 6000) with FP4 Qwen3-Next MoE in v1.3.0rc4 #11932

System Info

Problem

Error 1: `moe_config.backend: TRTLLM`

Error 2: `moe_config.backend: CUTLASS`

Serving Config

Context

Questions

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] TRTLLMGenFusedMoE + CUTLASS MoE both fail on SM120 (RTX PRO 6000) with FP4 Qwen3-Next MoE in v1.3.0rc4 #11932

Description

System Info

Problem

Error 1: moe_config.backend: TRTLLM

Error 2: moe_config.backend: CUTLASS

Serving Config

Context

Questions

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Error 1: `moe_config.backend: TRTLLM`

Error 2: `moe_config.backend: CUTLASS`