Skip to content

git-oss-20b will not properly load with MXFP4 quantization even though Triton version satisfys. #42723

@nickeisenberg

Description

@nickeisenberg

System Info

$ hf env

Copy-and-paste the text below in your GitHub issue.

- huggingface_hub version: 0.36.0
- Platform: Linux-4.18.0-553.83.1.1toss.t4.x86_64-x86_64-with-glibc2.28
- Python version: 3.12.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /g/g11/eisenbnt/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.9.1
- Jinja2: 3.1.6
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 2.3.5
- pydantic: N/A
- aiohttp: 3.13.2
- hf_xet: 1.2.0
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /g/g11/eisenbnt/.cache/huggingface/hub
- HF_ASSETS_CACHE: /g/g11/eisenbnt/.cache/huggingface/assets
- HF_TOKEN_PATH: /g/g11/eisenbnt/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /g/g11/eisenbnt/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_DISABLE_XET: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
((dev) ) eisenbnt@matrix9:~
$ nvidia-smi
Mon Dec  8 17:00:32 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   35C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
((dev) ) eisenbnt@matrix9:~
$
((dev) ) eisenbnt@matrix9:~
$ pip show triton
Name: triton
Version: 3.5.1
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License:
Location: /usr/workspace/eisenbnt/.venvman/envs/3.12/dev/lib64/python3.12/site-packages
Requires:
Required-by: torch
((dev) ) eisenbnt@matrix9:~
$

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

>>> from transformers.models.gpt_oss.modeling_gpt_oss import GptOssForCausalLM
>>> def get_gpt_oss(device):
...     model = GptOssForCausalLM.from_pretrained(
...         openai/gpt-oss-20b,
...     )
...     return model.to(device)
...
>>> model = get_gpt_oss("cuda:0")
MXFP4 quantization requires Triton and kernels installed: CUDA requires Triton >= 3.4.0, XPU requires Triton
>= 3.5.0, we will default to dequantizing the model to bf16
Loading checkpoint shards: 100%|███████████████████████████████████████████████| 3/3 [00:19<00:00,  6.56s/it]

Expected behavior

Is there anything special I need to do to get this working? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions