Skip to content

feat: fully implement compressed-tensors gs32 support in TurboMind#4429

Open
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:split/compressed-tensors-gs32
Open

feat: fully implement compressed-tensors gs32 support in TurboMind#4429
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:split/compressed-tensors-gs32

Conversation

@lapy
Copy link
Contributor

@lapy lapy commented Mar 19, 2026

Implement the complete end-to-end TurboMind path for compressed-tensors checkpoints with group_size=32, while keeping the format-specific behavior explicit and narrowly scoped to the combinations we actually support.

Converter and format handling:

  • replace the old implicit group_size == 128 assumption with an explicit quantized format/group-size support matrix
  • keep AWQ and GPTQ restricted to gs128 while allowing compressed-tensors to use the validated gs32 and gs128 paths
  • centralize default group-size normalization so grouped formats behave consistently when the caller leaves the value unset
  • surface compressed-tensors in the CLI and engine config help text so the format is first-class instead of only being recognized internally
  • continue routing compressed-tensors through the AWQ-style int4 export path only after the format- and group-size-specific validation passes

Grouped int4 export behavior:

  • handle AWQ/GPTQ qweight-based tensors and compressed-tensors weight_packed-based tensors through the same grouped-int4 parameter path instead of introducing a dedicated compressed-tensors parameter class
  • synthesize symmetric int4 zero-points directly from exported scale tensor shapes instead of relying on a hard-coded gs128-derived shape
  • document the synthesized zero-point behavior in the parameter export path so the intent is clear
  • tighten the Qwen3.5 compressed-tensors dequantization path used when linear-attention weights must be materialized in fp16
  • switch the dequant implementation to unpack directly into the final fp16 layout, avoiding the previous temporary-heavy unpack pattern before scaling
  • preserve the symmetric pack-quantized int4 interpretation used by the compressed-tensors checkpoints supported here

Qwen3.5 linear-attention support:

  • keep the linear-attention fallback able to initialize from compressed-tensors weights, not only AWQ weights
  • preserve the fixed positional tuple layout expected by the TurboMind export code while only materializing the tensors that are present

Kernel registrations:

  • add gs32 int4 GEMM registrations across the relevant TurboMind backends used by grouped int4 weight-only execution
  • implement the extra registrations with small local helpers so the gs32 enablement does not require duplicating full registration tables in each backend file
  • keep the registration changes limited to the formats and group sizes that are intentionally supported

Regression coverage:

  • add a focused compressed-tensors test module that can run without the native _turbomind extension being present during collection
  • verify synthesized zero-point tensor shapes are derived from scale shapes correctly through the shared grouped-int4 parameter path
  • verify _compressed_tensors_dequant() matches a trusted reference implementation for pack-quantized symmetric int4 weights
  • verify Qwen3.5 linear-attention initialization can materialize compressed-tensors weights through the fallback path
  • verify unsupported format/group-size combinations still fail loudly with explicit errors

Verification:

  • python -m py_compile lmdeploy/turbomind/deploy/parameter.py \ tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py
  • python -m pytest tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py -q

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Tested successfully with model:

lmdeploy serve api_server cyankiwi/Qwen3.5-27B-AWQ-4bit --tp 2

Implement the complete end-to-end TurboMind path for compressed-tensors checkpoints with group_size=32, while keeping the format-specific behavior explicit and narrowly scoped to the combinations we actually support.

Converter and format handling:

replace the old implicit group_size == 128 assumption with an explicit quantized format/group-size support matrix
keep AWQ and GPTQ restricted to gs128 while allowing compressed-tensors to use the validated gs32 and gs128 paths
centralize default group-size normalization so grouped formats behave consistently when the caller leaves the value unset
surface compressed-tensors in the CLI and engine config help text so the format is first-class instead of only being recognized internally
continue routing compressed-tensors through the AWQ-style int4 export path only after the format- and group-size-specific validation passes
Grouped int4 export behavior:

handle AWQ/GPTQ qweight-based tensors and compressed-tensors weight_packed-based tensors through the same grouped-int4 parameter path instead of introducing a dedicated compressed-tensors parameter class
synthesize symmetric int4 zero-points directly from exported scale tensor shapes instead of relying on a hard-coded gs128-derived shape
document the synthesized zero-point behavior in the parameter export path so the intent is clear
tighten the Qwen3.5 compressed-tensors dequantization path used when linear-attention weights must be materialized in fp16
switch the dequant implementation to unpack directly into the final fp16 layout, avoiding the previous temporary-heavy unpack pattern before scaling
preserve the symmetric pack-quantized int4 interpretation used by the compressed-tensors checkpoints supported here
Qwen3.5 linear-attention support:

keep the linear-attention fallback able to initialize from compressed-tensors weights, not only AWQ weights
preserve the fixed positional tuple layout expected by the TurboMind export code while only materializing the tensors that are present
Kernel registrations:

add gs32 int4 GEMM registrations across the relevant TurboMind backends used by grouped int4 weight-only execution
implement the extra registrations with small local helpers so the gs32 enablement does not require duplicating full registration tables in each backend file
keep the registration changes limited to the formats and group sizes that are intentionally supported
Regression coverage:

add a focused compressed-tensors test module that can run without the native _turbomind extension being present during collection
verify synthesized zero-point tensor shapes are derived from scale shapes correctly through the shared grouped-int4 parameter path
verify _compressed_tensors_dequant() matches a trusted reference implementation for pack-quantized symmetric int4 weights
verify Qwen3.5 linear-attention initialization can materialize compressed-tensors weights through the fallback path
verify unsupported format/group-size combinations still fail loudly with explicit errors
Verification:

python -m py_compile lmdeploy/turbomind/deploy/parameter.py
tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py
python -m pytest tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py -q
Copilot AI review requested due to automatic review settings March 19, 2026 00:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements end-to-end TurboMind support for compressed-tensors checkpoints with group_size=32 (and 128), including format/group-size validation, shared grouped-int4 export plumbing, Qwen3.5 linear-attention fp16 materialization for compressed-tensors weights, and kernel registrations for gs32 int4 GEMM.

Changes:

  • Add explicit format/group-size support matrix + centralized group-size normalization/validation in the converter.
  • Unify AWQ/GPTQ and compressed-tensors weight-only int4 export handling (including synthesized symmetric zero-points derived from scale shapes).
  • Register gs32 kernels across multiple backends and add regression tests for compressed-tensors behavior (including dequant reference checks and linear-attn fallback).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py New regression tests covering support matrix, synthesized zero-points, dequant correctness, and Qwen3.5 linear-attn fallback.
src/turbomind/kernels/gemm/kernel/sm90_16816_4.cu Adds gs32 registrations via helper lambdas for SM90 u4 GEMM configs.
src/turbomind/kernels/gemm/kernel/sm80_16816_4.cu Adds gs32 registrations via helper lambdas for SM80 u4 GEMM configs.
src/turbomind/kernels/gemm/kernel/sm75_16816_4.cu Adds gs32 registrations via helper lambdas for SM75 u4 GEMM configs.
src/turbomind/kernels/gemm/kernel/sm70_884_4.cu Adds explicit gs32 registration blocks for SM70 u4 GEMM configs.
lmdeploy/turbomind/deploy/source_model/qwen.py Implements _compressed_tensors_dequant and extends linear-attn fallback to handle compressed-tensors weights.
lmdeploy/turbomind/deploy/parameter.py Unifies grouped int4 parameter export for AWQ/GPTQ + compressed-tensors; synthesizes zero-points from scale shapes.
lmdeploy/turbomind/deploy/converter.py Adds supported format list updates plus group-size normalization/validation used by converter + engine init.
lmdeploy/messages.py Updates engine-config help text for model formats and compressed-tensors behavior.
lmdeploy/cli/utils.py Updates CLI --model-format help text to mention auto-detected formats like compressed-tensors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lvhan028 lvhan028 self-requested a review March 19, 2026 06:12
@lvhan028 lvhan028 added the enhancement New feature or request label Mar 19, 2026
web-flow and others added 3 commits March 19, 2026 08:24
Enhance the model format options in the engine configuration and CLI to include 'compressed-tensors' as a valid choice. This change clarifies the handling of pack-quantized grouped int4 checkpoints and ensures consistency across documentation and implementation. Additionally, remove a redundant kernel registration in the TurboMind backend to streamline the codebase.
Introduce two new functions, _complete_parallel_config and _update_parallel_config, to enhance the configuration management for parallel processing in TurboMind. These functions ensure proper setup of device and parallel sizes, improving the handling of distributed training scenarios. The changes include assertions for configuration consistency and adjustments to device allocation based on available resources.
@lvhan028
Copy link
Collaborator

Hi, @lapy
Could you sync your branch with the latest main? There have been some CI fixes merged recently that should resolve the current failures.

@lvhan028 lvhan028 requested review from 43758726 and lzhangzz March 20, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants