feat: fully implement compressed-tensors gs32 support in TurboMind#4429
Open
lapy wants to merge 5 commits intoInternLM:mainfrom
Open
feat: fully implement compressed-tensors gs32 support in TurboMind#4429lapy wants to merge 5 commits intoInternLM:mainfrom
lapy wants to merge 5 commits intoInternLM:mainfrom
Conversation
Implement the complete end-to-end TurboMind path for compressed-tensors checkpoints with group_size=32, while keeping the format-specific behavior explicit and narrowly scoped to the combinations we actually support. Converter and format handling: replace the old implicit group_size == 128 assumption with an explicit quantized format/group-size support matrix keep AWQ and GPTQ restricted to gs128 while allowing compressed-tensors to use the validated gs32 and gs128 paths centralize default group-size normalization so grouped formats behave consistently when the caller leaves the value unset surface compressed-tensors in the CLI and engine config help text so the format is first-class instead of only being recognized internally continue routing compressed-tensors through the AWQ-style int4 export path only after the format- and group-size-specific validation passes Grouped int4 export behavior: handle AWQ/GPTQ qweight-based tensors and compressed-tensors weight_packed-based tensors through the same grouped-int4 parameter path instead of introducing a dedicated compressed-tensors parameter class synthesize symmetric int4 zero-points directly from exported scale tensor shapes instead of relying on a hard-coded gs128-derived shape document the synthesized zero-point behavior in the parameter export path so the intent is clear tighten the Qwen3.5 compressed-tensors dequantization path used when linear-attention weights must be materialized in fp16 switch the dequant implementation to unpack directly into the final fp16 layout, avoiding the previous temporary-heavy unpack pattern before scaling preserve the symmetric pack-quantized int4 interpretation used by the compressed-tensors checkpoints supported here Qwen3.5 linear-attention support: keep the linear-attention fallback able to initialize from compressed-tensors weights, not only AWQ weights preserve the fixed positional tuple layout expected by the TurboMind export code while only materializing the tensors that are present Kernel registrations: add gs32 int4 GEMM registrations across the relevant TurboMind backends used by grouped int4 weight-only execution implement the extra registrations with small local helpers so the gs32 enablement does not require duplicating full registration tables in each backend file keep the registration changes limited to the formats and group sizes that are intentionally supported Regression coverage: add a focused compressed-tensors test module that can run without the native _turbomind extension being present during collection verify synthesized zero-point tensor shapes are derived from scale shapes correctly through the shared grouped-int4 parameter path verify _compressed_tensors_dequant() matches a trusted reference implementation for pack-quantized symmetric int4 weights verify Qwen3.5 linear-attention initialization can materialize compressed-tensors weights through the fallback path verify unsupported format/group-size combinations still fail loudly with explicit errors Verification: python -m py_compile lmdeploy/turbomind/deploy/parameter.py tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py python -m pytest tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py -q
Contributor
There was a problem hiding this comment.
Pull request overview
Implements end-to-end TurboMind support for compressed-tensors checkpoints with group_size=32 (and 128), including format/group-size validation, shared grouped-int4 export plumbing, Qwen3.5 linear-attention fp16 materialization for compressed-tensors weights, and kernel registrations for gs32 int4 GEMM.
Changes:
- Add explicit format/group-size support matrix + centralized group-size normalization/validation in the converter.
- Unify AWQ/GPTQ and compressed-tensors weight-only int4 export handling (including synthesized symmetric zero-points derived from scale shapes).
- Register gs32 kernels across multiple backends and add regression tests for compressed-tensors behavior (including dequant reference checks and linear-attn fallback).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py | New regression tests covering support matrix, synthesized zero-points, dequant correctness, and Qwen3.5 linear-attn fallback. |
| src/turbomind/kernels/gemm/kernel/sm90_16816_4.cu | Adds gs32 registrations via helper lambdas for SM90 u4 GEMM configs. |
| src/turbomind/kernels/gemm/kernel/sm80_16816_4.cu | Adds gs32 registrations via helper lambdas for SM80 u4 GEMM configs. |
| src/turbomind/kernels/gemm/kernel/sm75_16816_4.cu | Adds gs32 registrations via helper lambdas for SM75 u4 GEMM configs. |
| src/turbomind/kernels/gemm/kernel/sm70_884_4.cu | Adds explicit gs32 registration blocks for SM70 u4 GEMM configs. |
| lmdeploy/turbomind/deploy/source_model/qwen.py | Implements _compressed_tensors_dequant and extends linear-attn fallback to handle compressed-tensors weights. |
| lmdeploy/turbomind/deploy/parameter.py | Unifies grouped int4 parameter export for AWQ/GPTQ + compressed-tensors; synthesizes zero-points from scale shapes. |
| lmdeploy/turbomind/deploy/converter.py | Adds supported format list updates plus group-size normalization/validation used by converter + engine init. |
| lmdeploy/messages.py | Updates engine-config help text for model formats and compressed-tensors behavior. |
| lmdeploy/cli/utils.py | Updates CLI --model-format help text to mention auto-detected formats like compressed-tensors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Enhance the model format options in the engine configuration and CLI to include 'compressed-tensors' as a valid choice. This change clarifies the handling of pack-quantized grouped int4 checkpoints and ensures consistency across documentation and implementation. Additionally, remove a redundant kernel registration in the TurboMind backend to streamline the codebase.
Introduce two new functions, _complete_parallel_config and _update_parallel_config, to enhance the configuration management for parallel processing in TurboMind. These functions ensure proper setup of device and parallel sizes, improving the handling of distributed training scenarios. The changes include assertions for configuration consistency and adjustments to device allocation based on available resources.
Collaborator
|
Hi, @lapy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement the complete end-to-end TurboMind path for
compressed-tensorscheckpoints withgroup_size=32, while keeping the format-specific behavior explicit and narrowly scoped to the combinations we actually support.Converter and format handling:
group_size == 128assumption with an explicit quantized format/group-size support matrixcompressed-tensorsto use the validated gs32 and gs128 pathscompressed-tensorsin the CLI and engine config help text so the format is first-class instead of only being recognized internallycompressed-tensorsthrough the AWQ-style int4 export path only after the format- and group-size-specific validation passesGrouped int4 export behavior:
Qwen3.5 linear-attention support:
compressed-tensorsweights, not only AWQ weightsKernel registrations:
Regression coverage:
_turbomindextension being present during collection_compressed_tensors_dequant()matches a trusted reference implementation for pack-quantized symmetric int4 weightscompressed-tensorsweights through the fallback pathVerification:
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Tested successfully with model: