Skip to content

Conversation

@43758726
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Due to a bug in the scale calculation, the weights of the model converted from bf16 to fp8 in the code differ slightly from the weights of the official fp8 model.

Modification

The quant_utils.py file under the path lmdeploy/lmdeploy/lite/quantization/weight has been modified. The code to convert weight to fp32 has been removed in the def quant_blocked_fp8 function. In the def _get_quant_scaling function, the eps value under different dtypes is given to limit the amplitude to prevent division by zero.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@lvhan028
Copy link
Collaborator

There are linting errors. May resolve them as follows:

pip install pre-commit
cd /the/root/path/of/lmdeploy/repo
pre-commit install
pre-commit run --all-files

Make sure that the python version is 3.10

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix a quantization scale calculation bug that caused bf16 to fp8 weight conversion to differ from official fp8 model weights. The changes modify the scale computation logic and dtype handling in the quantization utilities.

Key changes:

  • Modified fast_pow2_torch and fast_round_scale_torch to accept a dtype parameter
  • Updated _get_quant_scaling to use dtype-specific epsilon values and prevent division by zero
  • Removed explicit float32 conversion in quant_blocked_fp8 to preserve original weight dtype

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 46 to 51

scaling = torch.where(scaling == 0,
torch.tensor(eps, dtype=scaling.dtype, device=scaling.device),
scaling,
)

Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch.tensor creation inside torch.where can be inefficient as it creates a new tensor on each call. Consider using scaling.clamp_min(eps) instead, which would be more efficient and cleaner. This would achieve the same result of ensuring scaling values are at least eps.

Suggested change
scaling = torch.where(scaling == 0,
torch.tensor(eps, dtype=scaling.dtype, device=scaling.device),
scaling,
)
scaling = scaling.clamp_min(eps)

Copilot uses AI. Check for mistakes.
finfo = torch.finfo(fp8_dtype)
fmax = finfo.max
amax = weight.abs().amax(dim, keepdim=True).clamp_min(1e-6).float()
eps = torch.finfo(weight.dtype).eps
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting eps from weight.dtype may produce unexpected behavior when weight.dtype is bfloat16. The epsilon value for bfloat16 is much larger than float32 (approximately 0.0078 vs 1.19e-07), which could significantly affect quantization accuracy. Consider using a consistent epsilon value or explicitly document this dtype-dependent behavior.

Suggested change
eps = torch.finfo(weight.dtype).eps
eps = torch.finfo(torch.float32).eps

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +23
def fast_pow2_torch(x: torch.Tensor, weight_dtype: torch.dtype) -> torch.Tensor:
bits_x = (x + 127) << 23
return bits_x.view(torch.float32)
return bits_x.view(weight_dtype)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fast_pow2_torch function performs bit manipulation assuming IEEE 754 float32 representation (adding 127 bias and shifting 23 bits for the mantissa). The bitcast operation on line 23 should always return float32 to maintain correctness, regardless of the input weight dtype. Using weight_dtype for the view operation will produce incorrect results for non-float32 dtypes like bfloat16 or float16, as they have different bit representations. The result should always be float32 since that's what the bit pattern represents.

Copilot uses AI. Check for mistakes.
def fast_round_scale_torch(amax: torch.Tensor, fp8_max: torch.Tensor) -> torch.Tensor:
return fast_pow2_torch(fast_log2_ceil_torch(amax / fp8_max))
def fast_round_scale_torch(amax: torch.Tensor, fp8_max: torch.Tensor, weight_dtype: torch.dtype) -> torch.Tensor:
return fast_pow2_torch(fast_log2_ceil_torch(amax / fp8_max), weight_dtype)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the trailing whitespace at the end of this line.

Suggested change
return fast_pow2_torch(fast_log2_ceil_torch(amax / fp8_max), weight_dtype)
return fast_pow2_torch(fast_log2_ceil_torch(amax / fp8_max), weight_dtype)

Copilot uses AI. Check for mistakes.
Comment on lines 26 to 27
def fast_round_scale_torch(amax: torch.Tensor, fp8_max: torch.Tensor, weight_dtype: torch.dtype) -> torch.Tensor:
return fast_pow2_torch(fast_log2_ceil_torch(amax / fp8_max), weight_dtype)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter weight_dtype is misleading because this function doesn't operate on weights but on scaling factors. Additionally, the bit manipulation in this function assumes IEEE 754 float32 representation, so the result type should always be float32. This parameter should be removed and the function should always return float32.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants