Skip to content

[Feature] 集成 TurboQuant KV Cache 压缩#6

Open
sunghajung6688 wants to merge 4 commits into
hw-native-sys:mainfrom
sunghajung6688:turboquant
Open

[Feature] 集成 TurboQuant KV Cache 压缩#6
sunghajung6688 wants to merge 4 commits into
hw-native-sys:mainfrom
sunghajung6688:turboquant

Conversation

@sunghajung6688
Copy link
Copy Markdown

将 TurboQuant V3(纯 MSE 模式)集成到 pypto-serving 的 KV Cache 管理中,支持在线压缩/解压缩历史 token,减少 KV Cache 内存占用。

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TurboQuant KV cache quantization to optimize memory usage during prefill and decode phases. The implementation includes a new turboquant module featuring MSE-optimal compression using random rotation and Lloyd-Max quantization. Feedback focuses on critical performance bottlenecks: the current compression logic re-processes the entire sequence prefix in each step, leading to $O(N^2)$ complexity; the manual Python loops used for writing to paged cache should be replaced with vectorized advanced indexing; and several constant tensor allocations and PDF calculations should be pre-computed or vectorized to reduce overhead.

Comment thread python/core/kv_cache.py
if alloc.request_id not in pool.compressed_segments:
pool.compressed_segments[alloc.request_id] = {}

for layer_idx in range(pool.num_layers):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

当前实现在每次调用时都会重新压缩整个序列前缀(从 token 0 到 tokens_used - residual_window)。在 decode 循环中调用时,这将导致相对于序列长度的 $O(N^2)$ 复杂度,随着上下文增长,性能将严重下降。建议实现增量压缩,仅处理刚刚超出 residual_window 的新 token。此外,调用的 read_context 方法内部也存在类似的循环读取瓶颈,建议一并优化。

Comment thread python/core/kv_cache.py Outdated
Comment on lines +379 to +385
for row in range(keys.shape[0]):
token_index = start_token_index + row
page_idx = token_index // pool.page_size
offset = token_index % pool.page_size
physical_page = alloc.page_ids[page_idx]
pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)
pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

在 Python 循环中遍历 token 并写入分页缓存(paged cache)非常低效,特别是对于设备张量(NPU/GPU)。应使用 PyTorch 的高级索引(advanced indexing)进行向量化,以在单个操作中完成更新。

Suggested change
for row in range(keys.shape[0]):
token_index = start_token_index + row
page_idx = token_index // pool.page_size
offset = token_index % pool.page_size
physical_page = alloc.page_ids[page_idx]
pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)
pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)
token_indices = torch.arange(start_token_index, start_token_index + keys.shape[0], device=keys.device)
page_indices = token_indices // pool.page_size
offsets = token_indices % pool.page_size
physical_pages = torch.tensor(alloc.page_ids, device=keys.device)[page_indices]
pool.key_pages[layer_idx, physical_pages, :, offsets, :] = keys.to(cache_dtype)
pool.value_pages[layer_idx, physical_pages, :, offsets, :] = values.to(cache_dtype)

Comment on lines +68 to +72
idx_powers = torch.tensor(
[2 ** (self.bits * i) for i in range(indices_per_byte - 1, -1, -1)],
dtype=torch.long,
device=idx_flat.device,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

idx_powers 是一个常量张量。在每次 compress 调用时创建它是低效的。建议在 __init__ 中根据 bits 预先计算并存储为 buffer,以减少冗余分配和 H2D 拷贝。

Comment thread python/core/turboquant/lloyd_max.py Outdated
Comment on lines +47 to +48
pdf_vals = torch.tensor([pdf(x) for x in xs])
weighted = xs * pdf_vals
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

高斯 PDF 计算目前在 Python 列表推导式中执行,对于 2048 个样本来说较慢。可以使用 PyTorch 操作轻松实现向量化,提高初始化速度。

            xs = torch.linspace(a, b, n_samples)
            pdf_vals = (1.0 / (math.sqrt(2 * math.pi) * sigma)) * torch.exp(-xs**2 / (2 * sigma**2))
            weighted = xs * pdf_vals

@sunghajung6688 sunghajung6688 force-pushed the turboquant branch 4 times, most recently from c502fd4 to 628a70a Compare May 26, 2026 07:20
@sunghajung6688 sunghajung6688 force-pushed the turboquant branch 15 times, most recently from d3cc145 to d717090 Compare May 28, 2026 03:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c09f685-00ed-42dc-b5df-2a84328e3e87

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds TurboQuant, a KV cache quantization system for reducing inference memory, consisting of core quantization algorithms (Lloyd-Max optimal scalar codebook, absmax quantization), integration into the KV cache manager with compress/restore lifecycles, NPU and PyTorch execution paths, worker/serving coordination, and CLI configuration support. Documentation and example scripts demonstrate offline and online usage.

Changes

TurboQuant KV Cache Quantization Integration

Layer / File(s) Summary
Core quantization types and configuration
python/core/types.py, python/core/turboquant/__init__.py
KvQuantConfig dataclass (enabled, key_bits, value_bits, residual_window, protected_layers, protected_bits) and RuntimeConfig.kv_quant_config field enable quantization configuration across the system.
Lloyd-Max optimal scalar quantizer
python/core/turboquant/lloyd_max.py, tests/test_lloyd_max.py
Iterative Lloyd-Max algorithm with vectorized Gaussian conditional-mean updates, LRU caching, and codebook precomputation; test suite validates MSE quality vs. legacy numerical approach and benchmarks speed across (d, bits) configurations.
Absmax compressor and KVCompressor implementation
python/core/turboquant/compressor.py
AbsmaxCompressor performs per-row absmax scaling and INT-N quantization with optional bit-packing; KVCompressor manages per-layer compressors with layer-adaptive protected bit-widths and dual Python/NPU execution paths for compress/decompress.
KV cache manager quantization integration
python/core/kv_cache.py
KvCacheManager initializes compressor during model registration, provides compress_old_tokens / restore_compressed_tokens APIs for token lifecycle management, tracks compressed segments per request, and exposes quantization_stats for memory reporting.
NPU executor L2 callable compilation
examples/model/qwen3_14b/runner/npu_executor.py
_compile_l2_callable_from_jit compiles @pl.jit functions into _L2Callable with JIT caching, dummy-arg binding for metadata extraction, and compile_and_assemble integration; placeholder tq_compress/tq_decompress stubs in _compile_model.
Model runner TurboQuant wiring
examples/model/qwen3_14b/runner/npu_runner.py
run_tq_compress / run_tq_decompress dispatch to compiled L2 callables or PyTorch fallback; compress/restore integrated into prefill and decode paths; L2 orchestration refactored to separate tensor/scalar arguments; debug logging tracks per-batch TQ state.
Serving worker compressed segment lifecycle
python/core/serving_worker.py
Worker corrects tokens_used to full context length after prefill; clears stale compressed segments during allocation creation/update and freeing to maintain segment consistency across preemption.
CLI configuration and serving startup
python/cli/main.py
load_serving_config() parses optional kv_quant JSON block and instantiates KvQuantConfig; default weight_dtype updated to bfloat16; startup logging reports TurboQuant settings when enabled.
Example scripts and CLI utilities
examples/model/qwen3_14b/npu_generate.py, examples/model/qwen3_14b/test_kv_capacity.py, examples/model/qwen3_14b/npu_serving.json
npu_generate.py adds --kv-quant and quantization-parameter flags; test_kv_capacity.py measures KV pool capacity with/without TurboQuant; config updated to weight_dtype: bfloat16.
Documentation and repository metadata
README.md, .gitmodules, .claude/skills
README restructured with Offline/Online mode sections, TurboQuant setup instructions, Common CLI Flags table, and Benchmark section; submodule URL updated; skills config added.

Sequence Diagram

sequenceDiagram
  participant Client
  participant Worker as WorkerProcess
  participant Runner as Qwen314BModelRunner
  participant KVCache as KvCacheManager
  participant Compressor as KVCompressor
  participant NPU as NPU/L2
  
  Client->>Worker: generate_result(prompt)
  Worker->>Runner: run_prefill()
  Runner->>Runner: prefill tokens
  Runner->>KVCache: compress_old_tokens(alloc)
  KVCache->>Compressor: compress_layer()
  Compressor->>NPU: run_kv_quantize (if available)
  NPU-->>Compressor: quantized indices
  Compressor->>KVCache: store compressed segment
  KVCache-->>Runner: freed pages
  
  loop decode steps
    Worker->>Runner: run_decode()
    Runner->>KVCache: restore_compressed_tokens(alloc)
    KVCache->>Compressor: decompress_layer()
    Compressor->>NPU: run_kv_dequantize (if available)
    NPU-->>Compressor: decompressed tokens
    Compressor->>KVCache: write into pages
    KVCache-->>Runner: restored pages
    Runner->>Runner: decode kernel
    Runner->>KVCache: compress_old_tokens(alloc)
    KVCache->>Compressor: compress_layer()
    Compressor-->>KVCache: compressed segment
  end
  
  Worker->>KVCache: free(alloc)
  KVCache->>KVCache: clear_compressed_segments()
  KVCache-->>Client: result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 In a digital warren deep and vast,
Where tokens cache and memories amass,
A TurboQuant tale—compress to save the day,
BF16 to INT-N in the NPU's way!
Absmax scales the hoard with Lloyd-Max's grace, 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Feature] 集成 TurboQuant KV Cache 压缩' clearly describes the main feature being added: integration of TurboQuant KV Cache compression, which aligns with the substantial changes across multiple files for implementing KV cache quantization.
Description check ✅ Passed The description 'Integrating TurboQuant V3 (pure MSE mode) into pypto-serving's KV Cache management, supporting online compression/decompression of historical tokens to reduce KV Cache memory usage' directly relates to the changeset, explaining the purpose and impact of the TurboQuant integration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (1)
examples/model/qwen3_14b/test_kv_capacity.py (1)

112-112: ⚡ Quick win

Align weight dtype with the other NPU TurboQuant entrypoints.

Using float32 here makes KV-capacity comparisons noisier by increasing weight memory pressure; bfloat16 better matches the updated serving/generation defaults.

💡 Proposed fix
-            weight_dtype="float32",
+            weight_dtype="bfloat16",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/test_kv_capacity.py` at line 112, Update the weight
dtype used in the KV-capacity test to match other NPU TurboQuant entrypoints by
changing the weight_dtype setting from "float32" to "bfloat16" in the
test_kv_capacity setup (look for the weight_dtype="float32" argument in the KV
capacity test config and replace it with weight_dtype="bfloat16").
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 299-305: The restore/compress calls operate on stale host-side KV
state and must be synchronized with the worker child-memory buffer: before
calling restore_compressed_tokens() or compress_old_tokens() (e.g., inside the
block where you iterate batch.kv_allocations when
is_quantization_enabled(model_id)), explicitly pull/sync the latest pages from
the worker child buffer into the host cache (use the same mechanism as
_l2_child_tensor(..., refresh=True) or add/ call a kv cache
sync_from_worker(model_id, alloc) helper) so host reads reflect the decode
buffer; after restore/compress, push any modified pages back to the worker
(sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will be
uploaded before the next decode, ensuring the quantization lifecycle is kept
consistent between restore_compress and the worker-resident cache.
- Around line 1177-1188: Wrap the noisy debug prints in _prepare_decode_inputs
behind the existing debug flag (e.g. check self._l3_trace or a dedicated
TurboQuant debug switch) so they only execute when tracing is enabled;
specifically guard the two print calls that reference batch_idx, seq_len,
max_blocks, alloc.page_ids, alloc.tokens_used, slot_mapping and the block_table
print with an if self._l3_trace (or if _l3_trace) block so the heavy stdout
output and flush=True are skipped by default.
- Around line 391-408: The compressed-segments memory calc calls sum(...) on
values returned by .numel(), which is an int and causes TypeError; update the
block that iterates pool.compressed_segments so that you add the .numel()
results directly (and multiply by 2 for "scales" entries) instead of wrapping
them in sum(), e.g., use seg.compressed_k.get("idx_bytes",
torch.Tensor()).numel() and seg.compressed_k.get("scales",
torch.Tensor()).numel() * 2 (same for compressed_v), and ensure the accumulated
value is cast to a float before dividing by 1024*1024 to compute compressed_mb.

In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Around line 58-59: The --step argument must be validated to be a positive
integer to avoid non-terminating loops that use prompt_len and step in the
capacity test; add validation (either a custom argparse type like positive_int
or an explicit check after parsing) to ensure args.step > 0 (and optionally cap
it against args.max_prompt or other relevant bounds), and raise/exit with a
clear error if the value is invalid so the loop that manipulates prompt_len
cannot stall or regress (refer to parser, args.step, prompt_len and the loop
logic around lines where prompt_len is increased).
- Around line 73-74: The code resolves the provided model directory into
model_dir but doesn't validate that it exists and is a directory; update the
argument handling around model_dir (where model_dir =
Path(args.model_dir).resolve()) to check model_dir.exists() and
model_dir.is_dir(), and if not, exit early with a clear error message (e.g., via
sys.exit or raising ArgumentError) that includes the invalid path
(args.model_dir) so failures fail fast with useful context.

In `@python/core/kv_cache.py`:
- Around line 489-516: compress_old_tokens is prematurely replacing BF16 pages
with compressed segments (see read_context -> pool.compressed_segments
assignment and use of pool.kv_compressor.compress_layer / compress_layer_npu
with run_tq_compress) so chunked run_prefill which expects original pages later
ends up with missing prompt pages; change the logic to not overwrite/free the
original BF16 pages during prefill: either defer creating/assigning
pool.compressed_segments inside compress_old_tokens to only run in decode paths,
or ensure a restore step is performed before any chunked prefill by invoking the
existing restore/restore_pages routine (or add one) so read_context +
compress_layer(*) do not remove the resident BF16 pages until after prefill is
complete.
- Around line 373-377: Replace all synchronous print(..., flush=True) debug
traces in python/core/kv_cache.py with gated logging (e.g., logger.debug) so
they are emitted only when debug logging is enabled; specifically update the
print in the restore/compress code paths and inside
_write_tokens_from_compressed and any other restore/compress helpers to call the
module logger (or an existing logger instance) instead of printing to stdout,
remove explicit flush usage, and preserve the original message formatting so
semantics remain the same.

In `@python/core/serving_worker.py`:
- Around line 343-346: The call to
self.kv_cache_manager.clear_compressed_segments(self.config.model_id, req_id) in
_get_or_update_allocation() is clearing compressed state on every allocation
refresh (prefill/decode) and breaks KvCacheManager.restore_compressed_tokens()
for multi-step requests; move or guard this call so it only runs on actual
preemption/reset paths (e.g., where you detect a preemption or allocation reset)
instead of the general allocation update flow—adjust _get_or_update_allocation()
to only invoke kv_cache_manager.clear_compressed_segments(...) when the function
is handling a preemption/reset event, leaving the compressed segments intact
during normal prefill/decode steps.

In `@python/core/turboquant/compressor.py`:
- Around line 189-199: Replace the duplicate KvQuantConfig dataclass in
compressor.py with an import of the single authoritative KvQuantConfig
definition so there is one source of truth; specifically, remove the local
KvQuantConfig class and import the KvQuantConfig symbol defined in the other
module (the canonical types definition) and update any local references to use
that imported class to ensure default values (e.g., value_bits) remain
consistent across the codebase.

In `@README.md`:
- Line 120: Update the README command example to clarify the device ID
placeholder: replace the ambiguous `--device {}` with a concrete example like
`--device 0` or add a short parenthetical note after the command explaining that
users must replace `{}` with their device ID (e.g., `0` for the first GPU).
Ensure the README's example uses the chosen concrete value or includes the
explicit instruction so readers aren't confused by empty braces.
- Around line 155-166: Update the README's kv_quant JSON example to include the
two supported options missing from the example: add "protected_layers" and
"protected_bits" with their default values (4 and 8) so the configuration
matches the implementation in python/cli/main.py; locate the kv_quant block in
the README sample and append these keys with appropriate values and brief
comments or values so users see all available parameters.

In `@tests/test_lloyd_max.py`:
- Line 127: The f-string prefix is used on plain strings without placeholders in
tests/test_lloyd_max.py (e.g., the print calls that currently read print(f" 
PASS: new MSE <= old MSE (within 0.5%)") and the similar PASS/FAIL prints),
which triggers Ruff F541; fix by removing the unnecessary leading "f" from those
print statements so they are plain strings (e.g., change print(f"...") to
print("...")) for each affected print occurrence.
- Around line 107-120: quantization_mse is integrating via Riemann sums but
never multiplies by the interval width dx, so the returned value total_mse /
n_pts is wrong; fix it by computing dx per interval (dx = (b - a) / (n_pts - 1))
inside the loop where a,b are taken from boundaries, multiply each interval's
summed contribution mse_i by that dx before adding to total_mse (i.e., total_mse
+= mse_i.item() * dx), and then return total_mse (remove the final division by
n_pts); refer to the function quantization_mse, variables boundaries, a, b, xs,
mse_i, total_mse, n_pts, and sigma.

---

Nitpick comments:
In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Line 112: Update the weight dtype used in the KV-capacity test to match other
NPU TurboQuant entrypoints by changing the weight_dtype setting from "float32"
to "bfloat16" in the test_kv_capacity setup (look for the weight_dtype="float32"
argument in the KV capacity test config and replace it with
weight_dtype="bfloat16").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f1389df-bcec-4ad4-98b9-f16906f52a18

📥 Commits

Reviewing files that changed from the base of the PR and between efe0a6f and d717090.

📒 Files selected for processing (18)
  • .claude/skills
  • .claude/skills
  • .gitmodules
  • README.md
  • examples/model/qwen3_14b/npu_generate.py
  • examples/model/qwen3_14b/npu_serving.json
  • examples/model/qwen3_14b/runner/npu_executor.py
  • examples/model/qwen3_14b/runner/npu_runner.py
  • examples/model/qwen3_14b/test_kv_capacity.py
  • pypto-lib
  • python/cli/main.py
  • python/core/kv_cache.py
  • python/core/serving_worker.py
  • python/core/turboquant/__init__.py
  • python/core/turboquant/compressor.py
  • python/core/turboquant/lloyd_max.py
  • python/core/types.py
  • tests/test_lloyd_max.py

Comment on lines +299 to +305
# ── KV quantization: restore compressed old tokens BEFORE building
# block_table so that workspace pages are included in page_ids. ──
t_tq_restore = time.perf_counter()
if self._kv_cache_manager.is_quantization_enabled(model_id):
for alloc in batch.kv_allocations:
self._kv_cache_manager.restore_compressed_tokens(model_id, alloc, npu_runner=self)
dt_tq_restore = (time.perf_counter() - t_tq_restore) * 1000
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Restore/compress are operating on stale KV state.

restore_compressed_tokens() and compress_old_tokens() both mutate/read the host-side cache, but decode runs against _l2_child_tensor(..., refresh=refresh_kv_cache) worker allocations. Once _l2_dirty_kv_models has been cleared, restored pages are never re-uploaded before the next decode, and the post-kernel compression path reads host KV that has not been synced back from the child-memory decode buffer. That will quantize/decode the wrong history after the first quantized step. Please either add explicit KV sync points around TurboQuant or keep the quantization lifecycle entirely on the worker-resident cache. python/runtime/worker.py:89-124 and python/core/kv_cache.py:346-403,444-507 show the contract mismatch.

Also applies to: 313-333, 374-379

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 299 - 305, The
restore/compress calls operate on stale host-side KV state and must be
synchronized with the worker child-memory buffer: before calling
restore_compressed_tokens() or compress_old_tokens() (e.g., inside the block
where you iterate batch.kv_allocations when is_quantization_enabled(model_id)),
explicitly pull/sync the latest pages from the worker child buffer into the host
cache (use the same mechanism as _l2_child_tensor(..., refresh=True) or add/
call a kv cache sync_from_worker(model_id, alloc) helper) so host reads reflect
the decode buffer; after restore/compress, push any modified pages back to the
worker (sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will
be uploaded before the next decode, ensuring the quantization lifecycle is kept
consistent between restore_compress and the worker-resident cache.

Comment on lines +391 to +408
# Compressed segments memory estimate
compressed_mb = 0.0
if pool and pool.compressed_segments:
for req_segs in pool.compressed_segments.values():
for seg in req_segs.values():
compressed_mb += sum(
seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),
)
compressed_mb += sum(
seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,
)
compressed_mb += sum(
seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),
)
compressed_mb += sum(
seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,
)
compressed_mb = compressed_mb / 1024 / 1024
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The KV-memory stats block will throw on the first compressed segment.

numel() returns an int, so sum(seg.compressed_k.get(...).numel(),) raises TypeError: 'int' object is not iterable. Because this sits on the quantized decode path, the logger will start crashing requests as soon as compression succeeds once.

Suggested fix
-            compressed_mb = 0.0
+            compressed_bytes = 0
             if pool and pool.compressed_segments:
                 for req_segs in pool.compressed_segments.values():
                     for seg in req_segs.values():
-                        compressed_mb += sum(
-                            seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),
-                        )
-                        compressed_mb += sum(
-                            seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,
-                        )
-                        compressed_mb += sum(
-                            seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),
-                        )
-                        compressed_mb += sum(
-                            seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,
-                        )
-                compressed_mb = compressed_mb / 1024 / 1024
+                        idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
+                        scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16))
+                        idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
+                        scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16))
+                        compressed_bytes += idx_k.numel() * idx_k.element_size()
+                        compressed_bytes += scales_k.numel() * scales_k.element_size()
+                        compressed_bytes += idx_v.numel() * idx_v.element_size()
+                        compressed_bytes += scales_v.numel() * scales_v.element_size()
+                compressed_mb = compressed_bytes / 1024 / 1024
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Compressed segments memory estimate
compressed_mb = 0.0
if pool and pool.compressed_segments:
for req_segs in pool.compressed_segments.values():
for seg in req_segs.values():
compressed_mb += sum(
seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),
)
compressed_mb += sum(
seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,
)
compressed_mb += sum(
seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),
)
compressed_mb += sum(
seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,
)
compressed_mb = compressed_mb / 1024 / 1024
# Compressed segments memory estimate
compressed_bytes = 0
if pool and pool.compressed_segments:
for req_segs in pool.compressed_segments.values():
for seg in req_segs.values():
idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16))
idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16))
compressed_bytes += idx_k.numel() * idx_k.element_size()
compressed_bytes += scales_k.numel() * scales_k.element_size()
compressed_bytes += idx_v.numel() * idx_v.element_size()
compressed_bytes += scales_v.numel() * scales_v.element_size()
compressed_mb = compressed_bytes / 1024 / 1024
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 391 - 408, The
compressed-segments memory calc calls sum(...) on values returned by .numel(),
which is an int and causes TypeError; update the block that iterates
pool.compressed_segments so that you add the .numel() results directly (and
multiply by 2 for "scales" entries) instead of wrapping them in sum(), e.g., use
seg.compressed_k.get("idx_bytes", torch.Tensor()).numel() and
seg.compressed_k.get("scales", torch.Tensor()).numel() * 2 (same for
compressed_v), and ensure the accumulated value is cast to a float before
dividing by 1024*1024 to compute compressed_mb.

Comment on lines +1177 to +1188
print(
f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
f"seq_len={seq_len}, max_blocks={max_blocks}, "
f"alloc.page_ids len={len(alloc.page_ids)}, "
f"alloc.page_ids={alloc.page_ids}, "
f"alloc.tokens_used={alloc.tokens_used}, "
f"slot_mapping={slot_mapping[batch_idx].item()}",
flush=True,
)
print(
f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
flush=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the TQ decode dumps behind a debug flag.

This runs on every decode step and prints full page metadata with flush=True, so long generations will spend a surprising amount of time in stdout and produce very noisy logs. Please hide it behind _l3_trace or a dedicated TurboQuant debug switch.

Suggested fix
-            print(
-                f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
-                f"seq_len={seq_len}, max_blocks={max_blocks}, "
-                f"alloc.page_ids len={len(alloc.page_ids)}, "
-                f"alloc.page_ids={alloc.page_ids}, "
-                f"alloc.tokens_used={alloc.tokens_used}, "
-                f"slot_mapping={slot_mapping[batch_idx].item()}",
-                flush=True,
-            )
-        print(
-            f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
-            flush=True,
-        )
+            if self._l3_trace:
+                print(
+                    f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
+                    f"seq_len={seq_len}, max_blocks={max_blocks}, "
+                    f"alloc.page_ids len={len(alloc.page_ids)}, "
+                    f"alloc.page_ids={alloc.page_ids}, "
+                    f"alloc.tokens_used={alloc.tokens_used}, "
+                    f"slot_mapping={slot_mapping[batch_idx].item()}",
+                    flush=True,
+                )
+        if self._l3_trace:
+            print(
+                f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
+                flush=True,
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 1177 - 1188, Wrap
the noisy debug prints in _prepare_decode_inputs behind the existing debug flag
(e.g. check self._l3_trace or a dedicated TurboQuant debug switch) so they only
execute when tracing is enabled; specifically guard the two print calls that
reference batch_idx, seq_len, max_blocks, alloc.page_ids, alloc.tokens_used,
slot_mapping and the block_table print with an if self._l3_trace (or if
_l3_trace) block so the heavy stdout output and flush=True are skipped by
default.

Comment on lines +58 to +59
parser.add_argument("--step", type=int, default=32,
help="Increase prompt tokens by this much each iteration.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate --step (and related bounds) to prevent non-terminating runs.

With --step <= 0, Line 147 can stall or regress prompt_len, making the loop at Line 135 non-terminating.

💡 Proposed fix
 def main() -> None:
     args = build_parser().parse_args()
+    if args.step <= 0:
+        raise ValueError("--step must be > 0.")
+    if args.start_len <= 0:
+        raise ValueError("--start-len must be > 0.")
+    if args.max_len < args.start_len:
+        raise ValueError("--max-len must be >= --start-len.")
+
     model_dir = Path(args.model_dir).resolve()

Also applies to: 135-148

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 58 - 59, The
--step argument must be validated to be a positive integer to avoid
non-terminating loops that use prompt_len and step in the capacity test; add
validation (either a custom argparse type like positive_int or an explicit check
after parsing) to ensure args.step > 0 (and optionally cap it against
args.max_prompt or other relevant bounds), and raise/exit with a clear error if
the value is invalid so the loop that manipulates prompt_len cannot stall or
regress (refer to parser, args.step, prompt_len and the loop logic around lines
where prompt_len is increased).

Comment on lines +73 to +74
model_dir = Path(args.model_dir).resolve()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when --model-dir is invalid.

Line 73 resolves the path but never verifies it exists as a directory, so errors surface later with less clear context.

💡 Proposed fix
     args = build_parser().parse_args()
     model_dir = Path(args.model_dir).resolve()
+    if not model_dir.is_dir():
+        raise FileNotFoundError(f"Model directory does not exist: {model_dir}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_dir = Path(args.model_dir).resolve()
args = build_parser().parse_args()
model_dir = Path(args.model_dir).resolve()
if not model_dir.is_dir():
raise FileNotFoundError(f"Model directory does not exist: {model_dir}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 73 - 74, The code
resolves the provided model directory into model_dir but doesn't validate that
it exists and is a directory; update the argument handling around model_dir
(where model_dir = Path(args.model_dir).resolve()) to check model_dir.exists()
and model_dir.is_dir(), and if not, exit early with a clear error message (e.g.,
via sys.exit or raising ArgumentError) that includes the invalid path
(args.model_dir) so failures fail fast with useful context.

Comment thread python/core/turboquant/compressor.py Outdated
Comment on lines +189 to +199
@dataclass
class KvQuantConfig:
"""Configuration for KV cache quantization."""

enabled: bool = False
key_bits: int = 4
value_bits: int = 4
residual_window: int = 128
protected_layers: int = 4
protected_bits: int = 8

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unify KvQuantConfig to a single source of truth.

This file redefines KvQuantConfig with defaults that diverge from python/core/types.py (e.g., Line 195 uses value_bits=4 vs Line 26 in python/core/types.py uses value_bits=2). Having two config classes for the same contract can silently change behavior depending on import path.

Proposed fix
-from dataclasses import dataclass
+from dataclasses import dataclass
+from python.core.types import KvQuantConfig
@@
-@dataclass
-class KvQuantConfig:
-    """Configuration for KV cache quantization."""
-
-    enabled: bool = False
-    key_bits: int = 4
-    value_bits: int = 4
-    residual_window: int = 128
-    protected_layers: int = 4
-    protected_bits: int = 8
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@dataclass
class KvQuantConfig:
"""Configuration for KV cache quantization."""
enabled: bool = False
key_bits: int = 4
value_bits: int = 4
residual_window: int = 128
protected_layers: int = 4
protected_bits: int = 8
# (Import statement to be added at the top of the file:)
from python.core.types import KvQuantConfig
# Lines 189-199 are removed entirely; no code remains at this location
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/turboquant/compressor.py` around lines 189 - 199, Replace the
duplicate KvQuantConfig dataclass in compressor.py with an import of the single
authoritative KvQuantConfig definition so there is one source of truth;
specifically, remove the local KvQuantConfig class and import the KvQuantConfig
symbol defined in the other module (the canonical types definition) and update
any local references to use that imported class to ensure default values (e.g.,
value_bits) remain consistent across the codebase.

Comment thread README.md
```bash
python -m python.cli.main \
--config examples/model/qwen3_14b/npu_serving.json \
--serve --port 8899 --device {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the device ID placeholder.

The command uses --device {} with empty braces, which may confuse users. Consider using a concrete example like --device 0 or adding a comment explaining that users should replace {} with their device ID.

📝 Suggested fix
-  --serve --port 8899 --device {}
+  --serve --port 8899 --device 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--serve --port 8899 --device {}
--serve --port 8899 --device 0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 120, Update the README command example to clarify the
device ID placeholder: replace the ambiguous `--device {}` with a concrete
example like `--device 0` or add a short parenthetical note after the command
explaining that users must replace `{}` with their device ID (e.g., `0` for the
first GPU). Ensure the README's example uses the chosen concrete value or
includes the explicit instruction so readers aren't confused by empty braces.

Comment thread README.md
Comment on lines +155 to +166
```json
{
"model": { ... },
"runtime": { ... },
"kv_quant": {
"enabled": true,
"key_bits": 4,
"value_bits": 2,
"residual_window": 128
}
}
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Complete the TurboQuant JSON configuration example.

The example kv_quant configuration is missing two available options: protected_layers and protected_bits. According to the implementation in python/cli/main.py, these options are supported with defaults of 4 and 8 respectively. Users should be aware of all available configuration parameters.

📝 Suggested fix
 {
   "model": { ... },
   "runtime": { ... },
   "kv_quant": {
     "enabled": true,
     "key_bits": 4,
     "value_bits": 2,
-    "residual_window": 128
+    "residual_window": 128,
+    "protected_layers": 4,
+    "protected_bits": 8
   }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 155 - 166, Update the README's kv_quant JSON example
to include the two supported options missing from the example: add
"protected_layers" and "protected_bits" with their default values (4 and 8) so
the configuration matches the implementation in python/cli/main.py; locate the
kv_quant block in the README sample and append these keys with appropriate
values and brief comments or values so users see all available parameters.

Comment thread tests/test_lloyd_max.py
Comment on lines +107 to +120
def quantization_mse(centroids_in):
"""Compute MSE of a scalar quantizer for N(0, sigma^2)."""
cs = sorted(centroids_in.tolist())
boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10]
total_mse = 0.0
n_pts = 10000
for i in range(len(cs)):
a, b = boundaries[i], boundaries[i+1]
xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
total_mse += mse_i.item()
dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate
return total_mse / n_pts # normalized
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Correct the MSE integration math in the validator.

quantization_mse currently computes dx but does not apply it; returning total_mse / n_pts is not the integral-based MSE and can mis-rank quantizers.

Proposed fix
 def quantization_mse(centroids_in):
@@
-        total_mse = 0.0
+        total_mse = 0.0
         n_pts = 10000
         for i in range(len(cs)):
             a, b = boundaries[i], boundaries[i+1]
             xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
             pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
-            mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
-            total_mse += mse_i.item()
-        dx = (boundaries[-1] - boundaries[0]) / n_pts  # approximate
-        return total_mse / n_pts  # normalized
+            integrand = (xs - cs[i]) ** 2 * pdf_vals
+            total_mse += torch.trapz(integrand, xs).item()
+        return total_mse
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_lloyd_max.py` around lines 107 - 120, quantization_mse is
integrating via Riemann sums but never multiplies by the interval width dx, so
the returned value total_mse / n_pts is wrong; fix it by computing dx per
interval (dx = (b - a) / (n_pts - 1)) inside the loop where a,b are taken from
boundaries, multiply each interval's summed contribution mse_i by that dx before
adding to total_mse (i.e., total_mse += mse_i.item() * dx), and then return
total_mse (remove the final division by n_pts); refer to the function
quantization_mse, variables boundaries, a, b, xs, mse_i, total_mse, n_pts, and
sigma.

Comment thread tests/test_lloyd_max.py
print(f" MSE old: {mse_old:.6e}")
print(f" MSE new: {mse_new:.6e}")
if mse_new <= mse_old * 1.005:
print(f" PASS: new MSE <= old MSE (within 0.5%)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix invalid f-strings flagged by lint.

Lines 127/134/136/146/148 use f"..." without placeholders, which triggers Ruff F541 and can fail CI.

Proposed fix
-        print(f"  PASS: new MSE <= old MSE (within 0.5%)")
+        print("  PASS: new MSE <= old MSE (within 0.5%)")
...
-        print(f"  PASS: centroids monotonically increasing")
+        print("  PASS: centroids monotonically increasing")
...
-        print(f"  FAIL: centroids not monotonically increasing")
+        print("  FAIL: centroids not monotonically increasing")
...
-        print(f"  PASS: symmetric around 0")
+        print("  PASS: symmetric around 0")
...
-        print(f"  WARN: not perfectly symmetric (expected for finite iterations)")
+        print("  WARN: not perfectly symmetric (expected for finite iterations)")

Also applies to: 134-134, 136-136, 146-146, 148-148

🧰 Tools
🪛 Ruff (0.15.14)

[error] 127-127: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_lloyd_max.py` at line 127, The f-string prefix is used on plain
strings without placeholders in tests/test_lloyd_max.py (e.g., the print calls
that currently read print(f"  PASS: new MSE <= old MSE (within 0.5%)") and the
similar PASS/FAIL prints), which triggers Ruff F541; fix by removing the
unnecessary leading "f" from those print statements so they are plain strings
(e.g., change print(f"...") to print("...")) for each affected print occurrence.

huangzhuo and others added 4 commits May 30, 2026 10:01
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3
stack: extend kv_cache.py with compress/restore/stats methods, add
TurboQuant hooks in npu_runner.py (compress after prefill/decode,
restore before decode), clean compressed segments on preemption in
serving_worker.py, and add kv_quant JSON config parsing in CLI.
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3
stack: extend kv_cache.py with compress/restore/stats methods, add
TurboQuant hooks in npu_runner.py (compress after prefill/decode,
restore before decode), clean compressed segments on preemption in
serving_worker.py, and add kv_quant JSON config parsing in CLI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant