[Feature] 集成 TurboQuant KV Cache 压缩 by sunghajung6688 · Pull Request #6 · hw-native-sys/pypto-serving

sunghajung6688 · 2026-05-19T07:08:14Z

将 TurboQuant V3（纯 MSE 模式）集成到 pypto-serving 的 KV Cache 管理中，支持在线压缩/解压缩历史 token，减少 KV Cache 内存占用。

gemini-code-assist

Code Review

This pull request introduces TurboQuant KV cache quantization to optimize memory usage during prefill and decode phases. The implementation includes a new turboquant module featuring MSE-optimal compression using random rotation and Lloyd-Max quantization. Feedback focuses on critical performance bottlenecks: the current compression logic re-processes the entire sequence prefix in each step, leading to $O(N^2)$ complexity; the manual Python loops used for writing to paged cache should be replaced with vectorized advanced indexing; and several constant tensor allocations and PDF calculations should be pre-computed or vectorized to reduce overhead.

gemini-code-assist · 2026-05-19T07:10:27Z

+        if alloc.request_id not in pool.compressed_segments:
+            pool.compressed_segments[alloc.request_id] = {}
+
+        for layer_idx in range(pool.num_layers):


当前实现在每次调用时都会重新压缩整个序列前缀（从 token 0 到 tokens_used - residual_window）。在 decode 循环中调用时，这将导致相对于序列长度的 $O(N^2)$ 复杂度，随着上下文增长，性能将严重下降。建议实现增量压缩，仅处理刚刚超出 residual_window 的新 token。此外，调用的 read_context 方法内部也存在类似的循环读取瓶颈，建议一并优化。

gemini-code-assist · 2026-05-19T07:10:27Z

+        for row in range(keys.shape[0]):
+            token_index = start_token_index + row
+            page_idx = token_index // pool.page_size
+            offset = token_index % pool.page_size
+            physical_page = alloc.page_ids[page_idx]
+            pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)
+            pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)


在 Python 循环中遍历 token 并写入分页缓存（paged cache）非常低效，特别是对于设备张量（NPU/GPU）。应使用 PyTorch 的高级索引（advanced indexing）进行向量化，以在单个操作中完成更新。

Suggested change

for row in range(keys.shape[0]):

token_index = start_token_index + row

page_idx = token_index // pool.page_size

offset = token_index % pool.page_size

physical_page = alloc.page_ids[page_idx]

pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)

pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)

token_indices = torch.arange(start_token_index, start_token_index + keys.shape[0], device=keys.device)

page_indices = token_indices // pool.page_size

offsets = token_indices % pool.page_size

physical_pages = torch.tensor(alloc.page_ids, device=keys.device)[page_indices]

pool.key_pages[layer_idx, physical_pages, :, offsets, :] = keys.to(cache_dtype)

pool.value_pages[layer_idx, physical_pages, :, offsets, :] = values.to(cache_dtype)

gemini-code-assist · 2026-05-19T07:10:28Z

+        idx_powers = torch.tensor(
+            [2 ** (self.bits * i) for i in range(indices_per_byte - 1, -1, -1)],
+            dtype=torch.long,
+            device=idx_flat.device,
+        )


idx_powers 是一个常量张量。在每次 compress 调用时创建它是低效的。建议在 __init__ 中根据 bits 预先计算并存储为 buffer，以减少冗余分配和 H2D 拷贝。

gemini-code-assist · 2026-05-19T07:10:28Z

+            pdf_vals = torch.tensor([pdf(x) for x in xs])
+            weighted = xs * pdf_vals


高斯 PDF 计算目前在 Python 列表推导式中执行，对于 2048 个样本来说较慢。可以使用 PyTorch 操作轻松实现向量化，提高初始化速度。

xs = torch.linspace(a, b, n_samples) pdf_vals = (1.0 / (math.sqrt(2 * math.pi) * sigma)) * torch.exp(-xs**2 / (2 * sigma**2)) weighted = xs * pdf_vals

coderabbitai · 2026-05-28T03:02:39Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c09f685-00ed-42dc-b5df-2a84328e3e87

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds TurboQuant, a KV cache quantization system for reducing inference memory, consisting of core quantization algorithms (Lloyd-Max optimal scalar codebook, absmax quantization), integration into the KV cache manager with compress/restore lifecycles, NPU and PyTorch execution paths, worker/serving coordination, and CLI configuration support. Documentation and example scripts demonstrate offline and online usage.

Changes

TurboQuant KV Cache Quantization Integration

Layer / File(s)	Summary
Core quantization types and configuration `python/core/types.py`, `python/core/turboquant/__init__.py`	`KvQuantConfig` dataclass (enabled, key_bits, value_bits, residual_window, protected_layers, protected_bits) and `RuntimeConfig.kv_quant_config` field enable quantization configuration across the system.
Lloyd-Max optimal scalar quantizer `python/core/turboquant/lloyd_max.py`, `tests/test_lloyd_max.py`	Iterative Lloyd-Max algorithm with vectorized Gaussian conditional-mean updates, LRU caching, and codebook precomputation; test suite validates MSE quality vs. legacy numerical approach and benchmarks speed across (d, bits) configurations.
Absmax compressor and KVCompressor implementation `python/core/turboquant/compressor.py`	`AbsmaxCompressor` performs per-row absmax scaling and INT-N quantization with optional bit-packing; `KVCompressor` manages per-layer compressors with layer-adaptive protected bit-widths and dual Python/NPU execution paths for compress/decompress.
KV cache manager quantization integration `python/core/kv_cache.py`	`KvCacheManager` initializes compressor during model registration, provides `compress_old_tokens` / `restore_compressed_tokens` APIs for token lifecycle management, tracks compressed segments per request, and exposes `quantization_stats` for memory reporting.
NPU executor L2 callable compilation `examples/model/qwen3_14b/runner/npu_executor.py`	`_compile_l2_callable_from_jit` compiles `@pl.jit` functions into `_L2Callable` with JIT caching, dummy-arg binding for metadata extraction, and `compile_and_assemble` integration; placeholder tq_compress/tq_decompress stubs in `_compile_model`.
Model runner TurboQuant wiring `examples/model/qwen3_14b/runner/npu_runner.py`	`run_tq_compress` / `run_tq_decompress` dispatch to compiled L2 callables or PyTorch fallback; compress/restore integrated into prefill and decode paths; L2 orchestration refactored to separate tensor/scalar arguments; debug logging tracks per-batch TQ state.
Serving worker compressed segment lifecycle `python/core/serving_worker.py`	Worker corrects `tokens_used` to full context length after prefill; clears stale compressed segments during allocation creation/update and freeing to maintain segment consistency across preemption.
CLI configuration and serving startup `python/cli/main.py`	`load_serving_config()` parses optional `kv_quant` JSON block and instantiates `KvQuantConfig`; default `weight_dtype` updated to `bfloat16`; startup logging reports TurboQuant settings when enabled.
Example scripts and CLI utilities `examples/model/qwen3_14b/npu_generate.py`, `examples/model/qwen3_14b/test_kv_capacity.py`, `examples/model/qwen3_14b/npu_serving.json`	`npu_generate.py` adds `--kv-quant` and quantization-parameter flags; `test_kv_capacity.py` measures KV pool capacity with/without TurboQuant; config updated to `weight_dtype: bfloat16`.
Documentation and repository metadata `README.md`, `.gitmodules`, `.claude/skills`	README restructured with Offline/Online mode sections, TurboQuant setup instructions, Common CLI Flags table, and Benchmark section; submodule URL updated; skills config added.

Sequence Diagram

sequenceDiagram
  participant Client
  participant Worker as WorkerProcess
  participant Runner as Qwen314BModelRunner
  participant KVCache as KvCacheManager
  participant Compressor as KVCompressor
  participant NPU as NPU/L2
  
  Client->>Worker: generate_result(prompt)
  Worker->>Runner: run_prefill()
  Runner->>Runner: prefill tokens
  Runner->>KVCache: compress_old_tokens(alloc)
  KVCache->>Compressor: compress_layer()
  Compressor->>NPU: run_kv_quantize (if available)
  NPU-->>Compressor: quantized indices
  Compressor->>KVCache: store compressed segment
  KVCache-->>Runner: freed pages
  
  loop decode steps
    Worker->>Runner: run_decode()
    Runner->>KVCache: restore_compressed_tokens(alloc)
    KVCache->>Compressor: decompress_layer()
    Compressor->>NPU: run_kv_dequantize (if available)
    NPU-->>Compressor: decompressed tokens
    Compressor->>KVCache: write into pages
    KVCache-->>Runner: restored pages
    Runner->>Runner: decode kernel
    Runner->>KVCache: compress_old_tokens(alloc)
    KVCache->>Compressor: compress_layer()
    Compressor-->>KVCache: compressed segment
  end
  
  Worker->>KVCache: free(alloc)
  KVCache->>KVCache: clear_compressed_segments()
  KVCache-->>Client: result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 In a digital warren deep and vast,
Where tokens cache and memories amass,
A TurboQuant tale—compress to save the day,
BF16 to INT-N in the NPU's way!
Absmax scales the hoard with Lloyd-Max's grace, 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 67.80% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Feature] 集成 TurboQuant KV Cache 压缩' clearly describes the main feature being added: integration of TurboQuant KV Cache compression, which aligns with the substantial changes across multiple files for implementing KV cache quantization.
Description check	✅ Passed	The description 'Integrating TurboQuant V3 (pure MSE mode) into pypto-serving's KV Cache management, supporting online compression/decompression of historical tokens to reduce KV Cache memory usage' directly relates to the changeset, explaining the purpose and impact of the TurboQuant integration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

🧹 Nitpick comments (1)

examples/model/qwen3_14b/test_kv_capacity.py (1)
112-112: ⚡ Quick win

Align weight dtype with the other NPU TurboQuant entrypoints.

Using float32 here makes KV-capacity comparisons noisier by increasing weight memory pressure; bfloat16 better matches the updated serving/generation defaults.
💡 Proposed fix
-            weight_dtype="float32",
+            weight_dtype="bfloat16",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/test_kv_capacity.py` at line 112, Update the weight
dtype used in the KV-capacity test to match other NPU TurboQuant entrypoints by
changing the weight_dtype setting from "float32" to "bfloat16" in the
test_kv_capacity setup (look for the weight_dtype="float32" argument in the KV
capacity test config and replace it with weight_dtype="bfloat16").

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 299-305: The restore/compress calls operate on stale host-side KV
state and must be synchronized with the worker child-memory buffer: before
calling restore_compressed_tokens() or compress_old_tokens() (e.g., inside the
block where you iterate batch.kv_allocations when
is_quantization_enabled(model_id)), explicitly pull/sync the latest pages from
the worker child buffer into the host cache (use the same mechanism as
_l2_child_tensor(..., refresh=True) or add/ call a kv cache
sync_from_worker(model_id, alloc) helper) so host reads reflect the decode
buffer; after restore/compress, push any modified pages back to the worker
(sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will be
uploaded before the next decode, ensuring the quantization lifecycle is kept
consistent between restore_compress and the worker-resident cache.
- Around line 1177-1188: Wrap the noisy debug prints in _prepare_decode_inputs
behind the existing debug flag (e.g. check self._l3_trace or a dedicated
TurboQuant debug switch) so they only execute when tracing is enabled;
specifically guard the two print calls that reference batch_idx, seq_len,
max_blocks, alloc.page_ids, alloc.tokens_used, slot_mapping and the block_table
print with an if self._l3_trace (or if _l3_trace) block so the heavy stdout
output and flush=True are skipped by default.
- Around line 391-408: The compressed-segments memory calc calls sum(...) on
values returned by .numel(), which is an int and causes TypeError; update the
block that iterates pool.compressed_segments so that you add the .numel()
results directly (and multiply by 2 for "scales" entries) instead of wrapping
them in sum(), e.g., use seg.compressed_k.get("idx_bytes",
torch.Tensor()).numel() and seg.compressed_k.get("scales",
torch.Tensor()).numel() * 2 (same for compressed_v), and ensure the accumulated
value is cast to a float before dividing by 1024*1024 to compute compressed_mb.

In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Around line 58-59: The --step argument must be validated to be a positive
integer to avoid non-terminating loops that use prompt_len and step in the
capacity test; add validation (either a custom argparse type like positive_int
or an explicit check after parsing) to ensure args.step > 0 (and optionally cap
it against args.max_prompt or other relevant bounds), and raise/exit with a
clear error if the value is invalid so the loop that manipulates prompt_len
cannot stall or regress (refer to parser, args.step, prompt_len and the loop
logic around lines where prompt_len is increased).
- Around line 73-74: The code resolves the provided model directory into
model_dir but doesn't validate that it exists and is a directory; update the
argument handling around model_dir (where model_dir =
Path(args.model_dir).resolve()) to check model_dir.exists() and
model_dir.is_dir(), and if not, exit early with a clear error message (e.g., via
sys.exit or raising ArgumentError) that includes the invalid path
(args.model_dir) so failures fail fast with useful context.

In `@python/core/kv_cache.py`:
- Around line 489-516: compress_old_tokens is prematurely replacing BF16 pages
with compressed segments (see read_context -> pool.compressed_segments
assignment and use of pool.kv_compressor.compress_layer / compress_layer_npu
with run_tq_compress) so chunked run_prefill which expects original pages later
ends up with missing prompt pages; change the logic to not overwrite/free the
original BF16 pages during prefill: either defer creating/assigning
pool.compressed_segments inside compress_old_tokens to only run in decode paths,
or ensure a restore step is performed before any chunked prefill by invoking the
existing restore/restore_pages routine (or add one) so read_context +
compress_layer(*) do not remove the resident BF16 pages until after prefill is
complete.
- Around line 373-377: Replace all synchronous print(..., flush=True) debug
traces in python/core/kv_cache.py with gated logging (e.g., logger.debug) so
they are emitted only when debug logging is enabled; specifically update the
print in the restore/compress code paths and inside
_write_tokens_from_compressed and any other restore/compress helpers to call the
module logger (or an existing logger instance) instead of printing to stdout,
remove explicit flush usage, and preserve the original message formatting so
semantics remain the same.

In `@python/core/serving_worker.py`:
- Around line 343-346: The call to
self.kv_cache_manager.clear_compressed_segments(self.config.model_id, req_id) in
_get_or_update_allocation() is clearing compressed state on every allocation
refresh (prefill/decode) and breaks KvCacheManager.restore_compressed_tokens()
for multi-step requests; move or guard this call so it only runs on actual
preemption/reset paths (e.g., where you detect a preemption or allocation reset)
instead of the general allocation update flow—adjust _get_or_update_allocation()
to only invoke kv_cache_manager.clear_compressed_segments(...) when the function
is handling a preemption/reset event, leaving the compressed segments intact
during normal prefill/decode steps.

In `@python/core/turboquant/compressor.py`:
- Around line 189-199: Replace the duplicate KvQuantConfig dataclass in
compressor.py with an import of the single authoritative KvQuantConfig
definition so there is one source of truth; specifically, remove the local
KvQuantConfig class and import the KvQuantConfig symbol defined in the other
module (the canonical types definition) and update any local references to use
that imported class to ensure default values (e.g., value_bits) remain
consistent across the codebase.

In `@README.md`:
- Line 120: Update the README command example to clarify the device ID
placeholder: replace the ambiguous `--device {}` with a concrete example like
`--device 0` or add a short parenthetical note after the command explaining that
users must replace `{}` with their device ID (e.g., `0` for the first GPU).
Ensure the README's example uses the chosen concrete value or includes the
explicit instruction so readers aren't confused by empty braces.
- Around line 155-166: Update the README's kv_quant JSON example to include the
two supported options missing from the example: add "protected_layers" and
"protected_bits" with their default values (4 and 8) so the configuration
matches the implementation in python/cli/main.py; locate the kv_quant block in
the README sample and append these keys with appropriate values and brief
comments or values so users see all available parameters.

In `@tests/test_lloyd_max.py`:
- Line 127: The f-string prefix is used on plain strings without placeholders in
tests/test_lloyd_max.py (e.g., the print calls that currently read print(f" 
PASS: new MSE <= old MSE (within 0.5%)") and the similar PASS/FAIL prints),
which triggers Ruff F541; fix by removing the unnecessary leading "f" from those
print statements so they are plain strings (e.g., change print(f"...") to
print("...")) for each affected print occurrence.
- Around line 107-120: quantization_mse is integrating via Riemann sums but
never multiplies by the interval width dx, so the returned value total_mse /
n_pts is wrong; fix it by computing dx per interval (dx = (b - a) / (n_pts - 1))
inside the loop where a,b are taken from boundaries, multiply each interval's
summed contribution mse_i by that dx before adding to total_mse (i.e., total_mse
+= mse_i.item() * dx), and then return total_mse (remove the final division by
n_pts); refer to the function quantization_mse, variables boundaries, a, b, xs,
mse_i, total_mse, n_pts, and sigma.

---

Nitpick comments:
In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Line 112: Update the weight dtype used in the KV-capacity test to match other
NPU TurboQuant entrypoints by changing the weight_dtype setting from "float32"
to "bfloat16" in the test_kv_capacity setup (look for the weight_dtype="float32"
argument in the KV capacity test config and replace it with
weight_dtype="bfloat16").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f1389df-bcec-4ad4-98b9-f16906f52a18

📥 Commits

Reviewing files that changed from the base of the PR and between efe0a6f and d717090.

📒 Files selected for processing (18)

.claude/skills
.claude/skills
.gitmodules
README.md
examples/model/qwen3_14b/npu_generate.py
examples/model/qwen3_14b/npu_serving.json
examples/model/qwen3_14b/runner/npu_executor.py
examples/model/qwen3_14b/runner/npu_runner.py
examples/model/qwen3_14b/test_kv_capacity.py
pypto-lib
python/cli/main.py
python/core/kv_cache.py
python/core/serving_worker.py
python/core/turboquant/__init__.py
python/core/turboquant/compressor.py
python/core/turboquant/lloyd_max.py
python/core/types.py
tests/test_lloyd_max.py

coderabbitai · 2026-05-28T03:15:28Z

+        # ── KV quantization: restore compressed old tokens BEFORE building
+        # block_table so that workspace pages are included in page_ids. ──
+        t_tq_restore = time.perf_counter()
+        if self._kv_cache_manager.is_quantization_enabled(model_id):
+            for alloc in batch.kv_allocations:
+                self._kv_cache_manager.restore_compressed_tokens(model_id, alloc, npu_runner=self)
+        dt_tq_restore = (time.perf_counter() - t_tq_restore) * 1000


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Restore/compress are operating on stale KV state.

restore_compressed_tokens() and compress_old_tokens() both mutate/read the host-side cache, but decode runs against _l2_child_tensor(..., refresh=refresh_kv_cache) worker allocations. Once _l2_dirty_kv_models has been cleared, restored pages are never re-uploaded before the next decode, and the post-kernel compression path reads host KV that has not been synced back from the child-memory decode buffer. That will quantize/decode the wrong history after the first quantized step. Please either add explicit KV sync points around TurboQuant or keep the quantization lifecycle entirely on the worker-resident cache. python/runtime/worker.py:89-124 and python/core/kv_cache.py:346-403,444-507 show the contract mismatch.

Also applies to: 313-333, 374-379

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 299 - 305, The restore/compress calls operate on stale host-side KV state and must be synchronized with the worker child-memory buffer: before calling restore_compressed_tokens() or compress_old_tokens() (e.g., inside the block where you iterate batch.kv_allocations when is_quantization_enabled(model_id)), explicitly pull/sync the latest pages from the worker child buffer into the host cache (use the same mechanism as _l2_child_tensor(..., refresh=True) or add/ call a kv cache sync_from_worker(model_id, alloc) helper) so host reads reflect the decode buffer; after restore/compress, push any modified pages back to the worker (sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will be uploaded before the next decode, ensuring the quantization lifecycle is kept consistent between restore_compress and the worker-resident cache.

coderabbitai · 2026-05-28T03:15:28Z

+            # Compressed segments memory estimate
+            compressed_mb = 0.0
+            if pool and pool.compressed_segments:
+                for req_segs in pool.compressed_segments.values():
+                    for seg in req_segs.values():
+                        compressed_mb += sum(
+                            seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),
+                        )
+                        compressed_mb += sum(
+                            seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,
+                        )
+                        compressed_mb += sum(
+                            seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),
+                        )
+                        compressed_mb += sum(
+                            seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,
+                        )
+                compressed_mb = compressed_mb / 1024 / 1024


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The KV-memory stats block will throw on the first compressed segment.

numel() returns an int, so sum(seg.compressed_k.get(...).numel(),) raises TypeError: 'int' object is not iterable. Because this sits on the quantized decode path, the logger will start crashing requests as soon as compression succeeds once.

Suggested fix

- compressed_mb = 0.0 + compressed_bytes = 0 if pool and pool.compressed_segments: for req_segs in pool.compressed_segments.values(): for seg in req_segs.values(): - compressed_mb += sum( - seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(), - ) - compressed_mb += sum( - seg.compressed_k.get("scales", torch.Tensor()).numel() * 2, - ) - compressed_mb += sum( - seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(), - ) - compressed_mb += sum( - seg.compressed_v.get("scales", torch.Tensor()).numel() * 2, - ) - compressed_mb = compressed_mb / 1024 / 1024 + idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8)) + scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16)) + idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8)) + scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16)) + compressed_bytes += idx_k.numel() * idx_k.element_size() + compressed_bytes += scales_k.numel() * scales_k.element_size() + compressed_bytes += idx_v.numel() * idx_v.element_size() + compressed_bytes += scales_v.numel() * scales_v.element_size() + compressed_mb = compressed_bytes / 1024 / 1024

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Compressed segments memory estimate

compressed_mb = 0.0

if pool and pool.compressed_segments:

for req_segs in pool.compressed_segments.values():

for seg in req_segs.values():

compressed_mb += sum(

seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),

)

compressed_mb += sum(

seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,

)

compressed_mb += sum(

seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),

)

compressed_mb += sum(

seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,

)

compressed_mb = compressed_mb / 1024 / 1024

# Compressed segments memory estimate

compressed_bytes = 0

if pool and pool.compressed_segments:

for req_segs in pool.compressed_segments.values():

for seg in req_segs.values():

idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8))

scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16))

idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8))

scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16))

compressed_bytes += idx_k.numel() * idx_k.element_size()

compressed_bytes += scales_k.numel() * scales_k.element_size()

compressed_bytes += idx_v.numel() * idx_v.element_size()

compressed_bytes += scales_v.numel() * scales_v.element_size()

compressed_mb = compressed_bytes / 1024 / 1024

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 391 - 408, The compressed-segments memory calc calls sum(...) on values returned by .numel(), which is an int and causes TypeError; update the block that iterates pool.compressed_segments so that you add the .numel() results directly (and multiply by 2 for "scales" entries) instead of wrapping them in sum(), e.g., use seg.compressed_k.get("idx_bytes", torch.Tensor()).numel() and seg.compressed_k.get("scales", torch.Tensor()).numel() * 2 (same for compressed_v), and ensure the accumulated value is cast to a float before dividing by 1024*1024 to compute compressed_mb.

coderabbitai · 2026-05-28T03:15:29Z

+            print(
+                f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
+                f"seq_len={seq_len}, max_blocks={max_blocks}, "
+                f"alloc.page_ids len={len(alloc.page_ids)}, "
+                f"alloc.page_ids={alloc.page_ids}, "
+                f"alloc.tokens_used={alloc.tokens_used}, "
+                f"slot_mapping={slot_mapping[batch_idx].item()}",
+                flush=True,
+            )
+        print(
+            f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
+            flush=True,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the TQ decode dumps behind a debug flag.

This runs on every decode step and prints full page metadata with flush=True, so long generations will spend a surprising amount of time in stdout and produce very noisy logs. Please hide it behind _l3_trace or a dedicated TurboQuant debug switch.

Suggested fix

- print( - f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, " - f"seq_len={seq_len}, max_blocks={max_blocks}, " - f"alloc.page_ids len={len(alloc.page_ids)}, " - f"alloc.page_ids={alloc.page_ids}, " - f"alloc.tokens_used={alloc.tokens_used}, " - f"slot_mapping={slot_mapping[batch_idx].item()}", - flush=True, - ) - print( - f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}", - flush=True, - ) + if self._l3_trace: + print( + f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, " + f"seq_len={seq_len}, max_blocks={max_blocks}, " + f"alloc.page_ids len={len(alloc.page_ids)}, " + f"alloc.page_ids={alloc.page_ids}, " + f"alloc.tokens_used={alloc.tokens_used}, " + f"slot_mapping={slot_mapping[batch_idx].item()}", + flush=True, + ) + if self._l3_trace: + print( + f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}", + flush=True, + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 1177 - 1188, Wrap the noisy debug prints in _prepare_decode_inputs behind the existing debug flag (e.g. check self._l3_trace or a dedicated TurboQuant debug switch) so they only execute when tracing is enabled; specifically guard the two print calls that reference batch_idx, seq_len, max_blocks, alloc.page_ids, alloc.tokens_used, slot_mapping and the block_table print with an if self._l3_trace (or if _l3_trace) block so the heavy stdout output and flush=True are skipped by default.

coderabbitai · 2026-05-28T03:15:29Z

+    parser.add_argument("--step", type=int, default=32,
+                        help="Increase prompt tokens by this much each iteration.")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate --step (and related bounds) to prevent non-terminating runs.

With --step <= 0, Line 147 can stall or regress prompt_len, making the loop at Line 135 non-terminating.

💡 Proposed fix

def main() -> None: args = build_parser().parse_args() + if args.step <= 0: + raise ValueError("--step must be > 0.") + if args.start_len <= 0: + raise ValueError("--start-len must be > 0.") + if args.max_len < args.start_len: + raise ValueError("--max-len must be >= --start-len.") + model_dir = Path(args.model_dir).resolve()

Also applies to: 135-148

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 58 - 59, The --step argument must be validated to be a positive integer to avoid non-terminating loops that use prompt_len and step in the capacity test; add validation (either a custom argparse type like positive_int or an explicit check after parsing) to ensure args.step > 0 (and optionally cap it against args.max_prompt or other relevant bounds), and raise/exit with a clear error if the value is invalid so the loop that manipulates prompt_len cannot stall or regress (refer to parser, args.step, prompt_len and the loop logic around lines where prompt_len is increased).

coderabbitai · 2026-05-28T03:15:29Z

+    model_dir = Path(args.model_dir).resolve()
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when --model-dir is invalid.

Line 73 resolves the path but never verifies it exists as a directory, so errors surface later with less clear context.

💡 Proposed fix

args = build_parser().parse_args() model_dir = Path(args.model_dir).resolve() + if not model_dir.is_dir(): + raise FileNotFoundError(f"Model directory does not exist: {model_dir}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model_dir = Path(args.model_dir).resolve()

args = build_parser().parse_args()

model_dir = Path(args.model_dir).resolve()

if not model_dir.is_dir():

raise FileNotFoundError(f"Model directory does not exist: {model_dir}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 73 - 74, The code resolves the provided model directory into model_dir but doesn't validate that it exists and is a directory; update the argument handling around model_dir (where model_dir = Path(args.model_dir).resolve()) to check model_dir.exists() and model_dir.is_dir(), and if not, exit early with a clear error message (e.g., via sys.exit or raising ArgumentError) that includes the invalid path (args.model_dir) so failures fail fast with useful context.

coderabbitai · 2026-05-28T03:15:29Z

+@dataclass
+class KvQuantConfig:
+    """Configuration for KV cache quantization."""
+
+    enabled: bool = False
+    key_bits: int = 4
+    value_bits: int = 4
+    residual_window: int = 128
+    protected_layers: int = 4
+    protected_bits: int = 8
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unify KvQuantConfig to a single source of truth.

This file redefines KvQuantConfig with defaults that diverge from python/core/types.py (e.g., Line 195 uses value_bits=4 vs Line 26 in python/core/types.py uses value_bits=2). Having two config classes for the same contract can silently change behavior depending on import path.

Proposed fix

-from dataclasses import dataclass +from dataclasses import dataclass +from python.core.types import KvQuantConfig @@ -@dataclass -class KvQuantConfig: - """Configuration for KV cache quantization.""" - - enabled: bool = False - key_bits: int = 4 - value_bits: int = 4 - residual_window: int = 128 - protected_layers: int = 4 - protected_bits: int = 8

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@dataclass

class KvQuantConfig:

"""Configuration for KV cache quantization."""

enabled: bool = False

key_bits: int = 4

value_bits: int = 4

residual_window: int = 128

protected_layers: int = 4

protected_bits: int = 8

# (Import statement to be added at the top of the file:)

from python.core.types import KvQuantConfig

# Lines 189-199 are removed entirely; no code remains at this location

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/turboquant/compressor.py` around lines 189 - 199, Replace the duplicate KvQuantConfig dataclass in compressor.py with an import of the single authoritative KvQuantConfig definition so there is one source of truth; specifically, remove the local KvQuantConfig class and import the KvQuantConfig symbol defined in the other module (the canonical types definition) and update any local references to use that imported class to ensure default values (e.g., value_bits) remain consistent across the codebase.

coderabbitai · 2026-05-28T03:15:29Z

 ```bash
 python -m python.cli.main \
  --config examples/model/qwen3_14b/npu_serving.json \
  --serve --port 8899 --device {}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the device ID placeholder.

The command uses --device {} with empty braces, which may confuse users. Consider using a concrete example like --device 0 or adding a comment explaining that users should replace {} with their device ID.

📝 Suggested fix

- --serve --port 8899 --device {} + --serve --port 8899 --device 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

--serve --port 8899 --device {}

--serve --port 8899 --device 0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 120, Update the README command example to clarify the device ID placeholder: replace the ambiguous `--device {}` with a concrete example like `--device 0` or add a short parenthetical note after the command explaining that users must replace `{}` with their device ID (e.g., `0` for the first GPU). Ensure the README's example uses the chosen concrete value or includes the explicit instruction so readers aren't confused by empty braces.

coderabbitai · 2026-05-28T03:15:29Z

+```json
+{
+  "model": { ... },
+  "runtime": { ... },
+  "kv_quant": {
+    "enabled": true,
+    "key_bits": 4,
+    "value_bits": 2,
+    "residual_window": 128
+  }
+}
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Complete the TurboQuant JSON configuration example.

The example kv_quant configuration is missing two available options: protected_layers and protected_bits. According to the implementation in python/cli/main.py, these options are supported with defaults of 4 and 8 respectively. Users should be aware of all available configuration parameters.

📝 Suggested fix

{ "model": { ... }, "runtime": { ... }, "kv_quant": { "enabled": true, "key_bits": 4, "value_bits": 2, - "residual_window": 128 + "residual_window": 128, + "protected_layers": 4, + "protected_bits": 8 } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 155 - 166, Update the README's kv_quant JSON example to include the two supported options missing from the example: add "protected_layers" and "protected_bits" with their default values (4 and 8) so the configuration matches the implementation in python/cli/main.py; locate the kv_quant block in the README sample and append these keys with appropriate values and brief comments or values so users see all available parameters.

coderabbitai · 2026-05-28T03:15:29Z

+    def quantization_mse(centroids_in):
+        """Compute MSE of a scalar quantizer for N(0, sigma^2)."""
+        cs = sorted(centroids_in.tolist())
+        boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10]
+        total_mse = 0.0
+        n_pts = 10000
+        for i in range(len(cs)):
+            a, b = boundaries[i], boundaries[i+1]
+            xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
+            pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
+            mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
+            total_mse += mse_i.item()
+        dx = (boundaries[-1] - boundaries[0]) / n_pts  # approximate
+        return total_mse / n_pts  # normalized


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Correct the MSE integration math in the validator.

quantization_mse currently computes dx but does not apply it; returning total_mse / n_pts is not the integral-based MSE and can mis-rank quantizers.

Proposed fix

def quantization_mse(centroids_in): @@ - total_mse = 0.0 + total_mse = 0.0 n_pts = 10000 for i in range(len(cs)): a, b = boundaries[i], boundaries[i+1] xs = torch.linspace(a, b, n_pts, dtype=torch.float64) pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi)) - mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum() - total_mse += mse_i.item() - dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate - return total_mse / n_pts # normalized + integrand = (xs - cs[i]) ** 2 * pdf_vals + total_mse += torch.trapz(integrand, xs).item() + return total_mse

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` around lines 107 - 120, quantization_mse is integrating via Riemann sums but never multiplies by the interval width dx, so the returned value total_mse / n_pts is wrong; fix it by computing dx per interval (dx = (b - a) / (n_pts - 1)) inside the loop where a,b are taken from boundaries, multiply each interval's summed contribution mse_i by that dx before adding to total_mse (i.e., total_mse += mse_i.item() * dx), and then return total_mse (remove the final division by n_pts); refer to the function quantization_mse, variables boundaries, a, b, xs, mse_i, total_mse, n_pts, and sigma.

coderabbitai · 2026-05-28T03:15:29Z

+    print(f"  MSE old: {mse_old:.6e}")
+    print(f"  MSE new: {mse_new:.6e}")
+    if mse_new <= mse_old * 1.005:
+        print(f"  PASS: new MSE <= old MSE (within 0.5%)")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix invalid f-strings flagged by lint.

Lines 127/134/136/146/148 use f"..." without placeholders, which triggers Ruff F541 and can fail CI.

Proposed fix

- print(f" PASS: new MSE <= old MSE (within 0.5%)") + print(" PASS: new MSE <= old MSE (within 0.5%)") ... - print(f" PASS: centroids monotonically increasing") + print(" PASS: centroids monotonically increasing") ... - print(f" FAIL: centroids not monotonically increasing") + print(" FAIL: centroids not monotonically increasing") ... - print(f" PASS: symmetric around 0") + print(" PASS: symmetric around 0") ... - print(f" WARN: not perfectly symmetric (expected for finite iterations)") + print(" WARN: not perfectly symmetric (expected for finite iterations)")

Also applies to: 134-134, 136-136, 146-146, 148-148

🧰 Tools

🪛 Ruff (0.15.14)

[error] 127-127: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` at line 127, The f-string prefix is used on plain strings without placeholders in tests/test_lloyd_max.py (e.g., the print calls that currently read print(f" PASS: new MSE <= old MSE (within 0.5%)") and the similar PASS/FAIL prints), which triggers Ruff F541; fix by removing the unnecessary leading "f" from those print statements so they are plain strings (e.g., change print(f"...") to print("...")) for each affected print occurrence.

Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

sunghajung6688 force-pushed the turboquant branch 4 times, most recently from c502fd4 to 628a70a Compare May 26, 2026 07:20

sunghajung6688 force-pushed the turboquant branch 15 times, most recently from d3cc145 to d717090 Compare May 28, 2026 03:02

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

huangzhuo and others added 4 commits May 30, 2026 10:01

test

2e3bc2c

test2

30e7c55

sunghajung6688 force-pushed the turboquant branch from a0306af to 30e7c55 Compare May 30, 2026 02:10

		pdf_vals = torch.tensor([pdf(x) for x in xs])
		weighted = xs * pdf_vals

		parser.add_argument("--step", type=int, default=32,
		help="Increase prompt tokens by this much each iteration.")

-    model_dir = Path(args.model_dir).resolve()
+    args = build_parser().parse_args()
+    model_dir = Path(args.model_dir).resolve()
+    if not model_dir.is_dir():
+        raise FileNotFoundError(f"Model directory does not exist: {model_dir}")

	--serve --port 8899 --device {}
	--serve --port 8899 --device 0

Conversation

sunghajung6688 commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 28, 2026 •

edited

Loading