[Feature] 集成 TurboQuant KV Cache 压缩#6
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces TurboQuant KV cache quantization to optimize memory usage during prefill and decode phases. The implementation includes a new turboquant module featuring MSE-optimal compression using random rotation and Lloyd-Max quantization. Feedback focuses on critical performance bottlenecks: the current compression logic re-processes the entire sequence prefix in each step, leading to
| if alloc.request_id not in pool.compressed_segments: | ||
| pool.compressed_segments[alloc.request_id] = {} | ||
|
|
||
| for layer_idx in range(pool.num_layers): |
| for row in range(keys.shape[0]): | ||
| token_index = start_token_index + row | ||
| page_idx = token_index // pool.page_size | ||
| offset = token_index % pool.page_size | ||
| physical_page = alloc.page_ids[page_idx] | ||
| pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype) | ||
| pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype) |
There was a problem hiding this comment.
在 Python 循环中遍历 token 并写入分页缓存(paged cache)非常低效,特别是对于设备张量(NPU/GPU)。应使用 PyTorch 的高级索引(advanced indexing)进行向量化,以在单个操作中完成更新。
| for row in range(keys.shape[0]): | |
| token_index = start_token_index + row | |
| page_idx = token_index // pool.page_size | |
| offset = token_index % pool.page_size | |
| physical_page = alloc.page_ids[page_idx] | |
| pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype) | |
| pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype) | |
| token_indices = torch.arange(start_token_index, start_token_index + keys.shape[0], device=keys.device) | |
| page_indices = token_indices // pool.page_size | |
| offsets = token_indices % pool.page_size | |
| physical_pages = torch.tensor(alloc.page_ids, device=keys.device)[page_indices] | |
| pool.key_pages[layer_idx, physical_pages, :, offsets, :] = keys.to(cache_dtype) | |
| pool.value_pages[layer_idx, physical_pages, :, offsets, :] = values.to(cache_dtype) |
| idx_powers = torch.tensor( | ||
| [2 ** (self.bits * i) for i in range(indices_per_byte - 1, -1, -1)], | ||
| dtype=torch.long, | ||
| device=idx_flat.device, | ||
| ) |
| pdf_vals = torch.tensor([pdf(x) for x in xs]) | ||
| weighted = xs * pdf_vals |
c502fd4 to
628a70a
Compare
d3cc145 to
d717090
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds TurboQuant, a KV cache quantization system for reducing inference memory, consisting of core quantization algorithms (Lloyd-Max optimal scalar codebook, absmax quantization), integration into the KV cache manager with compress/restore lifecycles, NPU and PyTorch execution paths, worker/serving coordination, and CLI configuration support. Documentation and example scripts demonstrate offline and online usage. ChangesTurboQuant KV Cache Quantization Integration
Sequence DiagramsequenceDiagram
participant Client
participant Worker as WorkerProcess
participant Runner as Qwen314BModelRunner
participant KVCache as KvCacheManager
participant Compressor as KVCompressor
participant NPU as NPU/L2
Client->>Worker: generate_result(prompt)
Worker->>Runner: run_prefill()
Runner->>Runner: prefill tokens
Runner->>KVCache: compress_old_tokens(alloc)
KVCache->>Compressor: compress_layer()
Compressor->>NPU: run_kv_quantize (if available)
NPU-->>Compressor: quantized indices
Compressor->>KVCache: store compressed segment
KVCache-->>Runner: freed pages
loop decode steps
Worker->>Runner: run_decode()
Runner->>KVCache: restore_compressed_tokens(alloc)
KVCache->>Compressor: decompress_layer()
Compressor->>NPU: run_kv_dequantize (if available)
NPU-->>Compressor: decompressed tokens
Compressor->>KVCache: write into pages
KVCache-->>Runner: restored pages
Runner->>Runner: decode kernel
Runner->>KVCache: compress_old_tokens(alloc)
KVCache->>Compressor: compress_layer()
Compressor-->>KVCache: compressed segment
end
Worker->>KVCache: free(alloc)
KVCache->>KVCache: clear_compressed_segments()
KVCache-->>Client: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (1)
examples/model/qwen3_14b/test_kv_capacity.py (1)
112-112: ⚡ Quick winAlign weight dtype with the other NPU TurboQuant entrypoints.
Using
float32here makes KV-capacity comparisons noisier by increasing weight memory pressure;bfloat16better matches the updated serving/generation defaults.💡 Proposed fix
- weight_dtype="float32", + weight_dtype="bfloat16",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/test_kv_capacity.py` at line 112, Update the weight dtype used in the KV-capacity test to match other NPU TurboQuant entrypoints by changing the weight_dtype setting from "float32" to "bfloat16" in the test_kv_capacity setup (look for the weight_dtype="float32" argument in the KV capacity test config and replace it with weight_dtype="bfloat16").
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 299-305: The restore/compress calls operate on stale host-side KV
state and must be synchronized with the worker child-memory buffer: before
calling restore_compressed_tokens() or compress_old_tokens() (e.g., inside the
block where you iterate batch.kv_allocations when
is_quantization_enabled(model_id)), explicitly pull/sync the latest pages from
the worker child buffer into the host cache (use the same mechanism as
_l2_child_tensor(..., refresh=True) or add/ call a kv cache
sync_from_worker(model_id, alloc) helper) so host reads reflect the decode
buffer; after restore/compress, push any modified pages back to the worker
(sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will be
uploaded before the next decode, ensuring the quantization lifecycle is kept
consistent between restore_compress and the worker-resident cache.
- Around line 1177-1188: Wrap the noisy debug prints in _prepare_decode_inputs
behind the existing debug flag (e.g. check self._l3_trace or a dedicated
TurboQuant debug switch) so they only execute when tracing is enabled;
specifically guard the two print calls that reference batch_idx, seq_len,
max_blocks, alloc.page_ids, alloc.tokens_used, slot_mapping and the block_table
print with an if self._l3_trace (or if _l3_trace) block so the heavy stdout
output and flush=True are skipped by default.
- Around line 391-408: The compressed-segments memory calc calls sum(...) on
values returned by .numel(), which is an int and causes TypeError; update the
block that iterates pool.compressed_segments so that you add the .numel()
results directly (and multiply by 2 for "scales" entries) instead of wrapping
them in sum(), e.g., use seg.compressed_k.get("idx_bytes",
torch.Tensor()).numel() and seg.compressed_k.get("scales",
torch.Tensor()).numel() * 2 (same for compressed_v), and ensure the accumulated
value is cast to a float before dividing by 1024*1024 to compute compressed_mb.
In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Around line 58-59: The --step argument must be validated to be a positive
integer to avoid non-terminating loops that use prompt_len and step in the
capacity test; add validation (either a custom argparse type like positive_int
or an explicit check after parsing) to ensure args.step > 0 (and optionally cap
it against args.max_prompt or other relevant bounds), and raise/exit with a
clear error if the value is invalid so the loop that manipulates prompt_len
cannot stall or regress (refer to parser, args.step, prompt_len and the loop
logic around lines where prompt_len is increased).
- Around line 73-74: The code resolves the provided model directory into
model_dir but doesn't validate that it exists and is a directory; update the
argument handling around model_dir (where model_dir =
Path(args.model_dir).resolve()) to check model_dir.exists() and
model_dir.is_dir(), and if not, exit early with a clear error message (e.g., via
sys.exit or raising ArgumentError) that includes the invalid path
(args.model_dir) so failures fail fast with useful context.
In `@python/core/kv_cache.py`:
- Around line 489-516: compress_old_tokens is prematurely replacing BF16 pages
with compressed segments (see read_context -> pool.compressed_segments
assignment and use of pool.kv_compressor.compress_layer / compress_layer_npu
with run_tq_compress) so chunked run_prefill which expects original pages later
ends up with missing prompt pages; change the logic to not overwrite/free the
original BF16 pages during prefill: either defer creating/assigning
pool.compressed_segments inside compress_old_tokens to only run in decode paths,
or ensure a restore step is performed before any chunked prefill by invoking the
existing restore/restore_pages routine (or add one) so read_context +
compress_layer(*) do not remove the resident BF16 pages until after prefill is
complete.
- Around line 373-377: Replace all synchronous print(..., flush=True) debug
traces in python/core/kv_cache.py with gated logging (e.g., logger.debug) so
they are emitted only when debug logging is enabled; specifically update the
print in the restore/compress code paths and inside
_write_tokens_from_compressed and any other restore/compress helpers to call the
module logger (or an existing logger instance) instead of printing to stdout,
remove explicit flush usage, and preserve the original message formatting so
semantics remain the same.
In `@python/core/serving_worker.py`:
- Around line 343-346: The call to
self.kv_cache_manager.clear_compressed_segments(self.config.model_id, req_id) in
_get_or_update_allocation() is clearing compressed state on every allocation
refresh (prefill/decode) and breaks KvCacheManager.restore_compressed_tokens()
for multi-step requests; move or guard this call so it only runs on actual
preemption/reset paths (e.g., where you detect a preemption or allocation reset)
instead of the general allocation update flow—adjust _get_or_update_allocation()
to only invoke kv_cache_manager.clear_compressed_segments(...) when the function
is handling a preemption/reset event, leaving the compressed segments intact
during normal prefill/decode steps.
In `@python/core/turboquant/compressor.py`:
- Around line 189-199: Replace the duplicate KvQuantConfig dataclass in
compressor.py with an import of the single authoritative KvQuantConfig
definition so there is one source of truth; specifically, remove the local
KvQuantConfig class and import the KvQuantConfig symbol defined in the other
module (the canonical types definition) and update any local references to use
that imported class to ensure default values (e.g., value_bits) remain
consistent across the codebase.
In `@README.md`:
- Line 120: Update the README command example to clarify the device ID
placeholder: replace the ambiguous `--device {}` with a concrete example like
`--device 0` or add a short parenthetical note after the command explaining that
users must replace `{}` with their device ID (e.g., `0` for the first GPU).
Ensure the README's example uses the chosen concrete value or includes the
explicit instruction so readers aren't confused by empty braces.
- Around line 155-166: Update the README's kv_quant JSON example to include the
two supported options missing from the example: add "protected_layers" and
"protected_bits" with their default values (4 and 8) so the configuration
matches the implementation in python/cli/main.py; locate the kv_quant block in
the README sample and append these keys with appropriate values and brief
comments or values so users see all available parameters.
In `@tests/test_lloyd_max.py`:
- Line 127: The f-string prefix is used on plain strings without placeholders in
tests/test_lloyd_max.py (e.g., the print calls that currently read print(f"
PASS: new MSE <= old MSE (within 0.5%)") and the similar PASS/FAIL prints),
which triggers Ruff F541; fix by removing the unnecessary leading "f" from those
print statements so they are plain strings (e.g., change print(f"...") to
print("...")) for each affected print occurrence.
- Around line 107-120: quantization_mse is integrating via Riemann sums but
never multiplies by the interval width dx, so the returned value total_mse /
n_pts is wrong; fix it by computing dx per interval (dx = (b - a) / (n_pts - 1))
inside the loop where a,b are taken from boundaries, multiply each interval's
summed contribution mse_i by that dx before adding to total_mse (i.e., total_mse
+= mse_i.item() * dx), and then return total_mse (remove the final division by
n_pts); refer to the function quantization_mse, variables boundaries, a, b, xs,
mse_i, total_mse, n_pts, and sigma.
---
Nitpick comments:
In `@examples/model/qwen3_14b/test_kv_capacity.py`:
- Line 112: Update the weight dtype used in the KV-capacity test to match other
NPU TurboQuant entrypoints by changing the weight_dtype setting from "float32"
to "bfloat16" in the test_kv_capacity setup (look for the weight_dtype="float32"
argument in the KV capacity test config and replace it with
weight_dtype="bfloat16").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4f1389df-bcec-4ad4-98b9-f16906f52a18
📒 Files selected for processing (18)
.claude/skills.claude/skills.gitmodulesREADME.mdexamples/model/qwen3_14b/npu_generate.pyexamples/model/qwen3_14b/npu_serving.jsonexamples/model/qwen3_14b/runner/npu_executor.pyexamples/model/qwen3_14b/runner/npu_runner.pyexamples/model/qwen3_14b/test_kv_capacity.pypypto-libpython/cli/main.pypython/core/kv_cache.pypython/core/serving_worker.pypython/core/turboquant/__init__.pypython/core/turboquant/compressor.pypython/core/turboquant/lloyd_max.pypython/core/types.pytests/test_lloyd_max.py
| # ── KV quantization: restore compressed old tokens BEFORE building | ||
| # block_table so that workspace pages are included in page_ids. ── | ||
| t_tq_restore = time.perf_counter() | ||
| if self._kv_cache_manager.is_quantization_enabled(model_id): | ||
| for alloc in batch.kv_allocations: | ||
| self._kv_cache_manager.restore_compressed_tokens(model_id, alloc, npu_runner=self) | ||
| dt_tq_restore = (time.perf_counter() - t_tq_restore) * 1000 |
There was a problem hiding this comment.
Restore/compress are operating on stale KV state.
restore_compressed_tokens() and compress_old_tokens() both mutate/read the host-side cache, but decode runs against _l2_child_tensor(..., refresh=refresh_kv_cache) worker allocations. Once _l2_dirty_kv_models has been cleared, restored pages are never re-uploaded before the next decode, and the post-kernel compression path reads host KV that has not been synced back from the child-memory decode buffer. That will quantize/decode the wrong history after the first quantized step. Please either add explicit KV sync points around TurboQuant or keep the quantization lifecycle entirely on the worker-resident cache. python/runtime/worker.py:89-124 and python/core/kv_cache.py:346-403,444-507 show the contract mismatch.
Also applies to: 313-333, 374-379
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 299 - 305, The
restore/compress calls operate on stale host-side KV state and must be
synchronized with the worker child-memory buffer: before calling
restore_compressed_tokens() or compress_old_tokens() (e.g., inside the block
where you iterate batch.kv_allocations when is_quantization_enabled(model_id)),
explicitly pull/sync the latest pages from the worker child buffer into the host
cache (use the same mechanism as _l2_child_tensor(..., refresh=True) or add/
call a kv cache sync_from_worker(model_id, alloc) helper) so host reads reflect
the decode buffer; after restore/compress, push any modified pages back to the
worker (sync_to_worker) or mark _l2_dirty_kv_models appropriately so pages will
be uploaded before the next decode, ensuring the quantization lifecycle is kept
consistent between restore_compress and the worker-resident cache.
| # Compressed segments memory estimate | ||
| compressed_mb = 0.0 | ||
| if pool and pool.compressed_segments: | ||
| for req_segs in pool.compressed_segments.values(): | ||
| for seg in req_segs.values(): | ||
| compressed_mb += sum( | ||
| seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(), | ||
| ) | ||
| compressed_mb += sum( | ||
| seg.compressed_k.get("scales", torch.Tensor()).numel() * 2, | ||
| ) | ||
| compressed_mb += sum( | ||
| seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(), | ||
| ) | ||
| compressed_mb += sum( | ||
| seg.compressed_v.get("scales", torch.Tensor()).numel() * 2, | ||
| ) | ||
| compressed_mb = compressed_mb / 1024 / 1024 |
There was a problem hiding this comment.
The KV-memory stats block will throw on the first compressed segment.
numel() returns an int, so sum(seg.compressed_k.get(...).numel(),) raises TypeError: 'int' object is not iterable. Because this sits on the quantized decode path, the logger will start crashing requests as soon as compression succeeds once.
Suggested fix
- compressed_mb = 0.0
+ compressed_bytes = 0
if pool and pool.compressed_segments:
for req_segs in pool.compressed_segments.values():
for seg in req_segs.values():
- compressed_mb += sum(
- seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(),
- )
- compressed_mb += sum(
- seg.compressed_k.get("scales", torch.Tensor()).numel() * 2,
- )
- compressed_mb += sum(
- seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(),
- )
- compressed_mb += sum(
- seg.compressed_v.get("scales", torch.Tensor()).numel() * 2,
- )
- compressed_mb = compressed_mb / 1024 / 1024
+ idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
+ scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16))
+ idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8))
+ scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16))
+ compressed_bytes += idx_k.numel() * idx_k.element_size()
+ compressed_bytes += scales_k.numel() * scales_k.element_size()
+ compressed_bytes += idx_v.numel() * idx_v.element_size()
+ compressed_bytes += scales_v.numel() * scales_v.element_size()
+ compressed_mb = compressed_bytes / 1024 / 1024📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compressed segments memory estimate | |
| compressed_mb = 0.0 | |
| if pool and pool.compressed_segments: | |
| for req_segs in pool.compressed_segments.values(): | |
| for seg in req_segs.values(): | |
| compressed_mb += sum( | |
| seg.compressed_k.get("idx_bytes", torch.Tensor()).numel(), | |
| ) | |
| compressed_mb += sum( | |
| seg.compressed_k.get("scales", torch.Tensor()).numel() * 2, | |
| ) | |
| compressed_mb += sum( | |
| seg.compressed_v.get("idx_bytes", torch.Tensor()).numel(), | |
| ) | |
| compressed_mb += sum( | |
| seg.compressed_v.get("scales", torch.Tensor()).numel() * 2, | |
| ) | |
| compressed_mb = compressed_mb / 1024 / 1024 | |
| # Compressed segments memory estimate | |
| compressed_bytes = 0 | |
| if pool and pool.compressed_segments: | |
| for req_segs in pool.compressed_segments.values(): | |
| for seg in req_segs.values(): | |
| idx_k = seg.compressed_k.get("idx_bytes", torch.empty(0, dtype=torch.uint8)) | |
| scales_k = seg.compressed_k.get("scales", torch.empty(0, dtype=torch.bfloat16)) | |
| idx_v = seg.compressed_v.get("idx_bytes", torch.empty(0, dtype=torch.uint8)) | |
| scales_v = seg.compressed_v.get("scales", torch.empty(0, dtype=torch.bfloat16)) | |
| compressed_bytes += idx_k.numel() * idx_k.element_size() | |
| compressed_bytes += scales_k.numel() * scales_k.element_size() | |
| compressed_bytes += idx_v.numel() * idx_v.element_size() | |
| compressed_bytes += scales_v.numel() * scales_v.element_size() | |
| compressed_mb = compressed_bytes / 1024 / 1024 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 391 - 408, The
compressed-segments memory calc calls sum(...) on values returned by .numel(),
which is an int and causes TypeError; update the block that iterates
pool.compressed_segments so that you add the .numel() results directly (and
multiply by 2 for "scales" entries) instead of wrapping them in sum(), e.g., use
seg.compressed_k.get("idx_bytes", torch.Tensor()).numel() and
seg.compressed_k.get("scales", torch.Tensor()).numel() * 2 (same for
compressed_v), and ensure the accumulated value is cast to a float before
dividing by 1024*1024 to compute compressed_mb.
| print( | ||
| f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, " | ||
| f"seq_len={seq_len}, max_blocks={max_blocks}, " | ||
| f"alloc.page_ids len={len(alloc.page_ids)}, " | ||
| f"alloc.page_ids={alloc.page_ids}, " | ||
| f"alloc.tokens_used={alloc.tokens_used}, " | ||
| f"slot_mapping={slot_mapping[batch_idx].item()}", | ||
| flush=True, | ||
| ) | ||
| print( | ||
| f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}", | ||
| flush=True, |
There was a problem hiding this comment.
Gate the TQ decode dumps behind a debug flag.
This runs on every decode step and prints full page metadata with flush=True, so long generations will spend a surprising amount of time in stdout and produce very noisy logs. Please hide it behind _l3_trace or a dedicated TurboQuant debug switch.
Suggested fix
- print(
- f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
- f"seq_len={seq_len}, max_blocks={max_blocks}, "
- f"alloc.page_ids len={len(alloc.page_ids)}, "
- f"alloc.page_ids={alloc.page_ids}, "
- f"alloc.tokens_used={alloc.tokens_used}, "
- f"slot_mapping={slot_mapping[batch_idx].item()}",
- flush=True,
- )
- print(
- f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
- flush=True,
- )
+ if self._l3_trace:
+ print(
+ f"[TQ-DEBUG] _prepare_decode_inputs: batch_idx={batch_idx}, "
+ f"seq_len={seq_len}, max_blocks={max_blocks}, "
+ f"alloc.page_ids len={len(alloc.page_ids)}, "
+ f"alloc.page_ids={alloc.page_ids}, "
+ f"alloc.tokens_used={alloc.tokens_used}, "
+ f"slot_mapping={slot_mapping[batch_idx].item()}",
+ flush=True,
+ )
+ if self._l3_trace:
+ print(
+ f"[TQ-DEBUG] _prepare_decode_inputs: block_table={block_table.tolist()[:max_blocks]}",
+ flush=True,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/runner/npu_runner.py` around lines 1177 - 1188, Wrap
the noisy debug prints in _prepare_decode_inputs behind the existing debug flag
(e.g. check self._l3_trace or a dedicated TurboQuant debug switch) so they only
execute when tracing is enabled; specifically guard the two print calls that
reference batch_idx, seq_len, max_blocks, alloc.page_ids, alloc.tokens_used,
slot_mapping and the block_table print with an if self._l3_trace (or if
_l3_trace) block so the heavy stdout output and flush=True are skipped by
default.
| parser.add_argument("--step", type=int, default=32, | ||
| help="Increase prompt tokens by this much each iteration.") |
There was a problem hiding this comment.
Validate --step (and related bounds) to prevent non-terminating runs.
With --step <= 0, Line 147 can stall or regress prompt_len, making the loop at Line 135 non-terminating.
💡 Proposed fix
def main() -> None:
args = build_parser().parse_args()
+ if args.step <= 0:
+ raise ValueError("--step must be > 0.")
+ if args.start_len <= 0:
+ raise ValueError("--start-len must be > 0.")
+ if args.max_len < args.start_len:
+ raise ValueError("--max-len must be >= --start-len.")
+
model_dir = Path(args.model_dir).resolve()Also applies to: 135-148
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 58 - 59, The
--step argument must be validated to be a positive integer to avoid
non-terminating loops that use prompt_len and step in the capacity test; add
validation (either a custom argparse type like positive_int or an explicit check
after parsing) to ensure args.step > 0 (and optionally cap it against
args.max_prompt or other relevant bounds), and raise/exit with a clear error if
the value is invalid so the loop that manipulates prompt_len cannot stall or
regress (refer to parser, args.step, prompt_len and the loop logic around lines
where prompt_len is increased).
| model_dir = Path(args.model_dir).resolve() | ||
|
|
There was a problem hiding this comment.
Fail fast when --model-dir is invalid.
Line 73 resolves the path but never verifies it exists as a directory, so errors surface later with less clear context.
💡 Proposed fix
args = build_parser().parse_args()
model_dir = Path(args.model_dir).resolve()
+ if not model_dir.is_dir():
+ raise FileNotFoundError(f"Model directory does not exist: {model_dir}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model_dir = Path(args.model_dir).resolve() | |
| args = build_parser().parse_args() | |
| model_dir = Path(args.model_dir).resolve() | |
| if not model_dir.is_dir(): | |
| raise FileNotFoundError(f"Model directory does not exist: {model_dir}") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/test_kv_capacity.py` around lines 73 - 74, The code
resolves the provided model directory into model_dir but doesn't validate that
it exists and is a directory; update the argument handling around model_dir
(where model_dir = Path(args.model_dir).resolve()) to check model_dir.exists()
and model_dir.is_dir(), and if not, exit early with a clear error message (e.g.,
via sys.exit or raising ArgumentError) that includes the invalid path
(args.model_dir) so failures fail fast with useful context.
| @dataclass | ||
| class KvQuantConfig: | ||
| """Configuration for KV cache quantization.""" | ||
|
|
||
| enabled: bool = False | ||
| key_bits: int = 4 | ||
| value_bits: int = 4 | ||
| residual_window: int = 128 | ||
| protected_layers: int = 4 | ||
| protected_bits: int = 8 | ||
|
|
There was a problem hiding this comment.
Unify KvQuantConfig to a single source of truth.
This file redefines KvQuantConfig with defaults that diverge from python/core/types.py (e.g., Line 195 uses value_bits=4 vs Line 26 in python/core/types.py uses value_bits=2). Having two config classes for the same contract can silently change behavior depending on import path.
Proposed fix
-from dataclasses import dataclass
+from dataclasses import dataclass
+from python.core.types import KvQuantConfig
@@
-@dataclass
-class KvQuantConfig:
- """Configuration for KV cache quantization."""
-
- enabled: bool = False
- key_bits: int = 4
- value_bits: int = 4
- residual_window: int = 128
- protected_layers: int = 4
- protected_bits: int = 8📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @dataclass | |
| class KvQuantConfig: | |
| """Configuration for KV cache quantization.""" | |
| enabled: bool = False | |
| key_bits: int = 4 | |
| value_bits: int = 4 | |
| residual_window: int = 128 | |
| protected_layers: int = 4 | |
| protected_bits: int = 8 | |
| # (Import statement to be added at the top of the file:) | |
| from python.core.types import KvQuantConfig | |
| # Lines 189-199 are removed entirely; no code remains at this location |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/turboquant/compressor.py` around lines 189 - 199, Replace the
duplicate KvQuantConfig dataclass in compressor.py with an import of the single
authoritative KvQuantConfig definition so there is one source of truth;
specifically, remove the local KvQuantConfig class and import the KvQuantConfig
symbol defined in the other module (the canonical types definition) and update
any local references to use that imported class to ensure default values (e.g.,
value_bits) remain consistent across the codebase.
| ```bash | ||
| python -m python.cli.main \ | ||
| --config examples/model/qwen3_14b/npu_serving.json \ | ||
| --serve --port 8899 --device {} |
There was a problem hiding this comment.
Clarify the device ID placeholder.
The command uses --device {} with empty braces, which may confuse users. Consider using a concrete example like --device 0 or adding a comment explaining that users should replace {} with their device ID.
📝 Suggested fix
- --serve --port 8899 --device {}
+ --serve --port 8899 --device 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --serve --port 8899 --device {} | |
| --serve --port 8899 --device 0 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 120, Update the README command example to clarify the
device ID placeholder: replace the ambiguous `--device {}` with a concrete
example like `--device 0` or add a short parenthetical note after the command
explaining that users must replace `{}` with their device ID (e.g., `0` for the
first GPU). Ensure the README's example uses the chosen concrete value or
includes the explicit instruction so readers aren't confused by empty braces.
| ```json | ||
| { | ||
| "model": { ... }, | ||
| "runtime": { ... }, | ||
| "kv_quant": { | ||
| "enabled": true, | ||
| "key_bits": 4, | ||
| "value_bits": 2, | ||
| "residual_window": 128 | ||
| } | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Complete the TurboQuant JSON configuration example.
The example kv_quant configuration is missing two available options: protected_layers and protected_bits. According to the implementation in python/cli/main.py, these options are supported with defaults of 4 and 8 respectively. Users should be aware of all available configuration parameters.
📝 Suggested fix
{
"model": { ... },
"runtime": { ... },
"kv_quant": {
"enabled": true,
"key_bits": 4,
"value_bits": 2,
- "residual_window": 128
+ "residual_window": 128,
+ "protected_layers": 4,
+ "protected_bits": 8
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 155 - 166, Update the README's kv_quant JSON example
to include the two supported options missing from the example: add
"protected_layers" and "protected_bits" with their default values (4 and 8) so
the configuration matches the implementation in python/cli/main.py; locate the
kv_quant block in the README sample and append these keys with appropriate
values and brief comments or values so users see all available parameters.
| def quantization_mse(centroids_in): | ||
| """Compute MSE of a scalar quantizer for N(0, sigma^2).""" | ||
| cs = sorted(centroids_in.tolist()) | ||
| boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10] | ||
| total_mse = 0.0 | ||
| n_pts = 10000 | ||
| for i in range(len(cs)): | ||
| a, b = boundaries[i], boundaries[i+1] | ||
| xs = torch.linspace(a, b, n_pts, dtype=torch.float64) | ||
| pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi)) | ||
| mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum() | ||
| total_mse += mse_i.item() | ||
| dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate | ||
| return total_mse / n_pts # normalized |
There was a problem hiding this comment.
Correct the MSE integration math in the validator.
quantization_mse currently computes dx but does not apply it; returning total_mse / n_pts is not the integral-based MSE and can mis-rank quantizers.
Proposed fix
def quantization_mse(centroids_in):
@@
- total_mse = 0.0
+ total_mse = 0.0
n_pts = 10000
for i in range(len(cs)):
a, b = boundaries[i], boundaries[i+1]
xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
- mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
- total_mse += mse_i.item()
- dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate
- return total_mse / n_pts # normalized
+ integrand = (xs - cs[i]) ** 2 * pdf_vals
+ total_mse += torch.trapz(integrand, xs).item()
+ return total_mse🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` around lines 107 - 120, quantization_mse is
integrating via Riemann sums but never multiplies by the interval width dx, so
the returned value total_mse / n_pts is wrong; fix it by computing dx per
interval (dx = (b - a) / (n_pts - 1)) inside the loop where a,b are taken from
boundaries, multiply each interval's summed contribution mse_i by that dx before
adding to total_mse (i.e., total_mse += mse_i.item() * dx), and then return
total_mse (remove the final division by n_pts); refer to the function
quantization_mse, variables boundaries, a, b, xs, mse_i, total_mse, n_pts, and
sigma.
| print(f" MSE old: {mse_old:.6e}") | ||
| print(f" MSE new: {mse_new:.6e}") | ||
| if mse_new <= mse_old * 1.005: | ||
| print(f" PASS: new MSE <= old MSE (within 0.5%)") |
There was a problem hiding this comment.
Fix invalid f-strings flagged by lint.
Lines 127/134/136/146/148 use f"..." without placeholders, which triggers Ruff F541 and can fail CI.
Proposed fix
- print(f" PASS: new MSE <= old MSE (within 0.5%)")
+ print(" PASS: new MSE <= old MSE (within 0.5%)")
...
- print(f" PASS: centroids monotonically increasing")
+ print(" PASS: centroids monotonically increasing")
...
- print(f" FAIL: centroids not monotonically increasing")
+ print(" FAIL: centroids not monotonically increasing")
...
- print(f" PASS: symmetric around 0")
+ print(" PASS: symmetric around 0")
...
- print(f" WARN: not perfectly symmetric (expected for finite iterations)")
+ print(" WARN: not perfectly symmetric (expected for finite iterations)")Also applies to: 134-134, 136-136, 146-146, 148-148
🧰 Tools
🪛 Ruff (0.15.14)
[error] 127-127: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` at line 127, The f-string prefix is used on plain
strings without placeholders in tests/test_lloyd_max.py (e.g., the print calls
that currently read print(f" PASS: new MSE <= old MSE (within 0.5%)") and the
similar PASS/FAIL prints), which triggers Ruff F541; fix by removing the
unnecessary leading "f" from those print statements so they are plain strings
(e.g., change print(f"...") to print("...")) for each affected print occurrence.
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.
a0306af to
30e7c55
Compare
将 TurboQuant V3(纯 MSE 模式)集成到 pypto-serving 的 KV Cache 管理中,支持在线压缩/解压缩历史 token,减少 KV Cache 内存占用。