Reuse L3 generate runtime across requests by luohuan19 · Pull Request #12 · hw-native-sys/pypto-serving

luohuan19 · 2026-05-26T03:53:21Z

Summary

Switch run_generate_l3 to DistributedCompiledProgram.prepare() so assemble + worker fork + static weight/KV upload happen once per model instead of ~3.4s of setup per request; per-request code only refreshes shared buffers in place and dispatches on the held DistributedRuntime.
Add close() to LLMEngine, ModelExecutor, and ModelRunner (default no-op), implement it on PyptoExecutor and Qwen314BModelRunner, and wrap qwen3_14b's npu_generate main() in try/finally so the long-lived worker is releasable on exit.

Test plan

Run examples/model/qwen3_14b/npu_generate.py with multiple sequential generate requests and confirm setup cost is paid once, not per request.
Verify outputs match the previous per-request setup path.
Confirm close() releases worker / KV resources cleanly on normal exit and on exception.

🤖 Generated with Claude Code

Previously each run_generate_l3 call rebuilt a Worker, re-uploaded weights, and tore it down — paying ~3.4s of setup per request. Switch to DistributedCompiledProgram.prepare() so assemble + worker fork + static weight/KV upload happen once per model; per-request code only refreshes shared buffers in place and dispatches on the held DistributedRuntime. To make the long-lived worker releasable, add close() to LLMEngine, ModelExecutor, and ModelRunner (default no-op), implement it on PyptoExecutor and Qwen314BModelRunner, and wrap the qwen3_14b npu_generate main() in try/finally so resources drop before exit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the L3-wrapped generation loop for the Qwen3-14B model to use a reusable runtime handle (_L3Runtime), which avoids expensive worker forks and weight uploads on every request. It also introduces a close() method across the engine, executor, and runner classes to properly release backend resources and prevent memory leaks. The review feedback suggests adding a validation check to ensure max_new_tokens is at least 1, and enforcing contiguity on the host KV cache tensors before raw memory copies to prevent potential data corruption.

gemini-code-assist · 2026-05-26T03:54:30Z

+        if max_new_tokens > model.runtime.max_new_tokens:
+            raise ValueError(
+                f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit "
+                f"{model.runtime.max_new_tokens}"
+            )


Add a validation check to ensure max_new_tokens is at least 1. If max_new_tokens is less than 1, the sub-worker's sampling logic and buffer indexing could behave unexpectedly.

if max_new_tokens < 1: raise ValueError(f"max_new_tokens must be >= 1, got {max_new_tokens}") if max_new_tokens > model.runtime.max_new_tokens: raise ValueError( f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit " f"{model.runtime.max_new_tokens}" )

gemini-code-assist · 2026-05-26T03:54:30Z

+        kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)
+        kv_k_host = self._share_cpu_tensor(kv_k_host)
+        kv_v_host = self._share_cpu_tensor(kv_v_host)


The host KV cache tensors kv_k_host and kv_v_host are copied to and from the device using raw pointers (data_ptr()). If these tensors are not contiguous, raw memory copies will result in data corruption. Enforcing contiguity before sharing the CPU tensors prevents this potential issue.

Suggested change

kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)

kv_k_host = self._share_cpu_tensor(kv_k_host)

kv_v_host = self._share_cpu_tensor(kv_v_host)

kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)

if not kv_k_host.is_contiguous():

kv_k_host = kv_k_host.contiguous()

if not kv_v_host.is_contiguous():

kv_v_host = kv_v_host.contiguous()

kv_k_host = self._share_cpu_tensor(kv_k_host)

kv_v_host = self._share_cpu_tensor(kv_v_host)

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

luohuan19 marked this pull request as draft May 26, 2026 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse L3 generate runtime across requests#12

Reuse L3 generate runtime across requests#12
luohuan19 wants to merge 1 commit into
hw-native-sys:mainfrom
luohuan19:reuse-l3-generate-runtime

luohuan19 commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luohuan19 commented May 26, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant