Skip to content

Reuse L3 generate runtime across requests#12

Draft
luohuan19 wants to merge 1 commit into
hw-native-sys:mainfrom
luohuan19:reuse-l3-generate-runtime
Draft

Reuse L3 generate runtime across requests#12
luohuan19 wants to merge 1 commit into
hw-native-sys:mainfrom
luohuan19:reuse-l3-generate-runtime

Conversation

@luohuan19
Copy link
Copy Markdown

Summary

  • Switch run_generate_l3 to DistributedCompiledProgram.prepare() so assemble + worker fork + static weight/KV upload happen once per model instead of ~3.4s of setup per request; per-request code only refreshes shared buffers in place and dispatches on the held DistributedRuntime.
  • Add close() to LLMEngine, ModelExecutor, and ModelRunner (default no-op), implement it on PyptoExecutor and Qwen314BModelRunner, and wrap qwen3_14b's npu_generate main() in try/finally so the long-lived worker is releasable on exit.

Test plan

  • Run examples/model/qwen3_14b/npu_generate.py with multiple sequential generate requests and confirm setup cost is paid once, not per request.
  • Verify outputs match the previous per-request setup path.
  • Confirm close() releases worker / KV resources cleanly on normal exit and on exception.

🤖 Generated with Claude Code

Previously each run_generate_l3 call rebuilt a Worker, re-uploaded weights,
and tore it down — paying ~3.4s of setup per request. Switch to
DistributedCompiledProgram.prepare() so assemble + worker fork + static
weight/KV upload happen once per model; per-request code only refreshes
shared buffers in place and dispatches on the held DistributedRuntime.

To make the long-lived worker releasable, add close() to LLMEngine,
ModelExecutor, and ModelRunner (default no-op), implement it on
PyptoExecutor and Qwen314BModelRunner, and wrap the qwen3_14b
npu_generate main() in try/finally so resources drop before exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the L3-wrapped generation loop for the Qwen3-14B model to use a reusable runtime handle (_L3Runtime), which avoids expensive worker forks and weight uploads on every request. It also introduces a close() method across the engine, executor, and runner classes to properly release backend resources and prevent memory leaks. The review feedback suggests adding a validation check to ensure max_new_tokens is at least 1, and enforcing contiguity on the host KV cache tensors before raw memory copies to prevent potential data corruption.

Comment on lines +569 to +573
if max_new_tokens > model.runtime.max_new_tokens:
raise ValueError(
f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit "
f"{model.runtime.max_new_tokens}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a validation check to ensure max_new_tokens is at least 1. If max_new_tokens is less than 1, the sub-worker's sampling logic and buffer indexing could behave unexpectedly.

        if max_new_tokens < 1:
            raise ValueError(f"max_new_tokens must be >= 1, got {max_new_tokens}")
        if max_new_tokens > model.runtime.max_new_tokens:
            raise ValueError(
                f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit "
                f"{model.runtime.max_new_tokens}"
            )

Comment on lines +731 to +733
kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)
kv_k_host = self._share_cpu_tensor(kv_k_host)
kv_v_host = self._share_cpu_tensor(kv_v_host)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The host KV cache tensors kv_k_host and kv_v_host are copied to and from the device using raw pointers (data_ptr()). If these tensors are not contiguous, raw memory copies will result in data corruption. Enforcing contiguity before sharing the CPU tensors prevents this potential issue.

Suggested change
kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)
kv_k_host = self._share_cpu_tensor(kv_k_host)
kv_v_host = self._share_cpu_tensor(kv_v_host)
kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id)
if not kv_k_host.is_contiguous():
kv_k_host = kv_k_host.contiguous()
if not kv_v_host.is_contiguous():
kv_v_host = kv_v_host.contiguous()
kv_k_host = self._share_cpu_tensor(kv_k_host)
kv_v_host = self._share_cpu_tensor(kv_v_host)

@luohuan19 luohuan19 marked this pull request as draft May 26, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant