Reuse L3 generate runtime across requests#12
Conversation
Previously each run_generate_l3 call rebuilt a Worker, re-uploaded weights, and tore it down — paying ~3.4s of setup per request. Switch to DistributedCompiledProgram.prepare() so assemble + worker fork + static weight/KV upload happen once per model; per-request code only refreshes shared buffers in place and dispatches on the held DistributedRuntime. To make the long-lived worker releasable, add close() to LLMEngine, ModelExecutor, and ModelRunner (default no-op), implement it on PyptoExecutor and Qwen314BModelRunner, and wrap the qwen3_14b npu_generate main() in try/finally so resources drop before exit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the L3-wrapped generation loop for the Qwen3-14B model to use a reusable runtime handle (_L3Runtime), which avoids expensive worker forks and weight uploads on every request. It also introduces a close() method across the engine, executor, and runner classes to properly release backend resources and prevent memory leaks. The review feedback suggests adding a validation check to ensure max_new_tokens is at least 1, and enforcing contiguity on the host KV cache tensors before raw memory copies to prevent potential data corruption.
| if max_new_tokens > model.runtime.max_new_tokens: | ||
| raise ValueError( | ||
| f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit " | ||
| f"{model.runtime.max_new_tokens}" | ||
| ) |
There was a problem hiding this comment.
Add a validation check to ensure max_new_tokens is at least 1. If max_new_tokens is less than 1, the sub-worker's sampling logic and buffer indexing could behave unexpectedly.
if max_new_tokens < 1:
raise ValueError(f"max_new_tokens must be >= 1, got {max_new_tokens}")
if max_new_tokens > model.runtime.max_new_tokens:
raise ValueError(
f"max_new_tokens={max_new_tokens} exceeds compiled L3 limit "
f"{model.runtime.max_new_tokens}"
)| kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id) | ||
| kv_k_host = self._share_cpu_tensor(kv_k_host) | ||
| kv_v_host = self._share_cpu_tensor(kv_v_host) |
There was a problem hiding this comment.
The host KV cache tensors kv_k_host and kv_v_host are copied to and from the device using raw pointers (data_ptr()). If these tensors are not contiguous, raw memory copies will result in data corruption. Enforcing contiguity before sharing the CPU tensors prevents this potential issue.
| kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id) | |
| kv_k_host = self._share_cpu_tensor(kv_k_host) | |
| kv_v_host = self._share_cpu_tensor(kv_v_host) | |
| kv_k_host, kv_v_host = self._kv_cache_manager.materialize_full_layer_cache(model_id) | |
| if not kv_k_host.is_contiguous(): | |
| kv_k_host = kv_k_host.contiguous() | |
| if not kv_v_host.is_contiguous(): | |
| kv_v_host = kv_v_host.contiguous() | |
| kv_k_host = self._share_cpu_tensor(kv_k_host) | |
| kv_v_host = self._share_cpu_tensor(kv_v_host) |
Summary
run_generate_l3toDistributedCompiledProgram.prepare()so assemble + worker fork + static weight/KV upload happen once per model instead of ~3.4s of setup per request; per-request code only refreshes shared buffers in place and dispatches on the heldDistributedRuntime.close()toLLMEngine,ModelExecutor, andModelRunner(default no-op), implement it onPyptoExecutorandQwen314BModelRunner, and wrapqwen3_14b'snpu_generatemain()intry/finallyso the long-lived worker is releasable on exit.Test plan
examples/model/qwen3_14b/npu_generate.pywith multiple sequential generate requests and confirm setup cost is paid once, not per request.close()releases worker / KV resources cleanly on normal exit and on exception.🤖 Generated with Claude Code