Skip to content

Add generation end-to-end profiling#17

Open
ndleslx wants to merge 1 commit into
hw-native-sys:mainfrom
ndleslx:add-generation-e2e-profiler
Open

Add generation end-to-end profiling#17
ndleslx wants to merge 1 commit into
hw-native-sys:mainfrom
ndleslx:add-generation-e2e-profiler

Conversation

@ndleslx
Copy link
Copy Markdown
Contributor

@ndleslx ndleslx commented May 30, 2026

Summary

  • add SA_PROFILE_OUTPUT / SA_PROFILE_LEVEL Chrome trace profiling under python/profile
  • instrument CLI, one-shot generation, HTTP serving, scheduler, worker, executor, and NPU kernel spans
  • include Simpler RunTiming host/device wall metrics on kernel trace args and profile reports

Validation

  • python -m pytest tests/test_profile.py tests/test_batching.py tests/test_cli.py
  • ruff check --config ruff.toml examples/model/qwen3_14b/npu_generate.py examples/model/qwen3_14b/runner/npu_executor.py examples/model/qwen3_14b/runner/npu_runner.py python/cli/main.py python/core/async_engine.py python/core/engine.py python/core/pypto_executor.py python/core/server.py python/core/serving_worker.py python/runtime/worker.py python/profile tests/test_batching.py tests/test_profile.py
  • task-submit HTTP non-L3 256-token profile completed; trace at profile_out/http_non_l3_256_profile_fixed/trace.json contained 256 kernel events with host_wall_us and device_wall_us

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c83d941-3de4-4cf7-aaef-23f6b88e9353

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds comprehensive Chrome trace profiling throughout the LLM serving stack. It introduces a new python/profile/ package with environment-driven configuration and JSONL-to-JSON trace merging, then integrates profiling spans at the CLI, engine, executor, worker, HTTP server, and NPU model execution layers to capture end-to-end and kernel-level timing.

Changes

Profiling System Implementation and Integration

Layer / File(s) Summary
Core profiling infrastructure
.gitignore, python/profile/__init__.py, python/profile/env.py, python/profile/merge.py, python/profile/recorder.py, tests/test_profile.py
ProfileConfig loads profiling settings from SA_PROFILE_OUTPUT/SA_PROFILE_LEVEL environment variables; ProfileRecorder writes Chrome trace events to per-process JSONL fragments in a fragments_dir; merge_fragments aggregates fragments into a single traceEvents JSON file. Tests validate configuration defaults, output path resolution, JSONL-to-JSON merging, and stale fragment cleanup.
Worker return type changes
python/runtime/worker.py
Worker.run() and Worker.run_prepared() return Any instead of None, enabling downstream code to capture timing objects returned by the underlying simpler.worker.Worker.
L2 kernel-level profiling
examples/model/qwen3_14b/runner/npu_runner.py, examples/model/qwen3_14b/runner/npu_executor.py, tests/test_batching.py
_L2Callable dataclass gains a name: str field; new helpers _l2_trace_name(), _run_timing_us(), and _add_run_timing_args() extract and inject wall-clock timings into profiling span arguments. _run_l2_program and run_generate_l3 wrap kernel dispatch in profile_span and return timing results. Tests verify trace naming and timing injection.
Core service profiling
python/core/engine.py, python/core/pypto_executor.py, python/core/async_engine.py, python/core/server.py, python/core/serving_worker.py
Wraps LLMEngine model initialization and batch generation, PyptoExecutor register/prefill/decode, AsyncLLMEngine startup/request queueing/scheduling loop, HTTP request handlers, and WorkerProcess lifecycle/batch operations in profile_span with metadata (model_id, batch_size, request_id, token counts).
Application-level profiling
python/cli/main.py, examples/model/qwen3_14b/npu_generate.py
CLI and serving modes initialize profiler via get_profiler() and merge profiles in try/finally cleanup. NPU model script extends _TimingCollector to record per-kernel host/device wall times, and wraps generation in profile_span with request metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop along, dear trace events bright,
Chrome's fragments dance in JSONL light,
From kernel walls to engine spans,
We profile every compute plan!
Wall clocks tick where microseconds gleam, 🕐✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding end-to-end profiling for generation, which aligns with the substantive addition of the python/profile module and comprehensive instrumentation throughout the codebase.
Description check ✅ Passed The description is directly related to the changeset, providing specific details about the profiling infrastructure added (SA_PROFILE_OUTPUT/SA_PROFILE_LEVEL), instrumented components (CLI, HTTP serving, scheduler, worker, executor, NPU kernels), and validation results demonstrating functional correctness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive profiling framework (python/profile) for PyPTO, including a Chrome trace-event recorder, environment-based configuration, and trace merging capabilities. It integrates this profiling system across the codebase, wrapping key execution steps (such as model initialization, prefill, decode, serving requests, and scheduler operations) in profile spans, and tracks host and device wall times for kernel execution. The review feedback focuses on improving the robustness, performance, and accuracy of the profiling framework. Key suggestions include optimizing I/O performance by removing aggressive flushing and making writes fail-silent, handling malformed JSON lines gracefully during trace merging, supporting proper task-level visualization in asyncio environments by using task IDs as thread IDs, preventing worker processes from deleting other workers' trace fragments, and improving cross-platform compatibility for default process names.

Comment thread python/profile/recorder.py Outdated
Comment on lines +167 to +175
def _write(self, event: dict[str, Any]) -> None:
if self._fh is None:
return
line = json.dumps(event, separators=(",", ":"), default=str)
with self._lock:
if self._fh is not None:
self._fh.write(line)
self._fh.write("\n")
self._fh.flush()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In a high-frequency profiling context (such as recording every kernel execution span or duration), calling self._fh.flush() on every single write forces a disk write/syscall every time, introducing a massive performance bottleneck. Additionally, any unhandled I/O exceptions (e.g., disk full) during _write could crash the main model execution flow. We should remove the aggressive flushing (letting the OS/Python file buffering handle it efficiently, since close() already flushes and closes the file at the end) and wrap the write operation in a try-except block to ensure profiling is fail-silent.

Suggested change
def _write(self, event: dict[str, Any]) -> None:
if self._fh is None:
return
line = json.dumps(event, separators=(",", ":"), default=str)
with self._lock:
if self._fh is not None:
self._fh.write(line)
self._fh.write("\n")
self._fh.flush()
def _write(self, event: dict[str, Any]) -> None:
if self._fh is None:
return
try:
line = json.dumps(event, separators=(",", ":"), default=str)
with self._lock:
if self._fh is not None:
self._fh.write(line)
self._fh.write("\n")
except Exception:
pass

Comment thread python/profile/merge.py Outdated
Comment on lines +23 to +27
for line in f:
stripped = line.strip()
if not stripped:
continue
events.append(json.loads(stripped))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a process crashed or was terminated abruptly, its last line in the fragment file might be incomplete or malformed, causing json.loads to raise a json.JSONDecodeError and preventing the entire trace from being merged. We should wrap json.loads in a try-except block to handle malformed lines gracefully and continue merging the rest of the trace.

Suggested change
for line in f:
stripped = line.strip()
if not stripped:
continue
events.append(json.loads(stripped))
for line in f:
stripped = line.strip()
if not stripped:
continue
try:
events.append(json.loads(stripped))
except json.JSONDecodeError:
continue

Comment on lines +178 to +183
def get_profiler(*, process_name: str | None = None) -> ProfileRecorder:
"""Return the process-local recorder configured from SA_PROFILE_* envs."""
global _profiler, _profiler_pid
pid = os.getpid()
if _profiler is not None and _profiler_pid == pid:
return _profiler
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since get_profiler caches the profiler instance in the global _profiler variable, any subsequent calls to get_profiler with a different process_name (such as the transition from pypto-serving to pypto-serving-api in run_serve) will silently ignore the new process name. We should allow updating the process name and writing a new metadata event if a different process_name is provided.

Suggested change
def get_profiler(*, process_name: str | None = None) -> ProfileRecorder:
"""Return the process-local recorder configured from SA_PROFILE_* envs."""
global _profiler, _profiler_pid
pid = os.getpid()
if _profiler is not None and _profiler_pid == pid:
return _profiler
def get_profiler(*, process_name: str | None = None) -> ProfileRecorder:
"""Return the process-local recorder configured from SA_PROFILE_* envs."""
global _profiler, _profiler_pid
pid = os.getpid()
if _profiler is not None and _profiler_pid == pid:
if process_name is not None and _profiler.process_name != process_name:
_profiler.process_name = process_name
_profiler.metadata("process_name", process_name, tid=0)
return _profiler

Comment thread python/profile/recorder.py Outdated
Comment on lines +158 to +165
def _ensure_thread_metadata(self, tid: int) -> None:
if tid in self._thread_names:
return
with self._lock:
if tid in self._thread_names:
return
self._thread_names.add(tid)
self.metadata("thread_name", threading.current_thread().name, tid=tid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In an asyncio application (like the serving server), all concurrent requests run on the same thread, leading to overlapping, non-nested spans on the same tid in Chrome Trace Viewer, which makes the trace unreadable. We should use the asyncio task ID as the tid when running inside an active event loop to separate concurrent async tasks into their own rows.

Suggested change
def _ensure_thread_metadata(self, tid: int) -> None:
if tid in self._thread_names:
return
with self._lock:
if tid in self._thread_names:
return
self._thread_names.add(tid)
self.metadata("thread_name", threading.current_thread().name, tid=tid)
def _get_tid_and_name(self) -> tuple[int, str]:
try:
import asyncio
task = asyncio.current_task()
if task is not None:
return id(task), task.get_name()
except (ImportError, AttributeError):
pass
return threading.get_ident(), threading.current_thread().name
def _ensure_thread_metadata(self, tid: int, name: str) -> None:
if tid in self._thread_names:
return
with self._lock:
if tid in self._thread_names:
return
self._thread_names.add(tid)
self.metadata("thread_name", name, tid=tid)

Comment thread python/profile/recorder.py Outdated
Comment on lines +71 to +72
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update span to resolve the tid and name using the new _get_tid_and_name helper to support proper task-level visualization in asyncio environments.

Suggested change
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
tid, name = self._get_tid_and_name()
self._ensure_thread_metadata(tid, name)

Comment thread python/profile/recorder.py Outdated
Comment on lines +101 to +102
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update instant to resolve the tid and name using the new _get_tid_and_name helper to support proper task-level visualization in asyncio environments.

Suggested change
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
tid, name = self._get_tid_and_name()
self._ensure_thread_metadata(tid, name)

Comment thread python/profile/recorder.py Outdated
Comment on lines +128 to +129
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update duration to resolve the tid and name using the new _get_tid_and_name helper to support proper task-level visualization in asyncio environments.

Suggested change
tid = threading.get_ident()
self._ensure_thread_metadata(tid)
tid, name = self._get_tid_and_name()
self._ensure_thread_metadata(tid, name)

Comment on lines +185 to +188
config = load_profile_config()
if config.enabled and _MAIN_PID_ENV not in os.environ:
os.environ[_MAIN_PID_ENV] = str(pid)
_prepare_run(config)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If get_profiler is not called in the main process before spawning workers, the first worker to call it will run _prepare_run and potentially delete other workers' fragments. We should explicitly check if we are in the main process (e.g., using multiprocessing.current_process().name == 'MainProcess') before unlinking.

Suggested change
config = load_profile_config()
if config.enabled and _MAIN_PID_ENV not in os.environ:
os.environ[_MAIN_PID_ENV] = str(pid)
_prepare_run(config)
config = load_profile_config()
import multiprocessing
is_main = multiprocessing.current_process().name == "MainProcess"
if config.enabled and is_main and _MAIN_PID_ENV not in os.environ:
os.environ[_MAIN_PID_ENV] = str(pid)
_prepare_run(config)

Comment on lines +245 to +246
def _default_process_name() -> str:
return Path(os.environ.get("_", "python")).name or "python"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.environ.get('_') is Unix-specific and often just returns 'python'. Using sys.argv[0] as a fallback or primary source would provide a much more descriptive default process name (e.g., the script name) across both Unix and Windows.

Suggested change
def _default_process_name() -> str:
return Path(os.environ.get("_", "python")).name or "python"
def _default_process_name() -> str:
import sys
if sys.argv and sys.argv[0]:
return Path(sys.argv[0]).stem
return Path(os.environ.get("_", "python")).name or "python"

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
python/core/async_engine.py (1)

188-218: ⚡ Quick win

Span scope is narrower than its name suggests.

profile_span("AsyncLLMEngine.add_request", ...) only wraps self.tokenizer.encode(prompt); it closes before the request is built, queued, or streamed. The recorded duration reflects tokenization only, which is misleading under this name. Either rename the span to reflect tokenization, or widen the scope to cover request construction/queueing.

♻️ Option: rename to reflect actual scope
-        with profile_span(
-            "AsyncLLMEngine.add_request",
-            cat="serving",
-            args={"request_id": request_id, "max_new_tokens": config.max_new_tokens},
-        ):
-            prompt_token_ids = self.tokenizer.encode(prompt)
+        with profile_span(
+            "AsyncLLMEngine.tokenize",
+            cat="serving",
+            args={"request_id": request_id, "max_new_tokens": config.max_new_tokens},
+        ):
+            prompt_token_ids = self.tokenizer.encode(prompt)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/async_engine.py` around lines 188 - 218, The profiling span named
"AsyncLLMEngine.add_request" currently only wraps tokenizer.encode(prompt) which
mislabels the measured work; either expand the profile_span to include request
construction and queueing (wrap from the tokenizer.encode call through creation
of Request, storing _request_contexts[request_id], and scheduler.add_request) or
rename the span to something like "AsyncLLMEngine.tokenize" to reflect it only
measures tokenization; update the span boundaries around
profile_span/profile_instant so the timing covers the intended calls
(tokenizer.encode, Request(...), _request_contexts assignment, and
scheduler.add_request) or change the span name accordingly to avoid misleading
metrics.
python/core/pypto_executor.py (1)

51-74: 💤 Low value

Inconsistent span naming convention.

register_model uses "PyptoExecutor.register_model" while the other two use "executor.run_prefill" / "executor.run_decode". Aligning them (e.g. PyptoExecutor.run_prefill/run_decode) keeps trace grouping consistent with the rest of the codebase's Class.method span names.

♻️ Proposed naming alignment
-        with profile_span(
-            "executor.run_prefill",
+        with profile_span(
+            "PyptoExecutor.run_prefill",
             cat="executor",
             args={"model_id": model.config.model_id, "batch_size": len(batch.request_ids)},
         ):
-        with profile_span(
-            "executor.run_decode",
+        with profile_span(
+            "PyptoExecutor.run_decode",
             cat="executor",
             args={"model_id": model.config.model_id, "batch_size": len(batch.request_ids)},
         ):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/pypto_executor.py` around lines 51 - 74, The span names are
inconsistent: register_model uses "PyptoExecutor.register_model" while
run_prefill and run_decode use "executor.run_prefill"/"executor.run_decode";
update the profile_span name arguments in the methods run_prefill and run_decode
to "PyptoExecutor.run_prefill" and "PyptoExecutor.run_decode" respectively so
all spans follow the Class.method convention (look for profile_span calls inside
PyptoExecutor.register_model, PyptoExecutor.run_prefill, and
PyptoExecutor.run_decode).
python/core/serving_worker.py (1)

74-79: ⚡ Quick win

Avoid re-initializing main profiler in in-process mode; process_name is likely ignored

  • get_profiler is a per-PID singleton (_profiler/_profiler_pid), so calling get_profiler(process_name=...) again in the main process won’t re-initialize or overwrite the existing profiler/trace metadata.
  • In in-process mode, AsyncLLMEngine.start already initializes the profiler via profile_span("AsyncLLMEngine.start") (which calls get_profiler() without process_name) before WorkerProcess.init_device_and_model() calls get_profiler(process_name="serving-worker-..."); that process_name won’t take effect unless it’s the first get_profiler call.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/serving_worker.py` around lines 74 - 79, The call to
get_profiler(process_name=...) in WorkerProcess.init_device_and_model is
ineffective in in-process mode because the profiler singleton is already created
by AsyncLLMEngine.start (via profile_span -> get_profiler() without a name);
remove the redundant get_profiler(process_name=...) call from
WorkerProcess.init_device_and_model (or move/ensure the single get_profiler call
that sets process_name happens before AsyncLLMEngine.start) so the profiler
process_name is set only on the first get_profiler invocation; update code
references in WorkerProcess.init_device_and_model and ensure
AsyncLLMEngine.start/profile_span remain the canonical initialization path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/core/async_engine.py`:
- Around line 188-218: The profiling span named "AsyncLLMEngine.add_request"
currently only wraps tokenizer.encode(prompt) which mislabels the measured work;
either expand the profile_span to include request construction and queueing
(wrap from the tokenizer.encode call through creation of Request, storing
_request_contexts[request_id], and scheduler.add_request) or rename the span to
something like "AsyncLLMEngine.tokenize" to reflect it only measures
tokenization; update the span boundaries around profile_span/profile_instant so
the timing covers the intended calls (tokenizer.encode, Request(...),
_request_contexts assignment, and scheduler.add_request) or change the span name
accordingly to avoid misleading metrics.

In `@python/core/pypto_executor.py`:
- Around line 51-74: The span names are inconsistent: register_model uses
"PyptoExecutor.register_model" while run_prefill and run_decode use
"executor.run_prefill"/"executor.run_decode"; update the profile_span name
arguments in the methods run_prefill and run_decode to
"PyptoExecutor.run_prefill" and "PyptoExecutor.run_decode" respectively so all
spans follow the Class.method convention (look for profile_span calls inside
PyptoExecutor.register_model, PyptoExecutor.run_prefill, and
PyptoExecutor.run_decode).

In `@python/core/serving_worker.py`:
- Around line 74-79: The call to get_profiler(process_name=...) in
WorkerProcess.init_device_and_model is ineffective in in-process mode because
the profiler singleton is already created by AsyncLLMEngine.start (via
profile_span -> get_profiler() without a name); remove the redundant
get_profiler(process_name=...) call from WorkerProcess.init_device_and_model (or
move/ensure the single get_profiler call that sets process_name happens before
AsyncLLMEngine.start) so the profiler process_name is set only on the first
get_profiler invocation; update code references in
WorkerProcess.init_device_and_model and ensure AsyncLLMEngine.start/profile_span
remain the canonical initialization path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0b410563-0087-4682-8304-c49712ecbcb7

📥 Commits

Reviewing files that changed from the base of the PR and between 7eccbb9 and 4217cca.

📒 Files selected for processing (17)
  • .gitignore
  • examples/model/qwen3_14b/npu_generate.py
  • examples/model/qwen3_14b/runner/npu_executor.py
  • examples/model/qwen3_14b/runner/npu_runner.py
  • python/cli/main.py
  • python/core/async_engine.py
  • python/core/engine.py
  • python/core/pypto_executor.py
  • python/core/server.py
  • python/core/serving_worker.py
  • python/profile/__init__.py
  • python/profile/env.py
  • python/profile/merge.py
  • python/profile/recorder.py
  • python/runtime/worker.py
  • tests/test_batching.py
  • tests/test_profile.py

@ndleslx ndleslx force-pushed the add-generation-e2e-profiler branch from 4217cca to f1112ca Compare May 30, 2026 06:19
@ndleslx
Copy link
Copy Markdown
Contributor Author

ndleslx commented May 30, 2026

Addressed the review feedback in f1112ca:

  • removed per-event trace flushes and made profiler writes fail-silent
  • made trace merging skip malformed JSONL lines from partial/crashed fragments
  • added async task-aware trace lanes for serving requests
  • limited stale fragment cleanup to the main process and allowed process-name metadata updates
  • improved default process names from sys.argv[0]
  • widened AsyncLLMEngine.add_request span and aligned PyptoExecutor span names

Validation:

  • python -m pytest tests/test_profile.py tests/test_batching.py tests/test_cli.py
  • ruff check --config ruff.toml python/profile tests/test_profile.py python/core/async_engine.py python/core/pypto_executor.py python/core/serving_worker.py

@ndleslx ndleslx force-pushed the add-generation-e2e-profiler branch from f1112ca to a82259f Compare May 30, 2026 06:55
@ndleslx
Copy link
Copy Markdown
Contributor Author

ndleslx commented May 30, 2026

Updated README examples to use the larger PTO2 ring settings:

  • PTO2_RING_HEAP=2147483648
  • PTO2_RING_TASK_WINDOW=262144
  • PTO2_RING_DEP_POOL=262144

Reran HTTP serving with 32 tokens using those settings. Result: HTTP 200, clean task exit=0, generated text starts with: " a Chinese company. The company is located in the United States..."

@ndleslx ndleslx force-pushed the add-generation-e2e-profiler branch from a82259f to 6a1c232 Compare May 30, 2026 08:43
@ndleslx
Copy link
Copy Markdown
Contributor Author

ndleslx commented May 30, 2026

Rebased on latest origin/main (25edc21) and resolved conflicts against the new block_ids-based serving/runner path. PR head is now 6a1c232.

Validation on rebased branch:

  • python -m pytest tests/test_profile.py tests/test_batching.py tests/test_cli.py: 27 passed
  • ruff check on touched Python paths: passed

128-token profiling:

  • normal npu_generate: task exit=0, trace profile_out/rebased_non_l3_normal_128_profile/trace.json
    • npu_generate.generate: 23.713s
    • kernels: 1 prefill + 127 decode
    • decode kernel avg: 115.10ms, host total 14462.34ms, device total 14272.66ms
  • HTTP serving: request returned HTTP 200 and trace merged at profile_out/rebased_http_serving_128_profile/trace.json
    • http.completions: 24.200s
    • kernels: 1 prefill + 127 decode
    • decode kernel avg: 115.53ms, host total 14492.63ms, device total 14253.01ms
    • note: the task wrapper reported exit=137 during server shutdown after the successful 200 response and trace merge; a queued clean-shutdown retry was cancelled because it was blocked waiting for an NPU lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant