Skip to content

Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500

Open
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0409
Open

Refactor: introduce tiered profiling levels for a2a3 tensormap_and_ringbuffer swimlane export#500
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0409

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

Replace the boolean enable_profiling flag with a 4-level perf_level (0=off, 1=AICore-only, 2=task+fanout, 3=full with AICPU phase records). The tensormap_and_ringbuffer runtime honors all four levels, while legacy host_build_graph / aicpu_build_graph paths continue to treat any non-zero value as a simple on/off and stay on their existing bool member (synchronized via a shared SFINAE helper in runtime_profiling_mode.h).

CLI and JSON are lifted to match:

  • --enable-profiling in run_example.py now takes an optional int (default 3 when flag given, 0 otherwise).
  • The swimlane JSON schema gains a new version=0 (level 1: AICore-only) that omits dispatch/finish/fanout fields, and swimlane_converter.py accepts it.
  • Phase buffer allocation, scheduler-phase recording and orchestrator summary writes in aicpu_executor.cpp are gated on perf_level>=3 so lower levels no longer pay the phase-profiling overhead; fanout/dispatch_timestamp collection is gated on perf_level>=2. Additionally:
  • CallConfig and WorkerPayload switch from bool to int; Python bindings accept both bool and int for backward compatibility (_normalize_perf_level in code_runner.py, getter/setter shim in task_interface.cpp).
  • PerformanceCollector skips phase-buffer shared-memory allocation and phase-thread management when perf_level < 3 (calc_perf_data_size path).
  • device_runner.cpp (onboard + sim): all enable_profiling guards replaced with perf_level > 0; set_perf_level() called before initialize().
  • Unit tests updated for int-based profiling values.

…ngbuffer swimlane export

Replace the boolean `enable_profiling` flag with a 4-level `perf_level` (0=off,
1=AICore-only, 2=task+fanout, 3=full with AICPU phase records). The
tensormap_and_ringbuffer runtime honors all four levels, while legacy
host_build_graph / aicpu_build_graph paths continue to treat any non-zero value
as a simple on/off and stay on their existing bool member (synchronized via a
shared SFINAE helper in runtime_profiling_mode.h).

CLI and JSON are lifted to match:
- `--enable-profiling` in run_example.py now takes an optional int (default 3
  when flag given, 0 otherwise).
- The swimlane JSON schema gains a new version=0 (level 1: AICore-only) that
  omits dispatch/finish/fanout fields, and swimlane_converter.py accepts it.
- Phase buffer allocation, scheduler-phase recording and orchestrator summary
  writes in aicpu_executor.cpp are gated on perf_level>=3 so lower levels no
  longer pay the phase-profiling overhead; fanout/dispatch_timestamp collection
  is gated on perf_level>=2.
Additionally:
- CallConfig and WorkerPayload switch from bool to int; Python bindings
  accept both bool and int for backward compatibility (_normalize_perf_level
  in code_runner.py, getter/setter shim in task_interface.cpp).
- PerformanceCollector skips phase-buffer shared-memory allocation and
  phase-thread management when perf_level < 3 (calc_perf_data_size path).
- device_runner.cpp (onboard + sim): all enable_profiling guards replaced
  with perf_level > 0; set_perf_level() called before initialize().
- Unit tests updated for int-based profiling values.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more granular performance profiling system by replacing the boolean 'enable_profiling' flag with an integer 'perf_level' across the codebase. This change allows for multiple profiling modes (0=off, 1=AICore-only, 2=task+fanout, 3=full) and includes updates to the runtime, device runner, and performance collector to handle these levels. Additionally, it updates the swimlane JSON export logic to support versioning based on the profiling level and improves robustness in the swimlane converter by handling optional fanout data. I have no feedback to provide as there are no review comments to evaluate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant