pto-runtime/
├── src/
│ ├── common/task_interface/ # Cross-architecture shared headers (data_type.h, tensor_arg.h, task_args.h)
│ └── {arch}/ # Architecture-specific code (a2a3, a5)
│ ├── platform/ # Platform-specific implementations
│ │ ├── include/ # Shared headers (host/, aicpu/, aicore/, common/)
│ │ ├── src/ # Shared source (compiled into both backends)
│ │ ├── onboard/ # Real hardware backend
│ │ │ ├── host/ # Host runtime (.so)
│ │ │ ├── aicpu/ # AICPU kernel (.so)
│ │ │ └── aicore/ # AICore kernel (.o)
│ │ └── sim/ # Thread-based simulation backend
│ │ ├── host/
│ │ ├── aicpu/
│ │ └── aicore/
│ │
│ └── runtime/ # Runtime implementations
│ ├── common/ # Shared components across runtimes
│ ├── host_build_graph/ # Host-built graph runtime
│ ├── aicpu_build_graph/ # AICPU-built graph runtime
│ └── tensormap_and_ringbuffer/ # Advanced production runtime
│
├── python/ # Language bindings
│ ├── bindings/ # nanobind extension module (_task_interface)
│ │ ├── CMakeLists.txt
│ │ ├── task_interface.cpp
│ │ └── dist_worker_bind.h
│ └── simpler/ # Python package
│ ├── worker.py # Unified Worker (L2 single-chip, L3 distributed)
│ ├── task_interface.py # Python re-exports of nanobind types + helpers
│ ├── runtime_compiler.py # Multi-platform runtime compiler
│ ├── kernel_compiler.py # Kernel compiler
│ ├── elf_parser.py # ELF binary parser
│ ├── env_manager.py # Environment variable management
│ └── toolchain.py # Toolchain configuration
│
├── examples/ # Working examples
│ ├── scripts/ # Build and test framework
│ │ ├── run_example.py # Run a single example
│ │ ├── code_runner.py # Example execution engine
│ │ ├── runtime_builder.py # Runtime binary builder (pre-built lookup or compile)
│ │ ├── build_runtimes.py # Pre-build all runtime variants
│ │ └── platform_info.py # Platform/runtime discovery utilities
│ └── {arch}/ # Architecture-specific examples
│ ├── host_build_graph/
│ ├── aicpu_build_graph/
│ └── tensormap_and_ringbuffer/
│
├── tests/ # Test suite
│ ├── ut/ # Unit tests
│ │ ├── py/ # Python unit tests (pytest)
│ │ └── cpp/ # C++ unit tests (GoogleTest)
│ └── st/ # Device scene tests (hardware-only)
│
└── docs/ # Documentation
| Role | Directory | Responsibility |
|---|---|---|
| Platform Developer | src/{arch}/platform/ |
Platform-specific logic and abstractions |
| Runtime Developer | src/{arch}/runtime/ |
Runtime logic (host, aicpu, aicore, common) |
| Codegen Developer | examples/ |
Code generation examples and kernel implementations |
Rules:
- Stay within your assigned directory unless explicitly requested otherwise
- Create new subdirectories under your assigned directory as needed
- When in doubt, ask before making changes to other areas
The build has two layers: runtime binaries (platform-dependent, user-code-independent) and user code (orchestration + kernels, compiled per-example).
Runtime binaries (host .so, aicpu .so, aicore .o) are pre-built during pip install . and cached in build/lib/{arch}/{variant}/{runtime}/. The pipeline:
examples/scripts/build_runtimes.py— detects available toolchains, iterates all (platform, runtime) combinationsexamples/scripts/runtime_builder.py— orchestrates per-runtime build (lookup pre-built or compile)python/runtime_compiler.py— invokes cmake for each target (host, aicpu, aicore)
Persistent cmake build directories under build/cache/ enable incremental compilation — only changed files are recompiled.
Architecture note: a2a3 and a5 differ only at runtime (device selection, block dimensions, etc.). The compiled binaries are architecture-independent — the same toolchain and flags produce artifacts that work on both chips. Therefore pip install . should build all architectures (both a2a3 and a5, both onboard and sim) whenever the corresponding toolchain is available. Toolchain detection (build_runtimes.py):
- sim (a2a3sim, a5sim): requires
gcc+g++inPATH - onboard (a2a3, a5): requires
ccecinPATH+ cross-compiler underASCEND_HOME_PATH
python/kernel_compiler.py— compiles user-written kernel.cppfiles (one perfunc_id)python/bindings/— nanobind extension providing ChipWorker, task types, and distributed types to Python
When preprocessor guards are used to isolate platform code paths, the __aarch64__ block must be placed first:
#if defined(__aarch64__)
// aarch64 path (must be first)
#elif defined(__x86_64__)
// x86_64 host simulation path
#else
// other platforms
#endifEvery example and device test follows this structure:
my_example/
golden.py # generate_inputs() + compute_golden()
kernels/
kernel_config.py # KERNELS list + ORCHESTRATION dict + RUNTIME_CONFIG
aic/ # AICore kernel sources (optional)
aiv/ # AIV kernel sources (optional)
orchestration/ # Orchestration C++ source
Run with: python examples/scripts/run_example.py -k <kernels_dir> -g <golden.py> -p <platform>
pip install -e .This builds the nanobind _task_interface extension and pre-builds all runtime binaries for available toolchains into build/lib/. Sim platforms (a2a3sim, a5sim) are built when gcc/g++ are available; onboard platforms (a2a3, a5) are built when ccec and the cross-compiler under ASCEND_HOME_PATH are available. Since a2a3 and a5 share the same compilation — differing only at runtime — both architectures are always built together when their toolchain is present.
| What changed | Action |
|---|---|
| First time / clean checkout | pip install -e . |
Runtime C++ source (src/{arch}/runtime/, src/{arch}/platform/) |
Pass --build to run_example.py (incremental, ~1-2s) |
Nanobind bindings (python/bindings/) |
Re-run pip install -e . |
Python-only code (python/*.py, examples/scripts/*.py) |
No rebuild needed (editable install) |
Examples / kernels (examples/{arch}/, tests/st/) |
No rebuild needed, just re-run |
By default, run_example.py loads pre-built runtime binaries from build/lib/. When runtime C++ source has changed, pass --build to recompile incrementally:
python examples/scripts/run_example.py --build \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3simThis uses the persistent cmake cache in build/cache/, recompiling only what changed. In CI, pip install . pre-builds all runtimes before ci.sh runs, so examples use pre-built binaries.
build/
cache/{arch}/{variant}/{runtime}/ # cmake intermediate files (persistent)
host/ # cmake build dir for host target
aicpu/ # cmake build dir for aicpu target
aicore/ # cmake build dir for aicore target
lib/{arch}/{variant}/{runtime}/ # final binaries (stable lookup paths)
libhost_runtime.so
libaicpu_kernel.so
aicore_kernel.o # or .so for sim
Kernels are compiled externally by KernelCompiler and uploaded to the device at runtime:
from simpler.kernel_compiler import KernelCompiler
compiler = KernelCompiler(platform="a2a3sim")
kernel_binary = compiler.compile_incore("path/to/kernel.cpp", core_type="aiv")The compiled binary is then uploaded via DeviceRunner::upload_kernel_binary(func_id, bin_data, bin_size), which loads it into device memory and returns the function address for task dispatch.
- Three programs compile independently with clear API boundaries
- Full Python API via nanobind with torch integration
- Modular design enables parallel component development
- Runtime linking via binary loading