Neuron Agentic Development provides a collection of AI agents and skills for developing custom NKI (Neuron Kernel Interface) kernels on AWS Trainium and Inferentia hardware. These capabilities equip coding agents in Kiro and Claude to author, debug, profile, and optimize NKI kernels — abstracting away the proprietary hardware knowledge required to work with Neuron.
NKI is the programming interface for writing custom compute kernels that run directly on AWS Trainium and Inferentia NeuronCores. NKI gives you fine-grained control over the hardware's tensor engines, vector engines, and DMA subsystem, enabling performance that exceeds what framework-level compilation alone can achieve.
Common use cases include fused attention kernels, custom normalization operations, quantized matrix multiplications, and any operation where the default compiler output leaves performance on the table.
| # | Requirement | Details | Needed for |
|---|---|---|---|
| 1 | Trainium/Inferentia instance | trn1, trn2, inf2 EC2 instance (AL2023 DLAMI recommended) | Compiling, profiling, optimization |
| 2 | Neuron SDK | aws-neuronx-tools (pre-installed on DLAMI) |
All on-device skills |
| 3 | Python venv with Neuron packages | neuronx-cc, torch-neuronx, neuron-explorer |
Compilation, profiling, analysis |
| 4 | Kiro or Claude Code | Installed on the Trainium instance | Running agents and skills |
| 5 | Anthropic API key | For Claude model inference | Agent reasoning |
Important: The agent runs on the same machine as the hardware. There is no laptop-to-remote-box file transfer — everything is co-located. Writing and documentation skills work anywhere (no hardware needed), but profiling and debugging require on-instance execution.
Launch a trn2.3xlarge instance in São Paulo (sa-east-1) or Melbourne (ap-southeast-4) using the Neuron Deep Learning AMI (DLAMI), then SSH into the instance.
Verify the Neuron devices are visible:
# Confirm Neuron devices are visible
neuron-ls
# Confirm neuron-explorer is available
which neuron-explorer && neuron-explorer --versionThe DLAMI comes with a pre-installed virtual environment. Activate it:
source ~/opt/aws_neuronx_venv_pytorch_2_9/bin/activategit clone https://github.com/aws-neuron/neuron-agentic-development.git
cd neuron-agentic-development
pip install .Then deploy to your preferred tool:
# For Kiro
deploy-neuron-agentic-development-to-kiro
# For Claude Code
deploy-neuron-agentic-development-to-claudecurl -fsSL https://cli.kiro.dev/install | bashkiro-cli chat --agent neuron-nki-agentThe neuron-nki-agent is the unified entry point. It automatically selects the right workflow based on your request and orchestrates the appropriate skills.
| What you want to do | What to say | Hardware needed? |
|---|---|---|
| Write a new kernel | "Write a fused softmax kernel for bf16 inputs" | No |
| Debug a compilation error | "Fix this kernel" (with error output) | Yes |
| Profile a kernel | "Profile my kernel and show me the metrics" | Yes |
| Analyze a profile | "What's the bottleneck in this kernel?" | Yes (neuron-explorer) |
The package provides five specialized skills that follow the natural kernel development pipeline: write → debug → profile → analyze.
| # | Skill | Category | Use when |
|---|---|---|---|
| 1 | neuron-nki-writing |
Authoring | Writing new kernels or modifying existing ones |
| 2 | neuron-nki-debugging |
Debugging | Resolving compiler errors or numerical mismatches |
| 3 | neuron-nki-docs |
Documentation | Looking up API signatures, tutorials, error codes |
| 4 | neuron-nki-profiling |
Profiling | Capturing execution traces on hardware |
| 5 | neuron-nki-profile-querying |
Analysis | SQL-based performance bounds and bottleneck analysis |
Your starting point for creating NKI kernels. Translates PyTorch, NumPy, or natural language descriptions into correct NKI code. Covers tiling strategies respecting hardware constraints (e.g., 128 partition dimension, 512/4096 PSUM free dimension), memory access patterns, and efficiency guidelines for DMA sizing and SBUF reuse.
Systematic workflow for resolving NKI compilation and execution errors. Covers environment setup with correct --target flags, compiler error resolution with a categorized index of all 28 NCC error codes, and numerical validation against CPU-computed references.
Captures execution profiles on hardware: configures runtime inspection environment variables, runs the kernel, identifies the correct NEFF (Neuron Execution File Format), captures the trace with neuron-explorer including DGE notifications for DMA-level detail, and extracts JSON metrics.
Ingests NEFF/NTFF files and runs SQL queries to compute performance bounds, identify bottleneck engines, and localize inefficiencies to specific NKI source lines. Supports three analysis approaches: the neuron-explorer API server, DuckDB directly on parquet, or Pandas for custom computation.
Used across all stages of development. Provides API signatures and tutorials during authoring, explains error codes during debugging, and clarifies hardware architecture details during profiling. Ask about any nisa.* or nl.* API, look up error codes, find tutorials, or browse architecture guides for Trainium 1/2/3.
| # | Agent | Focus | What it does |
|---|---|---|---|
| 1 | neuron-nki-agent |
Full lifecycle | Top-level entry point. Auto-selects the right workflow based on your request and orchestrates the appropriate skills. |
| 2 | neuron-nki-writer-agent |
Authoring | Translates PyTorch, NumPy, or natural language descriptions into NKI code. Handles modifications to existing kernels. |
| 3 | neuron-nki-debugger-agent |
Debugging | Autonomously analyzes compiler errors, searches documentation for fixes, and applies corrections. Tracks iterations (up to 10) and progressively simplifies when stuck. |
| 4 | neuron-nki-profile-analysis-agent |
Profiling + Analysis | Captures execution profiles on hardware, then runs SQL queries against profile data to compute performance bounds, identify bottleneck engines, and localize inefficiencies. |
┌─────────────────────────────────────────────────┐
│ Your Trainium Instance │
│ │
│ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Claude Code │───▶│ Neuron Agentic Dev │ │
│ │ or Kiro │ │ • write kernel │ │
│ └──────────────┘ │ • compile + debug │ │
│ │ │ • profile (NEFF→NTFF) │ │
│ │ │ • analyze (SQL/DuckDB)│ │
│ ▼ │ • optimize + iterate │ │
│ Anthropic API └───────────────────────┘ │
│ (LLM inference) │ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ NeuronCores │ │
│ │ (compile + execute)│ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────┘
- Profiling and debugging skills require execution on actual Trainium/Inferentia hardware. The writing and docs skills work anywhere.
- All skills target the current NKI Beta 0.3.0 API.
- Skills support Trainium 1 (gen2), Trainium 2 (gen3), and Trainium 3 (gen4) with appropriate
--targetflags. - Skills and agents are designed to work together: the top-level agent automatically invokes profiling and debugging skills as needed.
Here's a typical end-to-end workflow:
-
Write — Ask the agent: "Write an NKI kernel that computes scaled softmax: softmax(x * scale) along the last dimension, for input shape [batch, seq_len, hidden_dim] in bfloat16." The agent produces a complete kernel with proper tiling, hardware-accelerated exp, float32 accumulation, and bfloat16 output cast.
-
Debug — Ask the agent to run the kernel and verify numerical parity against a PyTorch reference. If compilation errors occur, the agent diagnoses and fixes them autonomously.
-
Profile — Point the agent at your kernel and ask it to profile. It compiles to a NEFF, benchmarks across multiple input shapes, and captures a hardware profile with
neuron-profile. -
Analyze — The agent runs a full bounds analysis on the captured profile and delivers a diagnosis: which engine is the bottleneck, what the stall rates are, and what optimization opportunities exist (e.g., DMA pipelining, double-buffering).
Give the Neuron Agentic Development capabilities a try and send feedback through GitHub Issues or your usual AWS Support contacts.