Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ z_local_saved/
tags

# Generated spell check config
.spellcheck-non-draft.yml
.spellcheck-non-draft.yml
2 changes: 1 addition & 1 deletion assets/contributors.csv
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Daniel Nguyen,,,,,
Joe Stech,Arm,JoeStech,joestech,,
visualSilicon,,,,,
Konstantinos Margaritis,VectorCamp,,,,
Kieran Hejmadi,,,,,
Kieran Hejmadi,Arm,kieranhejmadi01,kieran-hejmadi-88920815b,,
Alex Su,,,,,
Chaodong Gong,,,,,
Owen Wu,Arm,,,,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Get started with memory access analysis with Arm Performix and the Arm MCP Server

description: Learn how to profile memory access behavior in a C++ particle simulation on Arm Linux using the Arm Performix Memory Access recipe through the Arm MCP Server.

minutes_to_complete: 45

who_is_this_for: This is an introductory topic for C++ developers who want to use Arm Performix and the Arm MCP Server to diagnose cache and translation behavior in applications running on Arm Neoverse systems.

learning_objectives:
- Explain how L1 cache hits, TLB misses, and page walks affect C++ application runtime.
- Build and visualize the orbiting galaxies example on an Arm Linux target.
- Inspect and optimize particle data structure using insights from the memory access recipe
- Use the Arm MCP Server in combination with performix for an agentic solution.

prerequisites:
- Access to an Arm Neoverse-based Linux metal instance.
- Basic understanding of memory hierarchy within a CPU
- Basic C++ development experience.
- Familiarity with the Linux command line.

author: Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
tools_software_languages:
- Arm Performix
- MCP
- C++
- CMake
- Python
- Linux perf
operatingsystems:
- Linux

further_reading:
- resource:
title: Identify code hotspots using Arm Performix through the Arm MCP Server
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/performix-mcp-agent/
type: learning-path
- resource:
title: Find Code Hotspots with Arm Performix
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/
type: learning-path
- resource:
title: Optimize application performance using Arm Performix CPU microarchitecture analysis
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/performix-microarchitecture/
type: learning-path
- resource:
title: Automate x86-to-Arm application migration using Arm MCP Server
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/arm-mcp-server/
type: learning-path
- resource:
title: Arm Performix
link: https://developer.arm.com/servers-and-cloud-computing/arm-performix
type: website

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: Background
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Review of the CPU memory hierarchy

This Learning Path assumes you already understand memory hierarchy fundamentals. It is a recap, not an exhaustive explanation, and focuses on concepts used in the worked example.

Modern Arm server CPUs use a hierarchy of memories to reduce the cost of loading and storing data. The fastest storage sits close to each CPU core, while larger memories sit farther away and take more cycles to access.

You typically see:

- L1 data cache (`L1d`) and L1 instruction cache (`L1i`) close to each core with each access usually taking up to 10 cycles.
- L2 cache, often private to each core, with each access usually taking 10-20 cycles.
- Last-level cache, often shared across multiple cores, and usually taking 20+ cycles.
- DRAM, which is much larger but much slower than on-chip cache.

You can inspect cache topology on a Linux system with Arm's [sysreport](https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/) tool or the `lscpu` command. Unlike `lscpu`, Sysreport also reports the set associativity for each cache level. For example, you can run the following command on a system with `git` and `python` installed:

```output
git clone https://github.com/ArmDeveloperEcosystem/sysreport.git
python3 src/sysreport.py | grep -i cache -A 4

cache info: size, associativity, sharing
cache line size: 64
Caches:
64 x L1D 64K 4-way 64b-line
64 x L1I 64K 4-way 64b-line
64 x L2U 1M 8-way 64b-line
1 x L3U 32M 16-way 64b-line
```

For a more visual view, install `hwloc` and generate a topology image:

```bash
sudo apt update
sudo apt install -y hwloc
hwloc-ls --of png > topology.png
```

![Hardware locality topology for an Arm server showing per-core L1 and L2 caches and a shared L3 cache across all cores, which helps you verify cache hierarchy before profiling.#center](./topology.png "Example hardware locality topology")

The graphic above illustrates cache tiers on an AWS Graviton3 metal instance based on Neoverse V1. Each of the 64 cores has private `L1d`, `L1i`, and `L2` caches, and all cores share one `L3` cache, sometimes referred to as last-level cache (LLC). Cache sizes, especially at later levels, are not fixed by the Neoverse architecture; implementers such as AWS or Google can configure larger or smaller caches based on design goals.

NUMA, or non-uniform memory access, means memory latency can depend on which processor or socket owns the memory being accessed. On this AWS Graviton3 instance, there is only one NUMA node.

If you would like a comprehensive system-level understanding of the memory subsystem, review our learning path on the [Arm system characterisation tool](https://learn.arm.com/learning-paths/servers-and-cloud-computing/memory-subsystem/).

## Definition of terms used in this learning path

Applications use virtual addresses, which are the addresses a program sees instead of physical DRAM locations. Virtual addressing lets the operating system isolate processes, protect memory, and map each program's address space to available physical memory. The processor translates virtual addresses to physical addresses before it accesses memory.

### Translation lookaside buffer (TLB)

The translation lookaside buffer (TLB) caches recent virtual-to-physical translations at page granularity to avoid page table walks. A TLB miss occurs when the needed translation is not cached, so the processor performs a page table walk to find the mapping. Page walks add latency before a load or store can complete. Large working sets and irregular access patterns, such as strides larger than the typical 4KB page size, can increase TLB pressure because the program touches many pages with little reuse.

### Page Faults

A minor page fault is usually harmless: the data is already in RAM, and the kernel only creates the mapping. This commonly happens during anonymous paging when Linux lazily backs newly allocated heap or stack memory on first touch. A major page fault is more expensive because the kernel must fetch the page from disk, such as from a file or swap, so repeated major faults are usually a real performance concern.

### Working set size

The working set is the data your program actively touches during a period of execution. It differs from resident set size (RSS), which is the amount of physical memory currently resident for a process. A process can have a large RSS while the hot loop actively uses only a smaller working set.

### Memory access from a programmers perspective

From a programmer's perspective, much of the cache and memory subsystem is a black box defined by processor architecture and implementation. Features such as cache associativity, prefetching, and translation caching are designed to hide latency across many workloads. Your main software levers are data structure layout, allocation patterns, and choices such as page size. The layout of your C++ data structures can determine whether the memory hierarchy helps or hurts runtime. The compiler generally cannot reorder structure fields or split objects automatically because that would change program semantics.

Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: Build Example
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Install required packages and clone the repository

Use your remote Arm Linux target for all build and run steps. This example uses an AWS `c7g.metal` instance running `Ubuntu 24.04 LTS`.

## Install Arm Performix

Install and configure Arm Performix using the [install guide](https://learn.arm.com/install-guides/performix/) on both your local machine and the remote Neoverse-based system.

## Install the required system packages

Run the following command, replacing `apt` with the package manager for your linux distribution.

```bash
sudo apt update
sudo apt install -y git cmake build-essential python3 python3-venv python3-pip
```
{{% notice Please Note %}}

If you are running on an **AWS Ubuntu 24.04 LTS image**, you also need to enable SPE with the following commands. If you are running on another platform, see the [enable SPE learning path](https://github.com/ArmDeveloperEcosystem/arm-learning-paths/pull/3186).

```bash
sudo apt install -y linux-modules-extra-$(uname -r)
sudo modprobe arm_spe_pmu
```
{{% /notice %}}

Clone the example:

```bash
git clone https://github.com/arm-education/Orbiting-Galaxy-Example.git
cd Orbiting-Galaxy-Example
git checkout -b my-work v1.0.3
```

## Build with CMake

```bash
mkdir -p build
cd build
cmake ..
cmake --build . --parallel
```

This produces the workload binaries in `build/`.

## Set up a Python virtual environment and run visualization

From the repository root:

```bash
cd ..
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r scripts/requirements.txt
```

Generate simulation frames and create the GIF:

```bash
cd build
./baseline --visualize
python3 ../scripts/visualize.py galaxy_baseline.bin
```

The script reads simulation data from `galaxy_baseline.bin` and writes a GIF into `assets/`.

![Animated orbiting galaxy simulation generated by the baseline workload, showing particle motion over time so you can verify that the simulation output looks correct before profiling.#center](galaxy_compressed.gif "Orbiting galaxies workload visualization")

Use `--visualize` only for understanding the workload behavior. Do not include visualization mode in profiling runs because file I/O alters the measured runtime characteristics.
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: Inspect with Performix
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Inspect the particle data structure

Start by inspecting the baseline particle model in `src/baseline/particle.hpp`.

{{% notice Tip %}}

If you are using an IDE or editor with an LLM-based coding assistant, the `AGENT.md` file can improve your learning experience. This file provides repository context and helps guide the agent to give more useful assistance.

![Screenshot showing the AGENT.md file in the repository, highlighting the context file your coding assistant uses to provide more relevant guidance during this task.#center](./agent_screen_shot.png "Screenshot of GitHub Copilot in VSCode using AGENTS.md as a system prompt to act as a learning assistant.")

{{% /notice %}}

The baseline implementation stores every property for one particle in a single structure:

```cpp
struct Particle {
float x, y, z; // position (12 bytes)
float vx, vy, vz; // velocity (12 bytes)
float mass, charge, temperature; // properties (12 bytes)
float pressure, energy, density; // (12 bytes)
float spin_x, spin_y, spin_z; // (12 bytes)
float pad; // padding (4 bytes)
};
```

The ownership container in the same file is:

```cpp
class ParticleOwner {
// Stores particle references used by the simulation.
std::vector<Particle*> particles_;
};
```

The update loop in `src/baseline/baseline.cpp` repeatedly updates particle positions:

```cpp
for (int iter = 0; iter < iters; ++iter) {
update_positions(particles.data(), NUM_PARTICLES, dt);
}
```

This baseline design can create avoidable memory overhead:

- `ParticleOwner` stores pointers to separately allocated `Particle` objects, so the hot loop must follow an extra level of indirection.
- Each `Particle` is 64 bytes, but the position update only uses `x`, `y`, `z`, `vx`, `vy`, and `vz`.
- Loading whole particle objects can waste cache capacity and memory bandwidth when the loop only needs a subset of fields.

Before you optimize anything, profile and measure.

## Run the Performix Memory Access Recipe

Open the Performix GUI on your local machine and select the **Memory Access** recipe.

Configure the recipe to launch the baseline workload on your remote Arm target:

- Select the configured remote target.
- Set **Workload type** to **Launch a new process**.
- Set **Workload** to the baseline executable:

```output
<path to build directory>/baseline
```

Keep the default profiling duration so Performix records until the workload exits.

![Performix Memory Access recipe setup showing the selected remote Arm target and the workload path field populated with the baseline binary, which confirms the run configuration before profiling starts.#center](./setup.png "Configure the Performix Memory Access recipe")

Start the recipe and wait for the results to load.

## Assess Performance

![Performix Memory Access results for the baseline binary showing update_positions with about 66 percent L1C load hits and around 26-cycle average L1C latency, indicating weak cache locality in the hot path.#center](./performix_before_optimizations.png "Baseline memory access results before optimization")

Look at the memory access results for the baseline binary. Most samples are associated with the `update_positions()` function. The `L1C % Loads` value shows that only about two thirds of loads hit in L1 cache, and the average L1 cache load latency is about 26 cycles. A cache-friendly hot loop should have a much higher L1 hit rate and lower average latency.

To investigate further, check the TLB walk data. As described in the background section, the TLB caches virtual-to-physical address translations. As per the image below, the `TLB Walk Breakdown` tab shows no significant TLB walks. That means address translation is not the main issue.

![Performix Memory Access results show 0% TLB walks across all functions in the baseline binary, indicating that TLB pressure and costly address translation misses are not contributing to the performance issue.#center](./no_tlb_walks.png "TLB walk results showing 0 page table walks for all functions in baseline implementation")

In summary:

- Average load latency is about 26 cycles, indicating frequent accesses beyond L1 cache.
- SPE samples are concentrated in `update_positions()`, confirming this loop dominates execution.
- TLB misses are not significant, so page walks are not the source of the slowdown.

Double-click the `update_positions()` row to open the source code view. The source view shows that the samples concentrate on the per-particle position updates.

![Performix source code view for update_positions showing sample concentration on the x, y, and z update statements, helping you confirm that this loop is the main optimization target.#center](./source_code.png "Baseline source-level samples in update_positions")

Given that the majority of samples are associated with accessing the `Particle` data structure and that we fall back to L2 cache ~1/3 of the time, to improve the execution time of this example we will need to focus on more efficient ways, if any, of accessing the `Particle` member variables. For example, there may be an alternative data structure that has better cache utilization.

In the next section, you use this evidence to guide optimization.
Loading
Loading