Skip to content

Commit 2d7ffad

Browse files
authored
Qualcomm AI Engine Direct - Debugger Convergence Phase 2: Migrating to official numeric discrepancy evaluator (pytorch#18834)
### Summary Debugger Convergence Stage 2. Stage 1 (Merged): pytorch#17804 Stage 2: This PR Stage 3: Adding SKILL.md for debugger Changes made on QNN backend - Removed comparator logic and reuse dev tools `NumericalComparatorBase` - Using ETRecrod to retrieve `edge_after_transform/forward` reference graph - Reuse the online `edge_after_transform/forward` graph instead of the one that goes through serialize and deserialize since serialize will not save `quant attributes`. Reference: https://github.com/pytorch/executorch/blob/d31d4be15c176045ce3bae2c76a50c891fa5973a/exir/serde/serialize.py#L141 - Changing UT expected number of events as multi-output node is not supported in dev tools. - Verified that the IO's order of the graph is working properly. Changes made on dev tools - https://github.com/pytorch/executorch/blob/411ede26bd8abfe723ec34e5a6e729f8c60cfee2/devtools/inspector/_inspector.py#L1150 is changed because it is hardcoded to use `edge before passes` graph for now. Added a param and make sure it is still backward compatible. - Added debug_handle to the `pandas dataframe` since it is helpful for users to map `dataframe` back to the original graph. ### Test plan Passing E2E test: - `python backends/qualcomm/tests/test_qnn_delegate.py TestUtilsScript.test_intermediate_debugger --device ${DEVICE} --model SM8750 --build_folder build-android --executorch_root . --artifact_dir ./test_debugger --image_dataset ../datasets/imagenet-mini/val/` Passing the following UT: - `python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedUtils.test_qnn_backend_dump_intermediate_outputs_simple_model --model SM8750 --device ${DEVICE} --build_folder build-android` - `python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedUtils.test_qnn_backend_dump_intermediate_outputs_topk --model SM8750 --device ${DEVICE} --build_folder build-android` - `python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointUtils.test_qnn_backend_dump_intermediate_outputs_topk --model SM8750 --device ${DEVICE}--build_folder build-android` - `python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointUtils.test_qnn_backend_dump_intermediate_outputs_simple_model --model SM8750 --device ${DEVICE} --build_folder build-android` - Under `devtools/inspector/_inspector_utils.py`, skip delegate call event since it holds all `debug_handles` and will mess up the op event `debug handle`.
1 parent f220e71 commit 2d7ffad

18 files changed

Lines changed: 733 additions & 718 deletions

backends/qualcomm/builders/node_visitor.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -418,12 +418,12 @@ def get_tensor_name(
418418
elif is_graph_output(node):
419419
tensor_name = f"output_{tensor_name}"
420420

421-
# Save this for intermediate debugger
422-
# Needs idx since node like topk has 2 outputs
423-
if QCOM_TENSOR_NAME in node.meta:
424-
node.meta[QCOM_TENSOR_NAME][wrapper_idx] = tensor_name
425-
else:
426-
node.meta[QCOM_TENSOR_NAME] = {wrapper_idx: tensor_name}
421+
# Only add qcom_tensor_name when enable tensor dump.
422+
# Only runs in qnn_preprocess (not op validation) since that's when
423+
# tensor names are finalized and enable_tensor_dump is True.
424+
if self.enable_tensor_dump:
425+
node.meta.setdefault(QCOM_TENSOR_NAME, {})[wrapper_idx] = tensor_name
426+
427427
return tensor_name
428428

429429
def define_custom_tensor_wrapper(

backends/qualcomm/debugger/README.md

Lines changed: 101 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Generate optrace and QHAS files using QNN tools under $QNN_SDK_ROOT. After finis
5050
adb = SimpleADB(
5151
qnn_config=qnn_config,
5252
pte_path=f"{args.artifact}/{pte_filename}.pte",
53-
workspace=f"/data/local/tmp/executorch/{pte_filename},
53+
workspace=f"/data/local/tmp/executorch/{pte_filename}",
5454
)
5555
binaries_trace = generate_optrace(
5656
args, adb, f"{args.artifact}/{pte_filename}.pte", example_input
@@ -121,24 +121,24 @@ flowchart TB;
121121
debug --> output["Output Results"]
122122
```
123123

124-
## Instructions
125-
126-
### 1. Setup
124+
## Prerequisites
127125
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch.
128126
2. Follow the [tutorial](https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html) to build Qualcomm AI Engine Direct Backend.
129127

130-
### 2. Enable Flag
128+
## Instructions
131129

132-
When executing the script, please add the flag `--dump_intermediate_outputs`. This tells QNN to dump all intermediate tensors during execution.
130+
### 1. Initialize debugger and build binary
131+
132+
Create a `QNNIntermediateDebugger` with a sample input and pass it to `build_executorch_binary`. The `--dump_intermediate_outputs` flag tells QNN to dump all intermediate tensors during execution.
133133

134-
### 3. Add debugger to the example script
135-
Initialize a `QNNIntermediateDebugger`. Please pass initialized `QNNIntermediateDebugger` and the `args.dump_intermediate_outputs` to `build_executorch_binary` method as well.
136-
#### Example:
137134
```python
138135
from executorch.backends.qualcomm.export_utils import build_executorch_binary
139-
from executorch.backends.qualcomm.debugger.qnn_intermediate_debugger import QNNIntermediateDebugger
136+
from executorch.backends.qualcomm.debugger.qnn_intermediate_debugger import (
137+
OutputFormat,
138+
QNNIntermediateDebugger,
139+
)
140140

141-
qnn_intermediate_debugger = QNNIntermediateDebugger()
141+
qnn_intermediate_debugger = QNNIntermediateDebugger(sample_input=inputs[0])
142142
build_executorch_binary(
143143
model=MyModel(),
144144
qnn_config=qnn_config,
@@ -148,81 +148,121 @@ build_executorch_binary(
148148
)
149149
```
150150

151-
### 4. Set data num to 1
152-
It is perfectly fine for users to pass the desired amount of datasets to `build_executorch_binary`, which helps achieve better quantization results. However, after `build_executorch_binary` is called, we need to ensure that we only perform one inference during execution. Please ensure that CPU and QNN is using the same input during execution; otherwise, the debugging results might not be accurate.
151+
After `build_executorch_binary()`, the debugger holds:
152+
- `edge_ep` — edge `ExportedProgram` for CPU golden inference.
153+
- `etrecord_file_path` — path to the generated ET record.
154+
155+
### 2. Execute on device
156+
157+
Ensure `dump_intermediate_outputs` is enabled in your `QnnConfig` (or pass `--dump_intermediate_outputs` via CLI). Only run **one inference** for debugging — multiple executions are not supported.
153158

154-
### 5: Pull and process the results.
155-
After QNN execution with the runner, if the previous steps are done correctly, we should be able to get two files: `etdump.etdp` and `debug_output.bin`.
156-
The following example pulls the files back and calls a callback function to process the results. In this callback function, we create the `Inspector`. Then we perform CPU inference to get CPU intermediate results. Now, we have both QNN and CPU intermediate results, we can start generating results to compare the accuracy. Taking the following example, we should be able to get `debug_graph.svg` as an output in the current directory.
157-
#### Example:
158159
```python
159-
from executorch.backends.qualcomm.debugger.qnn_intermediate_debugger import OutputFormat
160+
from executorch.examples.qualcomm.utils import SimpleADB
161+
162+
adb = SimpleADB(
163+
qnn_config=qnn_config,
164+
pte_path=f"{args.artifact}/{pte_filename}.pte",
165+
workspace=f"/data/local/tmp/executorch/{pte_filename}",
166+
)
167+
adb.push(inputs=inputs)
168+
adb.execute()
169+
```
170+
171+
### 3. Pull results and compare
172+
173+
After execution, pull `etdump.etdp` and `debug_output.bin` from the device. Use `setup_inspector()` to create the `Inspector`, then create comparators and generate results.
174+
175+
Before comparing per-layer outputs, it is highly recommended to verify that the edge program's final output aligns with the original `nn.Module`. The debugger uses the edge program as the CPU golden reference, so if the edge graph itself has diverged (e.g., due to weights quantization or pass transformations), per-layer comparisons against it may be misleading.
176+
177+
```python
178+
from executorch.backends.qualcomm.debugger.qcom_numerical_comparator_sample import (
179+
QcomCosineSimilarityComparator, QcomMSEComparator,
180+
)
181+
160182
def validate_intermediate_tensor():
161-
inspector = Inspector(
183+
qnn_intermediate_debugger.setup_inspector(
162184
etdump_path=f"{args.artifact}/etdump.etdp",
163185
debug_buffer_path=f"{args.artifact}/debug_output.bin",
164186
)
165-
qnn_intermediate_debugger.intermediate_output_module(*(inputs[0]))
187+
188+
# Verify edge program output aligns with the original nn.Module.
189+
# This ensures the edge graph is a reliable golden reference.
190+
edge_result = qnn_intermediate_debugger.edge_ep.module()(*(inputs[0]))
191+
with torch.no_grad():
192+
source_result = source_model(*(inputs[0]))
193+
score = torch.nn.functional.cosine_similarity(
194+
edge_result.flatten(), source_result.flatten(), dim=0
195+
).item()
196+
print("Cosine similarity between nn.Module and edge CPU:", score)
197+
198+
cos_comparator = qnn_intermediate_debugger.create_comparator(
199+
QcomCosineSimilarityComparator, threshold=0.9
200+
)
166201
qnn_intermediate_debugger.generate_results(
167-
title="debug_graph",
168-
path=".",
169-
output_format=OutputFormat.SVG_GRAPHS,
170-
inspector=inspector,
171-
evaluator=CosineSimilarityEvaluator(0.9),
202+
title="debug_cos_similarity",
203+
path=args.artifact,
204+
output_format=OutputFormat.SVG_GRAPH,
205+
comparator=cos_comparator,
172206
)
173207

174208
adb.pull_debug_output(
175209
args.artifact, args.artifact, callback=validate_intermediate_tensor
176210
)
177211
```
178212

179-
#### Additional Options
180-
The above example sets output formats as SVG and evaluation metrics using Cosine Similarity. Based on different needs, users can choose other output formats as shown in the `OutputFormat` class under [qnn_intermediate_debugger](./qnn_intermediate_debugger.py)
213+
## Comparators
214+
215+
Create comparators via the `create_comparator()` factory, which automatically injects the `edge_ep`. A couple sample comparators are provided under [qcom_numerical_comparator_sample.py](./qcom_numerical_comparator_sample.py):
216+
181217
```python
182-
class OutputFormat(IntEnum):
183-
SVG_GRAPHS = 0
184-
CSV_FILES = 1
185-
DUMP_RAW = 2
218+
cos = qnn_intermediate_debugger.create_comparator(QcomCosineSimilarityComparator, threshold=0.9)
219+
mse = qnn_intermediate_debugger.create_comparator(QcomMSEComparator, threshold=0.1)
186220
```
187221

188-
For evaluation metrics, if users would like to implement their own metrics, we have provided the option to implement [MetricEvaluatorBase](./metrics_evaluator.py). The following shows how to define custom metrics.
222+
### Custom comparators
223+
224+
Users can also define their own comparator by implementing a derived class from [QcomNumericalComparatorBase](./qcom_numerical_comparator_base.py). Inside the derived class, users will need to implement `metric_name()`, `is_valid_score()`, and `element_compare()`. The base class handles QNN-specific preprocessing (dequantization, layout conversion) internally — `preprocessing` cannot be overridden.
189225
```python
190-
class RootMeanSquaredErrorEvaluator(MetricEvaluatorBase):
191-
def __init__(self, threshold=0.02):
226+
from executorch.backends.qualcomm.debugger.qcom_numerical_comparator_base import (
227+
QcomNumericalComparatorBase,
228+
)
229+
230+
class MyComparator(QcomNumericalComparatorBase):
231+
def __init__(self, edge_ep, threshold=0.5):
232+
super().__init__(edge_ep)
192233
self.threshold = threshold
193234

194235
def metric_name(self) -> str:
195-
return "Root Mean Squared Error"
196-
197-
def evaluate(
198-
self, qnn_output: torch.Tensor, cpu_output: torch.Tensor
199-
) -> Tuple[Any, bool]:
200-
mse = F.mse_loss(qnn_output, cpu_output)
201-
rmse = torch.sqrt(mse)
202-
valid = rmse < self.threshold
203-
return rmse, valid
204-
205-
qnn_intermediate_debugger.generate_results(
206-
title="my_metric",
207-
path=".",
208-
output_format=OutputFormat.SVG_GRAPHS,
209-
inspector=inspector,
210-
evaluator=RootMeanSquaredErrorEvaluator(),
211-
)
236+
return "my_metric"
237+
238+
def is_valid_score(self, score: float) -> bool:
239+
return score >= self.threshold
240+
241+
def element_compare(self, a, b) -> float:
242+
# your comparison logic here
243+
...
212244
```
213245

214-
### Example Script
215-
We have provided an inception_v3 demo script to help users better understand how to apply the debugger to their scripts. Please refer to [qnn_intermediate_debugger_demo.py](../../../examples/qualcomm/util_scripts/qnn_intermediate_debugger_demo.py) for the example script.
246+
## Output formats
247+
248+
| Format | Enum | Output |
249+
|--------|------|--------|
250+
| SVG graph | `OutputFormat.SVG_GRAPH` | Color-coded computation graph (green=pass, red=fail) |
251+
| CSV file | `OutputFormat.CSV_FILE` | Per-node tabular results |
252+
253+
## Example Script
254+
255+
An Inception_V3 demo script is provided at [qnn_intermediate_debugger_demo.py](../../../examples/qualcomm/util_scripts/qnn_intermediate_debugger_demo.py).
216256

217-
Before running the example script, please ensure that dataset is downloaded. Example dataset can be retrieved [here](https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000).
257+
Before running, ensure the dataset is downloaded. An example dataset can be retrieved [here](https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000).
218258

219-
To execute the model:
220259
```bash
221-
python examples/qualcomm/util_scripts/qnn_intermediate_debugger_demo.py -b build-android -m ${SOC_MODEL} --device ${SERIAL_NUM} --dataset ${PATH_TO_DATASET} --dump_intermediate_outputs
260+
python -m examples.qualcomm.util_scripts.qnn_intermediate_debugger_demo -b build-android -s $DEVICE_SERIAL -m $SOC_MODEL -d path/to/imagenet/val --dump_intermediate_outputs
222261
```
223262

224-
### Limitation
225-
1. The current debugger only supports performing one execution. Multiple executions may cause unknown behavior and are not recommended.
226-
2. Please ignore this if you are using `qnn_executor_runner`. If you have decided to write your own runner, please follow the [tutorial](https://pytorch.org/executorch/stable/etdump.html) on how to implement etdump into your own runner.
227-
3. The current debugger does not support graph with partitions. (WIP)
228-
4. The current debugger does not support LLM models. (WIP)
263+
## Limitations
264+
1. Only one execution per debug session — multiple executions may cause unknown behavior.
265+
2. If you have decided to write your own runner (instead of `qnn_executor_runner`), follow the [tutorial](https://pytorch.org/executorch/stable/etdump.html) on how to implement etdump.
266+
3. Does not support graphs with partitions (partial delegation).
267+
4. Does not support LLM models.
268+
5. Does not support graphs with multiple methods.

backends/qualcomm/debugger/TARGETS

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ runtime.python_library(
1515
name = "qnn_intermediate_debugger",
1616
srcs = [
1717
"format_outputs.py",
18-
"metrics_evaluator.py",
18+
"qcom_numerical_comparator_base.py",
19+
"qcom_numerical_comparator_sample.py",
1920
"qnn_intermediate_debugger.py",
2021
],
2122
deps = [

0 commit comments

Comments
 (0)