Skip to content

Commit 54dfde3

Browse files
quic-boyucclaude
andcommitted
observatory: RFC + docs for Report (JSON) (json_report 3/3)
RFC_SESSIONS.md: new §4a 'Report (JSON)' documents --output-report-json, the three-level lenses[lens][archive][session_id] nesting, the accuracy worst_record semantics (argmin for quality metrics, argmax for error metrics), worst_layers sort direction for per_layer_accuracy, and a side-by-side table contrasting Archive / Report (HTML) / Report (JSON). REFERENCE.md §1.3: adds a json_report row parallel to the existing dashboard row, documenting the call structure, return contract, and payload destination. USAGE.md §4a: new 'Report (JSON) for CI and LLM triage' section with a CLI example, a compact payload sketch, and a pointer to RFC §4a. README.md: one-sentence mention with USAGE.md link added to step 4 of the workflow list. accuracy.py: docstring corrected — worst_record for error metrics (mse, abs_err) = argmax (highest value = worst quality), not argmin. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
1 parent 60b93f8 commit 54dfde3

5 files changed

Lines changed: 109 additions & 3 deletions

File tree

devtools/observatory/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ capture --> store --> analyze --> visualize --> share
3737
1. **Capture**: Observatory wraps your export script. Built-in lenses (e.g. `pipeline_graph_collector`) install monkey-patches that call `collect(...)` at each compilation stage; you can also call `Observatory.collect(...)` directly anywhere in your code.
3838
2. **Store**: Records + per-Session metadata are persisted as a single Archive (JSON) for later re-analysis or comparison.
3939
3. **Analyze**: Each lens processes the Archive into findings, comparisons, and derived insights.
40-
4. **Visualize**: Results are assembled into an interactive HTML report (Report (HTML)) with multiple view types.
40+
4. **Visualize**: Results are assembled into an interactive HTML report (Report (HTML)) with multiple view types. Use `--output-report-json` to also emit a Report (JSON) — a lens-summarised dict suitable for CI threshold checks, LLM-driven triage, and dashboard time-series ingestion (see [USAGE.md §4a](USAGE.md)).
4141
5. **Share**: The Report is a single self-contained HTML file. Send it, attach it to a bug report, or host it on GitHub Pages.
4242

4343
## What you get

devtools/observatory/REFERENCE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ frontend rendering, and custom JS callbacks.
4343
| --- | --- | --- | --- |
4444
| `resources` | `resources() -> Dict[str, str]` | lens frontend implementation | `resources.js[]`, `resources.css[]` |
4545
| `dashboard` | `dashboard(session, session_records, analysis) -> Optional[ViewList]` | Framework calls once per `(Session, lens)` pair. `session` is the active `Session` (carries `id`, `name`, `archive`, `start_ts`, `end_ts`, per-lens `start_data` / `end_data`); `session_records` is `[r for r in records if r.session_id == session.id]`; `analysis` is the `AnalysisResult` for this lens. See [RFC_SESSIONS.md](RFC_SESSIONS.md) for the full contract. | `dashboard[lens][session_id]` |
46+
| `json_report` | `json_report(session, session_records, analysis) -> Optional[Dict[str, Any]]` | Same call structure as `dashboard`. Called once per `(Session, lens)` pair by `export_report_json`. Result lands at `report["lenses"][lens_name][archive_label][session_id]`. Return `None` to opt out (no ghost keys). Returned dict must be JSON-serialisable. See `lenses/accuracy.py` and `lenses/per_layer_accuracy.py` for reference implementations. | `lenses[lens][archive][session_id]` in Report (JSON) |
4647
| `record` | `record(digest, analysis, context) -> Optional[ViewList]` | `digest=record.data[lens]`, `analysis={"global": global_data, "record": per_record_data[name].data}`, `context={"index", "name"}` | `records[i].views[lens]` |
4748
| `check_badges` | `check_badges(digest, analysis) -> List[Dict[str, str]]` | current digest + `global_data` | `records[i].badges[]` |
4849
| `check_index_diffs` | `check_index_diffs(prev_digest, curr_digest, analysis) -> Dict[str, str]` | previous/current digest + `global_data` | `records[i].diff_index` |

devtools/observatory/RFC_SESSIONS.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,74 @@ Archive
9696

9797
There is no flat `start_data` / `end_data` in the archive either. `Observatory._load_archive_sessions` reads both the new shape and the legacy nested `session: {sessions, ...}` shape (forward-compat for previously-written archives) and synthesises an `archive` field for legacy entries.
9898

99+
## 4a. Report (JSON), `--output-report-json`
100+
101+
A third optional derived output alongside the HTML report. Intended for CI dashboards, LLM-driven regression triage, and automated comparison tooling. Produced by `Observatory.export_report_json`.
102+
103+
**Shape:**
104+
105+
```jsonc
106+
{
107+
"title": "...",
108+
"generated_at": "...",
109+
"archives": [ // same grouping as HTML report
110+
{ "label": "default", "session_ids": ["default"] }
111+
],
112+
"sessions": [ // identity+timing only (no start_data/end_data)
113+
{ "id": "default", "name": "default", "archive": "default",
114+
"start_ts": 1731.123, "end_ts": 1734.456 }
115+
],
116+
"lenses": { // lens -> archive -> session_id -> lens dict
117+
"accuracy": {
118+
"default": {
119+
"default": {
120+
"records_measured": 5,
121+
"metrics": {
122+
"psnr": { "mean": 38.5, "min": 30.1, "max": 42.0,
123+
"worst_record": "Quantized Model" },
124+
"cosine_sim":{ "mean": 0.998, "min": 0.991, "max": 0.999,
125+
"worst_record": "Quantized Model" },
126+
"mse": { "mean": 0.02, "min": 0.005, "max": 0.05,
127+
"worst_record": "Quantized Model" }
128+
}
129+
}
130+
}
131+
},
132+
"per_layer_accuracy": {
133+
"default": {
134+
"default": {
135+
"anchor": "Exported Float",
136+
"target": "Quantized Model",
137+
"n_layers": 142,
138+
"sample_source": "accuracy.worst[mse]",
139+
"metric_ranges": { "psnr": [10.0, 50.0], "cosine_sim": [0.5, 1.0] },
140+
"worst_layers": {
141+
"psnr": [
142+
{ "layer": "layer_3", "psnr": 12.5, "cosine_sim": 0.85, "mse": 0.5 }
143+
// top-N by metric, worst-first
144+
]
145+
}
146+
}
147+
}
148+
}
149+
}
150+
}
151+
```
152+
153+
**`worst_record` semantics for `accuracy`:** the record whose metric value was most unfavorable. For quality metrics (psnr, cosine_sim, top_k) this is the record with the *minimum* value; for error metrics (mse, abs_err) this is the record with the *maximum* value.
154+
155+
**`worst_layers` sort order for `per_layer_accuracy`:** psnr/cosine_sim sorted ascending (lower = worse); mse/abs_err sorted descending (higher = worse). Depth controlled by `config["per_layer_accuracy"]["json_report_top_n"]` (default 10).
156+
157+
**Lens hook:** lenses contribute by overriding `Frontend.json_report(session, session_records, analysis) -> Optional[Dict]`. Returning `None` opts out — no ghost keys appear. See `devtools/observatory/lenses/accuracy.py` and `devtools/observatory/lenses/per_layer_accuracy.py` for reference implementations.
158+
159+
**Distinguishing the three outputs:**
160+
161+
| Output | Flag | What it contains | For whom |
162+
|---|---|---|---|
163+
| Archive | `--output-archive` | Raw records + sessions, lossless | Re-visualization, compare |
164+
| Report (HTML) | `--output-html` | Interactive HTML with graphs, lens panels | Human reviewers |
165+
| Report (JSON) | `--output-report-json` | Lens-summarised semantic dicts | CI, LLMs, dashboards |
166+
99167
## 5. Lens contract
100168

101169
```python

devtools/observatory/USAGE.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,41 @@ python -m executorch.backends.qualcomm.debugger.observatory visualize \
109109
This separates the persisted Archive from the rendered Report (HTML), which can
110110
be re-run on demand (e.g. comparing models between two history commits).
111111

112+
## 4a. Report (JSON) for CI and LLM triage
113+
114+
Add `--output-report-json PATH` to any collection or compare invocation to write a
115+
machine-readable derived summary alongside the HTML report:
116+
117+
```bash
118+
python -m executorch.backends.qualcomm.debugger.observatory \
119+
--output-archive artifacts/report.json \
120+
--output-html artifacts/report.html \
121+
--output-report-json artifacts/report.summary.json \
122+
--lens-recipe accuracy --lens-recipe adb \
123+
--archive qualcomm/swin_v2 \
124+
my_export_script.py --output_dir artifacts/
125+
```
126+
127+
The resulting `report.summary.json` contains:
128+
129+
```jsonc
130+
{
131+
"archives": [...],
132+
"sessions": [{ "id", "name", "archive", "start_ts", "end_ts" }],
133+
"lenses": {
134+
"accuracy": { "<archive>": { "<session_id>": { "records_measured", "metrics": {...} } } },
135+
"per_layer_accuracy": { "<archive>": { "<session_id>": { "anchor", "target", "worst_layers", ... } } }
136+
}
137+
}
138+
```
139+
140+
Unlike the Archive (which is raw and lossless), the Report (JSON) is derived and
141+
semantic — suited for CI threshold checks, LLM failure-analysis prompts, and
142+
dashboard time-series ingestion. Lenses that do not override `Frontend.json_report`
143+
contribute nothing; the key simply does not appear.
144+
145+
See [RFC_SESSIONS.md §4a](RFC_SESSIONS.md) for the complete payload spec.
146+
112147
## 5. Disabling Lenses via Config
113148

114149
When using the Observatory Python API directly, pass a config dict to

devtools/observatory/lenses/accuracy.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -747,8 +747,10 @@ def json_report(self, session, session_records, analysis) -> Optional[Dict[str,
747747
"""Aggregate accuracy metrics across the session's records.
748748
749749
Each numeric primary metric contributes a ``mean``, ``min``, ``max``,
750-
and ``worst_record`` (the record name where the metric was lowest,
751-
indicating the worst quality sample for that metric).
750+
and ``worst_record`` — the record name where the metric was most
751+
unfavorable: minimum for quality metrics (psnr, cosine_sim, top_k)
752+
where lower = worse quality, and maximum for error metrics (mse,
753+
abs_err) where higher = worse quality.
752754
Internal ``_*`` keys, ``_min``/``_max`` per-sample stats, and
753755
``_worst_idx`` indices are excluded.
754756
"""

0 commit comments

Comments
 (0)