observatory: RFC + docs for Report (JSON) (json_report 3/3)

quic-boyuc · claude · quic-boyuc · commit 54dfde32ab92 · 2026-05-18T18:27:46.000+08:00
RFC_SESSIONS.md: new §4a 'Report (JSON)' documents --output-report-json,
the three-level lenses[lens][archive][session_id] nesting, the accuracy
worst_record semantics (argmin for quality metrics, argmax for error
metrics), worst_layers sort direction for per_layer_accuracy, and a
side-by-side table contrasting Archive / Report (HTML) / Report (JSON).

REFERENCE.md §1.3: adds a json_report row parallel to the existing
dashboard row, documenting the call structure, return contract, and
payload destination.

USAGE.md §4a: new 'Report (JSON) for CI and LLM triage' section with
a CLI example, a compact payload sketch, and a pointer to RFC §4a.

README.md: one-sentence mention with USAGE.md link added to step 4
of the workflow list.

accuracy.py: docstring corrected — worst_record for error metrics
(mse, abs_err) = argmax (highest value = worst quality), not argmin.

Co-Authored-By: Claude Sonnet 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/devtools/observatory/README.md b/devtools/observatory/README.md
@@ -37,7 +37,7 @@ capture  -->  store  -->  analyze  -->  visualize  -->  share
 1. **Capture**: Observatory wraps your export script. Built-in lenses (e.g. `pipeline_graph_collector`) install monkey-patches that call `collect(...)` at each compilation stage; you can also call `Observatory.collect(...)` directly anywhere in your code.
 2. **Store**: Records + per-Session metadata are persisted as a single Archive (JSON) for later re-analysis or comparison.
 3. **Analyze**: Each lens processes the Archive into findings, comparisons, and derived insights.
-4. **Visualize**: Results are assembled into an interactive HTML report (Report (HTML)) with multiple view types.
+4. **Visualize**: Results are assembled into an interactive HTML report (Report (HTML)) with multiple view types. Use `--output-report-json` to also emit a Report (JSON) — a lens-summarised dict suitable for CI threshold checks, LLM-driven triage, and dashboard time-series ingestion (see [USAGE.md §4a](USAGE.md)).
 5. **Share**: The Report is a single self-contained HTML file. Send it, attach it to a bug report, or host it on GitHub Pages.
 
 ## What you get
diff --git a/devtools/observatory/REFERENCE.md b/devtools/observatory/REFERENCE.md
@@ -43,6 +43,7 @@ frontend rendering, and custom JS callbacks.
 | --- | --- | --- | --- |
 | `resources` | `resources() -> Dict[str, str]` | lens frontend implementation | `resources.js[]`, `resources.css[]` |
 | `dashboard` | `dashboard(session, session_records, analysis) -> Optional[ViewList]` | Framework calls once per `(Session, lens)` pair. `session` is the active `Session` (carries `id`, `name`, `archive`, `start_ts`, `end_ts`, per-lens `start_data` / `end_data`); `session_records` is `[r for r in records if r.session_id == session.id]`; `analysis` is the `AnalysisResult` for this lens. See [RFC_SESSIONS.md](RFC_SESSIONS.md) for the full contract. | `dashboard[lens][session_id]` |
+| `json_report` | `json_report(session, session_records, analysis) -> Optional[Dict[str, Any]]` | Same call structure as `dashboard`. Called once per `(Session, lens)` pair by `export_report_json`. Result lands at `report["lenses"][lens_name][archive_label][session_id]`. Return `None` to opt out (no ghost keys). Returned dict must be JSON-serialisable. See `lenses/accuracy.py` and `lenses/per_layer_accuracy.py` for reference implementations. | `lenses[lens][archive][session_id]` in Report (JSON) |
 | `record` | `record(digest, analysis, context) -> Optional[ViewList]` | `digest=record.data[lens]`, `analysis={"global": global_data, "record": per_record_data[name].data}`, `context={"index", "name"}` | `records[i].views[lens]` |
 | `check_badges` | `check_badges(digest, analysis) -> List[Dict[str, str]]` | current digest + `global_data` | `records[i].badges[]` |
 | `check_index_diffs` | `check_index_diffs(prev_digest, curr_digest, analysis) -> Dict[str, str]` | previous/current digest + `global_data` | `records[i].diff_index` |
diff --git a/devtools/observatory/RFC_SESSIONS.md b/devtools/observatory/RFC_SESSIONS.md
@@ -96,6 +96,74 @@ Archive
 
 There is no flat `start_data` / `end_data` in the archive either. `Observatory._load_archive_sessions` reads both the new shape and the legacy nested `session: {sessions, ...}` shape (forward-compat for previously-written archives) and synthesises an `archive` field for legacy entries.
 
+## 4a. Report (JSON), `--output-report-json`
+
+A third optional derived output alongside the HTML report. Intended for CI dashboards, LLM-driven regression triage, and automated comparison tooling. Produced by `Observatory.export_report_json`.
+
+**Shape:**
+
+```jsonc
+{
+  "title": "...",
+  "generated_at": "...",
+  "archives": [                              // same grouping as HTML report
+    { "label": "default", "session_ids": ["default"] }
+  ],
+  "sessions": [                              // identity+timing only (no start_data/end_data)
+    { "id": "default", "name": "default", "archive": "default",
+      "start_ts": 1731.123, "end_ts": 1734.456 }
+  ],
+  "lenses": {                                // lens -> archive -> session_id -> lens dict
+    "accuracy": {
+      "default": {
+        "default": {
+          "records_measured": 5,
+          "metrics": {
+            "psnr":      { "mean": 38.5,  "min": 30.1, "max": 42.0,
+                           "worst_record": "Quantized Model" },
+            "cosine_sim":{ "mean": 0.998, "min": 0.991, "max": 0.999,
+                           "worst_record": "Quantized Model" },
+            "mse":       { "mean": 0.02,  "min": 0.005, "max": 0.05,
+                           "worst_record": "Quantized Model" }
+          }
+        }
+      }
+    },
+    "per_layer_accuracy": {
+      "default": {
+        "default": {
+          "anchor": "Exported Float",
+          "target": "Quantized Model",
+          "n_layers": 142,
+          "sample_source": "accuracy.worst[mse]",
+          "metric_ranges": { "psnr": [10.0, 50.0], "cosine_sim": [0.5, 1.0] },
+          "worst_layers": {
+            "psnr": [
+              { "layer": "layer_3", "psnr": 12.5, "cosine_sim": 0.85, "mse": 0.5 }
+              // top-N by metric, worst-first
+            ]
+          }
+        }
+      }
+    }
+  }
+}
+```
+
+**`worst_record` semantics for `accuracy`:** the record whose metric value was most unfavorable. For quality metrics (psnr, cosine_sim, top_k) this is the record with the *minimum* value; for error metrics (mse, abs_err) this is the record with the *maximum* value.
+
+**`worst_layers` sort order for `per_layer_accuracy`:** psnr/cosine_sim sorted ascending (lower = worse); mse/abs_err sorted descending (higher = worse). Depth controlled by `config["per_layer_accuracy"]["json_report_top_n"]` (default 10).
+
+**Lens hook:** lenses contribute by overriding `Frontend.json_report(session, session_records, analysis) -> Optional[Dict]`. Returning `None` opts out — no ghost keys appear. See `devtools/observatory/lenses/accuracy.py` and `devtools/observatory/lenses/per_layer_accuracy.py` for reference implementations.
+
+**Distinguishing the three outputs:**
+
+| Output | Flag | What it contains | For whom |
+|---|---|---|---|
+| Archive | `--output-archive` | Raw records + sessions, lossless | Re-visualization, compare |
+| Report (HTML) | `--output-html` | Interactive HTML with graphs, lens panels | Human reviewers |
+| Report (JSON) | `--output-report-json` | Lens-summarised semantic dicts | CI, LLMs, dashboards |
+
 ## 5. Lens contract
 
 ```python
diff --git a/devtools/observatory/USAGE.md b/devtools/observatory/USAGE.md
@@ -109,6 +109,41 @@ python -m executorch.backends.qualcomm.debugger.observatory visualize \
 This separates the persisted Archive from the rendered Report (HTML), which can
 be re-run on demand (e.g. comparing models between two history commits).
 
+## 4a. Report (JSON) for CI and LLM triage
+
+Add `--output-report-json PATH` to any collection or compare invocation to write a
+machine-readable derived summary alongside the HTML report:
+
+```bash
+python -m executorch.backends.qualcomm.debugger.observatory \
+    --output-archive artifacts/report.json \
+    --output-html artifacts/report.html \
+    --output-report-json artifacts/report.summary.json \
+    --lens-recipe accuracy --lens-recipe adb \
+    --archive qualcomm/swin_v2 \
+    my_export_script.py --output_dir artifacts/
+```
+
+The resulting `report.summary.json` contains:
+
+```jsonc
+{
+  "archives": [...],
+  "sessions": [{ "id", "name", "archive", "start_ts", "end_ts" }],
+  "lenses": {
+    "accuracy": { "<archive>": { "<session_id>": { "records_measured", "metrics": {...} } } },
+    "per_layer_accuracy": { "<archive>": { "<session_id>": { "anchor", "target", "worst_layers", ... } } }
+  }
+}
+```
+
+Unlike the Archive (which is raw and lossless), the Report (JSON) is derived and
+semantic — suited for CI threshold checks, LLM failure-analysis prompts, and
+dashboard time-series ingestion. Lenses that do not override `Frontend.json_report`
+contribute nothing; the key simply does not appear.
+
+See [RFC_SESSIONS.md §4a](RFC_SESSIONS.md) for the complete payload spec.
+
 ## 5. Disabling Lenses via Config
 
 When using the Observatory Python API directly, pass a config dict to
diff --git a/devtools/observatory/lenses/accuracy.py b/devtools/observatory/lenses/accuracy.py
@@ -747,8 +747,10 @@ def json_report(self, session, session_records, analysis) -> Optional[Dict[str,
         """Aggregate accuracy metrics across the session's records.
 
         Each numeric primary metric contributes a ``mean``, ``min``, ``max``,
-        and ``worst_record`` (the record name where the metric was lowest,
-        indicating the worst quality sample for that metric).
+        and ``worst_record`` — the record name where the metric was most
+        unfavorable: minimum for quality metrics (psnr, cosine_sim, top_k)
+        where lower = worse quality, and maximum for error metrics (mse,
+        abs_err) where higher = worse quality.
         Internal ``_*`` keys, ``_min``/``_max`` per-sample stats, and
         ``_worst_idx`` indices are excluded.
         """