Rewrite benchmark README: CI-first with detailed methodology

brunoborges · Copilot · brunoborges · commit 96992d6391f0 · 2026-02-18T15:56:16.000-05:00
Restructured to lead with the GitHub Actions CI benchmark,
explaining why CI cold-start measurements matter more than
local benchmarks. Detailed explanation of the three-job
workflow design and why Java AOT wins in CI environments.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/html-generators/benchmark/README.md b/html-generators/benchmark/README.md
@@ -2,7 +2,44 @@
 
 Performance comparison of execution methods for the HTML generator, measured on 95 snippets across 10 categories.
 
-## Phase 1: Training / Build Cost (one-time)
+## CI Benchmark (GitHub Actions)
+
+[![Benchmark Generator](https://github.com/javaevolved/javaevolved.github.io/actions/workflows/benchmark.yml/badge.svg)](https://github.com/javaevolved/javaevolved.github.io/actions/workflows/benchmark.yml)
+
+The most important benchmark runs on GitHub Actions because it measures performance in the environment where the generator actually executes — CI. The [Benchmark Generator](https://github.com/javaevolved/javaevolved.github.io/actions/workflows/benchmark.yml) workflow is manually triggered and runs across **Ubuntu**, **Windows**, and **macOS**.
+
+### Why CI benchmarks matter
+
+On a developer machine, repeated runs benefit from warm OS file caches — the operating system keeps recently read files in RAM, making subsequent reads nearly instant. This masks real-world performance differences. Python also benefits from `__pycache__/` bytecode that persists between runs.
+
+In CI, **every workflow run starts on a fresh runner**. There is no `__pycache__/`, no warm OS cache, no JBang compilation cache. This is the environment where the deploy workflow runs, so these numbers reflect actual production performance.
+
+### How the CI benchmark works
+
+The workflow has three jobs:
+
+1. **`benchmark`** — Runs Phase 1 (training/build costs) and Phase 2 (steady-state execution) on each OS. All tools are installed in the same job, so this measures raw execution speed after setup.
+
+2. **`build-jar`** — Builds the fat JAR and AOT cache on each OS, then uploads them as workflow artifacts. This simulates what the `build-generator.yml` workflow does weekly: produce the JAR and AOT cache and store them in the GitHub Actions cache.
+
+3. **`ci-cold-start`** — The key benchmark. Runs on a **completely fresh runner** that has never executed Java or Python in the current job. It downloads the JAR and AOT artifacts (simulating the `actions/cache/restore` step in the deploy workflow), then measures a single cold run of each method. This is the closest simulation of what happens when the deploy workflow runs:
+   - **Python** has no `__pycache__/` — it must interpret every `.py` file from scratch
+   - **Fat JAR** must load and link all classes on a cold JVM
+   - **Fat JAR + AOT** loads pre-linked classes from the `.aot` file, skipping class loading entirely
+
+   The `setup-java` and `setup-python` actions are required to provide the runtimes, but they don't warm up the generator code. The first invocation of `java` or `python3` in this job is the benchmark measurement itself.
+
+### Why Java AOT wins in CI
+
+Java's AOT cache (JEP 483) snapshots the result of class loading and linking from a training run into a `.aot` file. This file is platform-specific and ~21 MB. When restored from the actions cache, the JVM skips the expensive class discovery, verification, and linking steps that normally happen on first run.
+
+Python's `__pycache__/` serves a similar purpose — it caches compiled bytecode so Python doesn't re-parse `.py` files. But `__pycache__/` is not committed to git or stored in CI caches, so **Python always pays full interpretation cost in CI**. Java AOT, by contrast, is stored in the actions cache and restored before each deploy.
+
+## Local Benchmark
+
+The local benchmark script runs all three phases on your development machine. Local results will differ from CI because of OS file caching and warm `__pycache__/`.
+
+### Phase 1: Training / Build Cost (one-time)
 
 These are one-time setup costs, comparable across languages.
 
@@ -12,7 +49,7 @@ These are one-time setup costs, comparable across languages.
 | JBang export | 2.19s | Compiles source + bundles dependencies into fat JAR |
 | AOT training run | 2.92s | Runs JAR once to record class loading, produces `.aot` cache |
 
-## Phase 2: Steady-State Execution (avg of 5 runs)
+### Phase 2: Steady-State Execution (avg of 5 runs)
 
 After one-time setup, these are the per-run execution times.
 
@@ -23,11 +60,9 @@ After one-time setup, these are the per-run execution times.
 | **JBang** | 1.08s | Includes JBang launcher overhead |
 | **Python** | 1.26s | Uses cached `__pycache__` bytecode |
 
-## Phase 3: CI Cold Start (fresh runner, no caches)
+### Phase 3: CI Cold Start (simulated locally)
 
-Simulates a CI environment where every run is the first run.
-Python has no `__pycache__`, JBang has no compilation cache.
-Java AOT benefits from the pre-built `.aot` file restored from actions cache.
+Clears `__pycache__/` and JBang cache, then measures a single run. On a local machine the OS file cache still helps, so these numbers are faster than true CI.
 
 | Method | Time | Notes |
 |--------|------|-------|
@@ -36,14 +71,14 @@ Java AOT benefits from the pre-built `.aot` file restored from actions cache.
 | **JBang** | 3.25s | Must compile source before running |
 | **Python** | 0.16s | No `__pycache__`; full interpretation |
 
-## How It Works
+### How each method works
 
-- **Python** caches compiled bytecode in `__pycache__/` after the first run, similar to how Java's AOT cache works.
-- **Java AOT** (JEP 483) snapshots ~3,300 pre-loaded classes from a training run into a `.aot` file, eliminating class loading overhead on subsequent runs.
+- **Python** caches compiled bytecode in `__pycache__/` after the first run, similar to how Java's AOT cache works. But this cache is local-only and not available in CI.
+- **Java AOT** (JEP 483) snapshots ~3,300 pre-loaded classes from a training run into a `.aot` file, eliminating class loading overhead on subsequent runs. The `.aot` file is stored in the GitHub Actions cache.
 - **JBang** compiles and caches internally but adds launcher overhead on every invocation.
 - **Fat JAR** (`java -jar`) loads and links all classes from scratch each time.
 
-## AOT Cache Setup
+### AOT Cache Setup
 
 ```bash
 # One-time: build the fat JAR
@@ -56,7 +91,7 @@ java -XX:AOTCacheOutput=html-generators/generate.aot -jar html-generators/genera
 java -XX:AOTCache=html-generators/generate.aot -jar html-generators/generate.jar
 ```
 
-## Environment
+### Environment
 
 | | |
 |---|---|
@@ -67,13 +102,9 @@ java -XX:AOTCache=html-generators/generate.aot -jar html-generators/generate.jar
 | **Python** | 3.14.3 |
 | **OS** | Darwin |
 
-## Reproduce
+### Reproduce
 
 ```bash
 ./html-generators/benchmark/run.sh            # print results to stdout
-./html-generators/benchmark/run.sh --update    # also update this file
+./html-generators/benchmark/run.sh --update    # also update local results in this file
 ```
-
-### CI Benchmark
-
-The [Benchmark Generator](https://github.com/javaevolved/javaevolved.github.io/actions/workflows/benchmark.yml) workflow runs cross-platform benchmarks (Ubuntu, Windows, macOS) on GitHub Actions. It includes a CI cold-start phase on a fresh runner to measure true first-run performance. Trigger it manually from the Actions tab.