Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
413 changes: 144 additions & 269 deletions README.md

Large diffs are not rendered by default.

File renamed without changes
105 changes: 105 additions & 0 deletions assets/case_study/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Case Studies

Side-by-side animations of how the seed `init_program.py` evolves into the released best result for six tasks. Seeds are weak baselines (grids, random noise, textbook implementations); the released versions live under [`best_results/`](../../best_results).

Jump to: [Circle Packing](#1-circle-packing) · [Hadamard 29](#2-hadamard-maximum-determinant-n--29) · [Erdős](#3-erdős-minimum-overlap) · [LASSO Path](#4-lasso-regularisation-path) · [TriMul](#5-trimul-gpu-kernel) · [scRNA-seq Denoising](#6-single-cell-rna-seq-denoising)

---

## 1. Circle Packing

Domain: combinatorial construction. Task family: [`circle_packing`](../../datasets/circle_packing).

<p align="center">
<img src="circle_packing.gif" alt="Circle packing evolution — n=26 and n=32" width="720">
</p>

Pack `n` non-overlapping circles in the unit square to maximise the sum of radii. Released for `n = 26` and `n = 32`.

- **Seed**: uniform grid of equal-radius circles, then shrinks each radius to the minimum centre-to-centre distance.
- **Evolved**: LP feasibility check over a pre-computed pair-constraint matrix, then `scipy.optimize.differential_evolution` over the placement, with `cvxpy` polishing radii at each candidate.
- **Result**: matches or exceeds public baselines on both n. Side-by-side plot in [`best_results/combinatorial_construction/circle_packing_in_a_unit_square_n26/`](../../best_results/combinatorial_construction/circle_packing_in_a_unit_square_n26).

---

## 2. Hadamard Maximum Determinant (n = 29)

Domain: combinatorial construction. Task family: [`hadamard_maximal_det`](../../datasets/hadamard_maximal_det).

<p align="center">
<img src="hadamard_29.gif" alt="Hadamard ±1 matrix — baseline vs SimpleTES" width="720">
</p>

Find a 29 × 29 ±1 matrix maximising `|det(H)|`. Score is `|det(H)| / 29^(29/2)`.

- **Seed**: ad-hoc ±1 matrix with exact integer determinant via Bareiss. No structure.
- **Evolved**: warm-starts from a Paley (quadratic-residue circulant) construction, then refines via local sign flips guided by `logabs_det`.
- **Result**: visible in the GIF — baseline has banded structure, SimpleTES matrix is the noise-like high-determinant pattern.

---

## 3. Erdős Minimum Overlap

Domain: mathematics — extremal analysis. Task family: [`erdos`](../../datasets/erdos).

<p align="center">
<img src="erdos.gif" alt="Erdős minimum overlap step function and overlap profile" width="720">
</p>

Find a step function `h: [0, 2] → [0, 1]` with `∑h = n/2` that minimises `Ψ(h) = max_s ∫ h(x)·(1 − h(x+s)) dx`.

- **Seed**: `h ≡ 0.5` plus zero-mean random noise in `[-0.4, 0.4]`.
- **Evolved**: seven-stage pipeline — warm-start from Paley, stochastic donor-receiver swaps, Adam on a smooth-max surrogate, guided swaps at the worst shift, binary rounding, binary best-swap, simulated annealing.
- **Result**: `Ψ(h) = 0.3808676758`.

---

## 4. LASSO Regularisation Path

Domain: algorithm engineering. Task family: [`numerical_tasks`](../../datasets/numerical_tasks).

<p align="center">
<img src="lasso_path.gif" alt="LASSO regularisation path — 2.07× faster than glmnet" width="720">
</p>

Solve the full path `min ½n·‖y − Xw‖² + λ·‖w‖₁` over a decreasing λ schedule, matching `sklearn.lasso_path` within `1e-6` in float64. Score is `1 / geomean(wall_time)`.

- **Seed**: textbook C++ coordinate descent with `Eigen`. Single soft-threshold, naïve outer loop, no parallelism.
- **Evolved**: tuned CD with OpenMP, hot/cold variable partitioning across the λ schedule, cache-resident residual updates.
- **Result**: **2.07× faster than `glmnet`** at matched precision.

---

## 5. TriMul GPU Kernel

Domain: GPU kernel optimization. Task family: [`gpumode`](../../datasets/gpumode).

<p align="center">
<img src="trimul.gif" alt="TriMul kernel optimisation — 8.3 ms to 1.02 ms" width="900">
</p>

Implement the TriMul block (triangular matmul with gating and layernorm) matching the PyTorch reference within `2e-2`, minimising H100 latency.

- **Seed**: `torch.nn` with `einsum` and `nn.Linear`. Reference semantics, no GPU tuning.
- **Evolved**: hand-written Triton in four stages — FP16 compute / FP32 accumulate, concat-weight single GEMM, fused layernorm + gate + projection, full autotune with adaptive `num_warps`.
- **Result**: **8.309 ms → 1.020 ms** on H100 (~8×).

---

## 6. Single-cell RNA-seq Denoising

Domain: data science. Task family: [`open_problems_bio`](../../datasets/open_problems_bio).

<p align="center">
<img src="rna_seq_denoising.gif" alt="scRNA-seq denoising — train → denoise → test" width="900">
</p>

Given a sparse UMI count matrix `X_train` (cells × genes), produce a denoised `X̂` that minimises reconstruction error on a held-out `X_test` from the same pancreas dataset.

- **Seed**: stock MAGIC — k-NN graph, `t` diffusion steps on the graph operator.
- **Evolved**: truncated SVD + NMF for the low-rank backbone, `NearestNeighbors` for local smoothing, optional MAGIC pass behind a flag. Components combined by weights tuned on the held-out loss.
- **Result**: improves on the MAGIC baseline on the bundled pancreas split; matches the released paper best.

---

Each evolved program is in [`best_results/<domain>/<task>_best.py`](../../best_results). Each seed is in [`datasets/<family>/<subtask>/init_program.py`](../../datasets). To reproduce, run `main.py` on the same seed — see the [top-level Quickstart](../../README.md#installation--quickstart).
Binary file added assets/case_study/circle_packing.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/circle_packing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/erdos.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/erdos.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/hadamard_29.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/hadamard_29.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/lasso_path.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/lasso_path.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/rna_seq_denoising.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/rna_seq_denoising.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/trimul.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/case_study/trimul.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
273 changes: 273 additions & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
# SimpleTES Tasks

21 tasks across 6 domains, aligned with the artifacts in [`best_results/`](../best_results). The launcher auto-discovers anything at `datasets/<family>/<subtask>/init_program.{py,cpp,rs,...}`.

- [Catalogue](#catalogue) — what's here
- [Designing a task](#designing-a-task) — the file contract + worked example

---

## Catalogue

<table>
<thead>
<tr>
<th align="left">Domain</th>
<th align="left">Family</th>
<th align="center">Subtasks</th>
<th align="center">Language</th>
<th align="center">Setup</th>
</tr>
</thead>
<tbody>

<tr><td rowspan="2"><b>🪐 Quantum circuit compilation</b></td>
<td><a href="qubit_routing"><code>qubit_routing</code></a></td>
<td align="center">1</td><td align="center">Rust</td>
<td align="center"><sub>Rust toolchain</sub></td></tr>
<tr><td><a href="znaa"><code>znaa</code></a></td>
<td align="center">1</td><td align="center">Python</td>
<td align="center"><sub>family venv</sub></td></tr>

<tr><td rowspan="2"><b>⚡ GPU kernel optimization</b></td>
<td><a href="gpumode"><code>gpumode</code></a></td>
<td align="center">1</td><td align="center">CUDA / Torch</td>
<td align="center"><sub>GPU + server</sub></td></tr>
<tr><td><a href="kernelbench"><code>kernelbench</code></a></td>
<td align="center">2</td><td align="center">Triton</td>
<td align="center"><sub>GPU + server</sub></td></tr>

<tr><td rowspan="2"><b>🧮 Algorithm engineering</b></td>
<td><a href="ahc"><code>ahc</code></a></td>
<td align="center">2</td><td align="center">C++ in Docker</td>
<td align="center"><sub>Docker</sub></td></tr>
<tr><td><a href="numerical_tasks"><code>numerical_tasks</code></a></td>
<td align="center">1</td><td align="center">C++ via Python</td>
<td align="center"><sub><code>g++</code> + Eigen</sub></td></tr>

<tr><td rowspan="2"><b>📐 Mathematics — extremal analysis</b></td>
<td><a href="erdos"><code>erdos</code></a></td>
<td align="center">1</td><td align="center">Python</td>
<td align="center"><sub>none</sub></td></tr>
<tr><td><a href="autocorrelation"><code>autocorrelation</code></a></td>
<td align="center">3</td><td align="center">Python</td>
<td align="center"><sub>none</sub></td></tr>

<tr><td rowspan="3"><b>🧩 Combinatorial construction</b></td>
<td><a href="circle_packing"><code>circle_packing</code></a></td>
<td align="center">2</td><td align="center">Python</td>
<td align="center"><sub>none</sub></td></tr>
<tr><td><a href="hadamard_maximal_det"><code>hadamard_maximal_det</code></a></td>
<td align="center">1</td><td align="center">Python</td>
<td align="center"><sub>none</sub></td></tr>
<tr><td><a href="sums_diffs"><code>sums_diffs</code></a></td>
<td align="center">1</td><td align="center">Python</td>
<td align="center"><sub>none</sub></td></tr>

<tr><td rowspan="2"><b>🧬 Data science</b></td>
<td><a href="scaling_law"><code>scaling_law</code></a></td>
<td align="center">4</td><td align="center">Python</td>
<td align="center"><sub>HuggingFace cache</sub></td></tr>
<tr><td><a href="open_problems_bio"><code>open_problems_bio</code></a></td>
<td align="center">1</td><td align="center">Python</td>
<td align="center"><sub>bundled venv + dataset</sub></td></tr>

</tbody>
</table>

First-time pick: any `Setup: none` row. For setup-heavy families:

```bash
uv run python scripts/prepare_task.py --list
uv run python scripts/prepare_task.py --check
uv run python scripts/prepare_task.py --task scaling_law
```

Each family has its own `README.md` with task-specific assumptions and a run command.

---

## Designing a Task

One task = one directory with three files: the instruction states the constraints, the seed implements the entry function, the evaluator checks constraints and scores.

### Layout

```text
datasets/<family>/<task>/
├── init_program.{py|cpp|rs|...} required — seed, with EVOLVE-BLOCK markers
├── evaluator.py required — scores a candidate
└── <task>.txt required — problem statement for the LLM

datasets/<family>/
├── requirements.txt optional — packages allowed in the evolved code
├── pyproject.toml + venv/ optional — family-local Python env (auto-detected)
├── data_manifest.json optional — data this family needs
└── README.md optional — family-level notes
```

### 1. Seed program

The region between `EVOLVE-BLOCK-START` and `EVOLVE-BLOCK-END` is the part that gets evolved. Everything else is the fixed harness — imports, the entry function, anything the evaluator depends on.

```python
# EVOLVE-BLOCK-START
import numpy as np

def construct_circles():
... # ← the LLM edits this
# EVOLVE-BLOCK-END


def run_code(): # fixed entry point the evaluator calls
return construct_circles()
```

Requirements: markers paired; entry function in the fixed region; the seed runs and earns a finite, non-zero score. C++ / Rust seeds use the same markers via the host language's comment syntax.

### 2. Evaluator

Runs the candidate in an isolated subprocess, checks constraints, recomputes the score. The framework reads only `combined_score` from the returned dict.

Six structural parts. Only parts 3 and 4 are task-specific; copy an existing evaluator and edit those two.

| # | Part | Per-task? |
|---|------|-----------|
| 1 | Configuration: `TIMEOUT_SECONDS`, concurrency, memory fraction. | shared |
| 2 | Exception classes: `EvaluatorTimeoutError`, `MemoryLimitExceededError`. | shared |
| 3 | `validate_solution(sol)` — every hard constraint with tolerances. | **per task** |
| 4 | `compute_score(sol)` — recompute the score from the solution. | **per task** |
| 5 | `run_with_timeout(path, …)` — subprocess + timeout + memory cap + BLAS thread cap. | shared |
| 6 | `evaluate(path)` — entry; must never raise. | shared |

```python
def evaluate(path: str) -> dict:
# success
return {"combined_score": 12.34, "validity": 1.0, "eval_time": 8.7}
# failure
return {"combined_score": 0.0, "validity": 0.0, "error": "Timeout: ..."}
```

Design rules:

- Higher-is-better score. Minimising `q`? Use `1 / (eps + q)` (autocorrelation, erdos) or a reference ratio (hadamard).
- Score must discriminate. A 0/1 verdict degenerates into random search; aggregate over cases (ahc, kernelbench) or normalise (kissing-number-style).
- Hack-proof: recompute the score from the solution; never trust a self-reported value.
- `evaluate()` never raises. Wrap in `try/except` and return `combined_score: 0` on failure.

### 3. Instruction file — `<task>.txt`

Shown verbatim to the model. Cover: the problem, every constraint, the objective (max/min + how `combined_score` is computed), resource limits (especially per-evaluation timeout in seconds), any reshaping / discretisation, and the program interface.

Always include the sentence

> *Do this by evolving the code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END`.*

so the code extractor knows the evolved region.

### 4. Optional files

| File | Purpose |
|------|---------|
| `requirements.txt` | Not installed by SimpleTES; package names go into the prompt so the LLM does not hallucinate dependencies. |
| `venv/` (or `<family>/venv/`) | Task-local Python env, auto-detected. Override with `--eval-venv <path>`. |
| `<family>/pyproject.toml` + `uv.lock` | Reproducible family-level env. |
| `data_manifest.json` | Files the task needs; `scripts/prepare_task.py --task <family>` runs declared commands, `--check` verifies. |

Minimal `data_manifest.json`:

```json
{
"prepare_commands": [
{"command": ["bash", "setup.sh"], "cwd": ".", "description": "Build local deps"}
],
"required_files": ["my_deps/built_artifact"]
}
```

Don't commit large data; declare it.

---

## Worked Example

End-to-end square-root task.

```text
datasets/my_demo/square_root/
├── init_program.py
├── evaluator.py
└── square_root.txt
```

`init_program.py`:

```python
# EVOLVE-BLOCK-START
def sqrt(x: float) -> float:
return x / 2 # bad baseline; the LLM will fix it
# EVOLVE-BLOCK-END

def run_code():
return sqrt
```

`evaluator.py`:

```python
import importlib.util, math, sys, time

TIMEOUT_SECONDS = 30

def evaluate(filepath: str) -> dict:
try:
t0 = time.time()
spec = importlib.util.spec_from_file_location("cand", filepath)
mod = importlib.util.module_from_spec(spec)
sys.modules["cand"] = mod
spec.loader.exec_module(mod)
sqrt = mod.run_code()

targets = [0.25, 1.0, 2.0, 9.0, 1024.0, 1e-6]
errors = [abs(sqrt(x) - math.sqrt(x)) for x in targets]
return {
"combined_score": -sum(errors), # higher = better → negate error
"validity": 1.0,
"eval_time": time.time() - t0,
"max_error": max(errors),
}
except Exception as e:
return {"combined_score": 0.0, "validity": 0.0, "eval_time": 0.0, "error": str(e)}
```

`square_root.txt`:

```text
Improve sqrt(x) so it approximates math.sqrt as closely as possible on the
listed targets. You may use the math module but not math.sqrt.

Do this by evolving the code between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END.
The time limit for each program evaluation is 30 seconds.
```

Run:

```bash
uv run python main.py \
--init-program datasets/my_demo/square_root/init_program.py \
--evaluator datasets/my_demo/square_root/evaluator.py \
--instruction datasets/my_demo/square_root/square_root.txt \
--model gemini/gemini-2.0-flash \
--max-generations 30
```

---

## Pre-flight Checklist

- [ ] `<task>.txt`, `run_code()`, and `validate_solution` agree on the interface. Timeout values in `.txt`, `TIMEOUT_SECONDS`, and `--eval-timeout` all match.
- [ ] EVOLVE-BLOCK markers paired; entry function in the fixed region; seed earns a finite, non-zero score locally.
- [ ] Evaluator has subprocess isolation, hard timeout, memory cap. `validate_solution` covers every hard constraint with tolerances. Score is recomputed from the solution. `evaluate()` never raises.
- [ ] `combined_score` is higher-is-better and discriminates (not 0/1).
- [ ] `main.py ... --max-generations 5` completes locally. If the family has setup steps, `scripts/prepare_task.py --check` is clean.
- [ ] Family `README.md` covers problem, scoring, and environment.
Loading