This document explains why we believe the lean-eval submission pipeline is resistant to adversarial submissions, what assumptions it depends on, and where a future red-teamer should look first.
It is not a debugging log. The CI-postmortem narrative that previously
lived in LANDRUN.md is preserved in git history and replaced here by a
focused security write-up.
The attacker controls:
- One submission they file as a GitHub issue: a public/private GitHub
repo (or public gist) containing
Submission.leanand any number of files underSubmission/**/*.lean. Nothing else from the submission source is consumed. - The freeform
modelandproduction_descriptionfields in the issue body.
The attacker does not control:
- Whether the
submissionlabel gets applied — that requires triage permission onleanprover/lean-eval. Challenge.lean,Solution.lean,lakefile.toml,lean-toolchain,config.json,WorkspaceTest.leanin the generated workspace. These are taken from a pristinegenerated/<id>/checkout each run.- Any pinned upstream commit (landrun, lean4export, comparator, GitHub Actions). See Trusted dependencies and pin policy for the assumption this places on each upstream.
The goal we resist: a submitter receiving credit on the leaderboard for a theorem they have not actually proved.
Every benchmark problem ships as a self-contained Lake project in generated// with a fixed three-file shape:
- Challenge.lean is the trusted theorem statement, body
sorry. The benchmark author writes this and reviewers approve it. It is what the problem ASKS. - Submission.lean is the only file the solver edits. It must define
a theorem of the same signature inside
namespace Submission. This is what the solver provides as their PROOF. - Solution.lean is a generated bridge: it has the same theorem
signature as Challenge and provides
exact Submission.foo(or, for multi-hole problems, the structured equivalent). It exists so that comparator can verify "Solution typechecks => Submission really proves Challenge."
leanprover/comparator is the
verifier. Given a config.json listing theorem_names,
definition_names, and permitted_axioms, it:
- Builds Challenge and Solution separately under a
landrunLinux landlock sandbox, then runslean4exporton each (also sandboxed). - Walks the constant graph from each named theorem and requires that every transitively-used constant (other than def-hole targets, where only the type signature must match) is byte-identical between Challenge and Solution. See comparator/Comparator/Compare.lean.
- Walks the same graph for axioms and requires every reachable axiom
to be in the
permitted_axiomsallowlist. See comparator/Comparator/Axioms.lean. Our allowlist is{propext, Quot.sound, Classical.choice}. Notably absent:sorryAx(sosorryis rejected) andLean.ofReduceBool(sonative_decideis rejected). - Replays the exported Solution environment into the in-process Lean kernel for a final type-check.
Because Solution's only proof step is exact Submission.foo, and
because Compare.lean rejects any drift in the constant graph from
Challenge's theorem statement, the only way for verification to succeed
is if Submission.foo is definitionally equal to a proof of Challenge's
exact theorem statement, using only the permitted axioms.
User-submitted Lean is elaborated in exactly one place: comparator's
safeLakeBuild Solution, which runs lake build Solution (and
transitively lake build Submission) inside a landrun sandbox. Every
other step either does not elaborate Lean, or elaborates only trusted
code.
| Step | Process | Location | Elaborates Submission? |
|---|---|---|---|
actions/checkout of lean-eval |
trusted | runner | No |
rm -rf .git |
trusted | runner | No |
lake build comparator (in .ci/comparator/) |
trusted | runner | No |
lake build lean4export (in .ci/lean4export/) |
trusted | runner | No |
| Submission tarball extracted | inert data | runner | No |
evaluate_submission.py overlay walk |
trusted Python | runner | No |
_share_packages symlink setup |
trusted Python | runner | No |
_prime_workspace runs lake update |
parses TOML lakefile | runner | No |
_prime_workspace runs lake exe cache get |
downloads pre-built oleans | runner | No |
lake exe lean-eval run-eval (host exe) |
trusted Lean | runner | No |
lake test builds WorkspaceTest |
only import Lean |
runner | No |
WorkspaceTest spawns lake env comparator config.json |
lake env is an env-shim |
runner | No |
comparator calls lean --print-prefix / which git |
side-effect-free | runner | No |
comparator.safeLakeBuild Challenge |
trusted source, sandboxed | inside landrun | No |
comparator.safeExport challengeModule |
reads trusted olean | inside landrun | No |
comparator.safeLakeBuild Solution |
builds Solution -> imports Submission | inside landrun | YES |
comparator.safeExport solutionModule |
reads the just-built olean | inside landrun | No |
runKernel solution |
replays exported env in comparator process | runner | No |
Two boundary concerns to call out explicitly:
Lake env doesn't elaborate. scripts/security_probes/lake_env_probe.py
verifies on every run of the test suite that lake env -- <cmd> (and
nested lake env -- lake env -- <cmd>) on a workspace whose
Submission.lean has top-level initialize and #eval markers does
NOT produce those markers in stdout. Confirmed on the pinned toolchain
(2026-05-04). If a future lake version starts evaluating project Lean
during lake env, this probe must be re-run before the toolchain
bump lands.
_prime_workspace doesn't build user-controlled targets. The
docstring at scripts/evaluate_submission.py:288-322
documents the rule and the regression history (commit 3474943
violated it in 2026-04, reverted in #92 in 2026-05). lake update and
lake exe cache get are the only shell-outs allowed.
Writable .lake/, and why the obvious tampering attack does not work.
Comparator gives safeLakeBuild --rwx <workspace>/.lake, which is
necessary because lake writes its build outputs there. The natural
attack to consider is: a Submission.lean initialize block spawns a
detached setsid daemon that inherits landrun's filesystem rules,
survives the per-build landrun child's exit (children get reparented
to PID 1, not killed), and races to overwrite Solution.olean between
safeLakeBuild Solution returning and safeExport solutionModule
reading it — getting comparator to verify against an
attacker-controlled olean.
We ran scripts/security_probes/artifact_tamper_probe.py
on Linux (kernel 6.12 + landrun 5ed4a3db + comparator 71b52ec) on
2026-05-04 and the attack is structurally impossible at the spawn
step. Inside the sandbox, IO.Process.spawn succeeds only for the
lean binary; sh, setsid, bash, cp, /bin/sh, /usr/bin/env
all return exit 255 ("could not execute external process"). Comparator's
safeLakeBuild only --rox-whitelists leanPrefix and gitLocation;
landrun denies exec of anything else. The attacker has no way to
spawn a daemon, so there is nothing to race against safeExport.
This is a strong assumption to depend on. If a future comparator pin
broadens the executable allowlist (e.g. adds sh), or if a Lean
toolchain change allows the elaborator to spawn binaries by some other
path, this conclusion needs to be re-derived. Re-run the probe after
any landrun, comparator, or lean-toolchain pin bump.
We rely on the following claims about leanprover/comparator at the pinned commit:
- The constant-graph walk in
compareAtreaches every constant that the theorem's statement type or proof body transitively uses. If a constant is reachable, it must be byte-identical between Challenge and Solution unless it is listed indefinition_names(which the benchmark author controls via the manifest). - The axiom walk in
checkAxiomsreaches every axiom transitively used by the theorem and definition targets, and rejects any axiom not in the configured allowlist. safeLakeBuildandsafeExportrun inside landrun with the filesystem and environment policy described incomparator/Main.lean.- Lean kernel re-replay (
runKernel) catches any malformed olean structure that lean4export accepts but the kernel rejects.
If any of those claims drifts in a future comparator commit, our guarantees regress with it. The comparator pin (see Section 5) is the mechanism that prevents silent drift; the bump procedure is the mechanism that forces a re-audit.
Every external dependency of the submission pipeline is pinned to an
immutable commit SHA. Tags and branches are mutable; if landrun@main
or actions/checkout@v4 is ever resolved at install time, the
upstream publisher controls our supply chain.
| Dependency | Repo | Pinned to | Purpose | Last bumped |
|---|---|---|---|---|
| landrun | zouuup/landrun | 5ed4a3db3a4ad930d577215c6b9abaa19df7f99f |
Linux landlock sandbox | 2026-05-04 |
| lean4export | leanprover/lean4export | 12581a6b680d8478175596338eb2d53383a323e3 |
exports olean to text | 2026-05-04 |
| comparator | leanprover/comparator | 71b52ec29e06d4b7d882726553b1ceb99a2499e0 |
the verifier | pre-2026-05 |
jlumbroso/free-disk-space |
(action) | 54081f138730dfa15788a46383842cd2f914a1be |
runner disk cleanup | 2026-05-04 |
actions/checkout |
(action) | 11bd71901bbe5b1630ceea73d27597364c9af683 |
repo checkout | 2026-05-04 |
actions/setup-python |
(action) | a26af69be951a213d495a4c3e4e4022e16d87065 |
python setup | 2026-05-04 |
actions/setup-go |
(action) | d35c59abb061a4a6fb18e82ac0862c26744d6ab5 |
go setup | 2026-05-04 |
actions/upload-artifact |
(action) | ea165f8d65b6e75b540449e92b4886f43607fa02 |
artifact upload | 2026-05-04 |
actions/download-artifact |
(action) | d3f86a106a0bac45b974a628896c90dbdf5c8093 |
artifact download | 2026-05-04 |
actions/create-github-app-token |
(action) | d72941d797fd3113feb6b93fd0dec494b13a2547 |
App auth | 2026-05-04 |
leanprover/lean-action |
(action) | 38fbc41a8c28c4cbaec22d7f7de508ec2e7c0dd9 |
Lean toolchain | 2026-05-04 |
| Go toolchain | (setup-go arg) | 1.24.0 |
concrete version, not stable |
2026-05-04 |
| Python | (setup-python arg) | 3.11.10 |
concrete patch version | 2026-05-04 |
permitted_axioms (every workspace) |
comparator config | {propext, Quot.sound, Classical.choice} |
axioms allowed in proofs | unchanged |
| benchmark commit (leaderboard) | leanprover/lean-eval | .benchmark-commit in lean-eval-leaderboard |
which lean-eval the live site reflects | per-bump |
Mutable selectors are forbidden anywhere in .github/workflows/. The
scripts/action_pin_audit.py script
walks every workflow and hard-fails on uses: ...@main,
uses: ...@v<N> (loose tag), go-version: stable, python-version
without a patch component, and go install ...@<branch>. It runs on
every CI build; an unpinned dependency cannot be merged.
- Find the new SHA. For tags:
git ls-remote <repo> refs/tags/<tag>. For branch HEAD:git ls-remote <repo> refs/heads/main. - Update every site listed below in lockstep, and add a dated
comment of the form
# <dep> pinned to <sha7> (<note> as of YYYY-MM-DD)..github/workflows/ci.yml.github/workflows/submission.yml.github/workflows/regenerate-main.ymlscripts/check_comparator_installation.py(LANDRUN_INSTALL_TARGET, for landrun only)lean-eval-leaderboard/.benchmark-commit(for the lean-eval benchmark commit only)
- Open a PR. CI runs
scripts/action_pin_audit.py,scripts/sandbox_engaged_probe.py, andscripts/security_probes/env_dump_probe.py. All three must pass. - Re-run the one-shot probes locally on Linux:
- scripts/security_probes/lake_env_probe.py
- scripts/security_probes/artifact_tamper_probe.py Update Sections 3 and 6 of this doc if either changes verdict.
- Merge.
- Action pin audit. scripts/action_pin_audit.py runs in CI; mutable workflow refs are blocked from merging.
- Sandbox-engaged probe. scripts/sandbox_engaged_probe.py
runs in
ci.ymland on everysubmissionworkflow run. It builds a synthetic comparator workspace whoseSubmission.leanattempts five writes —/tmp/...,$HOME/...,../outside, through a.lake/symlink to outside, and.lake/probe/inside-ok— and asserts only the inside-ok write succeeds. Catches landrun fail-open, kernel landlock regressions, and policy drift in comparator's flag set. - Env-allowlist probe. scripts/security_probes/env_dump_probe.py
runs in
ci.ymland on everysubmissionworkflow run. The probe salts the parent env with decoy secrets (GH_TOKEN,LD_PRELOAD,AWS_ACCESS_KEY_ID,LEAN_EVAL_BOT_PRIVATE_KEY, etc.) and asserts thatSubmission's elaboration sees a subset of{PATH, HOME, LEAN_ABORT_ON_PANIC, LEAN_PATH, LD_LIBRARY_PATH}containing the required{PATH, HOME, LEAN_ABORT_ON_PANIC}. Allowlist-based, not spot-check, so the next unknown token class can't slip through. The required three are forwarded by comparator'senvPass;LEAN_PATHandLD_LIBRARY_PATHare added bylakeitself when it spawnsleanand may or may not be present depending on the OS (NixOS sets both, Ubuntu typically only sets LEAN_PATH). They point at the workspace's.lake/build/liband the toolchain's shared libs and contain no secrets. Confirmed empirically on Linux/NixOS on 2026-05-04 (runs both vars) and on Ubuntu GHA runners 2026-05-04 (runs only LEAN_PATH). - Overlay restriction.
evaluate_submission.py:overlay_matchonly copiesSubmission.leanand files matching*.leanunderSubmission/. Symlinks are rejected at submission walk and at overlay time; all paths are normalized and checked for traversal. _prime_workspaceinvariant. Onlylake update+lake exe cache getare allowed before comparator runs. The docstring carries the rule and the regression history.- Permitted axioms. Every generated
config.jsonhaspermitted_axioms = {propext, Quot.sound, Classical.choice}. NosorryAx, noLean.ofReduceBool. - Schema validation.
update_leaderboard.pyvalidatessubmission-refandbenchmark-commitas 40-char hex SHAs,submission-repoasowner/name,submission-kindasgithub_repoorgist, and--useras a GitHub login regex. - Benchmark-commit pin. The leaderboard's
scripts/generate_site_data.pyreads.benchmark-commitand aborts if the consumed../lean-evalcheckout's HEAD does not match. The intended bumper of the live site is the leaderboard repo'sbump-benchmark-snapshot.yml(not yet implemented; see Section 7).
Each item is something we believe is OK today but where the assumption is fragile or hard to audit.
- landrun
--best-effortsemantics on novel kernels. landrun silently runs without sandbox if the kernel lacks landlock support. Today'subuntu-latestrunners are fine. The sandbox-engaged probe catches this in CI on every run, but the probe only exercises the policy comparator uses — if comparator's policy ever diverges from the probe's mock workspace shape, a real false negative is possible. Re-run the probe by hand on a clean Linux box after any landrun bump. - Writable-
.lakeself-tampering (Section 3). Currently neutralised by landrun's exec restriction (onlyleancan be exec'd from inside the sandbox;sh/setsid/etc. are denied), so the attacker cannot spawn the daemon the attack requires. This depends on comparator'sexecutablePathsstaying narrow. If a future comparator pin addssh(or similar) toexecutablePaths, re-runartifact_tamper_probeand assume the attack is back in play until proven otherwise. lake envbehaviour across lake versions. Thelake_env_probeconfirms current behaviour. Lake version bumps must re-run it.definition_namesauthor trap. Comparator only requires that def/instance holes match Challenge in TYPE, not body. If the benchmark author writes a theorem whose statement does not pin the def's value, a submitter can satisfy the manifest with a trivial def. There is no automated guard against this — comparator structurally cannot verify that an author's spec theorems sufficiently constrain a def — and adding one would be at most a heuristic. The real guard is PR review of any problem that uses def/instance holes.- HTML escaping of freeform fields on the leaderboard. The
modelandproduction_descriptionfields are unbounded freeform text from the issue body, propagated intoleaderboard.jsonand rendered by Verso. Verso/markdown should escape but this has not been explicitly probed. - Leaderboard deploy doesn't yet enforce
.benchmark-commit.lean-eval-leaderboard/.github/workflows/deploy.ymlcurrently builds from frozensite-data/;generate_site_data.pyis invoked out-of-band. The pin guard added in this work fires only when the script runs. The intendedbump-benchmark-snapshot.ymlworkflow that would close this gap does not yet exist; tracked as follow-up. - Single-kernel defence.
enable_nanoda: falsein every generated config means the alternate kernel is not run. A Lean kernel soundness bug becomes a false-credit vector. Defence in depth would beenable_nanoda: trueonce nanoda's string handling is upstream-fixed. - Freeform
modellength.update_leaderboard.pyvalidatesproduction_descriptionlength but notmodel. A pathologicalmodelvalue cannot grant credit, but can pollute the leaderboard JSON. - The
submissionlabel is a triage gate. Anyone with triage access onleanprover/lean-evalcan trigger the workflow against any submission URL. Repository admin hygiene matters.
- comparator/Main.lean, comparator/Comparator/Compare.lean, comparator/Comparator/Axioms.lean — verifier internals.
- leanprover/comparator README — upstream trust model.
- scripts/security_probes/ — the four security probes cited above.
- scripts/action_pin_audit.py, scripts/sandbox_engaged_probe.py, scripts/check_comparator_installation.py, scripts/evaluate_submission.py, scripts/update_leaderboard.py, scripts/validate_manifest.py — the pipeline scripts referenced from this document.
- Workflow files: .github/workflows/submission.yml, .github/workflows/ci.yml, .github/workflows/regenerate-main.yml, .github/workflows/notify-leaderboard.yml.