Ground truth has no ground

Majority vote is not a neutral way to combine human judgement. It is a specific, contestable rule that silently makes governance decisions and ships them into the model as if they were facts.

A small, self-contained argument — with a runnable proof and a trainer-ready export — about the most consequential and least examined act in AI data labeling: collapsing many human judgements into one "ground truth." For contested / aesthetic / safety / synthetic-human data there is no ground truth to recover — only a distribution of human judgement to preserve — and the disagreement that pipelines are built to delete is usually the most valuable signal in the set.

The spine. A label is not ground truth; it is the output of an aggregation rule applied to evidence under a task definition — four objects, not one: the distribution (statistical), the hard label (a decision), the aggregation rule (governance), and the record (auditable). the-bayes-optimal-label.md proves this from decision theory, and the rest of the repo is those four objects examined closely: the tools produce the distribution and the decision, the four foundations dissect why the rule is never neutral, and the schema is the record.

What's here

file	what it is
`the-groundless-label.md`	The argument. Read this first. ~10 min, grounded in current research (HLV, VariErr, pluralistic alignment, model collapse).
`the-bayes-optimal-label.md`	The decision-theoretic spine. A label is a Bayes action, not ground truth: under log loss the optimal prediction is the whole distribution (Thm 1); under 0–1 loss it is the mode (Thm 2); under a cost model with a review option, the optimal action at a value fork is review, not a label (Thm 5). Concedes majority vote is correct in its one regime and proves where it ends. Read after the argument.
`disagreement.py`	Diagnostic. Instead of majority vote: keeps the distribution, separates genuine variation from likely error, flags value forks and manufactured consensus, prices what the collapse to one label destroys. Writes `triage.json`.
`soft_labels.py`	Operational. Turns the triage into things a trainer consumes: per-cell soft labels + entropy-derived weights (`soft_labels.jsonl`), and a governance queue of value forks awaiting a named human owner (`governance.jsonl`).
`aggregation.py`	Proof. Runs Arrow, May, and the Condorcet Jury Theorem against the same `data/labels.json`: the ribbon's "fact" flips with the aggregation rule, both 4–4 forks are decided by alphabetical order, and "get more labels" is shown to backfire under a shared norm.
`frustration.py`	Proof. Runs the spin-glass mapping on the same data: majority vote shown as a zero-temperature quench (and the bits it destroys), and an inferred-Ising ground state that recovers the two cohorts from votes alone (fact = ferromagnet, value fork = antiferromagnet, cyclic disagreement = spin glass).
`topology.py`	Proof. Runs the topological mapping on the same data: a continuous mean exists on a line but not a circle (Chichilnisky), a reward function exists iff the preference field is curl-free (Hodge — the Condorcet cycle has circulation 3), and the fork's Betti numbers show preference space torn in two (b₀=2).
`geometry.py`	Proof. Runs the information-geometry centres on the same data: cross-entropy = the arithmetic centre, which on an ordered axis is bimodal where every metric-aware centre is central (gap 0.70 TV); prints a per-cell "geometry gap" so the choice of loss stops being a silent default.
`data/labels.json`	A tiny hand-built annotation set modeled on the scenario this repo grew out of: AI-generated editorial portraits, 8 annotators in 2 normative cohorts, 3 questions each.
`the-aggregation-theorem.md`	The proof the argument didn't claim. Social choice theory (Arrow 1951, May 1952, Condorcet 1785) already settled the thesis — and drew the exact line the triage draws by hand. Companion to the argument.
`the-frustrated-label.md`	The physics one layer down. A crowd has no ground truth for the same reason a spin glass has no ground state (Parisi, Nobel 2021). The soft label is a Gibbs state at finite temperature; majority vote is the T→0 quench; model collapse is the second law applied to values.
`the-topological-label.md`	The shape underneath both. Aggregation is possible iff the preference space is contractible (Chichilnisky). A reward model is a potential a value fork forbids (it has curl: H¹≠0). Baryshnikov: Arrow = this hole. Closes the triptych.
`the-geometric-label.md`	The constructive turn (not a fourth impossibility). Given you keep the cloud — which cloud? On the curved (Fisher) simplex the KL, Fisher–Rao, and Wasserstein centres disagree, and cross-entropy silently picks one. A computable "geometry gap" + a decision: choose the loss to match the label's semantics.
`schema/resolution_record.schema.json`	The record — what the whole argument produces. A canonical, replayable record of one aggregation act: input judgements + reasons, the aggregation/tie-break rule, the loss/geometry, the measures (entropy, fork status, curl, geometry gap), the policy version + authority + owner, the disposition + conditions, and a replay hash. Most fields are produced by the tools above; the authority, disposition, and provenance fields (owner, decision, timestamp, replay hash) are the human and operational record the schema specifies. It names the object they were always describing.

Run it (two steps, zero dependencies, Python 3.8+)

python3 disagreement.py     # diagnose -> triage.json
python3 soft_labels.py      # operationalize -> soft_labels.jsonl + governance.jsonl
python3 bayes_optimal.py    # (optional) the decision-theoretic spine: a label is a Bayes action, not ground truth
python3 aggregation.py      # (optional) the theorem under the thesis: social choice theory on the same data
python3 frustration.py      # (optional) the physics under the thesis: the label as a frustrated (spin-glass) system
python3 topology.py         # (optional) the shape under the thesis: aggregation fails iff preference space has a hole
python3 geometry.py         # (optional) the constructive turn: which centre of the cloud? (your loss already chose)

The first prints a per-cell triage and a "bill" — how many bits of human disagreement a single ground-truth column would erase, and where. The second emits trainer-ready records and, crucially, a governance.jsonl queue: every value fork the pipeline would otherwise resolve silently, held open until a named human records a decision and a rationale.

The full arc: diagnostic → triage → trainer-ready export → governance queue. Generated files (triage.json, soft_labels.*) are git-ignored; reproduce them by running the two scripts. governance.jsonl is the exception in spirit: the exporter merges with any existing copy, preserving recorded decisions and owner assignments across runs (and keeping a decided record even if a cell is no longer a fork). Re-running never silently loses state — in production you persist this file as a living backlog of policy decisions.

What this does not prove

The dataset here is illustrative, not empirical — six images × three questions, eight hand-built annotators in two designed cohorts. It exists to make the mechanisms legible and the scripts runnable end-to-end, not to estimate how often value forks, manufactured consensus, or geometry gaps occur in production. The four foundational results (Arrow/May/Condorcet, the spin-glass mapping, Chichilnisky, the information-geometry centres) are mathematical: they show these failures must arise wherever plural judgement is aggregated under a non-neutral rule — not that they are frequent in your data. That last question is empirical, and this repo ships the instruments to measure it (the disagreement bill, the frustration index, the geometry gap) rather than the measurement.

The one thing to take away

So the honest deliverable of high-stakes labeling is not a label. It is the distribution + the reasons + a record of who got out-voted — and, for genuine value forks, a named human who owns the call.

That record has a shape: schema/resolution_record.schema.json — what an aggregation act looks like when it is written down on purpose, replayable and owned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ground truth has no ground

What's here

Run it (two steps, zero dependencies, Python 3.8+)

What this does not prove

The one thing to take away

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
data		data
schema		schema
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
aggregation.py		aggregation.py
bayes_optimal.py		bayes_optimal.py
disagreement.py		disagreement.py
frustration.py		frustration.py
geometry.py		geometry.py
soft_labels.py		soft_labels.py
the-aggregation-theorem.md		the-aggregation-theorem.md
the-bayes-optimal-label.md		the-bayes-optimal-label.md
the-frustrated-label.md		the-frustrated-label.md
the-geometric-label.md		the-geometric-label.md
the-groundless-label.md		the-groundless-label.md
the-topological-label.md		the-topological-label.md
topology.py		topology.py

Folders and files

Latest commit

History

Repository files navigation

Ground truth has no ground

What's here

Run it (two steps, zero dependencies, Python 3.8+)

What this does not prove

The one thing to take away

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages