glossa-lab

Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script (Mahadevan corpus, Holdat LLC dataset) using methods developed by Dr. Andreas Fuls (TU Berlin / ICIT).

Built and maintained by BitConcepts LLC

Overview

Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:

Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
Study Builder — multi-experiment research workflows as visual graphs
Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
Evidence Graph — per-project literature library, automated paper sweep (configurable via sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation
AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
Reports & Data — PDF, Markdown, JSON, CSV export of all results

System architecture

[ Tray ] ─────┐
              │
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
              │              │
[ CLI / Dev ] ┘         [ SQLite DB ]
                              │
                    [ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]

Key principles

The backend is the source of truth
The tray and frontend are interfaces, not runtime owners
All communication occurs through explicit REST APIs
Service lifecycle is deterministic and observable — every background process logs START/COMPLETE

Components

Backend (Python / FastAPI)

REST API + background job engine
SQLite database (providers, model scores, discovery items, experiments, studies)
AI provider registry with test/probe on startup and on-demand
HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
RAG index for research context injection
Ollama auto-detection and lifecycle management

Frontend (React / TypeScript / Vite)

Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.

Key panels:

Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
Experiment Builder — visual DAG editor with Evidence Graph palette category (7 nodes)
Study Builder — multi-experiment research workflows (accessible via Projects)
Discovery View — literature feed with → Evidence import action for Indus/Harappan items
Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal

Tray (Windows/macOS)

Local control surface. Start/stop/restart backend, open UI, quick status.

Indus Script Research Outputs

If you are here for the preprint materials, go directly to:

research/indus/
├── pierson_2026_indus_preprint_v1.pdf   ← preprint PDF
├── anchor_table.csv                     ← 397-sign table (open in Excel)
├── anchor_table.json                    ← same table with full metadata
├── mahadevan_parpola_crosswalk.json     ← M-number ↔ P-number crosswalk
└── phase_reports/                       ← 35 phase reports (Phases 127–170)

See research/indus/README.md for full details, corpus access notes, and citation.

Supplemental datasets for direct use (fish-sign compound-context, iconographic formula enrichment, formula bigram table, polysemy divergence test) are in research/indus/supplemental/.

Repository structure

glossa-lab/
├─ LICENSE              ← MIT (source code)
├─ AGENTS.md            ← agent operating rules (read first, every session)
├─ LEDGER.md            ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md         ← citation registry for all research data
├─ setup-os.cmd / setup-os.sh  ← start/stop/restart
├─ shell.cmd / shell.sh        ← tool wrapper (pytest, ruff, python)
├─ .github/
│  └─ workflows/ci.yml  ← GitHub Actions CI
├─ backend/             ← Python FastAPI application
│  ├─ glossa_lab/       ← app modules (api/, experiments/, discovery/, ...)
│  ├─ scripts/          ← all research and utility scripts
│  └─ tests/
├─ frontend/            ← React / TypeScript / Vite
│  ├─ src/
│  └─ dist/             ← built artefact (committed for server deploy)
├─ tray/                ← system tray app
├─ services/            ← systemd / launchd / Windows service definitions
├─ docs/
│  ├─ images/           ← diagrams and sign images
│  ├─ governance/       ← governance docs
│  ├─ research/         ← decipherment research docs
│  ├─ USER_GUIDE.md
│  ├─ architecture.md
│  └─ REQUIREMENTS.md
├─ data/                ← canonical corpus and reference data
│  ├─ crosswalks/       ← sign crosswalk CSVs (M-number ↔ Parpola, Yajnadevam)
│  ├─ raw/              ← raw source corpora
│  ├─ normalized/       ← cleaned / extracted corpus files
│  └─ import/           ← staged import artifacts
├─ outputs/             ← generated computational artifacts
│  └─ analysis/         ← summary JSON analysis files
├─ reports/             ← human-readable research reports (PDF, Markdown)
├─ research/            ← public preprint outputs
│  └─ indus/            ← preprint PDF, anchor table, phase reports (CC BY 4.0)
├─ scripts/             ← project-wide utility scripts
├─ glossa-corpus/       ← internal corpus store
├─ glossa-indus/        ← Evidence Graph data store
│  ├─ config/sweep.yaml
│  ├─ literature/ · claims/ · hypotheses/ · raw/
│  └─ scripts/
└─ corpora/             ← external corpus downloads (gitignored, ~3 GB)

Quick start

Windows

# First-time install (registers autostart, installs deps)
setup-os.cmd install

# Start backend + tray
setup-os.cmd start

# Verify
curl.exe -sf http://localhost:8001/api/v1/health

Linux (systemd)

cd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/health

Open http://localhost:8001 in your browser.

Development workflow

All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:

cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'

Research governance

H18 — Every data file must have _citation traceable to CITATIONS.md
H19 — Foundation check must PASS before external communication
Indus Script decipherment: 161 H+M candidate readings, 90.96% token coverage
Research outputs: research/indus/

Documentation

File	Purpose
`AGENTS.md`	Agent rules, start/stop commands, hard rules
`LEDGER.md`	Session ledger — sole continuity authority
`CITATIONS.md`	Research data citation registry
`docs/USER_GUIDE.md`	Full user guide (all panels) including Evidence Graph
`docs/architecture.md`	System architecture including Evidence Graph layer
`docs/REQUIREMENTS.md`	Formal requirements (R1–R16, incl. R14 Evidence Graph, R15 DB reliability, R16 CI/CD)
`docs/TESTS.md`	Test specification (TEST-IEA, TEST-EV, TEST-PW-EG, TEST-CI)
`docs/research/`	Decipherment research documents
`docs/guides/`	How-to guides (experiments, pipelines, studies)
`glossa-indus/LEDGER.md`	Evidence Graph batch work log
`research/indus/`	Public research outputs — anchor table, phase reports, preprint PDF

Current research status (May 2026 — Phase-44)

137 verified anchors (7 HIGH, 54 MEDIUM, 75 LOW; 196 nīr-placeholder entries from V8-V24 archive removed)
7 HIGH-confidence readings: M342=ay/ā, M176=an/aṇ, M099=kol/koḷ, M062=erutu, M045=yānai, M016=kaḷiṟu, M006=puli
CISI corpus rebuilt: 179 inscriptions / 1003 tokens / 182 distinct signs
Dravidian LM expanded: 944 bigrams (from 184) via TamilTB v0.1 integration
Phase-44 T1 (M342 genitive): UNCERTAIN — cross-site Jaccard 0.429; anchor signs confirmed in genitive context
Phase-44 T2 (M99 phonetic): SUPPORTED — kol/koḷ (DEDR 2173/2174); M267→M099 title formula 84×
Phase-43 (May 2026): 231.9σ positional structure confirmed; Hunt tripartite formula 59× lift
Evidence Graph (May 2026): 11 papers registered, 22 claims extracted across Parpola/FSW/Yadav/Roif/Hunt
TB correlation: 0.907 (post M267 correction)
V8-V24 autonomous campaign archived 2026-05-17; INDUS_FINAL_ANCHORS.json preserved at 137 entries

Status

Production — active research. Backend and frontend fully operational at http://localhost:8001.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Research Outputs

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Research governance

Documentation

Current research status (May 2026 — Phase-44)

Status

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 624 Commits
.github		.github
backend		backend
data		data
docs		docs
frontend		frontend
glossa-corpus		glossa-corpus
glossa-indus		glossa-indus
outputs		outputs
reports		reports
research/indus		research/indus
scripts		scripts
services		services
tray		tray
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LEDGER.md		LEDGER.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
scaffold.yml		scaffold.yml
setup-os.cmd		setup-os.cmd
setup-os.sh		setup-os.sh
shell.cmd		shell.cmd
shell.sh		shell.sh

Folders and files

Latest commit

History

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Research Outputs

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Research governance

Documentation

Current research status (May 2026 — Phase-44)

Status

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages