Discovery Engine

Distributed scientific paper extraction for cross-domain discovery.

Autonomous agents continuously extract structured knowledge from scientific papers and patents, building a public cross-domain discovery graph. Each paper is decomposed into:

Layer 1 (Facts): Entities, properties, relations — what the paper found
Layer 2 (Connections): Bridge tags, provides/requires interface, unsolved tensions — how it connects to other fields

Papers connect when one's provides matches another's requires, enabling cross-domain discovery.

Quick Start

# Install
pip install -e ".[anthropic]"   # or: .[openai], .[gemini], .[all]

# Configure
discovery config --provider anthropic --api-key sk-ant-...
discovery config --github-user your-username

# Run — discovers papers automatically, no setup needed
discovery run --count 5           # test with 5 papers
discovery run --auto-submit       # run forever, auto-submit PRs

That's it. The agent will:

Query arXiv, PubMed, OpenAlex, OSTI for recent papers
Check which ones are already processed (via shared tracking file)
Fetch text, extract with your LLM, validate, save
Submit PRs automatically when batch is full

How It Works

discover papers → fetch text → LLM extraction → validate → save → submit PR → CI validates → auto-merge
      ↑                                                                                          |
      └── processed_papers.jsonl ←── auto-updates on merge ←─────────────────────────────────────┘

You run the loop on your machine with your own LLM (cloud API or local)
Papers discovered in real-time from public APIs (no pre-built queue needed)
Duplicates avoided via shared processed_papers.jsonl on GitHub
Results validate locally against the schema
Submit a PR — GitHub Actions CI checks quality (schema, grounding, anomaly detection)
Auto-merge if all quality gates pass — results move to results/
Tracking file auto-updates — next contributor sees your papers as done

Commands

Command	What it does
`discovery run`	Run the autonomous extraction loop
`discovery run --count 50`	Stop after 50 papers
`discovery run --source arxiv`	Only arXiv papers
`discovery run --auto-submit`	Auto-create PRs when batch is full
`discovery run --dry-run`	Preview papers without extracting
`discovery submit`	Submit pending results as a PR
`discovery submit --dry-run`	Preview what would be submitted
`discovery validate path.json`	Validate an extraction result
`discovery status`	Show local + global progress
`discovery config --show`	Show current configuration

Supported LLM Providers

Provider	Models
Anthropic	Claude Sonnet 4
Google	Gemini 2.5 Flash
OpenRouter	DeepSeek V3, Llama 3.3 70B, Qwen3 235B
OpenAI	GPT-4o
Local (ollama, vllm, llama.cpp)	Any GGUF/safetensors model

Local LLMs

Run extraction with a local model — no API key, no cost, fully offline:

# Option A: Use the 'local' shortcut (defaults to ollama on localhost:11434)
discovery config --provider local
discovery config --model llama3.1    # or any model you've pulled

# Option B: Manual OpenAI-compatible setup (vllm, llama.cpp, etc.)
discovery config --provider openai --api-key not-needed
discovery config --base-url http://localhost:8000/v1   # your server URL
discovery config --model your-model-name

# Then run as usual
discovery run --count 5

Any server that exposes an OpenAI-compatible /v1/chat/completions endpoint works:

ollama — ollama serve (default port 11434)
vllm — vllm serve model-name (default port 8000)
llama.cpp — llama-server -m model.gguf (default port 8080)
LM Studio — enable local server in settings

Recommended local models (8B+ for acceptable quality):

llama3.1 (8B) — fast, good for testing
qwen2.5:32b — best quality/speed tradeoff
deepseek-r1:32b — strong reasoning

Note: Extraction quality depends heavily on model capability. Cloud APIs (especially Claude Sonnet and DeepSeek V3) produce significantly better results than small local models. For production use, we recommend cloud providers.

Paper Sources

Source	Papers	Full Text	Access
arXiv	2.5M	LaTeX/PDF	Open
PMC OA	7.2M	XML	Open
OSTI	3.4M	Mixed	Open
OpenAlex	250M+	Abstract	Mixed

Architecture

See DESIGN.md for the complete system design, including:

Two-layer extraction (facts + cross-domain abstraction)
Paper discovery and tracking
Quality assurance (schema validation, honeypots, consensus)
Model compatibility and contributor workflow

OpenClaw Skill

This project includes an OpenClaw skill for automated extraction. Install it to let your OpenClaw agent process papers autonomously.

Contributing

See docs/CONTRIBUTING.md for detailed contributor guide.

Short version: pip install → discovery config → discovery run --auto-submit → walk away.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
discovery		discovery
docs		docs
honeypots		honeypots
prompts		prompts
results		results
schemas		schemas
scripts		scripts
skills/discovery-extract		skills/discovery-extract
submissions		submissions
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
processed_papers.jsonl		processed_papers.jsonl
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovery Engine

Quick Start

How It Works

Commands

Supported LLM Providers

Local LLMs

Paper Sources

Architecture

OpenClaw Skill

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Discovery Engine

Quick Start

How It Works

Commands

Supported LLM Providers

Local LLMs

Paper Sources

Architecture

OpenClaw Skill

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages