Structured Data & Semantic Tagging Pipeline — A distributed pipeline that turns raw text (conversations, web pages, surveys) into structured, semantically tagged data for downstream AI and search applications. Enables low-cost, scalable semantic annotation and metadata generation using LLMs and embedding-based quality scoring.
- Coordinators establish ground-truth tags and embeddings on full content, split it into windows, and score worker outputs.
- Workers run LLM inference on text windows and return tags and embeddings.
- Scoring uses cosine similarity between each worker’s output and the coordinator’s ground-truth tag embeddings (weighted combination of top-3 mean, mean, median, max; penalties for low overlap or too few tags).
The pipeline supports multiple LLM providers (OpenAI, Anthropic, OpenRouter, Chutes, Groq) and can run against any conversation server that implements the same API.
| Category | Technologies |
|---|---|
| Language & runtime | Python 3.8–3.11 |
| ML / embeddings | PyTorch, NumPy, OpenAI Embeddings (e.g. text-embedding-3-large), cosine similarity |
| LLM inference | OpenAI, Anthropic, OpenRouter, Chutes, Groq (unified backend) |
| Backend & API | FastAPI, Pydantic |
| Data & storage | SQLite, Hugging Face datasets |
| Observability | Weights & Biases, Prometheus |
| DevOps | Docker, Docker Compose, PM2 |
| Testing | pytest, mocks for LLM/API |
Requirements: Python 3.8–3.11, OpenAI API key (and optionally WandB for coordinators).
git clone <this-repo> cgp && cd cgp
pip install -r requirements.txt
cp env.example .env
# Edit .env: set OPENAI_API_KEY and, for coordinators, WANDB_API_KEY and conversation-server API key if needed.Run tests (full loop: coordinator + workers, local):
python3 -m venv test_venv && source test_venv/bin/activate
pip install -r requirements_test.txt
python -m pytest -s --disable-warnings tests/test_full_loop.pyRun a worker:
python3 -m neurons.miner --netuid 33 --wallet.name <coldkey> --wallet.hotkey <hotkey> --axon.port <port>Run a coordinator:
python3 -m neurons.validator --netuid 33 --wallet.name <coldkey> --wallet.hotkey <hotkey> --axon.port <port>Use --subtensor.network test --netuid 138 for testnet. For Docker, Runpod, PM2, and coordinator API key generation, see docs/.
Coordinators can use a public conversation API or their own. To use your own data:
- Run the FastAPI server in
web/(seeweb/READMEorstart_conversation_store.sh) and point.envat it (CGP_API_READ_HOST,CGP_API_WRITE_HOST, etc.). - For coordinator access to a shared conversation server, generate an API key: docs/generate-validator-api-key.md. Place the generated file (e.g.
readyai_api_data.json) in the coordinator run directory.
MIT. See repository for full license text.