Skip to content

rustyneuron01/Conversation-Genome-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conversation Genome Project

License: MIT

Structured Data & Semantic Tagging Pipeline — A distributed pipeline that turns raw text (conversations, web pages, surveys) into structured, semantically tagged data for downstream AI and search applications. Enables low-cost, scalable semantic annotation and metadata generation using LLMs and embedding-based quality scoring.


What it does

  • Coordinators establish ground-truth tags and embeddings on full content, split it into windows, and score worker outputs.
  • Workers run LLM inference on text windows and return tags and embeddings.
  • Scoring uses cosine similarity between each worker’s output and the coordinator’s ground-truth tag embeddings (weighted combination of top-3 mean, mean, median, max; penalties for low overlap or too few tags).

The pipeline supports multiple LLM providers (OpenAI, Anthropic, OpenRouter, Chutes, Groq) and can run against any conversation server that implements the same API.


Tech stack

Category Technologies
Language & runtime Python 3.8–3.11
ML / embeddings PyTorch, NumPy, OpenAI Embeddings (e.g. text-embedding-3-large), cosine similarity
LLM inference OpenAI, Anthropic, OpenRouter, Chutes, Groq (unified backend)
Backend & API FastAPI, Pydantic
Data & storage SQLite, Hugging Face datasets
Observability Weights & Biases, Prometheus
DevOps Docker, Docker Compose, PM2
Testing pytest, mocks for LLM/API

Quick start

Requirements: Python 3.8–3.11, OpenAI API key (and optionally WandB for coordinators).

git clone <this-repo> cgp && cd cgp
pip install -r requirements.txt
cp env.example .env
# Edit .env: set OPENAI_API_KEY and, for coordinators, WANDB_API_KEY and conversation-server API key if needed.

Run tests (full loop: coordinator + workers, local):

python3 -m venv test_venv && source test_venv/bin/activate
pip install -r requirements_test.txt
python -m pytest -s --disable-warnings tests/test_full_loop.py

Run a worker:

python3 -m neurons.miner --netuid 33 --wallet.name <coldkey> --wallet.hotkey <hotkey> --axon.port <port>

Run a coordinator:

python3 -m neurons.validator --netuid 33 --wallet.name <coldkey> --wallet.hotkey <hotkey> --axon.port <port>

Use --subtensor.network test --netuid 138 for testnet. For Docker, Runpod, PM2, and coordinator API key generation, see docs/.


Custom conversation server

Coordinators can use a public conversation API or their own. To use your own data:

  1. Run the FastAPI server in web/ (see web/README or start_conversation_store.sh) and point .env at it (CGP_API_READ_HOST, CGP_API_WRITE_HOST, etc.).
  2. For coordinator access to a shared conversation server, generate an API key: docs/generate-validator-api-key.md. Place the generated file (e.g. readyai_api_data.json) in the coordinator run directory.

License

MIT. See repository for full license text.

About

Structured data & semantic tagging pipeline. Turns raw text (conversations, web pages, surveys) into tagged data for AI and search. Coordinators set ground truth; workers run LLM inference on windows. Scoring via cosine similarity. Python, FastAPI, OpenAI/Anthropic/OpenRouter, embeddings, Docker.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors