GitHub - rafacm/ragtime: RAGtime ("Retrieval in the Key of Jazz") is a Django application for ingesting jazz-related podcast episodes and powering Scott, a jazz-focused conversational agent named Scott.

RAGtime -- Retrieval Augmented Generation (RAG) in the Key of Jazz

_{Image generated with Nano Banana from the cover of E.L. Doctorow's novel "Ragtime"}

What is RAGtime?

RAGtime is a Django application for ingesting jazz-related podcast episodes. It extracts metadata, transcribes audio, identifies jazz entities, and powers Scott — a jazz-focused AI agent that answers questions strictly from ingested episode content, with references to specific episodes and timestamps.

Features

🎙️ Episode Ingestion — Add podcast episodes by URL. RAGtime scrapes metadata (title, description, date, image), downloads audio, and processes it through the pipeline.
📝 Multilingual Transcription — Transcribes episodes using configurable backends (Whisper API by default) with segment and word-level timestamps. Supports multiple languages (English, Spanish, German, Swedish, etc.).
🔍 Entity Extraction — Identifies jazz entities: musicians, musical groups, albums, music venues, recording sessions, record labels, years. Entities are resolved against existing records using LLM-based matching.
📇 Episode Indexing — Splits transcripts into segments and generates multilingual embeddings stored in ChromaDB. Enables cross-language semantic search so Scott can retrieve relevant content regardless of the question's language.
🎷 Scott — Your Jazz AI — A conversational agent that answers questions strictly from ingested episode content. Scott responds in the user's language and provides references to specific episodes and timestamps. Responses stream in real-time.

Status

RAGtime is under active development.

What's already implemented

Episode ingestion: submit episodes by URL, metadata scraping, audio download, transcription, summarization, chunking, entity extraction and resolution with Wikidata integration.
Episode management UI: Django admin interface to view episode status and metadata and browse extracted entities.
Configuration wizard: interactive manage.py configure command for all RAGTIME_* env vars.
LLM observability: optional Langfuse integration for tracing and monitoring LLM calls across the pipeline.
Agent-based recovery: Pydantic AI agent with Playwright browser automation recovers from scraping and downloading failures automatically.

See CHANGELOG.md for the full list of implemented features, fixes, implementation plans, feature documentation and session transcripts.

What's coming

Embed step (pipeline step 9): generate multilingual embeddings for transcript chunks and store them in ChromaDB.
Scott — the RAG chatbot (pipeline step 10 + chat app): conversational agent that answers questions strictly from ingested content, with episode/timestamp references, multilingual support, and streaming responses.

Processing Pipeline

Each step updates the episode's status field. A post_save signal dispatches the next step as an async Django Q2 task. Failures with exceptions trigger the recovery layer.

#	Step	Status	Description
1	📥 Submit	`pending`	User submits an episode URL
2	🕷️ Scrape	`scraping`	Extract metadata and detect language
3	⬇️ Download	`downloading`	Download audio and extract duration
4	🎙️ Transcribe	`transcribing`	Whisper API transcription with timestamps
5	📋 Summarize	`summarizing`	LLM-generated episode summary
6	✂️ Chunk	`chunking`	Split transcript into ~150-word chunks
7	🔍 Extract	`extracting`	Named entity recognition per chunk
8	🧩 Resolve	`resolving`	Entity linking and deduplication via Wikidata
9	📐 Embed	`embedding`	Multilingual embeddings into ChromaDB
10	✅ Ready	`ready`	Episode available for Scott to query

Steps 9–10 (Embed, Ready) are planned and not yet implemented.

See the full pipeline documentation for per-step details, entity types, and the recovery layer.

Documentation

Detailed documentation lives in the doc/ directory:

Full pipeline documentation — per-step details, entity types, recovery layer
How Scott works — RAG architecture and query flow
LLM observability with Langfuse — tracing setup and traced steps
Architecture diagrams — processing pipeline diagram
Feature documentation — per-feature docs with problem, changes, and verification
Plans — implementation plans
Session transcripts — planning and implementation session logs

Getting Started

Prerequisites

Python 3.13+
uv
ffmpeg (for audio downsampling)
wget (for audio downloading)

Installation

git clone <repo-url>
cd ragtime
uv sync

Optional dependency groups:

Extra	Install command	Description
`observability`	`uv sync --extra observability`	LLM observability via Langfuse
`recovery`	`uv sync --extra recovery`	Agent recovery with Pydantic AI + Playwright

Set up the database, create an admin account, and start the services:

uv run python manage.py migrate
uv run python manage.py createsuperuser   # Create an admin user for the Django admin UI
uv run python manage.py load_entity_types # Seed initial entity types
uv run python manage.py configure         # Interactive setup wizard for RAGTIME_* env vars
uv run python manage.py runserver         # Start the web server
uv run python manage.py qcluster          # Start the Django Q2 task worker (separate terminal)

Configuration

You can run uv run python manage.py configure to launch an interactive setup wizard for all RAGTIME_* env vars.

Alternatively, copy .env.sample to .env and fill in your values.

Tech Stack

Runtime: Python 3.13
Framework: Django 5.2
Database: SQLite
Vector Store: ChromaDB
Task Queue: Django Q2
AI Agents: Pydantic AI (recovery agent)
Transcription: Configurable — Whisper API (default), local Whisper, etc.
LLM: Configurable — Claude (Anthropic), GPT (OpenAI), etc.
Embeddings: Configurable — must support multilingual models for cross-language retrieval
Frontend: Django templates + HTMX + Tailwind CSS
Package Manager: uv

License

This project is licensed under the MIT License — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
.github/workflows		.github/workflows
core		core
doc		doc
episodes		episodes
ragtime		ragtime
.env.sample		.env.sample
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is RAGtime?

Features

Status

What's already implemented

What's coming

Processing Pipeline

Documentation

Getting Started

Prerequisites

Installation

Configuration

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is RAGtime?

Features

Status

What's already implemented

What's coming

Processing Pipeline

Documentation

Getting Started

Prerequisites

Installation

Configuration

Tech Stack

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages