DeepContext

A semantic chunking service for documents, GitHub repos, webpages, and Confluence. It creates semantically relevant chunks from your content that can be queried later.

Features

Multi-source ingestion: Markdown files, GitHub repositories, webpages, Confluence pages
Website crawling: Ingest entire documentation sites via sitemap or link crawling
Semantic chunking: Intelligent text chunking with code-awareness
Code-aware: Extracts code blocks with context and language detection
Per-library indexing: Each library/site gets its own Qdrant collection plus a central libraries index
Vector search: Powered by Qdrant for fast semantic search
Fast embeddings: Uses FastEmbed / BAAI bge models for efficient embedding generation
HTTP API + background worker: Queue-based ingestion API with an async worker that processes jobs in the background
MCP server: First-class Model Context Protocol (MCP) server exposing DeepContext as a documentation tool

Quick Start

1. Start Qdrant

# Using Docker
docker compose up -d

# Or using Podman
podman compose up -d

2. Install dependencies

uv sync

3. Ingest content (CLI)

# From a local markdown file
uv run deepcontext ingest ./docs/README.md

# From a GitHub URL
uv run deepcontext ingest https://github.com/vercel/next.js/blob/canary/contributing/core/developing.md

# From a webpage
uv run deepcontext ingest https://nextjs.org/docs/app/building-your-application/routing/middleware

# From a Confluence page (requires auth)
export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest https://company.atlassian.net/wiki/spaces/DOCS/pages/123456/Page-Title

# From an entire Confluence space
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki DOCS --limit 50

# From an entire website (uses sitemap or crawls links)
uv run deepcontext ingest-website https://ui.shadcn.com/docs --max-pages 50

4. Search (CLI)

uv run deepcontext search "middleware authentication"

5. Run the HTTP API + MCP server (optional)

You can also run DeepContext as a long-lived HTTP service that exposes:

REST API for ingestion, search, stats, and library metadata
MCP server mounted at /mcp for streamable-http MCP clients

# Start the API + MCP service (FastAPI + Uvicorn)
uv run deepcontext serve --host 0.0.0.0 --port 8000

# Open interactive docs
#   http://localhost:8000/docs

For stdio-based MCP clients (e.g. local tools), you can also run:

uv run deepcontext-mcp

and configure your MCP client to talk to the deepcontext-mcp command.

Example Output

Search Results for: how to redirect in middleware

────────────────────────────────────────────────────────────

### Example

Source: /path/to/middleware.md
Score: 0.6693

Example typescript code demonstrating implementation.

┌──────────────────────────────────────────────────────────┐
│ import { NextResponse } from 'next/server'               │
│ import type { NextRequest } from 'next/server'           │
│                                                          │
│ export function middleware(request: NextRequest) {       │
│   return NextResponse.redirect(new URL('/home', ...))    │
│ }                                                        │
│                                                          │
│ export const config = {                                  │
│   matcher: '/about/:path*',                              │
│ }                                                        │
└──────────────────────────────────────────────────────────┘

────────────────────────────────────────────────────────────

CLI Commands

`deepcontext ingest <source>`

Ingest content from any supported source.

SOURCE can be:

A local file path (e.g., ./docs/README.md)
A GitHub URL (e.g., https://github.com/vercel/next.js/blob/canary/docs/...)
A Confluence page (e.g., https://company.atlassian.net/wiki/spaces/DOCS/pages/123456)
A webpage URL (e.g., https://example.com/docs)

Options:

--chunk-size: Maximum chunk size in tokens (default: 1024)
--threshold: Similarity threshold for semantic chunking (default: 0.7)
--code-aware/--no-code-aware: Use code-aware chunking (default: enabled)
--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

`deepcontext ingest-repo <owner/repo>`

Ingest all documentation from a GitHub repository.

Options:

--branch: Branch to fetch from (default: auto-detect)
--path: Path within the repo to start from
--extensions: File extensions to fetch (default: .md,.mdx)
--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

Authentication: Set GITHUB_TOKEN environment variable for higher rate limits (5,000/hour vs 60/hour for unauthenticated requests).

# Create token at https://github.com/settings/tokens with 'public_repo' scope
export GITHUB_TOKEN=ghp_your_token_here
uv run deepcontext ingest-repo vercel/next.js --path docs

`deepcontext ingest-confluence <base_url> <space_key>`

Ingest all pages from a Confluence space.

Requires environment variables:

CONFLUENCE_EMAIL: Your Atlassian account email
CONFLUENCE_TOKEN: Your Atlassian API token (create one here)

Options:

--limit: Maximum pages to fetch (default: 100)
--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

Example:

export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki ENGINEERING

`deepcontext ingest-website <url>`

Ingest all pages from a website using sitemap or link crawling.

The command will:

Try to find and parse sitemap.xml
If no sitemap, crawl links from the base URL
Filter URLs to only those under the base path

Options:

--max-pages: Maximum pages to fetch (default: 100)
--pattern: URL pattern to filter pages (regex)
--no-sitemap: Don't try to use sitemap.xml, crawl links instead
--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

Examples:

# Ingest shadcn docs (uses sitemap automatically)
uv run deepcontext ingest-website https://ui.shadcn.com/docs

# Ingest with URL pattern filter
uv run deepcontext ingest-website https://nextjs.org/docs --pattern "/docs/app/"

# Force crawling instead of sitemap
uv run deepcontext ingest-website https://example.com/docs --no-sitemap

`deepcontext search <query>`

Search for semantically similar chunks.

Options:

--limit: Maximum results (default: 5)
--language: Filter by code language
--source-type: Filter by source type (markdown, github, webpage, confluence)
--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

`deepcontext stats`

Show statistics about the vector store.

Options:

--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

`deepcontext clear`

Clear all data from the vector store. Requires confirmation.

Options:

--collection: Qdrant collection name (default: deepcontext_chunks)
--host: Qdrant host (default: localhost)
--port: Qdrant port (default: 6333)

Queue-based ingestion and job management

DeepContext also supports background ingestion via a simple job queue. This is useful when you want to queue many large ingestions and process them with a worker.

`deepcontext queue <source>`

Queue a single source for ingestion (queued version of deepcontext ingest).

Options mirror deepcontext ingest:

--host, --port, --collection
--chunk-size, --threshold, --code-aware/--no-code-aware

The command prints a job ID; use deepcontext jobs / deepcontext job to inspect it.

`deepcontext queue-repo <owner/repo>`

Queue a GitHub repository for ingestion (queued version of ingest-repo).

Options mirror deepcontext ingest-repo:

--branch, --path, --extensions, --host, --port, --collection

`deepcontext queue-confluence <base_url> <space_key>`

Queue a Confluence space for ingestion (queued version of ingest-confluence).

Options mirror deepcontext ingest-confluence:

--limit, --host, --port, --collection

`deepcontext queue-website <url>`

Queue a website for ingestion (queued version of ingest-website).

Options mirror deepcontext ingest-website:

--max-pages, --pattern, --no-sitemap, --host, --port, --collection

`deepcontext jobs`

List jobs in the queue with optional filters:

--status: pending, processing, completed, failed
--limit: maximum number of jobs to show (default: 20)

`deepcontext job <job_id>`

Show details for a specific job, including status, counts, timestamps and any error message.

`deepcontext worker`

Run a background worker that polls the queue and processes ingestion jobs:

--once: process a single job and exit
--poll-interval: seconds between polling for new jobs (default: 2.0)

This is useful alongside the HTTP API, which also uses the same queue internally.

`deepcontext cancel-job <job_id>`

Cancel/delete a job from the queue. If the job is currently processing, it is removed from the queue but may still complete.

`deepcontext clear-jobs`

Clear all completed and failed jobs from the queue.

`deepcontext serve`

Start the HTTP API server (FastAPI + Uvicorn) with the MCP HTTP server mounted at /mcp:

uv run deepcontext serve --host 0.0.0.0 --port 8000

Options:

--host: Host to bind to (default: 0.0.0.0)
--port: Port to bind to (default: 8000)
--reload: Enable auto-reload for development

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌────────────────────────────┐
│  ContentFetcher │ ──▶ │ DocumentChunker │ ──▶ │   VectorStore (per-lib)    │
│                 │     │ /SemanticChunker│     │    + LibraryStore (meta)   │
│ - Markdown      │     │ - Code-aware    │     │                            │
│ - GitHub        │     │ - Section-based │     │  Qdrant collections:       │
│ - Webpages      │     │ - Smart titles  │     │  - one per library/site    │
│ - Websites      │     │                 │     │  - one global `libraries`  │
│ - Confluence    │     │                 │     │                            │
└─────────────────┘     └─────────────────┘     └────────────────────────────┘

                 ┌──────────────────────────────────────────────────────────┐
                 │                   DeepContext Service                    │
                 │   - HTTP API (/ingest, /search, /libraries, /stats, …)   │
                 │   - Background worker processing queued jobs             │
                 │   - MCP server (streamable-http) mounted at `/mcp`       │
                 └──────────────────────────────────────────────────────────┘

Python API

from deepcontext import ContentFetcher, VectorStore
from deepcontext.chunker import ChunkConfig, DocumentChunker

# Fetch content
with ContentFetcher() as fetcher:
    doc = fetcher.fetch("https://github.com/vercel/next.js/blob/canary/docs/...")

# Chunk with code awareness
config = ChunkConfig(
    max_chunk_size=2000,  # Max characters per chunk
    min_chunk_size=200,   # Min characters for a chunk
    overlap_size=100,     # Overlap between chunks
)
chunker = DocumentChunker(config)
chunks = chunker.chunk_document(doc)

# Store and search
store = VectorStore()
store.index_chunks(chunks)

results = store.search("middleware authentication")
for result in results:
    print(result.chunk.to_display_format())
    print(f"Score: {result.score}")

store.close()

Fetching multiple documents

from deepcontext import ContentFetcher

with ContentFetcher() as fetcher:
    # Fetch entire GitHub repo
    docs = fetcher.fetch_github_repo("vercel/next.js", path="docs")
    
    # Fetch Confluence space
    docs = fetcher.fetch_confluence_space(
        base_url="https://company.atlassian.net/wiki",
        space_key="DOCS",
        limit=50,
    )
    
    # Fetch entire website
    docs = fetcher.fetch_website(
        base_url="https://ui.shadcn.com/docs",
        max_pages=100,
        url_pattern=r"/docs/",  # Optional regex filter
    )

Environment Variables

Variable	Description
`GITHUB_TOKEN`	GitHub personal access token for higher API rate limits
`CONFLUENCE_EMAIL`	Atlassian account email for Confluence access
`CONFLUENCE_TOKEN`	Atlassian API token for Confluence access
`QDRANT_HOST`	Host for the Qdrant instance used by the HTTP API / MCP server (default: `localhost`)
`QDRANT_PORT`	Port for the Qdrant instance (default: `6333`)
`DEFAULT_COLLECTION`	Default collection name for legacy CLI commands (default: `deepcontext_chunks`)

You can also use a .env file in the project root - it will be loaded automatically.

Requirements

Python 3.12+
Docker or Podman (for Qdrant)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
deepcontext		deepcontext
.cursorignore		.cursorignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
context7.json		context7.json
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

pranavms13/deepcontext

Folders and files

Latest commit

History

Repository files navigation

DeepContext

Features

Quick Start

1. Start Qdrant

2. Install dependencies

3. Ingest content (CLI)

4. Search (CLI)

5. Run the HTTP API + MCP server (optional)

Example Output

CLI Commands

deepcontext ingest <source>

deepcontext ingest-repo <owner/repo>

deepcontext ingest-confluence <base_url> <space_key>

deepcontext ingest-website <url>

deepcontext search <query>

deepcontext stats

deepcontext clear

Queue-based ingestion and job management

deepcontext queue <source>

deepcontext queue-repo <owner/repo>

deepcontext queue-confluence <base_url> <space_key>

deepcontext queue-website <url>

deepcontext jobs

deepcontext job <job_id>

deepcontext worker

deepcontext cancel-job <job_id>

deepcontext clear-jobs

deepcontext serve

Architecture

Python API

Fetching multiple documents

Environment Variables

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`deepcontext ingest <source>`

`deepcontext ingest-repo <owner/repo>`

`deepcontext ingest-confluence <base_url> <space_key>`

`deepcontext ingest-website <url>`

`deepcontext search <query>`

`deepcontext stats`

`deepcontext clear`

`deepcontext queue <source>`

`deepcontext queue-repo <owner/repo>`

`deepcontext queue-confluence <base_url> <space_key>`

`deepcontext queue-website <url>`

`deepcontext jobs`

`deepcontext job <job_id>`

`deepcontext worker`

`deepcontext cancel-job <job_id>`

`deepcontext clear-jobs`

`deepcontext serve`

Packages