Skip to content

A semantic chunking service for documents, GitHub repos, webpages, and Confluence. It creates semantically relevant chunks from your content that can be queried later.

License

Notifications You must be signed in to change notification settings

pranavms13/deepcontext

Repository files navigation

DeepContext

A semantic chunking service for documents, GitHub repos, webpages, and Confluence. It creates semantically relevant chunks from your content that can be queried later.

Features

  • Multi-source ingestion: Markdown files, GitHub repositories, webpages, Confluence pages
  • Website crawling: Ingest entire documentation sites via sitemap or link crawling
  • Semantic chunking: Intelligent text chunking with code-awareness
  • Code-aware: Extracts code blocks with context and language detection
  • Per-library indexing: Each library/site gets its own Qdrant collection plus a central libraries index
  • Vector search: Powered by Qdrant for fast semantic search
  • Fast embeddings: Uses FastEmbed / BAAI bge models for efficient embedding generation
  • HTTP API + background worker: Queue-based ingestion API with an async worker that processes jobs in the background
  • MCP server: First-class Model Context Protocol (MCP) server exposing DeepContext as a documentation tool

Quick Start

1. Start Qdrant

# Using Docker
docker compose up -d

# Or using Podman
podman compose up -d

2. Install dependencies

uv sync

3. Ingest content (CLI)

# From a local markdown file
uv run deepcontext ingest ./docs/README.md

# From a GitHub URL
uv run deepcontext ingest https://github.com/vercel/next.js/blob/canary/contributing/core/developing.md

# From a webpage
uv run deepcontext ingest https://nextjs.org/docs/app/building-your-application/routing/middleware

# From a Confluence page (requires auth)
export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest https://company.atlassian.net/wiki/spaces/DOCS/pages/123456/Page-Title

# From an entire Confluence space
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki DOCS --limit 50

# From an entire website (uses sitemap or crawls links)
uv run deepcontext ingest-website https://ui.shadcn.com/docs --max-pages 50

4. Search (CLI)

uv run deepcontext search "middleware authentication"

5. Run the HTTP API + MCP server (optional)

You can also run DeepContext as a long-lived HTTP service that exposes:

  • REST API for ingestion, search, stats, and library metadata
  • MCP server mounted at /mcp for streamable-http MCP clients
# Start the API + MCP service (FastAPI + Uvicorn)
uv run deepcontext serve --host 0.0.0.0 --port 8000

# Open interactive docs
#   http://localhost:8000/docs

For stdio-based MCP clients (e.g. local tools), you can also run:

uv run deepcontext-mcp

and configure your MCP client to talk to the deepcontext-mcp command.

Example Output

Search Results for: how to redirect in middleware

────────────────────────────────────────────────────────────

### Example

Source: /path/to/middleware.md
Score: 0.6693

Example typescript code demonstrating implementation.

┌──────────────────────────────────────────────────────────┐
│ import { NextResponse } from 'next/server'               │
│ import type { NextRequest } from 'next/server'           │
│                                                          │
│ export function middleware(request: NextRequest) {       │
│   return NextResponse.redirect(new URL('/home', ...))    │
│ }                                                        │
│                                                          │
│ export const config = {                                  │
│   matcher: '/about/:path*',                              │
│ }                                                        │
└──────────────────────────────────────────────────────────┘

────────────────────────────────────────────────────────────

CLI Commands

deepcontext ingest <source>

Ingest content from any supported source.

SOURCE can be:

  • A local file path (e.g., ./docs/README.md)
  • A GitHub URL (e.g., https://github.com/vercel/next.js/blob/canary/docs/...)
  • A Confluence page (e.g., https://company.atlassian.net/wiki/spaces/DOCS/pages/123456)
  • A webpage URL (e.g., https://example.com/docs)

Options:

  • --chunk-size: Maximum chunk size in tokens (default: 1024)
  • --threshold: Similarity threshold for semantic chunking (default: 0.7)
  • --code-aware/--no-code-aware: Use code-aware chunking (default: enabled)
  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

deepcontext ingest-repo <owner/repo>

Ingest all documentation from a GitHub repository.

Options:

  • --branch: Branch to fetch from (default: auto-detect)
  • --path: Path within the repo to start from
  • --extensions: File extensions to fetch (default: .md,.mdx)
  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

Authentication: Set GITHUB_TOKEN environment variable for higher rate limits (5,000/hour vs 60/hour for unauthenticated requests).

# Create token at https://github.com/settings/tokens with 'public_repo' scope
export GITHUB_TOKEN=ghp_your_token_here
uv run deepcontext ingest-repo vercel/next.js --path docs

deepcontext ingest-confluence <base_url> <space_key>

Ingest all pages from a Confluence space.

Requires environment variables:

  • CONFLUENCE_EMAIL: Your Atlassian account email
  • CONFLUENCE_TOKEN: Your Atlassian API token (create one here)

Options:

  • --limit: Maximum pages to fetch (default: 100)
  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

Example:

export CONFLUENCE_EMAIL=you@company.com
export CONFLUENCE_TOKEN=your_api_token
uv run deepcontext ingest-confluence https://company.atlassian.net/wiki ENGINEERING

deepcontext ingest-website <url>

Ingest all pages from a website using sitemap or link crawling.

The command will:

  1. Try to find and parse sitemap.xml
  2. If no sitemap, crawl links from the base URL
  3. Filter URLs to only those under the base path

Options:

  • --max-pages: Maximum pages to fetch (default: 100)
  • --pattern: URL pattern to filter pages (regex)
  • --no-sitemap: Don't try to use sitemap.xml, crawl links instead
  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

Examples:

# Ingest shadcn docs (uses sitemap automatically)
uv run deepcontext ingest-website https://ui.shadcn.com/docs

# Ingest with URL pattern filter
uv run deepcontext ingest-website https://nextjs.org/docs --pattern "/docs/app/"

# Force crawling instead of sitemap
uv run deepcontext ingest-website https://example.com/docs --no-sitemap

deepcontext search <query>

Search for semantically similar chunks.

Options:

  • --limit: Maximum results (default: 5)
  • --language: Filter by code language
  • --source-type: Filter by source type (markdown, github, webpage, confluence)
  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

deepcontext stats

Show statistics about the vector store.

Options:

  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

deepcontext clear

Clear all data from the vector store. Requires confirmation.

Options:

  • --collection: Qdrant collection name (default: deepcontext_chunks)
  • --host: Qdrant host (default: localhost)
  • --port: Qdrant port (default: 6333)

Queue-based ingestion and job management

DeepContext also supports background ingestion via a simple job queue. This is useful when you want to queue many large ingestions and process them with a worker.

deepcontext queue <source>

Queue a single source for ingestion (queued version of deepcontext ingest).

Options mirror deepcontext ingest:

  • --host, --port, --collection
  • --chunk-size, --threshold, --code-aware/--no-code-aware

The command prints a job ID; use deepcontext jobs / deepcontext job to inspect it.

deepcontext queue-repo <owner/repo>

Queue a GitHub repository for ingestion (queued version of ingest-repo).

Options mirror deepcontext ingest-repo:

  • --branch, --path, --extensions, --host, --port, --collection

deepcontext queue-confluence <base_url> <space_key>

Queue a Confluence space for ingestion (queued version of ingest-confluence).

Options mirror deepcontext ingest-confluence:

  • --limit, --host, --port, --collection

deepcontext queue-website <url>

Queue a website for ingestion (queued version of ingest-website).

Options mirror deepcontext ingest-website:

  • --max-pages, --pattern, --no-sitemap, --host, --port, --collection

deepcontext jobs

List jobs in the queue with optional filters:

  • --status: pending, processing, completed, failed
  • --limit: maximum number of jobs to show (default: 20)

deepcontext job <job_id>

Show details for a specific job, including status, counts, timestamps and any error message.

deepcontext worker

Run a background worker that polls the queue and processes ingestion jobs:

  • --once: process a single job and exit
  • --poll-interval: seconds between polling for new jobs (default: 2.0)

This is useful alongside the HTTP API, which also uses the same queue internally.

deepcontext cancel-job <job_id>

Cancel/delete a job from the queue. If the job is currently processing, it is removed from the queue but may still complete.

deepcontext clear-jobs

Clear all completed and failed jobs from the queue.

deepcontext serve

Start the HTTP API server (FastAPI + Uvicorn) with the MCP HTTP server mounted at /mcp:

uv run deepcontext serve --host 0.0.0.0 --port 8000

Options:

  • --host: Host to bind to (default: 0.0.0.0)
  • --port: Port to bind to (default: 8000)
  • --reload: Enable auto-reload for development

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌────────────────────────────┐
│  ContentFetcher │ ──▶ │ DocumentChunker │ ──▶ │   VectorStore (per-lib)    │
│                 │     │ /SemanticChunker│     │    + LibraryStore (meta)   │
│ - Markdown      │     │ - Code-aware    │     │                            │
│ - GitHub        │     │ - Section-based │     │  Qdrant collections:       │
│ - Webpages      │     │ - Smart titles  │     │  - one per library/site    │
│ - Websites      │     │                 │     │  - one global `libraries`  │
│ - Confluence    │     │                 │     │                            │
└─────────────────┘     └─────────────────┘     └────────────────────────────┘

                 ┌──────────────────────────────────────────────────────────┐
                 │                   DeepContext Service                    │
                 │   - HTTP API (/ingest, /search, /libraries, /stats, …)   │
                 │   - Background worker processing queued jobs             │
                 │   - MCP server (streamable-http) mounted at `/mcp`       │
                 └──────────────────────────────────────────────────────────┘

Python API

from deepcontext import ContentFetcher, VectorStore
from deepcontext.chunker import ChunkConfig, DocumentChunker

# Fetch content
with ContentFetcher() as fetcher:
    doc = fetcher.fetch("https://github.com/vercel/next.js/blob/canary/docs/...")

# Chunk with code awareness
config = ChunkConfig(
    max_chunk_size=2000,  # Max characters per chunk
    min_chunk_size=200,   # Min characters for a chunk
    overlap_size=100,     # Overlap between chunks
)
chunker = DocumentChunker(config)
chunks = chunker.chunk_document(doc)

# Store and search
store = VectorStore()
store.index_chunks(chunks)

results = store.search("middleware authentication")
for result in results:
    print(result.chunk.to_display_format())
    print(f"Score: {result.score}")

store.close()

Fetching multiple documents

from deepcontext import ContentFetcher

with ContentFetcher() as fetcher:
    # Fetch entire GitHub repo
    docs = fetcher.fetch_github_repo("vercel/next.js", path="docs")
    
    # Fetch Confluence space
    docs = fetcher.fetch_confluence_space(
        base_url="https://company.atlassian.net/wiki",
        space_key="DOCS",
        limit=50,
    )
    
    # Fetch entire website
    docs = fetcher.fetch_website(
        base_url="https://ui.shadcn.com/docs",
        max_pages=100,
        url_pattern=r"/docs/",  # Optional regex filter
    )

Environment Variables

Variable Description
GITHUB_TOKEN GitHub personal access token for higher API rate limits
CONFLUENCE_EMAIL Atlassian account email for Confluence access
CONFLUENCE_TOKEN Atlassian API token for Confluence access
QDRANT_HOST Host for the Qdrant instance used by the HTTP API / MCP server (default: localhost)
QDRANT_PORT Port for the Qdrant instance (default: 6333)
DEFAULT_COLLECTION Default collection name for legacy CLI commands (default: deepcontext_chunks)

You can also use a .env file in the project root - it will be loaded automatically.

Requirements

  • Python 3.12+
  • Docker or Podman (for Qdrant)

License

MIT

About

A semantic chunking service for documents, GitHub repos, webpages, and Confluence. It creates semantically relevant chunks from your content that can be queried later.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published