A contextual understanding layer for your file system. Cortex indexes, extracts, and embeds every file in your workspace — documents, images, spreadsheets, PDFs, audio, video — into a semantic vector space. AI agents can then query this space through MCP to understand your files, not just search them.
Your file system is a flat list of bytes. AI agents that need to work with your files can only do text search — they have no understanding of what a Word document is about, how a PDF relates to a spreadsheet, or what a folder structure represents.
Cortex sits between your files and AI agents. It:
- Extracts content from every file type — PDFs, DOCX, XLSX, PPTX, images, audio, video
- Indexes deep metadata — not just filenames, but content structure, document authors, EXIF data, MIME types, named entities
- Embeds everything into a vector space using local embedding models
- Locates each file precisely in semantic space based on its full context — content, metadata, relationships, usage patterns
- Serves this understanding to AI agents via gRPC (MCP-ready)
The result: an agent can ask "find the contracts related to the Q3 financial report" and get meaningful answers, even if those files share no keywords, live in different folders, and are in different formats.
Your Files (any type)
│
▼
Cortex Pipeline
├── Extract: text from PDFs, Office docs, images (OCR), audio (transcription)
├── Analyze: metadata, structure, relationships, entities
├── Embed: vector representations via local models
├── Classify: AI-generated tags, projects, categories, summaries
└── Link: cross-document relationships, temporal co-occurrence
│
▼
Semantic Vector Space (SQLite + embeddings)
│
▼
gRPC API / MCP Server
├── Semantic search: "files about network configuration"
├── RAG queries: "summarize the Q3 financial reports"
├── Contextual retrieval: related files, clusters, knowledge graph
└── Structured metadata: tags, projects, states, relationships
| Component | Language | Role |
|---|---|---|
Backend Daemon (cortexd) |
Go | Core engine — extraction, indexing, embedding, AI, storage |
| VS Code Extension | TypeScript | UI layer — browse the semantic space visually |
The daemon is the brain. It processes files through a multi-stage pipeline, stores everything in SQLite with vector embeddings, and exposes a gRPC API with 100+ methods across 8 services. The VS Code extension is one client; any MCP-compatible agent can be another.
cd backend
cp configs/cortexd.yaml.example cortexd.local.yaml
# Edit cortexd.local.yaml — set watch_paths to your workspace directory
make build
./cortexd --config cortexd.local.yamlollama pull llama3.2 # Language model for AI features
ollama pull nomic-embed-text # Embedding model for vector spacenpm install
npm run compile
# Press F5 in VS Code to launch the Extension Development HostThe daemon works standalone — the VS Code extension is just one way to interact with it.
Every file passes through a pipeline that extracts progressively deeper understanding:
| Stage | What It Extracts |
|---|---|
basic |
Size, timestamps, hashes (MD5, SHA-256), path structure |
mime |
True MIME type via magic bytes (not just extension) |
mirror |
Full text from PDFs, DOCX, XLSX, PPTX, legacy Office; OCR for scanned docs |
document |
Parsing, chunking, heading structure, word/page counts |
metadata |
Author, title, creation date, EXIF, IPTC, XMP, OS-level attributes |
relationship |
Cross-document references and dependencies |
state |
Document lifecycle: draft, active, replaced, archived |
enrichment |
Named entities, sentiment, citations, tables, transcription (Whisper) |
embedding |
Vector representation via nomic-embed-text for semantic search |
ai |
LLM-generated tags, project assignments, summaries, categories |
clustering |
Semantic clusters via embedding similarity, temporal co-occurrence |
taxonomy |
AI-induced hierarchical category tree (Chain-of-Layer) |
After indexing, every file has a rich vector representation that captures its full context — not just its text content, but its metadata, relationships, and semantic meaning.
Ask natural language questions and get answers with source citations:
"Which documents discuss the supplier agreement?"
"Summarize the financial reports from Q3"
"What contracts are related to the Acme project?"
The RAG system retrieves relevant document chunks by embedding similarity, then generates an answer using the LLM with the retrieved context.
- Related files — find files semantically similar to a given file
- Document clusters — auto-detected groups of related files
- Knowledge graph — relationships between documents (depends_on, replaces, references)
- Usage patterns — files frequently accessed or edited together
- Faceted browsing — filter by tag, project, type, date, size, category, metrics
Every file has extractable structured data:
- Tags (manual + AI-generated)
- Project assignments (manual + AI-inferred)
- AI summaries
- Document state (draft/active/replaced/archived)
- Document metrics (pages, words, author)
- Image metadata (EXIF, GPS, camera, dimensions)
- Audio/video metadata (duration, bitrate, codec, ID3 tags)
- Hierarchical categories (AI-generated taxonomy)
The backend exposes 8 gRPC services (defined in backend/api/proto/cortex/v1/):
| Service | Methods | Purpose |
|---|---|---|
AdminService |
16 | Daemon control, workspace management, pipeline streaming |
FileService |
11 | Workspace scanning, file queries, grouped listings |
MetadataService |
16 | Tags, projects, notes, AI summaries, suggestions |
LLMService |
10 | AI operations — tag/project/summary/category generation |
RAGService |
3 | Semantic search, RAG queries with citations, index stats |
KnowledgeService |
34 | Projects, relationships, states, usage analytics, visualization |
TaxonomyService |
15 | Hierarchical categories, AI-driven taxonomy induction |
ClusteringService |
6 | Document clustering, similarity graph analysis |
See the backend README for the full API reference with every method documented.
Cortex can run as an MCP server, giving AI coding agents (Claude Code, Copilot, Cursor) structured access to the knowledge graph. Instead of reading entire files, agents query documents, projects, and relationships with token-efficient tool calls.
cortexd --mcp --config cortexd.yaml| Tool | Purpose | Example |
|---|---|---|
cortex_find |
Search documents, projects, or files | {"kind": "document", "state": "active", "tag": "review"} |
cortex_show |
Inspect metadata, outline, or content | {"target": "architecture.md", "view": "outline"} |
cortex_relations |
Navigate the knowledge graph | {"target": "api-spec.md", "type": "depends_on"} |
// Find active documents tagged "review"
{"kind": "document", "state": "active", "tag": "review"}
// Find software projects
{"kind": "project", "nature": "development.software"}
// Semantic search (RAG-powered)
{"kind": "document", "query": "network VPN configuration"}
// Find PDF files
{"kind": "file", "extension": ".pdf", "limit": 10}// Document metadata (title, state, tags, projects, AI summary)
{"target": "meeting-notes.md", "view": "signature"}
// Document outline (heading structure)
{"target": "architecture.pdf", "view": "outline"}
// Project members (assigned documents)
{"target": "Nexus Platform", "view": "members"}
// Full file metadata (tags, contexts, language, AI category)
{"target": "invoice.xlsx", "view": "metadata"}// What does this document depend on?
{"target": "api-spec.md", "direction": "outgoing", "type": "depends_on"}
// What references this template?
{"target": "template.md", "direction": "incoming", "type": "references"}
// Find shortest path between two documents
{"target": "doc-a.md", "path_to": "doc-b.md"}Add to your Claude Code settings (~/.claude/settings.json):
{
"mcpServers": {
"cortex": {
"command": "/path/to/cortexd",
"args": ["--mcp", "--config", "/path/to/cortexd.yaml"]
}
}
}Then Claude Code can query your document knowledge graph directly:
"What documents do I have about network configuration?" ->
cortex_findwith semantic search"Show me the outline of the architecture document" ->
cortex_showwith outline view"What depends on the API spec?" ->
cortex_relationswith depends_on traversal
The extension provides a visual interface to browse the semantic space:
- Faceted views — by project, tag, type, date, size, folder, content type
- Taxonomy tree — AI-generated hierarchical categories
- Admin dashboard — backend status, pipeline progress, configuration
- Cluster graph — visual representation of document relationships
- Semantic commands — natural language file operations
- RAG queries — ask questions about your workspace from the editor
| Command | Description |
|---|---|
Cortex: Ask AI |
RAG query about your workspace |
Cortex: Execute Semantic Command |
Natural language file operation |
Cortex: Suggest Tags (AI) |
AI tag suggestions for current file |
Cortex: Suggest Project (AI) |
AI project suggestions for current file |
Cortex: Generate File Summary (AI) |
Generate AI summary |
Cortex: Add tag to current file |
Manual tag assignment |
Cortex: Assign project to current file |
Manual project assignment |
Re-index Everything |
Trigger full re-indexing |
Backend Admin |
Open admin dashboard |
Pipeline Progress |
Real-time indexing progress |
See backend/configs/cortexd.yaml.example for the full reference.
grpc_address: "localhost:50051"
data_dir: "./cortex-data"
worker_count: 4
watch_paths:
- "/path/to/your/workspace"
llm:
enabled: true
default_provider: "ollama"
default_model: "llama3.2"
embeddings:
enabled: true
model: "nomic-embed-text"
tika:
enabled: true # Apache Tika for deep document extraction
auto_download: true # Downloads Tika JAR automaticallycortex/
├── backend/ # Go daemon (the core engine)
│ ├── cmd/cortexd/ # Entry point
│ ├── api/proto/ # gRPC service definitions (8 services)
│ ├── internal/
│ │ ├── application/ # Pipeline, services, business logic
│ │ ├── domain/ # Entities, repository interfaces
│ │ ├── infrastructure/ # SQLite, LLM providers, embeddings, file system
│ │ └── interfaces/grpc/ # gRPC handlers and adapters
│ └── Makefile
├── src/ # VS Code extension (TypeScript)
│ ├── extension.ts # Entry point
│ ├── core/ # gRPC clients
│ ├── views/ # Facet-based tree providers
│ ├── frontend/ # WebView panels
│ ├── commands/ # Command handlers
│ └── services/ # Extension services
├── docker-compose.yml # Apache Tika (document extraction)
├── docker-compose.onlyoffice.yml # OnlyOffice (document editing)
└── LICENSE # MIT
# Backend
cd backend
make build # Build cortexd
make run # Build and run
make test # Run tests
make proto # Regenerate gRPC code
# Extension
npm install
npm run compile
npm run watch # Watch mode
npm testSee CONTRIBUTING.md for the full development guide.
- Image understanding — visual content analysis via LLM vision models, not just EXIF metadata
- Multi-workspace — index across multiple directories and projects
- Graph visualization — interactive knowledge graph in the browser
- Plugin system — custom extractors and pipeline stages