Cortex

A contextual understanding layer for your file system. Cortex indexes, extracts, and embeds every file in your workspace — documents, images, spreadsheets, PDFs, audio, video — into a semantic vector space. AI agents can then query this space through MCP to understand your files, not just search them.

The Problem

Your file system is a flat list of bytes. AI agents that need to work with your files can only do text search — they have no understanding of what a Word document is about, how a PDF relates to a spreadsheet, or what a folder structure represents.

What Cortex Does

Cortex sits between your files and AI agents. It:

Extracts content from every file type — PDFs, DOCX, XLSX, PPTX, images, audio, video
Indexes deep metadata — not just filenames, but content structure, document authors, EXIF data, MIME types, named entities
Embeds everything into a vector space using local embedding models
Locates each file precisely in semantic space based on its full context — content, metadata, relationships, usage patterns
Serves this understanding to AI agents via gRPC (MCP-ready)

The result: an agent can ask "find the contracts related to the Q3 financial report" and get meaningful answers, even if those files share no keywords, live in different folders, and are in different formats.

Your Files (any type)
  │
  ▼
Cortex Pipeline
  ├── Extract: text from PDFs, Office docs, images (OCR), audio (transcription)
  ├── Analyze: metadata, structure, relationships, entities
  ├── Embed: vector representations via local models
  ├── Classify: AI-generated tags, projects, categories, summaries
  └── Link: cross-document relationships, temporal co-occurrence
  │
  ▼
Semantic Vector Space (SQLite + embeddings)
  │
  ▼
gRPC API / MCP Server
  ├── Semantic search: "files about network configuration"
  ├── RAG queries: "summarize the Q3 financial reports"
  ├── Contextual retrieval: related files, clusters, knowledge graph
  └── Structured metadata: tags, projects, states, relationships

Architecture

Component	Language	Role
Backend Daemon (`cortexd`)	Go	Core engine — extraction, indexing, embedding, AI, storage
VS Code Extension	TypeScript	UI layer — browse the semantic space visually

The daemon is the brain. It processes files through a multi-stage pipeline, stores everything in SQLite with vector embeddings, and exposes a gRPC API with 100+ methods across 8 services. The VS Code extension is one client; any MCP-compatible agent can be another.

Quick Start

Prerequisites

Go 1.22+
Node.js 18+ (for the VS Code extension)
Ollama with llama3.2 and nomic-embed-text models

1. Start the backend

cd backend
cp configs/cortexd.yaml.example cortexd.local.yaml
# Edit cortexd.local.yaml — set watch_paths to your workspace directory
make build
./cortexd --config cortexd.local.yaml

2. Pull the models

ollama pull llama3.2           # Language model for AI features
ollama pull nomic-embed-text   # Embedding model for vector space

3. (Optional) Run the VS Code extension

npm install
npm run compile
# Press F5 in VS Code to launch the Extension Development Host

The daemon works standalone — the VS Code extension is just one way to interact with it.

How Indexing Works

Every file passes through a pipeline that extracts progressively deeper understanding:

Stage	What It Extracts
`basic`	Size, timestamps, hashes (MD5, SHA-256), path structure
`mime`	True MIME type via magic bytes (not just extension)
`mirror`	Full text from PDFs, DOCX, XLSX, PPTX, legacy Office; OCR for scanned docs
`document`	Parsing, chunking, heading structure, word/page counts
`metadata`	Author, title, creation date, EXIF, IPTC, XMP, OS-level attributes
`relationship`	Cross-document references and dependencies
`state`	Document lifecycle: draft, active, replaced, archived
`enrichment`	Named entities, sentiment, citations, tables, transcription (Whisper)
`embedding`	Vector representation via `nomic-embed-text` for semantic search
`ai`	LLM-generated tags, project assignments, summaries, categories
`clustering`	Semantic clusters via embedding similarity, temporal co-occurrence
`taxonomy`	AI-induced hierarchical category tree (Chain-of-Layer)

After indexing, every file has a rich vector representation that captures its full context — not just its text content, but its metadata, relationships, and semantic meaning.

What You Can Query

Semantic Search (RAG)

Ask natural language questions and get answers with source citations:

"Which documents discuss the supplier agreement?"
"Summarize the financial reports from Q3"
"What contracts are related to the Acme project?"

The RAG system retrieves relevant document chunks by embedding similarity, then generates an answer using the LLM with the retrieved context.

Contextual Retrieval

Related files — find files semantically similar to a given file
Document clusters — auto-detected groups of related files
Knowledge graph — relationships between documents (depends_on, replaces, references)
Usage patterns — files frequently accessed or edited together
Faceted browsing — filter by tag, project, type, date, size, category, metrics

Structured Metadata

Every file has extractable structured data:

Tags (manual + AI-generated)
Project assignments (manual + AI-inferred)
AI summaries
Document state (draft/active/replaced/archived)
Document metrics (pages, words, author)
Image metadata (EXIF, GPS, camera, dimensions)
Audio/video metadata (duration, bitrate, codec, ID3 tags)
Hierarchical categories (AI-generated taxonomy)

gRPC API

The backend exposes 8 gRPC services (defined in backend/api/proto/cortex/v1/):

Service	Methods	Purpose
`AdminService`	16	Daemon control, workspace management, pipeline streaming
`FileService`	11	Workspace scanning, file queries, grouped listings
`MetadataService`	16	Tags, projects, notes, AI summaries, suggestions
`LLMService`	10	AI operations — tag/project/summary/category generation
`RAGService`	3	Semantic search, RAG queries with citations, index stats
`KnowledgeService`	34	Projects, relationships, states, usage analytics, visualization
`TaxonomyService`	15	Hierarchical categories, AI-driven taxonomy induction
`ClusteringService`	6	Document clustering, similarity graph analysis

See the backend README for the full API reference with every method documented.

MCP Server (AI Agent Interface)

Cortex can run as an MCP server, giving AI coding agents (Claude Code, Copilot, Cursor) structured access to the knowledge graph. Instead of reading entire files, agents query documents, projects, and relationships with token-efficient tool calls.

cortexd --mcp --config cortexd.yaml

Tools

Tool	Purpose	Example
`cortex_find`	Search documents, projects, or files	`{"kind": "document", "state": "active", "tag": "review"}`
`cortex_show`	Inspect metadata, outline, or content	`{"target": "architecture.md", "view": "outline"}`
`cortex_relations`	Navigate the knowledge graph	`{"target": "api-spec.md", "type": "depends_on"}`

`cortex_find` — Search the knowledge graph

// Find active documents tagged "review"
{"kind": "document", "state": "active", "tag": "review"}

// Find software projects
{"kind": "project", "nature": "development.software"}

// Semantic search (RAG-powered)
{"kind": "document", "query": "network VPN configuration"}

// Find PDF files
{"kind": "file", "extension": ".pdf", "limit": 10}

`cortex_show` — Inspect a document or project

// Document metadata (title, state, tags, projects, AI summary)
{"target": "meeting-notes.md", "view": "signature"}

// Document outline (heading structure)
{"target": "architecture.pdf", "view": "outline"}

// Project members (assigned documents)
{"target": "Nexus Platform", "view": "members"}

// Full file metadata (tags, contexts, language, AI category)
{"target": "invoice.xlsx", "view": "metadata"}

`cortex_relations` — Navigate relationships

// What does this document depend on?
{"target": "api-spec.md", "direction": "outgoing", "type": "depends_on"}

// What references this template?
{"target": "template.md", "direction": "incoming", "type": "references"}

// Find shortest path between two documents
{"target": "doc-a.md", "path_to": "doc-b.md"}

Integration with Claude Code

Add to your Claude Code settings (~/.claude/settings.json):

{
  "mcpServers": {
    "cortex": {
      "command": "/path/to/cortexd",
      "args": ["--mcp", "--config", "/path/to/cortexd.yaml"]
    }
  }
}

Then Claude Code can query your document knowledge graph directly:

"What documents do I have about network configuration?" -> cortex_find with semantic search

"Show me the outline of the architecture document" -> cortex_show with outline view

"What depends on the API spec?" -> cortex_relations with depends_on traversal

VS Code Extension

The extension provides a visual interface to browse the semantic space:

Faceted views — by project, tag, type, date, size, folder, content type
Taxonomy tree — AI-generated hierarchical categories
Admin dashboard — backend status, pipeline progress, configuration
Cluster graph — visual representation of document relationships
Semantic commands — natural language file operations
RAG queries — ask questions about your workspace from the editor

Commands

Command	Description
`Cortex: Ask AI`	RAG query about your workspace
`Cortex: Execute Semantic Command`	Natural language file operation
`Cortex: Suggest Tags (AI)`	AI tag suggestions for current file
`Cortex: Suggest Project (AI)`	AI project suggestions for current file
`Cortex: Generate File Summary (AI)`	Generate AI summary
`Cortex: Add tag to current file`	Manual tag assignment
`Cortex: Assign project to current file`	Manual project assignment
`Re-index Everything`	Trigger full re-indexing
`Backend Admin`	Open admin dashboard
`Pipeline Progress`	Real-time indexing progress

Configuration

Daemon (YAML)

See backend/configs/cortexd.yaml.example for the full reference.

grpc_address: "localhost:50051"
data_dir: "./cortex-data"
worker_count: 4

watch_paths:
  - "/path/to/your/workspace"

llm:
  enabled: true
  default_provider: "ollama"
  default_model: "llama3.2"
  embeddings:
    enabled: true
    model: "nomic-embed-text"

tika:
  enabled: true          # Apache Tika for deep document extraction
  auto_download: true    # Downloads Tika JAR automatically

Extension (VS Code Settings)

"cortex.grpc.address": "127.0.0.1:50051",
"cortex.llm.endpoint": "http://localhost:11434",
"cortex.llm.model": "llama3.2"

Project Structure

cortex/
├── backend/                       # Go daemon (the core engine)
│   ├── cmd/cortexd/               # Entry point
│   ├── api/proto/                 # gRPC service definitions (8 services)
│   ├── internal/
│   │   ├── application/           # Pipeline, services, business logic
│   │   ├── domain/                # Entities, repository interfaces
│   │   ├── infrastructure/        # SQLite, LLM providers, embeddings, file system
│   │   └── interfaces/grpc/       # gRPC handlers and adapters
│   └── Makefile
├── src/                           # VS Code extension (TypeScript)
│   ├── extension.ts               # Entry point
│   ├── core/                      # gRPC clients
│   ├── views/                     # Facet-based tree providers
│   ├── frontend/                  # WebView panels
│   ├── commands/                  # Command handlers
│   └── services/                  # Extension services
├── docker-compose.yml             # Apache Tika (document extraction)
├── docker-compose.onlyoffice.yml  # OnlyOffice (document editing)
└── LICENSE                        # MIT

Development

# Backend
cd backend
make build           # Build cortexd
make run             # Build and run
make test            # Run tests
make proto           # Regenerate gRPC code

# Extension
npm install
npm run compile
npm run watch        # Watch mode
npm test

See CONTRIBUTING.md for the full development guide.

Roadmap

Image understanding — visual content analysis via LLM vision models, not just EXIF metadata
Multi-workspace — index across multiple directories and projects
Graph visualization — interactive knowledge graph in the browser
Plugin system — custom extractors and pipeline stages

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude/agents		.claude/agents
.vscode		.vscode
backend		backend
benchmarks		benchmarks
bin		bin
docs		docs
examples/go-backend		examples/go-backend
resources		resources
scripts		scripts
src		src
test-workspace/.vscode		test-workspace/.vscode
.gitignore		.gitignore
.vscodeignore		.vscodeignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build-go.sh		build-go.sh
docker-compose.onlyoffice.yml		docker-compose.onlyoffice.yml
docker-compose.yml		docker-compose.yml
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
test-llm-api.js		test-llm-api.js
test-llm-demo.ts		test-llm-demo.ts
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Cortex

The Problem

What Cortex Does

Architecture

Quick Start

Prerequisites

1. Start the backend

2. Pull the models

3. (Optional) Run the VS Code extension

How Indexing Works

What You Can Query

Semantic Search (RAG)

Contextual Retrieval

Structured Metadata

gRPC API

MCP Server (AI Agent Interface)

Tools

cortex_find — Search the knowledge graph

cortex_show — Inspect a document or project

cortex_relations — Navigate relationships

Integration with Claude Code

VS Code Extension

Commands

Configuration

Daemon (YAML)

Extension (VS Code Settings)

Project Structure

Development

Roadmap

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`cortex_find` — Search the knowledge graph

`cortex_show` — Inspect a document or project

`cortex_relations` — Navigate relationships

Packages