Skip to content

Commit db5c83a

Browse files
committed
docs: add connectors-for-ai-context upgrade plan (Notion AI-style connectors + PageIndex RAG)
1 parent b95aebb commit db5c83a

1 file changed

Lines changed: 197 additions & 0 deletions

File tree

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Connectors for AI Context — Notion AI-style
2+
3+
Add a **Connectors** system that lets users connect external data sources to enrich AI context — similar to Notion AI connectors (Slack, Google Drive, GitHub, etc.).
4+
5+
TextAgent already has `context-memory.js` with SQLite FTS5 indexing and folder/file attachment. This plan builds on that foundation with **named connector adapters** that fetch, index, and surface external content.
6+
7+
## Design Philosophy
8+
9+
Since TextAgent is 100% client-side with zero-knowledge privacy:
10+
- **No server-side indexing** — all indexing happens in-browser via SQLite/FTS5
11+
- **API keys stored in localStorage** (same pattern as `ai-web-search.js`)
12+
- **Content fetched, chunked, and indexed locally** into the existing memory system
13+
- **Database**: Stay with sql.js + FTS5 (see DB Assessment below)
14+
15+
---
16+
17+
## Connector Registry
18+
19+
| Connector | Icon | Requires Key? | Description |
20+
|-----------|------|--------------|-------------|
21+
| `url` | `bi-link-45deg` | No | Fetch & index any public URL/webpage |
22+
| `rss` | `bi-rss` | No | Index RSS/Atom feed entries |
23+
| `github` | `bi-github` | Yes (PAT) | Index repo files (README, docs, code) |
24+
| `paste` | `bi-clipboard-data` | No | Manually paste text/markdown to index |
25+
| `pageindex` | `bi-file-earmark-text` | Yes (API key) | Vectorless reasoning-based RAG for PDFs via [PageIndex](https://github.com/VectifyAI/PageIndex) |
26+
27+
### Future Connectors (require OAuth)
28+
- Google Drive (Google Picker API)
29+
- Notion (OAuth)
30+
- Slack (OAuth)
31+
32+
---
33+
34+
## PageIndex Integration (VectifyAI)
35+
36+
[PageIndex](https://github.com/VectifyAI/PageIndex) is a **vectorless, reasoning-based RAG** system that replaces similarity search with LLM reasoning over hierarchical document tree indexes. It's ideal for long, complex documents (financial reports, legal filings, technical manuals) where traditional chunking + FTS5 falls short.
37+
38+
### Why PageIndex complements the existing connector approach
39+
40+
| Aspect | Local FTS5 (url, rss, github, paste) | PageIndex |
41+
|--------|--------------------------------------|-----------|
42+
| **Indexing** | Client-side chunking + SQLite FTS5 | Server-side hierarchical tree index |
43+
| **Retrieval** | Keyword/BM25 similarity | LLM reasoning-based tree search |
44+
| **Best for** | Short-to-medium docs, quick lookups | Long professional documents (100+ pages) |
45+
| **Privacy** | 100% local | API call to `api.pageindex.ai` (document uploaded) |
46+
| **Cost** | Free | Requires PageIndex API key (free tier available) |
47+
48+
### Integration approach
49+
50+
PageIndex operates differently from the other connectors — it doesn't feed into the local FTS5 pipeline. Instead:
51+
52+
1. **Upload**: User attaches a PDF via the Connector panel → file is submitted to `pi_client.submit_document()` via REST API
53+
2. **Index**: PageIndex builds a hierarchical tree index (async, poll for completion)
54+
3. **Query**: When AI chat queries fire, the PageIndex Chat API is called alongside local FTS5 search
55+
4. **Response**: PageIndex returns reasoning-traced, page-referenced answers in OpenAI-compatible format
56+
57+
### Adapter — `fetchPageIndex(apiKey, docId, query)`
58+
59+
```js
60+
// REST API calls (no Python SDK needed — direct fetch to api.pageindex.ai)
61+
62+
// Submit document
63+
async function submitToPageIndex(apiKey, pdfBlob, filename) {
64+
// POST multipart/form-data to PageIndex API
65+
// Returns { doc_id: "pi-..." }
66+
}
67+
68+
// Check processing status
69+
async function getPageIndexTree(apiKey, docId) {
70+
// GET document tree structure
71+
// Returns { status: "completed", result: [...tree nodes...] }
72+
}
73+
74+
// Chat with document (reasoning-based RAG)
75+
async function chatWithPageIndex(apiKey, docId, messages, stream) {
76+
// POST to chat_completions endpoint
77+
// OpenAI-compatible response format
78+
// Supports streaming via SSE
79+
}
80+
```
81+
82+
### DocGen tag support
83+
84+
```
85+
{{AI:
86+
@connect: pageindex:pi-abc123def456
87+
@prompt: What are the key financial risks mentioned in this report?
88+
}}
89+
```
90+
91+
### Storage keys
92+
- `API_KEY_PAGEINDEX` — PageIndex API key (from [developer dashboard](https://dash.pageindex.ai))
93+
- `CONNECTOR_PAGEINDEX_DOCS` — JSON map of `{ docId → { filename, status, uploadedAt } }`
94+
95+
### Privacy note
96+
Unlike the local connectors, PageIndex **uploads documents to an external service**. This should trigger:
97+
- A clear consent banner on first use ("Your PDF will be uploaded to pageindex.ai for processing")
98+
- A separate privacy indicator icon on PageIndex-connected sources
99+
- Option to delete documents from PageIndex via `delete_document(doc_id)`
100+
101+
---
102+
103+
## Implementation Scope
104+
105+
### 1. Storage Keys (`storage-keys.js`)
106+
- `CONNECTORS_CONFIG` — JSON blob of enabled connectors and settings
107+
- `API_KEY_GITHUB_PAT` — GitHub Personal Access Token
108+
- `CONNECTOR_SYNC_LOG` — Last sync timestamps per connector
109+
110+
### 2. Connector Engine — `js/connectors.js` (~400 lines)
111+
112+
**Public API** (`M._connectors`):
113+
```js
114+
M._connectors = {
115+
REGISTRY, // Connector type definitions
116+
getConnectedSources(), // List active connectors
117+
connect(type, config), // Add connector → fetch + index
118+
disconnect(id), // Remove connector + indexed data
119+
syncAll(), // Re-fetch all connectors
120+
sync(id), // Re-fetch one connector
121+
search(query, connectorIds, max), // Search across connector indices
122+
formatForContext(results), // Format for LLM injection
123+
};
124+
```
125+
126+
**Data flow:**
127+
1. User configures connector (URL, repo, etc.)
128+
2. Adapter fetches content (fetch API, GitHub API)
129+
3. Content chunked via existing `chunkMarkdown`/`chunkPlainText`
130+
4. Chunks indexed into per-connector FTS5 database (IndexedDB)
131+
5. AI queries search across all enabled connectors
132+
133+
**Adapters:**
134+
- `fetchUrl(url)` — fetch + DOMParser to extract text from HTML
135+
- `fetchRss(feedUrl)` — Parse RSS/Atom XML, extract entries
136+
- `fetchGitHub(owner, repo, pat)` — GitHub REST API for repo tree + files
137+
- `pasteText(label, text)` — Directly index user-provided text
138+
139+
### 3. Connector Panel UI (`index.html`)
140+
- Toolbar button: `<button id="connector-toggle"><i class="bi bi-plug"></i></button>`
141+
- Slide-out panel (same architecture as AI panel)
142+
- Cards for each active connector (icon, label, status, sync/delete buttons)
143+
- "+ Add Connector" dropdown for selecting type
144+
- Inline config form per type
145+
146+
### 4. Panel CSS (`styles.css`, ~200 lines)
147+
- `.connector-panel` — Fixed right panel with slide animation
148+
- `.connector-card` — Card per connector
149+
- `.connector-add-modal` — Type selector
150+
- Status indicators (synced, syncing, error)
151+
- Mobile responsive
152+
153+
### 5. AI Chat Integration (`ai-chat.js`)
154+
In `sendChatMessage()`, inject connector context alongside web search:
155+
```js
156+
if (M._connectors) {
157+
var sources = M._connectors.getConnectedSources();
158+
if (sources.length > 0) {
159+
var results = await M._connectors.search(text, sources.map(s => s.id), 5);
160+
if (results.length > 0) {
161+
context += '\n\n[Connected Sources]\n' + M._connectors.formatForContext(results);
162+
}
163+
}
164+
}
165+
```
166+
167+
### 6. DocGen Tag Integration (`ai-docgen.js`)
168+
Add `@connect:` field parsing:
169+
```
170+
{{AI:
171+
@connect: github-repo, docs-site
172+
@prompt: Summarize the API changes
173+
}}
174+
```
175+
176+
---
177+
178+
## DB Assessment (sql.js vs alternatives)
179+
180+
| DB | Size | FTS? | Verdict |
181+
|---|---|---|---|
182+
| **sql.js** (current) | ~300KB | FTS5 ✓ | ✅ Keep — right tool for chunked doc search |
183+
| **PGlite** | ~3–5MB | tsvector | ❌ Overkill — massive bundle, less FTS capability |
184+
| **wa-sqlite** | ~300KB | FTS5 | ✅ Future upgrade — same API, native OPFS |
185+
| **cr-sqlite** | ~400KB | FTS5 | ✅ Future — adds CRDTs for multi-device sync |
186+
| **DuckDB-WASM** | ~8MB | Limited | ❌ Analytics engine, wrong use case |
187+
188+
**Recommendation**: Stay with sql.js + FTS5. If perf upgrade needed, migrate to **wa-sqlite** (same SQL API, native OPFS persistence). If multi-device sync needed, consider **cr-sqlite**.
189+
190+
---
191+
192+
## Privacy Notes
193+
194+
- URL/RSS connectors make fetch requests to external URLs
195+
- GitHub connector sends PAT to `api.github.com`
196+
- Consistent with existing web search feature (calls external APIs)
197+
- Consider adding a consent/warning banner for first-time use

0 commit comments

Comments
 (0)