Skip to content

Add Librarian — AI chat panel for SQL lineage Q&A#43

Open
liliyaminibaeva wants to merge 38 commits into
pondpilot:masterfrom
liliyaminibaeva:feature/librarian
Open

Add Librarian — AI chat panel for SQL lineage Q&A#43
liliyaminibaeva wants to merge 38 commits into
pondpilot:masterfrom
liliyaminibaeva:feature/librarian

Conversation

@liliyaminibaeva
Copy link
Copy Markdown

PR: Add Librarian — AI-powered chat panel for SQL lineage Q&A

Summary

Librarian is a new chat panel that lets users ask natural-language questions about their data using SQL lineage context and uploaded PDF documentation. It sits as a third resizable panel to the right of the analysis view.

Features

  • AI Chat with structured responses (Summary / Data Lineage / Documentation sections); off-topic questions get a fixed refusal.
  • Multiple AI providers: OpenAI, Anthropic, and custom OpenAI-compatible endpoints (e.g. LiteLLM). Config in localStorage, sent only to the configured provider.
  • PDF documentation — drag-and-drop or click upload. Local pipeline: pdfjs-dist text extraction → 500-char chunks → Xenova/multilingual-e5-small embedding (Web Worker, 100+ languages) → cosine similarity search. 10 MB per file, no file count limit.
  • Inline identifier highlighting in answers — table and column names are styled distinct from inline code; matching is case-insensitive and normalized to canonical schema casing.
  • Click-to-navigate: clicking an assistant answer reads its Summary section and drives the lineage view via the existing search pipeline. Every referenced column highlights across all owning tables.
  • Schema search in the Schema view header — substring match by table or column name with Prev/Next cycling.
  • Per-project state isolation (RAM-only) — chat history, PDFs, and embedded chunks are scoped to the active project; switching projects does not leak content into another project's prompt. F5 wipes everything.
  • Keyboard shortcut ⌘L / Ctrl+L to toggle the panel.
  • Test coverage — 304 Vitest tests passing (yarn typecheck clean).

How it works

User question → use-librarian-chat.ts (orchestrator)
  ├── Lineage context ← lineage-formatter.ts ← useLineageState()
  ├── SQL snippet ← useProject() (active file, truncated to 3000 chars)
  ├── PDF context ← vector search ← embedding-service (Web Worker)
  └── Chat history ← Zustand store (last 10 messages sent to AI, full history shown in UI)
        ↓
  context-builder.ts → assembled prompt
        ↓
  ai-service.ts → fetch() → OpenAI / Anthropic / Custom endpoint
        ↓
  Response → store → chat UI (markdown + highlighted identifiers)
  • All processing runs locally in the browser (embeddings via Web Worker, no backend).
  • AI answers strictly from provided data (SQL, lineage, PDFs) — no general knowledge.
  • API keys stored in localStorage, sent only to the configured provider.

Files added

app/src/features/librarian/
├── components/          # UI: panel, chat, PDF upload, settings dialog, toggle
├── services/            # AI service, context builder, lineage formatter, PDF processor, vector search, embeddings
├── hooks/               # Chat orchestrator, sync-active-project
├── workers/             # Embedding Web Worker (Xenova/transformers)
├── utils/               # schema-identifiers (detect, resolve, extractSummary)
├── __tests__/           # Vitest suites
├── types.ts
├── constants.ts
├── store.ts
└── index.ts

app/public/polly-icon.svg                 # Custom duck icon
app/src/lib/lineage-node-resolver.ts      # Resolve ChatReference[] → lineage node ids
app/src/lib/lineage-navigation.ts         # Pure consumer of NavigationTarget for the lineage tab
app/src/lib/__tests__/lineage-node-resolver.test.ts
app/src/lib/__tests__/lineage-navigation.test.ts
docs/librarian.md                         # User guide

Files modified

File Change
app/src/components/Workspace.tsx Third ResizablePanel for Librarian; toggle button in analysis toolbar; LibrarianPanelWithNavigation wires chat-click → lineage search-term + reveal
app/src/components/AnalysisView.tsx Wires actionsRef.current.revealNodeInGraph into applyLineageNavigation deps
app/src/components/GlobalDropZone.tsx Overlay made pointer-events-none so PDF drops reach the Librarian dropzone
app/src/lib/view-state-store.ts Added librarianOpen flag and per-project Librarian state hook integration
app/src/lib/shortcuts.ts Toggle-librarian shortcut
app/src/lib/navigation-context.tsx NavigationTarget.highlightNodeIds / tablesToExpand / primaryFocusId for chat-click navigation
app/package.json Added pdfjs-dist, @xenova/transformers, Vitest + RTL deps
app/vitest.config.ts New — Vitest configuration
CHANGELOG.md Single [Unreleased] entry describing the Librarian feature
docs/librarian.md User guide

New dependencies

Package Purpose Size
pdfjs-dist PDF text extraction in browser ~800 KB
@xenova/transformers Local embedding model inference ~2 MB JS + ~23 MB model (lazy loaded, cached in env.useBrowserCache)
vitest, @testing-library/react, jsdom Testing (devDependencies)

Known limitations

  • Single-substring search. Click-to-navigate writes one substring into the lineage search box; heterogeneous Summary references (e.g. both MANDT and BUKRS) cannot share a single search term — the first column wins. The user can dismiss the highlight via the lineage search box.
  • SchemaView highlights one table at a time. The Schema view's selection prop accepts a single selectedTableName; schema search uses Prev/Next cycling to compensate.
  • Chat-click overrides active manual lineage search. Setting the search term on click overwrites whatever the user had typed; clearing the search box resets to no highlight.
  • Per-project state is RAM-only. Page reload (F5) clears chat, PDFs, and embedded chunks. By design — no localStorage persistence to keep the LLM context predictable.

Test plan

  • Open FlowScope, paste SQL, verify lineage renders normally.
  • Click the Librarian toolbar button (or ⌘L) — Librarian panel opens on the right.
  • Configure AI settings (OpenAI, Anthropic, or custom endpoint).
  • Ask a question about the SQL — verify structured response with Summary / Data Lineage / Documentation.
  • Upload a PDF (click and drag-and-drop) — verify it processes and shows "ready" status.
  • Ask a question related to PDF content — verify Documentation section cites the PDF.
  • Click an assistant answer that mentions a column — verify the lineage view highlights the column across all owning tables and pans/pulses on a source table.
  • Click an answer that mentions only a table — verify pan/pulse on that table.
  • Switch projects — verify chat, PDFs, and answers are scoped to the active project (no leakage).
  • Run yarn typecheck — 0 errors.
  • Run yarn test — all tests pass.

liliyaminibaeva and others added 30 commits April 30, 2026 17:26
Librarian is a chat panel that lets users ask natural-language questions
about their data using SQL lineage context and uploaded PDF documentation.

Features:
- AI chat with structured responses (Summary / Data Lineage / Documentation)
- OpenAI, Anthropic, and custom endpoint support (LiteLLM compatible)
- PDF upload with text extraction, chunking, and vector search (embeddings)
- Drag-and-drop PDF upload with GlobalDropZone integration
- Keyboard shortcut to toggle panel
- Full test coverage with Vitest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set useBrowserCache=true to prevent CORS errors when loading the
  embedding model from Hugging Face (no-cache requests get blocked)
- Strengthen "No information." instruction in prompt to reduce AI
  deviations from the expected response format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add min-w-0 on the file-name cell, whitespace-nowrap on the size span,
and shrink-0 on the delete button so the file name truncates with
ellipsis while the size and trash icon stay fully on screen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a schema-identifiers utility and renders known table/column names in
assistant messages as distinct font-mono/primary tokens. Updates the system
prompt to emit bare identifiers, include technical names in Summary, and
refuse off-topic questions with a canned response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a collapsible magnifying-glass search control overlaid on the
Schema tab. Typing prefix-matches a table (case-insensitive) and drives
the existing selectedTableName highlight; clearing or Escape collapses
the control and clears selection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
App workspace tests, lint, and typecheck all clean. Pre-existing
failures in packages/react and pdf-processor.test.ts are unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes out the 2026-04-20 Librarian UI/LLM plan: Unreleased notes
cover the new help popover, schema search, clickable assistant
messages, styled schema identifiers, and the toolbar toggle move;
plan archived to docs/plans/completed/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix two test issues found during code review of the Librarian UI/LLM
polish plan: a misleading test name that referenced max-h-[200px] while
asserting on max-h-[64px], and a fake persistence test for librarianOpen
that only restated the prior test's assertion instead of verifying the
partialize output reaches localStorage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- AnalysisView: keep LibrarianToggleButton reachable when no analysis
  result is loaded (was hidden because the early-return placeholder
  dropped the toolbar after the toggle was moved out of the global
  header).
- chat-messages: don't navigate to schema when the user is selecting
  text inside an assistant bubble or clicking inside a code block —
  preserves text selection and SQL copy interactions on clickable
  bubbles.
- embedding-worker: refresh stale comment that contradicted the now
  enabled browser cache.
- pdf-processor.test: replace `as any` casts with a typed cast to
  clear pre-existing lint errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UI:
- Narrow-panel fixes: PDF file row truncates name, delete button stays visible
- Moved Librarian toggle to analysis toolbar next to Schema button, rounded-full shape preserved in both states
- Replaced BookOpen with custom Polly duck icon in toolbar, chat avatar, empty state
- Aligned Librarian panel header height with analysis toolbar (44px)
- Chat input placeholder: "Ask about your data..."
- Colored inline code (table/column names) in accent color

Schema search:
- Extended search to column names (falls back to owning table)
- Prev/Next navigation for multiple matches with "1/N" indicator
- Enter/Shift+Enter to cycle matches, Escape to close

Prompt tuning:
- Summary must include concrete table/column names, not vague overviews
- Documentation must cite source PDF file name
- Off-topic refusal and bare-token identifier formatting retained

Embeddings:
- Switched to multilingual-e5-small (100+ languages)
- Added query/passage prefix handling required by e5 models
- User questions embedded as "query:", PDF chunks as "passage:"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New docs/librarian.md with user-facing guide (AI setup, PDFs,
  shortcuts, privacy, troubleshooting)
- New app/src/features/librarian/TEST_CASES.md with 25 manual test
  cases and the sample SAP SQL pipeline used during testing
- README.md: mention Librarian in Web App features, Key Features,
  and Documentation sections
- docs/README.md: add Features section pointing to librarian.md
- app/ARCHITECTURE.md: document features/librarian/ module, chat
  data flow, Polly icon, and ⌘L shortcut
- Remove unused Polly_librarian_violet.svg (icons use polly-icon.svg)
- Remove stale plan doc (superseded by current code + CHANGELOG)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce byProject keyed by activeProjectId; rewrite mutators to
operate on the active bucket and no-op when no project is active.
Add useLibrarianMessages / useLibrarianPdfFiles / useLibrarianPdfChunks
selector hooks. Flat-shape mirror retained transitionally so existing
consumers keep typechecking until Task 3 migrates them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds useSyncActiveProject hook and invokes it once in Workspace.tsx so
the Librarian store stays pointed at the right per-project bucket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch LibrarianPanel and PdfUpload to useLibrarianMessages /
useLibrarianPdfFiles so the panel reads from the active project's
bucket. Add a no-active-project guard: ChatInput accepts an optional
noActiveProject flag that disables input and surfaces an empty-state
hint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace flat-store reads in use-librarian-chat with bucket-scoped reads
keyed by activeProjectId, so PDF chunks and chat history sent to the LLM
come strictly from the current project. Bail out with a friendly
assistant message when no project is active. Seed byProject in the test
beforeEach and add a case verifying project A's PDF chunks aren't
searched while project B is active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olation

typecheck and lint clean. Per-project isolation manually verified by code
review: store mutators are activeProjectId-scoped, selectors return
empty when no active project, chat hook reads from the active bucket,
chat input disables with a hint when activeProjectId is null. No
regressions in the librarian test suite (pre-existing prompt/model/UI
test rot from earlier commits is out of scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a CHANGELOG entry, document per-project chat/PDF scoping in the user
guide, and move the plan into docs/plans/completed/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix mid-flight project switch leak in use-librarian-chat: capture
  activeProjectId once and route writes through addMessageToProject
  so an in-flight LLM response always lands in the originating
  project's bucket, not whichever project is active when the
  network call returns.
- Add pruneProjectBuckets mutator and wire it from useSyncActiveProject
  so deleting a project also drops its Librarian bucket (chat history
  + embedded chunks) instead of leaking RAM for the tab lifetime.
- Drop dead-code addMessage('Open or create a project') call in the
  chat hook; the chat-input UI's noActiveProject hint already covers
  the empty-state UX.
- Compare PDF file names case-insensitively in hasPdfFile so duplicate
  detection works on case-insensitive filesystems.
- Update ARCHITECTURE.md to describe the per-project store shape and
  the new sync hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Close two remaining per-project isolation gaps surfaced in the
second-pass review.

PDF mid-flight project switch leak: handlePdfUpload now captures the
active project id at upload time and routes addPdfFile / addPdfChunks /
setPdfStatus through new explicit-id mutators (addPdfFileToProject,
addPdfChunksToProject, setPdfStatusForProject). Without this, switching
projects between addPdfFile and the chunks/status writes that follow
processPdf would land the chunks in the wrong bucket and leave the
originating PDF stuck on processing.

Zombie bucket on deleted project: addMessageToProject and the new PDF
*ToProject mutators now drop writes when the bucket is missing AND the
project is no longer active. This prevents a late-arriving response or
chunk write from resurrecting a bucket that pruneProjectBuckets just
removed because the project was deleted. Lazy initialization for fresh
projects still works because the active project always passes the
guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves a list of ChatReference values from chat answers into concrete
GlobalNode IDs in the lineage graph, plus the parent table IDs that need
to be expanded so matched columns become visible. This is the resolution
layer the chat-click multi-highlight wiring (Task 4) will consume.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the schema-tab single-table click handler with a reference-set
flow: chat-messages now parses every identifier with `resolveAllReferences`,
the panel hands the full ref array to Workspace, and
LibrarianPanelWithNavigation resolves them to lineage node IDs and triggers
a `navigateTo('lineage', { highlightNodeIds, tablesToExpand })` so all
referenced tables/columns can be highlighted together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

When a Librarian chat answer is clicked, AnalysisView now expands any parent
tables that are not already expanded and selects the first highlighted node
so the column references become visible. The branching logic was extracted
into applyLineageNavigation so the navigation contract can be unit-tested
without mounting the full view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns stale tests with current production code (multilingual embedding
model, rewritten prompt, Polly icon, PDF size whitespace) and fixes a
SchemaSearchControl regression where emptying the input no longer
cleared the table selection. All gates pass: yarn typecheck (0 errors),
yarn test (291/291), yarn lint (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the Librarian user guide to describe the new chat-click → Lineage
flow, log the change in CHANGELOG.md, and move the completed plan into
docs/plans/completed/. Notes the single-select fallback caused by the
@pondpilot/flowscope-react public API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Tighten chat-reference qualifier gap to single dot + horizontal
  whitespace only. Newlines or repeated dots between identifiers
  (e.g. "BKPF.\n\nMANDT" or "BKPF..MANDT") no longer cause spurious
  qualified references that would mis-resolve in lineage.
- Remove dead `resolveFirstTableReference` and the `SchemaReference`
  type they returned — no production caller remains after the
  multi-highlight migration; plan called for removal if unused.
- Add regression tests for the tightened gap behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
liliyaminibaeva and others added 8 commits April 30, 2026 17:28
Recenter on parent table when chat-click first highlight is a column.

Columns are not top-level ReactFlow nodes (they render inside table nodes),
so passing a column id to useNodeFocus.getNode returns undefined and the
fitView never fires — the viewport silently failed to recenter whenever a
chat answer's first matched reference was a column. The resolver now
exposes a primaryFocusId that points at the parent table for column refs;
applyLineageNavigation uses it for setFocusNodeId while still passing the
column id to selectNode so the column highlights inside its table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness fixes from third-pass review:

- lineage-node-resolver: index table-like nodes by qualified name
  (catalog.schema.name) so columns from multi-schema graphs with
  duplicate table names route to their actual parent for expansion.
  Previously the bare-name index dropped schema, so a column from
  staging.BKPF.MANDT was mapped to whichever BKPF (sap or staging)
  registered first — leaving the real owning table collapsed and the
  column highlight invisible.

- chat-messages: scope the page-wide-text-selection guard to mouse
  clicks only. Pressing Enter/Space on a focused chat bubble now
  always activates regardless of any stale selection elsewhere on
  the page (keyboard a11y).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on chat click

- detectIdentifiers now matches case-insensitively and normalizes matches
  back to canonical schema casing, so LLM-emitted lowercase references
  (e.g. rbkp.ZLSPR) resolve like their uppercase canonical counterparts.
- LibrarianPanelWithNavigation writes the first column (preferred) or
  first table from resolved refs into the persisted lineage searchTerm,
  reusing the existing GraphView search pipeline. Force-enables column
  edges when a column is referenced so the matching column rows can
  highlight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
applyLineageNavigation no longer calls setFocusNodeId for the
highlightNodeIds branch. The useNodeFocus hook in flowscope-react
auto-zooms aggressively, which is jarring for chat answers that
reference multiple tables. The search-term highlight + selectNode
still flag the relevant tables; the user keeps their viewport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream v0.7.0 flattened AnalyzeResult: graph nodes and edges are now
top-level fields instead of nested under `globalLineage`. Migrate the
librarian-side consumers and their tests:

- formatLineage now reads result.nodes / result.edges directly
- resolveLineageNodeIds uses result.nodes for the global node index
- Test fixtures replace globalLineage.{nodes,edges} with the flat
  fields and statementRefs with statementIds, matching the new Node /
  Edge contract

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt table

The lineage view's chat-click handler now reacts only to identifiers
mentioned in the answer's Summary section, instead of every identifier
across Summary / Data Lineage / Documentation. This avoids navigation
landing on tables that appear only as supporting context (e.g. "MANDT
is not exposed in INVOICE_HEADER" no longer pulls focus to
INVOICE_HEADER).

- Add extractSummary() utility that lifts the Summary block from the
  three-section LLM answer template, falling back to the full text when
  no marker is found
- chat-messages.tsx feeds extractSummary(content) into
  resolveAllReferences for click navigation; inline identifier styling
  still uses the full message
- LibrarianPanelWithNavigation writes the lineage searchTerm before the
  resolver short-circuits on empty nodeIds, so table cards still
  highlight via isNodeHighlighted even when no concrete column nodes
  are reachable
- lineage-node-resolver ranks column matches by parent type — actual
  source tables ('table') win over views ('view') and CTEs ('cte') so
  primaryFocusId lands on a base table when one exists, instead of a
  view that just touches the column transitively
- lineage-navigation now calls revealNodeInGraph for the highlight
  branch (gentle pan + pulse via flowscope-react v0.7.0) instead of the
  aggressive useNodeFocus zoom we removed earlier; AnalysisView wires
  actionsRef.current.revealNodeInGraph into the deps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CHANGELOG: collapse the per-iteration Librarian bullets in
[Unreleased] into a single dated entry that describes the feature as
shipped, since the iterations were local development noise rather
than user-visible increments.

docs/librarian.md: rewrite the "Jump to Lineage from a chat answer"
section to reflect the actual behavior — Summary-scoped navigation,
search-term-driven multi-highlight, auto-enabled column edges,
gentle pan + pulse on the parent source table, and case-insensitive
identifier matching. Replaces the stale "single-select due to API
limitation" note with the substring-search trade-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run prettier --write on the Librarian feature directory and the two
lib files (lineage-navigation, lineage-node-resolver) it owns. No
behavior changes — formatting only.

Also removes a completed-plans doc that is no longer relevant: the
chat-click multi-highlight plan landed via the search-term pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@twoxfh
Copy link
Copy Markdown

twoxfh commented May 8, 2026

Very cool! I tried it out and its very useful. From what I have seen, prompts are also part of the configuration in these instances. Could the context-builder source its prompt from an editable field with the default being the current prompt?

Context can get huge, user facing raw string size might be very helpful. Just a thought!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants