Skip to content

Add web search UI#4

Open
welsberr wants to merge 15 commits intotimschmidt:mainfrom
welsberr:add-web-search-ui
Open

Add web search UI#4
welsberr wants to merge 15 commits intotimschmidt:mainfrom
welsberr:add-web-search-ui

Conversation

@welsberr
Copy link
Copy Markdown
Contributor

@welsberr welsberr commented May 5, 2026

A web search UI for Wolfe.

Establishes a locally-served page. Search phrase input, number of top matches, submit search button. Matches display by filename, and can also incorporate metadata (additional scripts for the metadata analysis). The search phrase is processed with local Jina for the embedding.

Also added a 'context' match function that provides a flattened text result with N prior and succeeding chunks past matches.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the Wolfe search tool by introducing a local web UI, a persistent embedding service, and a comprehensive corpus enrichment pipeline for extracting bibliographic metadata, references, and concept phrases. It also adds support for remote OpenAI-compatible embedding endpoints, a lexical match context search mode, and improved CUDA device management. Feedback focused on removing a hardcoded row limit in index retrieval to prevent data truncation, implementing atomic file writes for metadata catalogs to ensure reliability, and reducing the network timeout for remote embedding requests.

Comment thread src/main.rs
let table = open_table(connection, table_name)
.await
.ok_or("search table does not exist")?;
let mut results = table.query().limit(1_000_000).execute().await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding a limit of 1,000,000 rows in all_index_rows may cause data truncation for large corpora. Consider removing the limit or making it configurable.

Suggested change
let mut results = table.query().limit(1_000_000).execute().await?;
let mut results = table.query().execute().await?;

Comment thread src/main.rs
fs::create_dir_all(parent)?;
}
}
let catalog_file = fs::File::create(&args.metadata_catalog)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Writing directly to the catalog file is not atomic. If the process fails, the file may be left in a corrupted state. Consider writing to a temporary file and renaming it to the final destination.

Comment thread scripts/embed.py

request = urllib.request.Request(embedding_url, data=payload, headers=headers, method="POST")
try:
with urllib.request.urlopen(request, timeout=300) as response:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A 300-second timeout for an HTTP request is quite long and may cause the process to hang if the embedding service is unresponsive. Consider a shorter timeout.

Suggested change
with urllib.request.urlopen(request, timeout=300) as response:
with urllib.request.urlopen(request, timeout=30) as response:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant