Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
404 changes: 404 additions & 0 deletions docs/branching-example.md

Large diffs are not rendered by default.

36 changes: 34 additions & 2 deletions docs/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@ The database consists of the following tables:
7. **git_commits** - Git commit metadata with unified diffs and changed symbols
8. **lore** - Lore.kernel.org email archive with FTS indices for fast searching
9. **lore_vectors** - Vector embeddings for semantic search of lore emails
10. **content_0 through content_15** - Deduplicated content storage (16 shards)
10. **indexed_branches** - Tracks which git branches have been indexed with their tip commits
11. **content_0 through content_15** - Deduplicated content storage (16 shards)

## Table Schemas

Expand Down Expand Up @@ -354,7 +355,38 @@ Embeddings combine from, subject, recipients, and body into a single representat
- IVF-PQ vector index for fast approximate nearest neighbor search


### 10. content_0 through content_15 (Content Shards)
### 10. indexed_branches

Tracks which git branches have been indexed, enabling multi-branch support and efficient incremental indexing across branches.

**Schema:**
```
branch_name (Utf8, NOT NULL) - Branch name (e.g., "main", "origin/develop")
tip_commit (Utf8, NOT NULL) - Commit SHA at the tip when indexed (40-char hex)
indexed_at (Int64, NOT NULL) - Unix timestamp of when branch was last indexed
remote (Utf8, nullable) - Remote name if tracking branch (e.g., "origin")
```

**Purpose:**
- Tracks which branches have been indexed and at which commit
- Enables efficient multi-branch indexing by skipping already-current branches
- Supports both local branches (e.g., "main") and remote-tracking branches (e.g., "origin/develop")
- Stores indexing timestamp for freshness tracking

**Use Cases:**
- Multi-branch indexing: `semcode-index --branches main,develop,feature-x`
- Branch update detection: Skip branches already indexed at current tip
- Query scoping: Limit queries to specific branch context
- Branch cleanup: Remove data for deleted branches

**Indices:**
- BTree on `branch_name` (primary lookup by branch name)
- BTree on `tip_commit` (find branches at specific commits)
- BTree on `remote` (filter by remote)

---

### 11. content_0 through content_15 (Content Shards)

Stores deduplicated content referenced by other tables, distributed across 16 shard tables for optimal performance.

Expand Down
19 changes: 18 additions & 1 deletion docs/semcode-mcp.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# semcode usage guide

All semcode functions are git aware and default to lookups on the current
commit. You can also pass a specific commit you're interested in.
commit. You can also pass a specific commit you're interested in, or a branch name.

**Note on Regex Patterns**: All regex patterns in semcode are **case-insensitive by default**. This applies to all pattern matching including function names, commit messages, symbols, and lore email searches. You don't need to use the `(?i)` flag.

**Branch Support**: Most query tools support a `branch` parameter as an alternative to `git_sha`. When you specify a branch name (e.g., "main", "develop"), it will be resolved to the current tip commit of that branch. Branch takes precedence over git_sha if both are provided.

**find_function**: search for functions and macros
- git_sha: indicates which commit to search (default: current)
- branch: branch name to search (alternative to git_sha, e.g., "main", "develop")
- name: function/macro name, or a regex
- also displays details on callers and callees
**find_type**: search for types and typedefs
Expand Down Expand Up @@ -74,6 +77,20 @@ commit. You can also pass a specific commit you're interested in.
sha provided. Mutually exclusive with git_range
- page: optional page number for pagination (1-based). Each page contains
50 lines, results indicate current page and total pages. Default: full results
**list_branches**: list all indexed branches with their status
- No parameters required
- Shows branch names, indexed commit SHAs, and freshness status
- **up-to-date**: indexed commit matches current branch tip
- **outdated**: branch has new commits since indexing (re-index to update)
- Useful for tracking multiple stable branches (e.g., linux-5.10.y, 6.1.y, 6.12.y)
and knowing when they need re-indexing after new releases
**compare_branches**: compare two branches and show their relationship
- branch1: first branch name (e.g., "main")
- branch2: second branch name (e.g., "feature-branch")
- Shows merge base, ahead/behind status, and indexing status for both branches
**indexing_status**: check the status of background indexing operation
- No parameters required
- Shows current indexing progress, errors, and timing

## Recipes

Expand Down
240 changes: 240 additions & 0 deletions src/bin/index.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,30 @@ struct Args {
/// Clone and index a lore.kernel.org archive into <db_dir>/lore/<repo>
#[arg(long, value_name = "URL")]
lore: Option<String>,

// ==================== Multi-Branch Indexing ====================
/// Index a specific branch (can be specified multiple times)
/// Example: --branch main --branch develop
#[arg(long, value_name = "BRANCH")]
branch: Vec<String>,

/// Comma-separated list of branches to index
/// Example: --branches main,develop,feature-x
#[arg(long, value_name = "LIST", value_delimiter = ',')]
branches: Vec<String>,

/// Index all local branches
#[arg(long)]
all_branches: bool,

/// Include remote-tracking branches when using --all-branches
#[arg(long)]
remote_branches: bool,

/// Only index branches that have new commits since last indexed
/// (skip branches already indexed at current tip)
#[arg(long)]
update_branches: bool,
}

/// Fetch and parse the lore.kernel.org manifest
Expand Down Expand Up @@ -401,6 +425,215 @@ async fn clone_lore_repository(lore_url: &str, db_path: &str) -> Result<PathBuf>
Ok(clone_path)
}

// ==================== Branch Indexing Support ====================

/// Collect branches to index from the various branch-related CLI flags
/// Returns BranchRef structs with proper is_remote/remote metadata
fn collect_branches_to_index(args: &Args) -> Result<Vec<semcode::git::BranchRef>> {
use semcode::git::{get_branch_info, list_branches, BranchRef};

let mut branch_names = Vec::new();

// Collect branch names from --branch flags (can be repeated)
branch_names.extend(args.branch.iter().cloned());

// Collect from --branches (comma-separated)
branch_names.extend(args.branches.iter().cloned());

// Remove duplicates while preserving order
let mut seen = HashSet::new();
branch_names.retain(|b| seen.insert(b.clone()));

// Convert manually-specified branch names to BranchRef with proper metadata
let mut branches: Vec<BranchRef> = Vec::new();
for name in &branch_names {
match get_branch_info(&args.source, name) {
Ok(branch_ref) => branches.push(branch_ref),
Err(e) => return Err(anyhow::anyhow!("Branch '{}' does not exist: {}", name, e)),
}
}

// If --all-branches is set, get all branches from git
if args.all_branches {
let git_branches = list_branches(&args.source, args.remote_branches)?;
for branch in git_branches {
// Skip if already added from --branch/--branches
if !branches.iter().any(|b| b.name == branch.name) {
branches.push(branch);
}
}
}

Ok(branches)
}

/// Run branch indexing mode - index multiple branches
async fn run_branch_indexing(args: Args, branches: Vec<semcode::git::BranchRef>) -> Result<()> {
info!("Starting multi-branch indexing mode");
info!(
"Branches to index: {:?}",
branches.iter().map(|b| &b.name).collect::<Vec<_>>()
);

// Process database path
let database_path = process_database_path(args.database.as_deref(), Some(&args.source));

// Create database manager wrapped in Arc for efficient sharing with process_git_range
// (Arc::clone is cheap - just increments ref count, reuses same LanceDB connection)
let db_manager = Arc::new(
DatabaseManager::new(&database_path, args.source.to_string_lossy().to_string()).await?,
);
db_manager.create_tables().await?;

if args.clear {
println!("Clearing existing data...");
db_manager.clear_all_data().await?;
println!("Existing data cleared.");
}

let start_time = std::time::Instant::now();
let mut branches_indexed = 0;
let mut branches_skipped = 0;

for branch in &branches {
let branch_name = &branch.name;
let tip_commit = &branch.tip_commit;

println!(
"\n{}",
format!("=== Processing branch: {} ===", branch_name).cyan()
);

info!("Branch {} at commit {}", branch_name, &tip_commit[..8]);

// Check if branch is already indexed at current tip (if --update-branches)
if args.update_branches
&& db_manager
.is_branch_current(branch_name, tip_commit)
.await?
{
println!(
" {} Branch already indexed at current tip, skipping",
"→".yellow()
);
branches_skipped += 1;
continue;
}

// Get extensions for this indexing operation
let extensions: Vec<String> = args
.extensions
.split(',')
.map(|s| s.trim().to_string())
.collect();

// Check if this is initial or incremental indexing
if let Some(indexed_tip) = db_manager.get_branch_tip(branch_name).await? {
// Incremental indexing: commits from last indexed tip to current tip
let range = format!("{}..{}", indexed_tip, tip_commit);
info!("Incremental indexing: {} for branch {}", range, branch_name);

println!(" {} Processing range: {}", "→".blue(), range);

// Get commit count for this range
let repo = gix::discover(&args.source)
.map_err(|e| anyhow::anyhow!("Not in a git repository: {}", e))?;

let commit_shas = match list_shas_in_range(&repo, &range) {
Ok(shas) => shas,
Err(e) => {
warn!("Failed to get commits for {}: {}", range, e);
vec![]
}
};

if commit_shas.is_empty() {
println!(" {} No new commits to index", "✓".green());
} else {
println!(
" {} Found {} commits to process",
"→".blue(),
commit_shas.len()
);

semcode::git_range::process_git_range(
&args.source,
&range,
&extensions,
db_manager.clone(),
args.no_macros,
args.db_threads,
)
.await?;
}
} else {
// Initial indexing: index the tree snapshot at the tip commit
// This is MUCH faster than walking all commits (e.g., 80k files vs 1.4M commits for Linux)
info!(
"Initial indexing for branch {} (tree at {})",
branch_name,
&tip_commit[..8]
);

println!(" {} Processing tree at {}", "→".blue(), &tip_commit[..8]);

semcode::git_range::process_git_tree(
&args.source,
tip_commit,
&extensions,
db_manager.clone(),
args.no_macros,
args.db_threads,
)
.await?;
}

// Record that this branch has been indexed at the current tip
// Use the proper remote info from BranchRef (not a naive string split)
db_manager
.record_branch_indexed(branch_name, tip_commit, branch.remote.as_deref())
.await?;

println!(" {} Branch indexed successfully", "✓".green());
branches_indexed += 1;
}

let total_time = start_time.elapsed();

println!(
"\n{}",
"=== Multi-Branch Indexing Complete ===".green().bold()
);
println!("Total time: {:.1}s", total_time.as_secs_f64());
println!("Branches indexed: {}", branches_indexed);
if branches_skipped > 0 {
println!("Branches skipped (already current): {}", branches_skipped);
}

// List all indexed branches
let indexed = db_manager.list_indexed_branches().await?;
if !indexed.is_empty() {
println!("\nIndexed branches:");
for branch in indexed {
println!(
" {} → {} (indexed {})",
branch.branch_name.cyan(),
&branch.tip_commit[..8],
chrono::DateTime::from_timestamp(branch.indexed_at, 0)
.map(|dt| dt.format("%Y-%m-%d %H:%M").to_string())
.unwrap_or_else(|| "unknown".to_string())
);
}
}

println!("\nTo query this database, run:");
println!(" semcode --database {}", database_path);

Ok(())
}

// ==================== End Branch Indexing Support ====================

#[tokio::main]
async fn main() -> Result<()> {
// Suppress ORT verbose logging
Expand Down Expand Up @@ -642,6 +875,13 @@ async fn main() -> Result<()> {
));
}

// Check if branch indexing mode is requested
let branches_to_index = collect_branches_to_index(&args)?;
if !branches_to_index.is_empty() {
info!("Branch indexing mode: {} branches", branches_to_index.len());
return run_branch_indexing(args, branches_to_index).await;
}

info!("Starting semantic code indexing");
if let Some(ref git_range) = args.git {
info!("Git commit indexing mode: {}", git_range);
Expand Down
15 changes: 14 additions & 1 deletion src/bin/query.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ struct Args {
/// Path to local model directory (for semantic search)
#[arg(long, value_name = "PATH")]
model_path: Option<String>,

/// Query code at a specific branch instead of current HEAD
/// Example: --branch main
#[arg(long, value_name = "BRANCH")]
branch: Option<String>,
}

/// Check if the current commit needs indexing and perform incremental indexing if needed
Expand Down Expand Up @@ -192,7 +197,15 @@ async fn main() -> Result<()> {
let parts: Vec<&str> = parts_owned.iter().map(|s| s.as_str()).collect();

// Handle command and check if we should exit
if handle_command(&parts, &db_manager, &args.git_repo, &args.model_path).await? {
if handle_command(
&parts,
&db_manager,
&args.git_repo,
&args.model_path,
&args.branch,
)
.await?
{
break;
}
}
Expand Down
Loading