Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# AGENTS.md - Engineering and Agent Governance for ACI

This file defines mandatory engineering rules for this repository.

Applies to:
- Human contributors
- AI coding agents

## 1. Purpose and Scope

This document is the single governance contract for:
- Requirements
- Design
- Programming
- Testing
- CI/CD
- Git workflow
- AI safety constraints
- AI commit discipline

If a change conflicts with this document, this document takes precedence unless a maintainer explicitly overrides it in the task context.

## 2. Roles and Audience

- Humans MUST follow these standards when planning, coding, testing, and reviewing.
- AI agents MUST follow these standards when proposing, editing, validating, and reporting changes.

## 3. Requirement Engineering Standards

- Every non-trivial task MUST define:
- Objective
- In-scope
- Out-of-scope
- Acceptance criteria
- Bugfixes MUST include:
- Reproducible symptom
- Expected behavior
- If acceptance criteria are undefined, implementation MUST NOT start.

## 4. Design Standards

- Non-trivial changes MUST include a brief design rationale in PR/task notes.
- Design rationale MUST state:
- Selected approach
- Alternatives considered
- Tradeoffs
- Layering MUST be respected:
- `core`: domain logic and cross-cutting primitives
- `infrastructure`: external systems and adapters
- `services`: orchestration and business workflows
- `cli` / `http` / `mcp`: entrypoint adapters only

## 5. Programming Standards

- New code MUST include type annotations.
- Error handling and logs MUST preserve root causes and actionable context.
- Security-sensitive data MUST NOT be hardcoded or logged.
- Changes SHOULD minimize diff surface; avoid opportunistic refactors.
- Existing repository conventions MUST be preserved unless the task explicitly requests a redesign.

## 6. Testing Standards

- Logic changes MUST include tests appropriate to the change:
- Unit tests
- Property tests
- Integration tests
- Flaky or non-deterministic tests MUST NOT be merged.
- Minimum local validation before PR:
- `uv run ruff check src tests`
- `uv run pytest tests/ -v --tb=short -q --durations=10`
- Type checking SHOULD be run and reviewed:
- `uv run mypy src --ignore-missing-imports --no-error-summary`

## 7. CI/CD Standards

- CI source of truth:
- `.github/workflows/test.yml`
- `.github/workflows/release.yml`
- Required CI commands are:
- `uv run ruff check src tests`
- `uv run pytest tests/ -v --tb=short -q --durations=10`
- Current mypy execution in CI is informational/non-blocking:
- `uv run mypy src --ignore-missing-imports --no-error-summary || true`
- Release tags MUST follow `v*`.
- A release is marked prerelease when tag name contains `-`.
- PRs MUST NOT merge with failing required checks.

## 8. Git Workflow and PR Standards

- Workflow MUST be Trunk-based development + PR.
- Commit messages MUST follow Conventional Commits with scope:
- `type(scope): short imperative summary`
- Allowed types:
- `feat`, `fix`, `refactor`, `test`, `docs`, `chore`, `ci`, `perf`
- Branch naming SHOULD use:
- `feat/<topic>`
- `fix/<topic>`
- `chore/<topic>`
- `docs/<topic>`
- Every PR MUST include:
- What changed
- Why
- Test evidence (commands + outcomes)
- Risk/rollback notes for non-trivial changes

## 9. AI Safety and Dangerous-Command Policy

### 9.1 Strict Blocking Rules

AI MUST NOT run destructive or high-risk commands.

Default prohibited patterns:
- `rm -rf`
- `del /s /q`
- `format`
- `mkfs`
- `dd`
- shutdown/reboot commands
- recursive permission/ownership changes outside task scope
- `git reset --hard`
- `git clean -fdx`
- `git checkout -- .`
- force push to protected branches

### 9.2 Additional Safety Constraints

- AI MUST NOT modify files outside repository task scope without explicit user instruction.
- AI MUST explicitly call out risk before any potentially destructive operation.
- AI MUST prefer reversible operations.
- AI MUST preserve unrelated local changes.

## 10. AI Commit Discipline and Commit Message Standard

### 10.1 Milestone-Based Commit Discipline

- AI MUST commit at each verifiable milestone, not only at final completion.
- A milestone is:
- A complete logical subtask
- With relevant checks passing
- With coherent rollback boundaries
- AI MUST avoid oversized mixed-purpose commits.
- AI SHOULD keep one intent per commit.

### 10.2 AI Commit Message Template

Header (required):
- `type(scope): short imperative summary`

Body (required):
- `Why:` context/problem
- `What:` key changes
- `Test:` commands executed and outcomes

Optional footer:
- `BREAKING CHANGE:` when applicable

Example:
- `fix(indexing): retry qdrant upsert on transient timeout`

## 11. Definition of Done (DoD) Checklist

Before marking work complete, ALL applicable items MUST be satisfied:

- Requirements are explicit and testable
- Design rationale is captured for non-trivial changes
- Code is scoped and consistent with project architecture
- Tests are added/updated and passing
- Lint checks are passing
- CI/CD impact is considered
- Documentation is updated when behavior changes
- Commit/PR metadata follows policy

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Project ACI - Augmented Codebase Indexer

Language: **English** | [简体中文](doc/README.zh-CN.md)
Development governance: see [AGENTS.md](AGENTS.md)

A Python tool for semantic code search with precise line-level location results.

Expand Down
109 changes: 109 additions & 0 deletions doc/CHUNKING_ALGORITHM.zh-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Chunking 算法原理(当前实现)

本文基于 `src/aci/core/chunker` 的当前代码实现,说明 ACI 在索引阶段如何把源码切分为可检索片段(chunks)。

## 1. 总体流程

`Chunker.chunk(file, ast_nodes)` 的主流程:

1. 先按语言抽取 import 列表(写入每个 chunk 的 metadata)。
2. 若存在 AST 节点:走 **语义切分(AST-based)**。
3. 若无 AST 节点:走 **固定行数切分(fixed-size fallback)**。
4. 若配置了 `summary_generator`,并行产出 function/class/file summary artifact。

## 2. AST 语义切分(优先路径)

当解析器能产出 AST 节点时:

- 每个 AST 节点(`function/class/method`)默认作为一个 chunk 候选。
- metadata 会补充结构化信息:
- `function_name`
- `class_name`
- `parent_class`(method 场景)
- `imports`、`file_hash`、`language`
- 若节点有 docstring,会先规范化,再以分隔符拼到 chunk 内容前缀,提高语义可检索性。

### Token 上限控制

对每个候选节点:

- `token_count <= max_tokens`:直接生成单个 chunk。
- `token_count > max_tokens`:交给 `SmartChunkSplitter` 做智能拆分。

## 3. SmartChunkSplitter 智能拆分策略

目标:在 token 约束下尽量不破坏代码语法/语义边界。

### 3.1 拆分优先级

在一个超大节点内部,优先在这些位置切分:

1. 空行
2. 语句边界(`def/class/if/for/while/try/except/return/...` 模式)
3. 缩进较低的行(块边界)
4. 实在不行按可容纳最大范围切

### 3.2 如何找“可容纳最大范围”

- 通过二分法 `_find_max_end_index` 找从 `start_idx` 开始,token 不超限的最远 `end_idx`。
- 再在 `[start_idx, end_idx]` 区间回溯挑“最佳切点”。

### 3.3 上下文补偿

拆分后会给后续子块加上下文前缀,避免脱离语境:

- 方法:`# Context: class <Parent>`
- 函数:`# Context: function <Name>`
- 类:`# Context: class <Name>`

此外:

- docstring 前缀只附加在首个子块。
- metadata 里标记 `is_partial / part_index / total_parts` 等字段。

## 4. 固定行数切分(fallback)

当某语言暂不支持 AST(或 AST 为空)时:

- 以 `fixed_chunk_lines`(默认 50 行)分块。
- 相邻块保留 `overlap_lines`(默认 5 行)重叠,降低跨块语义断裂。
- 每块仍会做 token 校验;若超限,持续从块尾减行直到不超限(至少保留 1 行)。
- chunk 类型标记为 `fixed`。

## 5. Import 抽取策略

chunking 前会先提取 import,并写入 metadata:

- Python:识别 `import ...` / `from ...`
- JS/TS:识别 `import ...` 和 `const ... require(...)`
- Go:支持 `import (...)` 块和单行 import
- 其他语言:空实现(返回空列表)

这让检索和后续总结模型可利用依赖上下文。

## 6. 输出数据形态

最终 `ChunkingResult` 包含两类产物:

- `chunks: list[CodeChunk]`
- `summaries: list[SummaryArtifact]`

其中 `CodeChunk` 是索引主对象,带有:

- 行号范围(1-based,含结束行)
- 原始/拆分后的内容
- chunk 类型(`function/class/method/fixed`)
- metadata(含 imports、符号名、分片标记等)

## 7. 设计取舍总结

当前算法是“**语义优先 + token 兜底 + 行切分回退**”的混合方案:

- 优点:
- 尽量对齐语言结构(函数/类/方法),检索粒度更自然。
- 超大节点可智能拆分,并保留上下文,降低语义损失。
- 对不支持 AST 的语言仍可工作(工程可用性高)。
- 潜在限制:
- 语句边界模式目前偏 Python 风格正则,对其他语言并非完全精确。
- 固定切分路径主要靠行数和重叠,语义一致性弱于 AST 路径。

Loading