Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
bb69f37
Implement chunk-level extraction and embedding.
gvanrossum-ms May 9, 2026
5a60bad
Add producer task. Chunk ID is TextLocation.
gvanrossum-ms May 10, 2026
4510004
Add _worker_task().
gvanrossum-ms May 10, 2026
c8db35d
Add reassembler task and simplify some data structures.
gvanrossum-ms May 10, 2026
f4ece84
Refactor index writes to support precomputed embeddings.
gvanrossum-ms May 10, 2026
423eb57
Fix test (add more mock methods)
gvanrossum-ms May 10, 2026
90091b0
Use a dispatcher task that spawns one-shot workers.
gvanrossum-ms May 10, 2026
2b320dd
Chunk validation without try/except.
gvanrossum-ms May 10, 2026
9f8f51d
Add _commit_batch_from_chunk_results to ConversationBase.
gvanrossum-ms May 10, 2026
4cae56c
Reformat conversation_base.py
gvanrossum-ms May 11, 2026
7f484a6
Add tests, 100% coverage. Fix one thing in add_messages.py.
gvanrossum-ms May 11, 2026
c45a814
Add the new add_messages_streaming()
gvanrossum-ms May 11, 2026
4558ebf
Treat extraction Failure as a hard error, same as raised exceptions
gvanrossum-ms May 12, 2026
208529f
Fix forward reference and run pyright for 3.12/3.14
gvanrossum-ms May 12, 2026
69fa64b
Simplify exception aggregation in add_messages_streaming
gvanrossum-ms May 12, 2026
c8e9a67
A.md update
gvanrossum-ms May 13, 2026
36eec7a
Add maxsize=concurrency*2 to Queue(); use only one embedding model (p…
gvanrossum-ms May 13, 2026
8cdbe68
Oops, fix tests
gvanrossum-ms May 13, 2026
83a252d
Good docstring for add_messages_streaming()
gvanrossum-ms May 13, 2026
bf55d96
Fix two more docstrings
gvanrossum-ms May 13, 2026
ae84430
Eliminate unused 'success' property; rename {_,}NoOpKnowledgeExtractor
gvanrossum-ms May 13, 2026
1dea32b
Swap in new add_messages_streaming. Update message text index when me…
gvanrossum-ms May 13, 2026
96e6934
Add skip_failed_messages flag and use in ingest_email.py
gvanrossum-ms May 13, 2026
927fdae
Print chunk summaries after clipping
gvanrossum-ms May 13, 2026
fd14b95
[Incomplete] Handle ^C
gvanrossum-ms May 13, 2026
766b655
Make second ^C a hard exit
gvanrossum-ms May 13, 2026
35bb11a
No optional args for add_messages.py helper. Formatted test_add_messa…
gvanrossum-ms May 14, 2026
90f8cf5
Ensure pre-computed chunk embeddings are used and not recomputed
gvanrossum-ms May 14, 2026
61f7fef
Merge branch 'main' into pipeline
bmerkle May 14, 2026
db13cb0
Remove dead chunk_embeddings param from _update_secondary_indexes_inc…
gvanrossum-ms May 15, 2026
efe1ed2
Replace type: ignore with assert for extracted_knowledge narrowing
gvanrossum-ms May 15, 2026
cffb4c3
Tighten chunk_embeddings type from list[Any] to list[NormalizedEmbedd…
gvanrossum-ms May 15, 2026
b2fa6ce
Avoid wasted embedding work when deserializing message collections
gvanrossum-ms May 15, 2026
a71f93a
Fix reassembler staged-state retry hazard on post-commit callback fai…
gvanrossum-ms May 16, 2026
72ceb90
Fail fast when staged chunk embeddings are missing
gvanrossum-ms May 16, 2026
18561c3
Align sqlite message extend embedding typing with protocol
gvanrossum-ms May 16, 2026
216ea03
Clarify deserialize message-index replacement semantics
gvanrossum-ms May 16, 2026
fa513a5
Use itertools.chain for related-action term collection
gvanrossum-ms May 16, 2026
55feb8d
Replace redundant embedding list-copy comprehensions
gvanrossum-ms May 16, 2026
88e812a
Refactor process_chunk_with_extraction_and_embeddings: parallel chunk…
gvanrossum-ms May 16, 2026
72ab410
Move semaphore release before result queue put
gvanrossum-ms May 16, 2026
ad157f3
Remove unused on_batch_committed from _reassembler_task
gvanrossum-ms May 16, 2026
619b3b6
Consolidate skip-failed logging to message level
gvanrossum-ms May 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ AGENTS.md. In all cases show what you added to AGENTS.md.

- Don't use '!' on the command line, it's some bash magic (even inside single quotes)
- When running 'make' commands, do not use the venv (the Makefile uses 'uv run')
- If a term can refer to OS behavior or repository code behavior (for example, 'force quit'), prefer the in-repo meaning first and verify by searching the code.
- To get API keys in ad-hoc code, call `load_dotenv()`
- Use `pytest test` to run tests in test/
- Use `pyright` to check type annotations in src/, tools/, tests/, examples/
Expand All @@ -29,7 +30,21 @@ AGENTS.md. In all cases show what you added to AGENTS.md.
- Use `make check test` to run `make check` and if it passes also run `make test`
- Use `make format` to format all files using `black`. Do this before reporting success.
- When validating changes, first run `pytest` only on new/modified test files, then run `make format check test` once at the end.
- While building `add_messages.py` before dedicated tests exist, skip running the full test suite; run full tests after those tests are added.
- Keep ad-hoc and performance benchmarks under `tools/`, not `tests/`, so `make test` does not run them.
- In add-messages pipeline chunk processing, compute chunk-text embeddings with uncached model calls and related-term embeddings with cached model calls.
- In add-messages pipeline flow, lower stop_at_message_id to min(existing, failing_message_id), and always enqueue queue-1 sentinels even when the input iterator fails so workers can drain and exit cleanly.
- In add-messages pipeline data structures, use `TextLocation` as the chunk identifier instead of a formatted string chunk ID.
- In add-messages reassembler validation, prefer explicit guard checks over wrapping validation-only logic in `try/except` blocks.
- In add-messages reassembler validation, prefer a single `validation_error` variable with consistent `if/elif` checks over helper functions for simple message-only validation.
- When adding precomputed-embedding write paths, expose explicit `*_with_embeddings` methods and have existing methods compute embeddings then delegate to those methods.
- In asyncio code, avoid locks for in-memory state updates that do not `await` between read/modify/write; use locks only when a critical section spans `await` points.
- Name returned summary/value objects as `*Result`; reserve `*State` for mutable shared/internal state.
- Keep internal helper type naming consistent within a module; avoid mixing underscored and non-underscored helper class names without a clear API-boundary reason.
- Prefer variable names that reflect role rather than lifecycle; for accumulators like message assemblies, use neutral names (e.g., `assembly`) instead of state-qualified names (e.g., `existing`).
- Avoid potential import cycles between conversation orchestration and pipeline modules by using neutral payload protocols/arguments instead of importing concrete pipeline result classes across modules.
- Prefer ordinal type aliases (e.g., `MessageOrdinal`, `ChunkOrdinal`) over raw `int` in pipeline code for readability.
- When the user asks to "fix the test only", update tests/mocks first and avoid adding production compatibility fallbacks unless explicitly requested.

## Package Management with uv

Expand All @@ -55,8 +70,12 @@ please follow these guidelines:

* Assume Python 3.12

* `from __future__ import annotations` is not allowed.

* Always strip trailing spaces

* Keep docstrings in sync with code when changing implementation.

* Keep class and type names in `PascalCase`
* Use `python_case` for variable/field and function/method names

Expand Down
9 changes: 5 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,19 @@ format: venv

.PHONY: check
check: venv
uv run pyright src tests tools examples
uv run pyright --pythonversion 3.12 src tests tools examples
uv run pyright --pythonversion 3.14 src tests tools examples

.PHONY: test
test: venv
uv run pytest $(FLAGS)

.PHONY: coverage
coverage: venv
coverage erase
uv run coverage erase
COVERAGE_PROCESS_START=.coveragerc uv run coverage run -m pytest $(FLAGS)
coverage combine
coverage report
uv run coverage combine
uv run coverage report

.PHONY: demo
demo: venv
Expand Down
Loading
Loading