fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection#17
Open
AperturePlus wants to merge 1 commit intodevelopfrom
Open
Conversation
Why: - Indexing against Ollama WordPiece-based embedding models can exceed model context limits because chunk sizing used only cl100k_base token counts. - Ollama error payloads using "input length exceeds the context length" were not consistently classified as token-limit failures. What: - Added tokenizer strategies (tiktoken, character, simple) and strategy-based default tokenizer factory. - Added ACI_TOKENIZER config/env support and wired tokenizer selection into service initialization so chunking and summary generation share the chosen tokenizer. - Expanded token-limit error detection to include context-length style 400 responses. - Updated unit/property tests for tokenizer strategy selection, offline-safe tokenizer tests, config strategy generation, and token-limit message coverage. Test: - uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q (pass) - uv run ruff check src tests (pass) - uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1 (fails in this environment due proxy blocking tiktoken encoding download) - uv run mypy src --ignore-missing-imports --no-error-summary (existing repo-wide mypy errors)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
cl100k_base) which underestimates token counts for WordPiece/BERT-based embedding models (e.g., Ollama), causing embedding API rejections for context-length oversize.Description
CharacterTokenizer(chars-per-token conservative estimator) andSimpleTokenizer(whitespace-based), and extendedTiktokenTokenizerto remain the default. (src/aci/core/tokenizer.py).get_default_tokenizer(strategy)to accepttiktoken|character|simpleand raise on unsupported strategies, and exported the new tokenizers in the core package (src/aci/core/__init__.py).IndexingConfig.tokenizerand environment overrideACI_TOKENIZER, and wired tokenizer selection into service initialization so chunker and summary generator share the same tokenizer instance (src/aci/core/config.py,src/aci/services/container.py).src/aci/infrastructure/embedding/response_parser.py).tiktokenunit tests offline-safe by stubbing encoding during unit tests (tests/unit/test_tokenizer.py,tests/property/test_embedding_client_properties.py,tests/property/test_config_properties.py).Testing
uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q, and they passed.uv run ruff check src tests, and it passed.uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1, which failed in this environment due to a proxy blockingtiktokenencoding download (environment network/proxy issue), not a logic regression.uv run mypy src --ignore-missing-imports --no-error-summary, which reports pre-existing repository-wide mypy issues unrelated to this change.Codex Task