Skip to content

fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection#17

Open
AperturePlus wants to merge 1 commit intodevelopfrom
codex/fix-token-count-mismatch-with-ollama-models
Open

fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection#17
AperturePlus wants to merge 1 commit intodevelopfrom
codex/fix-token-count-mismatch-with-ollama-models

Conversation

@AperturePlus
Copy link
Owner

Motivation

  • Indexing used a hardcoded OpenAI BPE tokenizer (cl100k_base) which underestimates token counts for WordPiece/BERT-based embedding models (e.g., Ollama), causing embedding API rejections for context-length oversize.
  • Provide a conservative, configurable token-estimation option and make token-limit error detection more robust so a single oversized item doesn’t silently break indexing.

Description

  • Added two alternate tokenizer implementations: CharacterTokenizer (chars-per-token conservative estimator) and SimpleTokenizer (whitespace-based), and extended TiktokenTokenizer to remain the default. (src/aci/core/tokenizer.py).
  • Extended get_default_tokenizer(strategy) to accept tiktoken|character|simple and raise on unsupported strategies, and exported the new tokenizers in the core package (src/aci/core/__init__.py).
  • Added IndexingConfig.tokenizer and environment override ACI_TOKENIZER, and wired tokenizer selection into service initialization so chunker and summary generator share the same tokenizer instance (src/aci/core/config.py, src/aci/services/container.py).
  • Improved embedding error classification to detect context-length style messages (e.g. "the input length exceeds the context length") from 400 responses so the batch-fallback logic can trigger correctly (src/aci/infrastructure/embedding/response_parser.py).
  • Updated and added tests to cover tokenizer strategies and token-limit patterns, and made tiktoken unit tests offline-safe by stubbing encoding during unit tests (tests/unit/test_tokenizer.py, tests/property/test_embedding_client_properties.py, tests/property/test_config_properties.py).

Testing

  • Ran targeted unit/property tests: uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q, and they passed.
  • Ran linter: uv run ruff check src tests, and it passed.
  • Ran full test suite: uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1, which failed in this environment due to a proxy blocking tiktoken encoding download (environment network/proxy issue), not a logic regression.
  • Ran type checks: uv run mypy src --ignore-missing-imports --no-error-summary, which reports pre-existing repository-wide mypy issues unrelated to this change.

Codex Task

Why:
- Indexing against Ollama WordPiece-based embedding models can exceed model context limits because chunk sizing used only cl100k_base token counts.
- Ollama error payloads using "input length exceeds the context length" were not consistently classified as token-limit failures.

What:
- Added tokenizer strategies (tiktoken, character, simple) and strategy-based default tokenizer factory.
- Added ACI_TOKENIZER config/env support and wired tokenizer selection into service initialization so chunking and summary generation share the chosen tokenizer.
- Expanded token-limit error detection to include context-length style 400 responses.
- Updated unit/property tests for tokenizer strategy selection, offline-safe tokenizer tests, config strategy generation, and token-limit message coverage.

Test:
- uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q (pass)
- uv run ruff check src tests (pass)
- uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1 (fails in this environment due proxy blocking tiktoken encoding download)
- uv run mypy src --ignore-missing-imports --no-error-summary (existing repo-wide mypy errors)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant