fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection by AperturePlus · Pull Request #17 · AperturePlus/augmented-codebase-indexer

AperturePlus · 2026-02-25T16:03:26Z

Motivation

Indexing used a hardcoded OpenAI BPE tokenizer (cl100k_base) which underestimates token counts for WordPiece/BERT-based embedding models (e.g., Ollama), causing embedding API rejections for context-length oversize.
Provide a conservative, configurable token-estimation option and make token-limit error detection more robust so a single oversized item doesn’t silently break indexing.

Description

Added two alternate tokenizer implementations: CharacterTokenizer (chars-per-token conservative estimator) and SimpleTokenizer (whitespace-based), and extended TiktokenTokenizer to remain the default. (src/aci/core/tokenizer.py).
Extended get_default_tokenizer(strategy) to accept tiktoken|character|simple and raise on unsupported strategies, and exported the new tokenizers in the core package (src/aci/core/__init__.py).
Added IndexingConfig.tokenizer and environment override ACI_TOKENIZER, and wired tokenizer selection into service initialization so chunker and summary generator share the same tokenizer instance (src/aci/core/config.py, src/aci/services/container.py).
Improved embedding error classification to detect context-length style messages (e.g. "the input length exceeds the context length") from 400 responses so the batch-fallback logic can trigger correctly (src/aci/infrastructure/embedding/response_parser.py).
Updated and added tests to cover tokenizer strategies and token-limit patterns, and made tiktoken unit tests offline-safe by stubbing encoding during unit tests (tests/unit/test_tokenizer.py, tests/property/test_embedding_client_properties.py, tests/property/test_config_properties.py).

Testing

Ran targeted unit/property tests: uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q, and they passed.
Ran linter: uv run ruff check src tests, and it passed.
Ran full test suite: uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1, which failed in this environment due to a proxy blocking tiktoken encoding download (environment network/proxy issue), not a logic regression.
Ran type checks: uv run mypy src --ignore-missing-imports --no-error-summary, which reports pre-existing repository-wide mypy issues unrelated to this change.

Why: - Indexing against Ollama WordPiece-based embedding models can exceed model context limits because chunk sizing used only cl100k_base token counts. - Ollama error payloads using "input length exceeds the context length" were not consistently classified as token-limit failures. What: - Added tokenizer strategies (tiktoken, character, simple) and strategy-based default tokenizer factory. - Added ACI_TOKENIZER config/env support and wired tokenizer selection into service initialization so chunking and summary generation share the chosen tokenizer. - Expanded token-limit error detection to include context-length style 400 responses. - Updated unit/property tests for tokenizer strategy selection, offline-safe tokenizer tests, config strategy generation, and token-limit message coverage. Test: - uv run pytest tests/unit/test_tokenizer.py tests/property/test_embedding_client_properties.py tests/property/test_config_properties.py -q (pass) - uv run ruff check src tests (pass) - uv run pytest tests/ -v --tb=short -q --durations=10 --maxfail=1 (fails in this environment due proxy blocking tiktoken encoding download) - uv run mypy src --ignore-missing-imports --no-error-summary (existing repo-wide mypy errors)

AperturePlus added the codex label Feb 25, 2026 — with ChatGPT Codex Connector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection#17

fix(tokenizer): add configurable tokenizer strategies and improve token-limit detection#17
AperturePlus wants to merge 1 commit intodevelopfrom
codex/fix-token-count-mismatch-with-ollama-models

AperturePlus commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AperturePlus commented Feb 25, 2026

Motivation

Description

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant