ACI uses tiktoken with cl100k_base encoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using Ollama with BERT-based embedding models (e.g., nomic-embed-text, mxbai-embed-large), these models use WordPiece tokenizers that produce significantly different token counts. This mismatch causes indexing to always fail with "the input length exceeds the context length" — no matter how low ACI_INDEXING_MAX_CHUNK_TOKENS is set.
Environment
- ACI version: latest master (
4b52424)
- Embedding provider: Ollama (local) via
http://localhost:11434/v1/embeddings
- Models tested:
nomic-embed-text (768 dim, 2048 context), mxbai-embed-large (1024 dim, 512 context)
- OS: macOS (Darwin), Python 3.14
- Qdrant: local Docker (auto-started by ACI)
Steps to Reproduce
- Install ACI and set up
.env for Ollama:
ACI_EMBEDDING_API_URL=http://localhost:11434/v1/embeddings
ACI_EMBEDDING_API_KEY=ollama
ACI_EMBEDDING_MODEL=nomic-embed-text
ACI_EMBEDDING_DIMENSION=768
ACI_EMBEDDING_BATCH_SIZE=1
ACI_INDEXING_MAX_CHUNK_TOKENS=256 # Even extremely low values don't help
ACI_VECTOR_STORE_VECTOR_SIZE=768
-
Run: aci index /path/to/any/codebase
-
Observe failure:
Chunk exceeds token limit (528 > 256), this may indicate a very long single line
Batch batch_xxx failed: API error: 400 - {"error":{"message":"the input length exceeds the context length","type":"api_error","param":null,"code":null}}
Root Cause
In src/aci/core/tokenizer.py, ACI hardcodes cl100k_base (OpenAI's BPE tokenizer):
class TiktokenTokenizer(TokenizerInterface):
def __init__(self, encoding_name: str = "cl100k_base"):
self._encoding_name = encoding_name
# ...
def count_tokens(self, text: str) -> int:
return len(self.encoding.encode(text))
And in get_default_tokenizer():
def get_default_tokenizer() -> TokenizerInterface:
return TiktokenTokenizer(encoding_name="cl100k_base")
The problem: cl100k_base and BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:
- BPE (tiktoken) merges common subword pairs aggressively — code identifiers like
handleAuthenticationCallback might be 2-3 BPE tokens
- WordPiece splits more conservatively — the same identifier could be 5-8 WordPiece tokens
- Special characters, camelCase, and code syntax amplify the divergence
This means the chunker produces chunks that it thinks are within limits, but the actual embedding model rejects them.
Observed Behavior
ACI_INDEXING_MAX_CHUNK_TOKENS |
Result |
| 8192 (default) |
Fails — chunks up to 42,982 tokens in model's tokenizer |
| 2048 |
Fails — chunks still exceed nomic's 2048 context |
| 512 |
Fails — chunks reported as 528 by ACI, but actually much larger for model |
| 256 |
Fails — chunks reported as 256-528, still exceed context |
There is no safe value because the token count ratio is unpredictable and can be 2-4x.
Suggested Fix
Option A: Configurable tokenizer (minimal change)
Add an ACI_TOKENIZER env var that allows selecting the tokenizer strategy:
# .env
ACI_TOKENIZER=character # or "tiktoken" (default), "simple" (whitespace split)
A simple character-based estimator (e.g., len(text) / 4) would be conservative enough for any model.
Option B: Auto-detect from embedding model (better UX)
Query the Ollama API (/api/show) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.
Option C: Graceful skip on embedding failure (safety net)
The branch codex/fix-indexing-failure-for-oversized-items (commit 72e88b5) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.
Recommended approach
Combine Option A + Option C: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.
Additional Context
- The README advertises "OpenAI-compatible API (OpenAI, SiliconFlow, etc.)" support, which implies Ollama should work since it exposes an OpenAI-compatible
/v1/embeddings endpoint
- Ollama is a very popular local alternative — many users who want a "free" setup (as documented in ACI's README) will hit this
- The
ACI_EMBEDDING_BATCH_SIZE=1 setting doesn't help because even individual chunks exceed the model's context
- The
ACI_INDEXING_FILE_EXTENSIONS filter doesn't help because the issue affects normal source code files, not just minified assets
ACI uses
tiktokenwithcl100k_baseencoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using Ollama with BERT-based embedding models (e.g.,nomic-embed-text,mxbai-embed-large), these models use WordPiece tokenizers that produce significantly different token counts. This mismatch causes indexing to always fail with"the input length exceeds the context length"— no matter how lowACI_INDEXING_MAX_CHUNK_TOKENSis set.Environment
4b52424)http://localhost:11434/v1/embeddingsnomic-embed-text(768 dim, 2048 context),mxbai-embed-large(1024 dim, 512 context)Steps to Reproduce
.envfor Ollama:Run:
aci index /path/to/any/codebaseObserve failure:
Root Cause
In
src/aci/core/tokenizer.py, ACI hardcodescl100k_base(OpenAI's BPE tokenizer):And in
get_default_tokenizer():The problem:
cl100k_baseand BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:handleAuthenticationCallbackmight be 2-3 BPE tokensThis means the chunker produces chunks that it thinks are within limits, but the actual embedding model rejects them.
Observed Behavior
ACI_INDEXING_MAX_CHUNK_TOKENSThere is no safe value because the token count ratio is unpredictable and can be 2-4x.
Suggested Fix
Option A: Configurable tokenizer (minimal change)
Add an
ACI_TOKENIZERenv var that allows selecting the tokenizer strategy:A simple character-based estimator (e.g.,
len(text) / 4) would be conservative enough for any model.Option B: Auto-detect from embedding model (better UX)
Query the Ollama API (
/api/show) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.Option C: Graceful skip on embedding failure (safety net)
The branch
codex/fix-indexing-failure-for-oversized-items(commit72e88b5) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.Recommended approach
Combine Option A + Option C: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.
Additional Context
/v1/embeddingsendpointACI_EMBEDDING_BATCH_SIZE=1setting doesn't help because even individual chunks exceed the model's contextACI_INDEXING_FILE_EXTENSIONSfilter doesn't help because the issue affects normal source code files, not just minified assets