Skip to content

Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch #14

@vkuprin

Description

@vkuprin

ACI uses tiktoken with cl100k_base encoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using Ollama with BERT-based embedding models (e.g., nomic-embed-text, mxbai-embed-large), these models use WordPiece tokenizers that produce significantly different token counts. This mismatch causes indexing to always fail with "the input length exceeds the context length" — no matter how low ACI_INDEXING_MAX_CHUNK_TOKENS is set.

Environment

  • ACI version: latest master (4b52424)
  • Embedding provider: Ollama (local) via http://localhost:11434/v1/embeddings
  • Models tested: nomic-embed-text (768 dim, 2048 context), mxbai-embed-large (1024 dim, 512 context)
  • OS: macOS (Darwin), Python 3.14
  • Qdrant: local Docker (auto-started by ACI)

Steps to Reproduce

  1. Install ACI and set up .env for Ollama:
ACI_EMBEDDING_API_URL=http://localhost:11434/v1/embeddings
ACI_EMBEDDING_API_KEY=ollama
ACI_EMBEDDING_MODEL=nomic-embed-text
ACI_EMBEDDING_DIMENSION=768
ACI_EMBEDDING_BATCH_SIZE=1
ACI_INDEXING_MAX_CHUNK_TOKENS=256  # Even extremely low values don't help
ACI_VECTOR_STORE_VECTOR_SIZE=768
  1. Run: aci index /path/to/any/codebase

  2. Observe failure:

Chunk exceeds token limit (528 > 256), this may indicate a very long single line
Batch batch_xxx failed: API error: 400 - {"error":{"message":"the input length exceeds the context length","type":"api_error","param":null,"code":null}}

Root Cause

In src/aci/core/tokenizer.py, ACI hardcodes cl100k_base (OpenAI's BPE tokenizer):

class TiktokenTokenizer(TokenizerInterface):
    def __init__(self, encoding_name: str = "cl100k_base"):
        self._encoding_name = encoding_name
        # ...

    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))

And in get_default_tokenizer():

def get_default_tokenizer() -> TokenizerInterface:
    return TiktokenTokenizer(encoding_name="cl100k_base")

The problem: cl100k_base and BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:

  • BPE (tiktoken) merges common subword pairs aggressively — code identifiers like handleAuthenticationCallback might be 2-3 BPE tokens
  • WordPiece splits more conservatively — the same identifier could be 5-8 WordPiece tokens
  • Special characters, camelCase, and code syntax amplify the divergence

This means the chunker produces chunks that it thinks are within limits, but the actual embedding model rejects them.

Observed Behavior

ACI_INDEXING_MAX_CHUNK_TOKENS Result
8192 (default) Fails — chunks up to 42,982 tokens in model's tokenizer
2048 Fails — chunks still exceed nomic's 2048 context
512 Fails — chunks reported as 528 by ACI, but actually much larger for model
256 Fails — chunks reported as 256-528, still exceed context

There is no safe value because the token count ratio is unpredictable and can be 2-4x.

Suggested Fix

Option A: Configurable tokenizer (minimal change)

Add an ACI_TOKENIZER env var that allows selecting the tokenizer strategy:

# .env
ACI_TOKENIZER=character  # or "tiktoken" (default), "simple" (whitespace split)

A simple character-based estimator (e.g., len(text) / 4) would be conservative enough for any model.

Option B: Auto-detect from embedding model (better UX)

Query the Ollama API (/api/show) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.

Option C: Graceful skip on embedding failure (safety net)

The branch codex/fix-indexing-failure-for-oversized-items (commit 72e88b5) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.

Recommended approach

Combine Option A + Option C: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.

Additional Context

  • The README advertises "OpenAI-compatible API (OpenAI, SiliconFlow, etc.)" support, which implies Ollama should work since it exposes an OpenAI-compatible /v1/embeddings endpoint
  • Ollama is a very popular local alternative — many users who want a "free" setup (as documented in ACI's README) will hit this
  • The ACI_EMBEDDING_BATCH_SIZE=1 setting doesn't help because even individual chunks exceed the model's context
  • The ACI_INDEXING_FILE_EXTENSIONS filter doesn't help because the issue affects normal source code files, not just minified assets

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions