Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch


ACI uses `tiktoken` with `cl100k_base` encoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using **Ollama with BERT-based embedding models** (e.g., `nomic-embed-text`, `mxbai-embed-large`), these models use **WordPiece tokenizers** that produce significantly different token counts. This mismatch causes indexing to **always fail** with `"the input length exceeds the context length"` — no matter how low `ACI_INDEXING_MAX_CHUNK_TOKENS` is set.

## Environment

- **ACI version**: latest master (`4b52424`)
- **Embedding provider**: Ollama (local) via `http://localhost:11434/v1/embeddings`
- **Models tested**: `nomic-embed-text` (768 dim, 2048 context), `mxbai-embed-large` (1024 dim, 512 context)
- **OS**: macOS (Darwin), Python 3.14
- **Qdrant**: local Docker (auto-started by ACI)

## Steps to Reproduce

1. Install ACI and set up `.env` for Ollama:

```env
ACI_EMBEDDING_API_URL=http://localhost:11434/v1/embeddings
ACI_EMBEDDING_API_KEY=ollama
ACI_EMBEDDING_MODEL=nomic-embed-text
ACI_EMBEDDING_DIMENSION=768
ACI_EMBEDDING_BATCH_SIZE=1
ACI_INDEXING_MAX_CHUNK_TOKENS=256  # Even extremely low values don't help
ACI_VECTOR_STORE_VECTOR_SIZE=768
```

2. Run: `aci index /path/to/any/codebase`

3. Observe failure:

```
Chunk exceeds token limit (528 > 256), this may indicate a very long single line
Batch batch_xxx failed: API error: 400 - {"error":{"message":"the input length exceeds the context length","type":"api_error","param":null,"code":null}}
```

## Root Cause

In [`src/aci/core/tokenizer.py`](https://github.com/AperturePlus/augmented-codebase-indexer/blob/master/src/aci/core/tokenizer.py#L47-L87), ACI hardcodes `cl100k_base` (OpenAI's BPE tokenizer):

```python
class TiktokenTokenizer(TokenizerInterface):
    def __init__(self, encoding_name: str = "cl100k_base"):
        self._encoding_name = encoding_name
        # ...

    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))
```

And in `get_default_tokenizer()`:

```python
def get_default_tokenizer() -> TokenizerInterface:
    return TiktokenTokenizer(encoding_name="cl100k_base")
```

**The problem**: `cl100k_base` and BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:

- BPE (tiktoken) merges common subword pairs aggressively — code identifiers like `handleAuthenticationCallback` might be 2-3 BPE tokens
- WordPiece splits more conservatively — the same identifier could be 5-8 WordPiece tokens
- Special characters, camelCase, and code syntax amplify the divergence

This means the chunker produces chunks that it *thinks* are within limits, but the actual embedding model rejects them.

## Observed Behavior

| `ACI_INDEXING_MAX_CHUNK_TOKENS` | Result |
|---|---|
| 8192 (default) | Fails — chunks up to 42,982 tokens in model's tokenizer |
| 2048 | Fails — chunks still exceed nomic's 2048 context |
| 512 | Fails — chunks reported as 528 by ACI, but actually much larger for model |
| 256 | Fails — chunks reported as 256-528, still exceed context |

There is **no safe value** because the token count ratio is unpredictable and can be 2-4x.

## Suggested Fix

### Option A: Configurable tokenizer (minimal change)

Add an `ACI_TOKENIZER` env var that allows selecting the tokenizer strategy:

```python
# .env
ACI_TOKENIZER=character  # or "tiktoken" (default), "simple" (whitespace split)
```

A simple character-based estimator (e.g., `len(text) / 4`) would be conservative enough for any model.

### Option B: Auto-detect from embedding model (better UX)

Query the Ollama API (`/api/show`) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.

### Option C: Graceful skip on embedding failure (safety net)

The branch `codex/fix-indexing-failure-for-oversized-items` (commit `72e88b5`) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.

### Recommended approach

Combine **Option A** + **Option C**: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.

## Additional Context

- The README advertises "OpenAI-compatible API (OpenAI, SiliconFlow, etc.)" support, which implies Ollama should work since it exposes an OpenAI-compatible `/v1/embeddings` endpoint
- Ollama is a very popular local alternative — many users who want a "free" setup (as documented in ACI's README) will hit this
- The `ACI_EMBEDDING_BATCH_SIZE=1` setting doesn't help because even individual chunks exceed the model's context
- The `ACI_INDEXING_FILE_EXTENSIONS` filter doesn't help because the issue affects normal source code files, not just minified assets


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch #14

Environment

Steps to Reproduce

Root Cause

Observed Behavior

Suggested Fix

Option A: Configurable tokenizer (minimal change)

Option B: Auto-detect from embedding model (better UX)

Option C: Graceful skip on embedding failure (safety net)

Recommended approach

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

`ACI_INDEXING_MAX_CHUNK_TOKENS`	Result
8192 (default)	Fails — chunks up to 42,982 tokens in model's tokenizer
2048	Fails — chunks still exceed nomic's 2048 context
512	Fails — chunks reported as 528 by ACI, but actually much larger for model
256	Fails — chunks reported as 256-528, still exceed context

Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch #14

Description

Environment

Steps to Reproduce

Root Cause

Observed Behavior

Suggested Fix

Option A: Configurable tokenizer (minimal change)

Option B: Auto-detect from embedding model (better UX)

Option C: Graceful skip on embedding failure (safety net)

Recommended approach

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions