Skip to content

Streaming commit transaction does embedding and fuzzy index work that could be pipelined #270

@KRRT7

Description

@KRRT7

_commit_batch_streaming (conversation_base.py line 434) opens a transaction and calls _update_secondary_indexes_incremental inside it. That function does two expensive things:

  1. Embedding generation_update_message_index_incrementalmessage_index.add_messages()text_location_index.add_text_locations()_embedding_index.add_texts(). These are API calls to the embedding model, happening inside the DB transaction.

  2. Fuzzy index term embeddings_update_related_terms_incremental collects new terms from semantic refs and calls fuzzy_index.add_terms(), which generates an embedding per unique term. Many terms repeat across batches; the CachingEmbeddingModel helps but the per-batch overhead of collecting and checking is still there.

The two-stage pipeline already overlaps LLM extraction(N+1) with commit(N). But the commit phase itself is slower than necessary because of these embedding calls. Pre-computing embeddings alongside knowledge extraction (before the transaction opens) would keep the commit phase to pure DB writes. The fuzzy index could also be deferred to a single pass after all batches complete, since it's only needed for query-time — not for correctness of the ingested data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions