Skip to content

Introduce staged parallelism in ingestion pipeline (classification → judge → weaver) #153

@anirudhaacharyap

Description

@anirudhaacharyap

Problem

Current ingestion pipeline is either:

  • Fully sequential (safe but slow), or
  • Fully parallel (fast but risks race conditions during updates)

We need a hybrid approach to improve throughput while maintaining consistency.

Goal

Introduce staged parallelism to balance performance and correctness.

Proposed Approach

Split pipeline into 3 stages:

Phase A — Classification / Extraction (Parallel)

  • Run LLM-based extraction for all batch items concurrently
  • Independent per item (no shared state)

Phase B — Judge (Sequential)

  • Critical section
  • Evaluate and decide updates one-by-one
  • Ensures deterministic memory updates

Phase C — Weaver (Parallel)

  • Apply final writes (DB / vector store)
  • Parallelizable if no shared state conflict

Benefits

  • Reduces latency significantly
  • Preserves correctness in update step
  • Scales better for batch ingestion

Scope

  • Refactor pipeline execution into stages
  • Introduce controlled concurrency boundaries

Acceptance Criteria

  • Classification runs concurrently
  • Judge phase is strictly sequential
  • Weaver phase can run concurrently
  • No race conditions in memory updates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions