Skip to content

Optional: Cross-algorithm allocation budget / tokenisation cache for Scorer #2

@millerjp

Description

@millerjp

Background

During the Phase 8 review (2026-05-14) and the Phase 8.5 discuss-phase (2026-05-17), the security-reviewer agent flagged that DefaultScorer.Score() allocates approximately 5 MB of heap pressure per call across its 6 algorithms:

  • DamerauLevenshteinOSA: ~2.4 MB (three-row int DP)
  • JaroWinkler: 2 × [256]bool (stack, negligible)
  • TokenJaccard: 2 tokenisations (rune-count maps)
  • QGramJaccard: 2 × map[string]int with capacity (len(s)-n+1)*5/4
  • SorensenDice: same as QGramJaccard
  • DoubleMetaphone: 2 × [4]byte (negligible post-Phase 8.5 optimisation)

Total: ~5 MB heap per Score() call.

The dispatch-table abstraction in dispatch_*.go means each algorithm allocates its own tokenisation independently — there is no cross-algorithm input reuse within a single Score() invocation.

Decision (deferred from v1.0)

The Phase 8.5 discuss-phase session decided to defer this optimisation work from v1.0 for the following reasons:

  1. Not a real-world DoS today. Go's GC handles ~5 MB/call fine at expected library throughput. No consumer has reported GC pressure.
  2. Both implementation options need real design work that does not fit the v1.0 schedule:
    • WithMaxScoreAllocBytes(n int) ScorerOption — alloc-budget enforcement has unclear semantics if hit mid-Score (truncate? error? best-effort?).
    • Cross-algorithm tokenisation cache scoped to a single Score() call — needs careful lifetime/safety design to avoid use-after-free across goroutines if the Scorer is shared.
  3. v1.0 documents the 5 MB/call as expected behaviour in docs/algorithms.md#performance-characteristics, allowing consumers to profile their own workloads.

Triggers for picking this up

Re-open this work when any of the following hold:

  • A real-world consumer reports measurable GC pressure under their workload.
  • Benchmarks show a GC-bound throughput regression that per-algorithm optimisations cannot address.
  • A use case emerges (e.g. streaming match against millions of candidates) where amortising tokenisation across algorithms would deliver a 2x+ throughput improvement.

Acceptance criteria when picked up

  • Design doc covering both options (allocation budget vs tokenisation cache) with explicit semantics for each.
  • Benchmark suite demonstrating measurable GC reduction and per-Score latency impact.
  • API addition that is backward-compatible — no breaking change to the existing Scorer surface.
  • Documentation update in docs/algorithms.md#performance-characteristics.
  • New BDD scenarios covering the chosen option.

References

  • L4237 in REVIEW-FINDINGS.md (Phase 8 review, 2026-05-14)
  • Phase 8.5 discuss-phase decision Q14a (2026-05-17)

Metadata

Metadata

Assignees

No one assigned

    Labels

    deferred-v1.xDeferred from v1.0 scope; pick up in v1.x when triggers in the issue body fireenhancementNew feature or requestperformancePerformance, allocation budgets, benchmark regression

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions