All six features from the retired in-tree engine are now landed in
src/singularity_memory_server/. This file is the historical record of
what was ported and where.
| # | Feature | Schema | Method | Wiring | Endpoint / surface |
|---|---|---|---|---|---|
| 1 | embeddings_pending | n/a | ✅ count_unembedded |
n/a (see note) | ✅ GET /v1/default/banks/{id}/admin/embeddings-pending |
| 2 | vector_enabled | n/a | ✅ NoneEmbeddings |
✅ retain gate | n/a |
| 3 | backfill_embeddings | n/a | ✅ method | ✅ admin CLI | ✅ POST /admin/backfill-embeddings |
| 4 | RRF lane weighting | n/a | ✅ weights= param |
✅ recall caller | ✅ env vars |
| 5 | two-tier rerank | n/a | ✅ rerank_two_tier |
✅ recall branch | ✅ env vars |
| 6 | helpfulness feedback | ✅ alembic | ✅ record_feedback |
✅ ranking boost | ✅ HTTP + MCP tool |
- Method:
MemoryEngine.count_unembedded(bank_id=None)→engine/memory_engine.py. Single SQLCOUNT(*) WHERE embedding IS NULL. - HTTP:
GET /v1/default/banks/{bank_id}/admin/embeddings-pendinginapi/http.py. Returns{bank_id, embeddings_pending}. - CLI: surfaced indirectly via the backfill command's logs (it logs the pending count before each batch).
- Prometheus: deliberately not wired as an OTel observable gauge.
Bridging an async-DB-query into a sync OTel callback would require a
fragile background loop; operators can scrape the HTTP endpoint with
blackbox_exporter instead. See note in
metrics.py.
- Config:
vector_enabled: boolinSingularityConfig(defaultTrue, env:SINGULARITY_VECTOR_ENABLED). - Provider class:
NoneEmbeddingsinengine/embeddings.py—provider_name="none",encode()returns empty vectors. - Factory:
create_embeddings_from_env()returnsNoneEmbeddingswhenembeddings_provider="none"or whenvector_enabled=False. - Retain gate:
engine/retain/orchestrator.py:_extract_and_embedskipsgenerate_embeddings_batchand produces facts withembedding=Nonewhenprovider_name=="none". Embedding column is already nullable.
- Method:
MemoryEngine.backfill_embeddings(bank_id, batch_size, max_batches)inengine/memory_engine.py. Idempotent loop overembedding IS NULLrows. Per-row failures logged-and-skipped. - HTTP:
POST /v1/default/banks/{bank_id}/admin/backfill-embeddingsinapi/http.py. Body:{batch_size?, max_batches?}. Returns{bank_id, processed}. - CLI:
singularity-memory-server admin backfill-embeddings [--bank ...] [--batch-size 32] [--max-batches N]inadmin/cli.py. - Auto-trigger: not implemented. Operators run this manually after
flipping
vector_enabledon. Adding auto-on-startup would need throttling and idle detection — defer until real-world usage shows it matters.
- Function:
reciprocal_rank_fusion(result_lists, k, weights=None)inengine/search/fusion.py. Negative weights clamped to 0;Noneis the equal-weight baseline. - Config:
rrf_vector_weight,rrf_lexical_weight,rrf_graph_weight,rrf_temporal_weightinSingularityConfig(default1.0, env:SINGULARITY_RRF_*_WEIGHT). - Caller:
engine/memory_engine.pyrecall path reads the four weights from config and passesweights=[w_vec, w_lex, w_grf, w_tmp]to RRF. - Sample: code search with
SINGULARITY_RRF_LEXICAL_WEIGHT=1.5.
- Function:
rerank_two_tier(query, candidates, fast, deep, fast_top_n, deep_top_n)inengine/search/reranking.py. Falls back gracefully when either tier raises. - Config:
rerank_fast_model,rerank_deep_model,rerank_top_n(default 20),rerank_deep_top_n(default 4) inSingularityConfig. Empty model = tier disabled. - Recall branch:
engine/memory_engine.pyconstructs the deep cross-encoder when bothrerank_fast_modelandrerank_deep_modelare non-empty (and not equal), then routes throughrerank_two_tier. Falls back to single-tierCrossEncoderReranker.rerankotherwise or on construction failure.
- Migration: alembic revision
b2c3d4e5f6a7_add_helpfulness_feedback_to_memory_unitsaddshelpful_countandunhelpful_count(INTEGER NOT NULL DEFAULT 0) tomemory_units. Idempotent up/down. - Method:
MemoryEngine.record_feedback(memory_item_id, helpful)inengine/memory_engine.py. AtomicUPDATE ... RETURNING. - HTTP:
POST /v1/default/banks/{bank_id}/memories/{memory_id}/feedbackwith{helpful: bool}. Returns post-increment counts. - MCP tool:
memory_feedback(memory_item_id, helpful)inmcp_tools.py. Registered by default. - Ranking:
apply_combined_scoringinengine/search/reranking.pyapplies afeedback_boost = 1 + 0.05 * tanh((helpful - unhelpful) / 5)multiplier alongside the existing recency/temporal/proof_count boosts. Bounded to roughly [0.95, 1.05] so a few thumbs don't dominate the cross-encoder score.
Three additional features ported from the convergent design across Letta, MemGPT, and Google's Always On Memory Agent.
Letta-style "always-in-context" facts (persona, user_profile, system_directives). Distinct from archival memory: not searched, injected into every prompt by the client adapter; edited by the agent via tool calls.
- Migration:
c3d4e5f6a7b8_add_core_memory_blocksadds acore_memory_blockstable (bank_id, block_name, content, char_limit, description, created_at, updated_at). - Engine:
MemoryEngine.{get_core_memory, core_memory_set, core_memory_append, core_memory_replace, core_memory_delete}. All bounded bychar_limit(default 2000) with atruncated=Truesignal so the agent knows to summarize. - HTTP:
GET /v1/.../core-memory,PUT/PATCH/DELETE /v1/.../core-memory/{block_name}(and.../append,.../replace). - MCP:
core_memory_get / set / append / replace / deletetools, registered by default. - Hermes adapter:
prefetch()now fetches core blocks before recall and prepends them in a<core-memory>envelope (with explicit "untrusted historical context" framing). Tool surface added:singularity_core_memory_*andsingularity_memory_summarize_offload.
MemGPT's working-memory loop: agent voluntarily compresses old turns to free context space.
- Engine:
MemoryEngine.summarize_and_offload(bank_id, messages, target_chars)— concatenates messages, truncates to roughlytarget_chars(head + tail with ellipsis), retains via the standard retain pipeline so the summary becomes searchable, returns{memory_item_id, preview, compressed_chars, original_message_count}. - HTTP:
POST /v1/.../memories/summarize-and-offload. - MCP:
memory_summarize_and_offloadtool, registered by default. - Note: current implementation uses structural compression
(head+tail), not LLM summarization. LLM-driven version is a follow-up
— drop in
llm_config.summarize(...)call when ready.
Wraps the existing run_consolidation_job so it can be triggered
manually or on a schedule outside the retain pipeline (which only runs
consolidation reactively).
- CLI:
singularity-memory-server admin consolidate --bank <id> [--interval 60]. Without--interval, runs once. With--interval N, loops with N-minute gaps until Ctrl-C — minimal viable cron-in-process. For real production scheduling use systemd timers / k8s CronJob calling the bare command repeatedly.
A single engine/background_workers.py module owns three opt-in
coroutine tasks, started by MemoryEngine.initialize() and stopped
cleanly by MemoryEngine.close().
BackgroundWorkers._auto_backfill_loop: whenauto_backfill_enabled=True, checks the cached unembedded-row count everyauto_backfill_interval_seconds(default 300) and callsbackfill_embeddingswhenever pending > 0. Idempotent.- Eliminates the manual
admin backfill-embeddingsstep after flippingvector_enabled=Trueon an existing corpus.
BackgroundWorkers._auto_consolidation_loop: whenauto_consolidation_enabled=True, iterates known banks everyauto_consolidation_interval_seconds(default 3600) and runsrun_consolidation_jobon each. Banks withenable_observations=Falseshort-circuit inside the consolidator, so this is safe to leave on.- Replaces the need for external cron / k8s CronJob calling the
admin consolidatecommand.
BackgroundWorkers._pending_count_loop: refreshescached_unembedded_counteveryembeddings_pending_refresh_seconds(default 60) via a singlecount_unembeddedcall.MetricsCollector.set_memory_engine(engine)registers an observable gaugesingularity_memory.embeddings.pendingwhose callback yields the cached value (no async-from-sync hack required).- Wired automatically at engine startup; replaces the "scrape the HTTP endpoint with blackbox_exporter" workaround.
summarize_and_offloadnow calls the configured LLM (_consolidation_llm_config→_llm_configfallback) with a pressure-pager system prompt. Falls back to structural head+tail truncation when the LLM provider isnoneor the call fails.- No new config — reuses existing LLM credentials.
Borrowed from Claude Code's src/services/teamMemorySync/secretScanner.ts
teamMemSecretGuard.ts. We're a SHARED memory backend across many agents; without this, an agent can persist an API key into a bank that gets surfaced to other agents/users on recall. Real security gap.
Implementation sketch:
- New
engine/secret_scanner.pymodule with a curated gitleaks-style regex set. Patterns to cover: AWS (AKIA...), GCP (AIza...), Azure AD, Anthropic (sk-ant-api-03-...), OpenAI (sk-...), HuggingFace, GitHub PATs (ghp_,github_pat_), DigitalOcean (dop_v1_,doo_v1_), generic high-entropy tokens. - Hook point:
MemoryEngine.retainandMemoryEngine.core_memory_set / core_memory_append. Two modes via config:secret_scan_mode = "block"(default): reject the write with the matched label list.secret_scan_mode = "redact": replace the matched span with[REDACTED:<label>]and continue.
- Surface the matched label in the API response so the client can surface a clear error to the user.
- Make the regex set extensible via env or config so operators can add org-specific patterns.
Risk: low. No schema change. Hook is in retain — adding it doesn't disturb recall.
Borrowed from Claude Code's findRelevantMemories.ts system prompt
("if a list of recently-used tools is provided, do not select memories
that are usage reference or API documentation for those tools"). When
the agent is actively using tool X, surfacing how-to-use-X memories is
noise — gotchas for X are still valuable.
Implementation sketch:
- Add an optional
tool_context: list[str]parameter to recall. - Memory items can carry an optional
tool_refs: list[str]andmemory_kind: "usage_doc" | "gotcha" | "fact" | ...(extends existing fact_type taxonomy). - Recall path filters out items where
memory_kind == "usage_doc" AND tool_refs ∩ tool_context != ∅. Items withmemory_kind == "gotcha"referencing the same tools get a small ranking boost (they matter EXACTLY when the tool is in use). - Pure metadata filter — no LLM call.
Risk: low-medium. Schema change (alembic migration adding two
nullable columns to memory_units) but no behavioral changes for
existing rows (which leave both fields NULL → filter never fires).
Borrowed from Claude Code's services/extractMemories/. Replaces blind
retain(every_turn_text) with an LLM-driven extraction that looks at
recent messages, checks against existing memories in this bank, and
writes only fresh facts. Higher memory quality, bounded cost.
Implementation sketch:
- New
MemoryEngine.extract_and_retain(bank_id, messages, recent_memory_count=20)method. Two-step LLM call:- List recent memory headers from the bank (cheap SQL, no embedding).
- Send the agent a prompt: "Here are the recent messages and the existing memory headers. Return a JSON list of new facts to save, or an empty list. Update an existing fact in place if a duplicate would be created."
- Apply the writes.
- Off by default (config:
extract_memories_enabled: bool = False). Operators who want it pay the per-turn LLM cost. - HTTP/MCP surface: the existing
sync_turn/retainpaths can opt into extraction mode via a request flag, or operators can change the default at config level.
Risk: medium. Adds a new LLM call path; failures should silently
fall back to the existing retain behavior. Doesn't touch retrieval.
For posterity, the patterns we examined and rejected as poor fits for a multi-client memory backend:
MEMORY.mdalways-loaded entrypoint with caps — their human-readable index of topic files. Ourcore_memory_blocksdoes the same job in a structured shape. Different surface, equivalent function; double-implementing would create two truths.- DreamTask forked-agent runtime — only works when one agent owns
the memory directory and can fork a sub-agent against the same
toolset. Doesn't generalize to a server with many concurrent clients,
none of which "own" the corpus. Our consolidation is server-driven
via
run_consolidation_job+ the autoDream-borrowed scheduling discipline; that's the right shape for our use case. agentMemorySnapshot— single-agent warm-start. We track per-bank state in DB.- LLM-as-retrieval (Sonnet picks ≤5 from headers) — expensive per query, doesn't beat our BM25+vector+RRF at scale. Acceptable when query volume is low and the corpus is small (Claude Code's case); not when many agents share one backend.
- Boundary-marker compaction semantics — caller-driven version is
already supported by
summarize_and_offload. Explicit boundary semantics are a UX layer the client should add, not the engine.
- Feedback signal tuning — the
tanh / 5saturation and±5%bound inapply_combined_scoringare conservative defaults. Once real-world feedback data is available, sweep these against held-out recall benchmarks. - Validate against real Postgres + vchord — none of the engine ports have been run against a live database in this session. Code compiles, signatures align, but the alembic migrations + retain/recall integration paths need a smoke test before we make confident retrieval-quality claims.