Skip to content

Virgil Lemma foundations#8

Open
Snider wants to merge 1709 commits into
mainfrom
dev
Open

Virgil Lemma foundations#8
Snider wants to merge 1709 commits into
mainfrom
dev

Conversation

@Snider
Copy link
Copy Markdown
Contributor

@Snider Snider commented May 20, 2026

@coderabbitai summary

Summary by CodeRabbit

  • New Features

    • Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
    • Block‑prefix cache service and memvid bundle index for faster prefix restores.
    • Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
  • Improvements

    • Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
    • Build/toolchain updated (C++23) and macOS deployment target raised.
  • Documentation

    • Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s) Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs)
.gitignore, .gitmodules, CMakeLists.txt, cpp/CMakeLists.txt, external/*, go/adapter.go, go/adapter/*, go/backend.go, go/agent/*, go/blockcache/*, go/artifact/*, go/*_test.go, docs/*
Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (10)
docs/inference/thinking.md (1)

74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to

markdown linter.
docs/runtime/README.md (2)

68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.

📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".

44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).

📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)

9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.

✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).

Or alternatively:

-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)

31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).

📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)

82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).

📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)

7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)

7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)

5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)

16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)
  • .gitignore
  • .gitmodules
  • CLAUDE.md
  • CMakeLists.txt
  • GOAL.md
  • docs/README.md
  • docs/architecture.md
  • docs/build.md
  • docs/cmd/violet.md
  • docs/compute/compute.md
  • docs/development.md
  • docs/examples/compute/frame-pipeline.md
  • docs/examples/daemon/violet-socket.md
  • docs/examples/eval/attention-probe.md
  • docs/examples/eval/perplexity.md
  • docs/examples/inference/batch.md
  • docs/examples/inference/chat.md
  • docs/examples/inference/quantization.md
  • docs/examples/inference/streaming.md
  • docs/examples/model-ops/hf-fit.md
  • docs/examples/model-ops/kv-snapshot.md
  • docs/examples/model-ops/merge.md
  • docs/examples/model-ops/quantize-gguf.md
  • docs/examples/training/distill.md
  • docs/examples/training/grpo.md
  • docs/examples/training/lora-finetune.md
  • docs/examples/training/lora-fuse.md
  • docs/history.md
  • docs/index.md
  • docs/inference/README.md
  • docs/inference/block_cache.md
  • docs/inference/decode_optimisation.md
  • docs/inference/parser_registry.md
  • docs/inference/scheduler.md
  • docs/inference/thinking.md
  • docs/memory/README.md
  • docs/memory/agent_memory.md
  • docs/memory/agentic_project_seed.md
  • docs/memory/kv_snapshot.md
  • docs/memory/kv_snapshot_blocks.md
  • docs/memory/kv_snapshot_index.md
  • docs/memory/kv_snapshot_memvid.md
  • docs/memory/medium.md
  • docs/memory/state_bundle.md
  • docs/model-operations.md
  • docs/model/README.md
  • docs/model/memory_plan.md
  • docs/model/model_pack.md
  • docs/models.md
  • docs/moe/README.md
  • docs/moe/codebook_vq.md
  • docs/moe/expert_residency.md
  • docs/moe/jang.md
  • docs/moe/minimax_m2.md
  • docs/observability/probe.md
  • docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
  • docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
  • docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
  • docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
  • docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
  • docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
  • docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
  • docs/runtime/2026-05-19-goal-completion-audit.md
  • docs/runtime/2026-05-19-runner-calibration.md
  • docs/runtime/2026-05-20-chapter-profile-safety.md
  • docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
  • docs/runtime/README.md
  • docs/runtime/adapter.md
  • docs/runtime/local_autotune.md
  • docs/runtime/register_metal.md
  • docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
  • docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
  • docs/training/README.md
  • docs/training/distill.md
  • docs/training/eval.md
  • docs/training/grpo.md
  • docs/training/lora_adapter.md
  • docs/training/sft.md
  • docs/vmlx-feature-gap-report.md
  • external/go-ai
  • external/go-inference
  • external/go-ml
  • go/adapter.go
  • go/adapter/adapter.go
  • go/adapter_example_test.go
  • go/adapter_test.go
  • go/agent/helpers.go
  • go/agent/index.go
  • go/agent/index_test.go
  • go/agent/test_helpers_test.go
  • go/agent/wake_sleep.go
  • go/api_common.go
  • go/api_common_example_test.go
  • go/api_darwin_test.go
  • go/api_shape_test.go
  • go/api_stub.go
  • go/api_stub_example_test.go
  • go/api_stub_test.go
  • go/api_test.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/artifact/artifact.go
  • go/artifact/artifact_test.go
  • go/attention_test.go
  • go/backend.go
  • go/backend_example_test.go
  • go/backend_test.go
  • go/blockcache/blockcache.go
  • go/blockcache/blockcache_test.go
  • go/blockcache/helpers_test.go
  • go/bundle/bundle.go
  • go/bundle/bundle_test.go
  • go/bundle/example_test.go
  • go/bundle/sami.go
  • go/chaptersmoke/chaptersmoke.go
  • go/chaptersmoke/chaptersmoke_test.go
  • go/chat/chat.go
  • go/chat/chat_test.go
  • go/chat/example_test.go
  • go/cmd/go-mlx/main.go
  • go/cmd/go-mlx/main_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/cmd/mlx/split_ffn_tune.go
  • go/compute/compute.go
  • go/compute/compute_example_test.go
  • go/compute/compute_metal.go
  • go/compute/compute_metal_example_test.go
  • go/compute/compute_metal_helper_test.go
  • go/compute/compute_metal_test.go
  • go/compute/compute_test.go
  • go/compute_stub.go
  • go/compute_stub_example_test.go
  • go/compute_stub_test.go
  • go/compute_test.go
  • go/dataset/jsonl.go
  • go/dataset/sample.go
  • go/dataset_stream.go
  • go/dataset_stream_example_test.go
  • go/dataset_stream_test.go
  • go/device_info.go
  • go/distill.go
  • go/distill_test.go
  • go/eval.go
  • go/eval_darwin.go
  • go/eval_darwin_test.go
  • go/eval_stub.go
  • go/eval_test.go
  • go/fast_eval.go
  • go/fast_eval_example_test.go
  • go/fast_eval_runner.go
  • go/fast_eval_test.go
  • go/gguf/info.go
  • go/gguf/info_example_test.go
  • go/gguf/info_test.go
  • go/gguf/quantize.go
  • go/gguf/quantize_test.go
  • go/grpo.go
  • go/grpo_test.go
  • go/helpers.go
  • go/hf/hf.go
  • go/hf/hf_test.go
  • go/hf/test_helpers_test.go
  • go/hf_fit.go
  • go/inference_contract.go
  • go/inference_contract_test.go
  • go/internal/metal/activation_bridge.cpp
  • go/internal/metal/array.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/batch.go
  • go/internal/metal/cache.go
  • go/internal/metal/cache_test.go
  • go/internal/metal/close.go
  • go/internal/metal/codebook_vq.go
  • go/internal/metal/codebook_vq_test.go
  • go/internal/metal/compile.go
  • go/internal/metal/compile_test.go
  • go/internal/metal/decode.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/decode_bridge.h
  • go/internal/metal/decode_test.go
  • go/internal/metal/dense_matvec.go
  • go/internal/metal/dense_matvec_test.go
  • go/internal/metal/device.go
  • go/internal/metal/dtype.go
  • go/internal/metal/error_test.go
  • go/internal/metal/expert_id_matvec.go
  • go/internal/metal/expert_id_matvec_test.go
  • go/internal/metal/fast.go
  • go/internal/metal/fast_test.go
  • go/internal/metal/gemma3.go
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_assistant.go
  • go/internal/metal/gemma4_assistant_decode.go
  • go/internal/metal/gemma4_assistant_decode_example_test.go
  • go/internal/metal/gemma4_assistant_decode_test.go
  • go/internal/metal/gemma4_assistant_generate.go
  • go/internal/metal/gemma4_assistant_generate_test.go
  • go/internal/metal/gemma4_assistant_pair.go
  • go/internal/metal/gemma4_assistant_test.go
  • go/internal/metal/gemma4_ffn_residual.go
  • go/internal/metal/gemma4_ffn_residual_test.go
  • go/internal/metal/gemma4_router_topk.go
  • go/internal/metal/gemma4_router_topk_test.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/gemma4_vision.go
  • go/internal/metal/generate.go
  • go/internal/metal/generate_test.go
  • go/internal/metal/jang_dequant.go
  • go/internal/metal/jang_dequant_test.go
  • go/internal/metal/kv_snapshot.go
  • go/internal/metal/metal.go
  • go/internal/metal/minimax_m2.go
  • go/internal/metal/minimax_m2_test.go
  • go/internal/metal/mlx_mlx_backend_cpu_available.cpp
  • go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
  • go/internal/metal/model.go
  • go/internal/metal/model_test.go
  • go/internal/metal/nn.go
  • go/internal/metal/nn_test.go
  • go/internal/metal/ops.go
  • go/internal/metal/process_memory_darwin.go
  • go/internal/metal/process_memory_stub.go
  • go/internal/metal/prompt_cache.go
  • go/internal/metal/prompt_cache_test.go
  • go/internal/metal/qwen3.go
  • go/internal/metal/qwen3_test.go
  • go/internal/metal/runtime_gate.go
  • go/internal/metal/runtime_gate_example_test.go
  • go/internal/metal/runtime_gate_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/internal/metal/session_example_test.go
  • go/internal/metal/session_test.go
  • go/internal/metal/split.go
  • go/internal/metal/split_test.go
  • go/internal/metal/stream.go
  • go/internal/metal/tokenizer.go
  • go/internal/metal/tokenizer_test.go
  • go/internal/metal/trace.go
  • go/internal/metal/trace_test.go
  • go/internal/metal/training.go
  • go/jang_test.go
  • go/kv/analysis.go
  • go/kv/analysis_example_test.go
  • go/kv/analysis_test.go
  • go/kv/bench.go
  • go/kv/bench_test.go
  • go/kv/blocks.go
  • go/kv/blocks_test.go
  • go/kv/helpers_test.go
  • go/kv/memvid.go
  • go/kv/memvid_test.go
  • go/kv/snapshot.go
  • go/kv/snapshot_example_test.go
  • go/kv/snapshot_test.go
  • go/kv_analysis_example_test.go
  • go/kv_cache_bench.go
  • go/kv_snapshot.go
  • go/kv_snapshot_example_test.go
  • go/kv_snapshot_test.go
  • go/local_tuning.go
  • go/local_tuning_test.go
  • go/lora/adapter.go
  • go/lora/fuse.go
  • go/lora/fuse_stub.go
  • go/lora/fuse_test.go
  • go/lora_adapter_darwin_test.go
  • go/lora_adapter_test.go
  • go/lora_fuse.go
  • go/lora_fuse_darwin.go
  • go/lora_fuse_darwin_test.go
  • go/lora_fuse_test.go
  • go/medium_test.go
  • go/memory/example_test.go
  • go/memory/memory.go
  • go/memory/memory_test.go
  • go/memory_plan.go
  • go/memory_plan_example_test.go
  • go/memory_plan_test.go
  • go/memvid_chapter_smoke.go
  • go/merge/compare.go
  • go/merge/compare_example_test.go
  • go/merge/compare_test.go
  • go/merge/helpers_test.go
  • go/merge/merge.go
  • go/merge/merge_test.go
  • go/mlx.go
  • go/mlx_example_test.go
  • go/mlx_internal_test.go
  • go/mlx_stub.go
  • go/mlx_stub_example_test.go
💤 Files with no reviewable changes (15)
  • go/api_test.go
  • go/api_stub_example_test.go
  • go/api_tokenizer_stub_test.go
  • go/adapter_example_test.go
  • go/api_tokenizer_stub.go
  • go/api_tokenizer_darwin_test.go
  • go/api_tokenizer_stub_example_test.go
  • go/backend_example_test.go
  • go/api_common_example_test.go
  • go/api_shape_test.go
  • go/api_common.go
  • go/api_darwin_test.go
  • go/attention_test.go
  • go/api_stub.go
  • go/api_stub_test.go

Comment thread docs/memory/kv_snapshot_blocks.md
Comment thread go/artifact/artifact.go Outdated
Comment thread go/backend.go
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go
Comment thread go/blockcache/blockcache.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)
  • CMakeLists.txt
  • cpp/CMakeLists.txt
  • go/backend.go
  • go/backend_test.go
  • go/cmd/mlx/main.go
  • go/cmd/mlx/main_test.go
  • go/internal/metal/backend.go
  • go/internal/metal/backend_test.go
  • go/internal/metal/decode_bridge.cpp
  • go/internal/metal/gemma4.go
  • go/internal/metal/gemma4_test.go
  • go/internal/metal/generate.go
  • go/internal/metal/metal.go
  • go/internal/metal/mlx_build_config.h
  • go/internal/metal/pinned_array.go
  • go/internal/metal/pinned_array_bridge.cpp
  • go/internal/metal/pinned_array_test.go
  • go/internal/metal/sample.go
  • go/internal/metal/sample_test.go
  • go/internal/metal/session.go
  • go/kv/snapshot.go
  • go/memvid_chapter_smoke.go
✅ Files skipped from review due to trivial changes (1)
  • cpp/CMakeLists.txt

Comment thread go/backend.go
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Comment on lines +188 to +207
book_path.write_text(
"# "
+ title
+ "\n\n"
+ f"Generated by go-mlx retained State run `{report_path.name}`.\n\n"
+ f"Seed prompt: `{seed['id']}`\n\n"
+ seed["prompt"]
+ "\n\n"
+ "Distractor prompts were supplied one per chapter as entropy and "
"imagery pressure, not as replacement plot instructions.\n\n"
+ "## Distractors\n\n"
+ "\n".join(f"- `{item['id']}`" for item in distractors)
+ "\n\n"
+ "## Metrics\n\n"
+ metric_line(report)
+ "\n---\n\n"
+ "\n\n".join(chapters)
+ "\n",
encoding="utf-8",
)
parser.add_argument("--random-seed", type=int, default=0)
parser.add_argument("--count", type=int, default=1)
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--count", type=int, default=1)
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
parser.add_argument("--turns", type=int, default=10)
parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
parser.add_argument("--manifest", type=Path, default=Path("/private/tmp/go-mlx-goal/books/manifest.jsonl"))
Comment thread scripts/state_book_from_phase0.py Fixed
Snider and others added 13 commits May 30, 2026 17:20
Pre-existing uncommitted AX-11 coverage for the internal/tokenizer BPE
surface (DecodeOne / DecodeToken / Encode / bpeMerge). Measured clean;
committed as-is to preserve the baseline alongside the metal-side
optimisation.

Co-Authored-By: Athena <athena@lthn.ai>
Co-Authored-By: Hephaestus <hephaestus@lthn.ai>
…ad per-head copy

toMetalKVSnapshot is the pure-Go conversion WarmPromptCacheFromKV runs
before the Metal restorer. A v4 snapshot loaded with default (non-
RawKVOnly) options populates BOTH layer-level native KeyBytes/ValueBytes
AND decoded per-head float32. The restorer (kvLayerArrays) takes the
native-slab branch and pins the layer bytes zero-copy via
fromPinnedRawBytes — it never reads the per-head float32. But
toMetalKVSnapshot was copying every head's float32 into a fresh slab
regardless, materialising the entire prefix cache a second time alongside
the zero-copy byte slab. That is the State-continuity restore doubling.

Fix: when a layer carries native K/V slab bytes, skip the per-head slab
allocation and pass head.Key/head.Value through by reference (same
ownership contract as KeyBytes, whose source already outlives the metal
snapshot for the restore call). Heads-only snapshots (v3, no layer bytes)
keep the load-bearing defensive copy — there the heads ARE the cache data.

Measured on the production dual shape (26 layers x 4 heads x 2048 tok x
256 dim, BenchmarkToMetalKVSnapshot_DualNativePlusHeads, -benchtime=200ms):
  before: 436,234,232 B/op  5 allocs/op  20.1 ms/op
  after:       26,048 B/op  4 allocs/op   2.47 us/op
(~16,747x B/op reduction — the ~416 MiB float32 payload is no longer copied)

Correctness: TestToMetalKVSnapshot_DualNativePlusHeads_Good asserts layer
KeyBytes/ValueBytes byte-identical (what the restorer pins) and per-head
float32 value-identical; TestToMetalKVSnapshot_HeadsOnly_Good asserts the
heads-only path still deep-copies independently of the source.

Co-authored-by: Hephaestus <hephaestus@lthn.ai>
…te + open gates

GOAL.md had grown to 4028 lines / 346KB, dominated by a ~2465-line chronological
log of dated correction/measurement entries (2026-05-16..05-25) — done work whose
full history lives in git + reports/*.json. Cut to 1580 lines: kept the Goal, the
production-path invariants, a current-state summary (raw-decode ~1.26x gap is the
live target), the open [ ] gates, the IDEAS.md optimisation brief, Acceptance
Criteria, Baseline, Architecture Rules, the 8 Workstreams, and Verification.
Dropped the done [x] open-gates; workstream progress checkboxes left as-is.

Co-Authored-By: Virgil <virgil@lethean.io>
…ror (Mantis #1829)

A Metal library load failure mid-construction left m.Layers pre-allocated
with nil entries; the deferred closeGemma4(m) cleanup then nil-deref'd
layer.compiledNativeOwnerDecode, panicking a second time and masking the
real error in the HTTP handler. Guard the model pointer and skip nil layer
entries across closeGemma/closeGemma4/closeQwen3 so cleanup returns cleanly
and the original load error propagates.

Co-Authored-By: Virgil <virgil@lethean.io>
…Mantis #1780)

F-7 N-2: the byte-prefix check (HasPrefix(resolved+"/", rootResolved))
rejected genuine children when macOS case-insensitive symlink resolution
handed back a differently-cased root. Replace with a core.PathRel-based
pathWithinDir helper that tests containment over cleaned path semantics and
still rejects sibling dirs that merely share a prefix.

Co-Authored-By: Virgil <virgil@lethean.io>
…antis #1781)

F-6 N-3: the adminDownloadRegistry jobs map grew one entry per download
for the process lifetime with no prune. Add maxDownloadJobsRetained (32)
and evictOldDownloadJobsLocked, called when a new job is recorded; it drops
finished (done/failed) jobs oldest-first and never evicts an in-flight job.

Co-Authored-By: Virgil <virgil@lethean.io>
…1782)

F-6 N-4: fetchAndVerify computed the HF-manifest size mismatch then dropped
it on the floor (`_ = expectedSize`), so the drift was dead code. Emit a
core.Warn the operator can correlate; sha256 remains the load-bearing
integrity gate so drift stays non-fatal as the original comment intended.

Co-Authored-By: Virgil <virgil@lethean.io>
F-6 N-9: isSafeHFEntryPath accepted segments beginning with `.`, so a
compromised mirror could plant .git/, .ssh/, or other hidden config into
the model tree. Reject any leading-dot segment; genuine model artefacts are
never dotfiles, and git metadata (.gitattributes) is filtered out as
non-model content rather than failing the download.

Co-Authored-By: Virgil <virgil@lethean.io>
…antis #1784)

F-6 N-6: writeModelManifest ranged over the digests map directly, so the
.sha256 sidecar came out in a different byte order on every download —
breaking diffing and reproducibility checks. Sort filenames via
core.SliceSort before serialising.

Co-Authored-By: Virgil <virgil@lethean.io>
…tis #1785)

F-7 N-7: hotSwapResolver.Replace loaded the new model with only the
per-reload opts (ContextLength + AdapterPath), discarding the auto-tuned
boot options (CacheMode, BatchSize, PromptCache, allocator limits, …) the
resolver was constructed with. Overlay the reload opts on top of initOpts
via reloadLoadOpts so the tuned baseline survives and the reload only
overrides the fields it explicitly carries (LoadOption apply is last-wins).

Co-Authored-By: Virgil <virgil@lethean.io>
Add an app-facing Gemma 4 E2B quantisation policy that keeps the q4 lane as an archived control while making 6-bit the product default and 8-bit the quality tier.

Report an explicitly labelled decode bandwidth proxy in driver-profile and state-ramp summaries so retained workflow reports can reason about active bytes per token without pretending to sample hardware bandwidth.

Co-Authored-By: Virgil <virgil@lethean.io>
Add a regression that treats raw native State block bytes as the durable KV payload contract. The test proves saved block payloads are byte-for-byte the native-encoded KV blocks and that raw-only reload reconstructs the original native slabs without duplicated per-head payloads.

Co-Authored-By: Virgil <virgil@lethean.io>
_ = os.Setenv("MLX_METALLIB_PATH", dst)
return
}
if err := os.MkdirAll(dir, 0o755); err != nil {
Snider and others added 7 commits May 31, 2026 07:37
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Snider and others added 13 commits June 2, 2026 05:19
Align architecture, build, and local-tuning docs with the current GOAL policy: metadata-only native gaps stay on the Metal planning path with native_runtime=false diagnostics, while mlxlm remains a legacy manual backend until it can be deleted.

Co-Authored-By: Virgil <virgil@lethean.io>
Keep the legacy requires_python_conversion pack field false now that metadata-only architecture gaps stay on the Metal planning path with unsupported-runtime diagnostics. NativeLoadable and ModelPackIssueUnsupportedRuntime are the supported signals for pack consumers.

Co-Authored-By: Virgil <virgil@lethean.io>
Record the current q6 go-mlx-vs-go-mlx driver-profile rows after the latest external/runtime refresh. The old combined gate remains healthy on the short prompt shape, but promotion still waits for retained-workflow evidence.

Co-Authored-By: Virgil <virgil@lethean.io>
Add a 10-turn q6 retained-state self-benchmark comparing the current default gate against the forced old combined gate. The retained workflow rejects promoting paged decode fast concat despite its short-prompt decode win because wall time, energy, memory, and sampled output shape regress.

Co-Authored-By: Virgil <virgil@lethean.io>
Carry state-ramp prompt and generation shape fields through the production MTP comparator and reject mismatched retained workflows as prompt-shape mismatches. This keeps official E2B MTP promotion evidence from comparing target-only and MTP rows that used different retained seed, append, turn, or sampling settings.

Co-Authored-By: Virgil <virgil@lethean.io>
Record the rebuilt q6 go-mlx-vs-go-mlx short self-benchmark after the retained shape comparator fix. The current default remains faster than fast-lane-off and the forced old combined gate, while the retained workflow gate keeps paged fast concat diagnostic-only.

Co-Authored-By: Virgil <virgil@lethean.io>
Aggregate token-phase and native-event trace summaries with nil-index linear scans for the small retained reporting shapes instead of allocating maps on every summary. This keeps non-trace state-ramp reporting unchanged and cuts the long retained trace summary from 18 to 12 allocs/op while reducing runtime from about 17.4us to 14.1us.

Co-Authored-By: Virgil <virgil@lethean.io>
Keep DefaultGemma4FastRuntimeGates as a defensive-copy public API, but add count/index accessors for hot paths that only need read-only iteration. The CLI fast-lane default and runtime-gate reporters now avoid allocating that defensive slice copy per dispatch.

Benchmarks: DefaultGemma4FastRuntimeGateAccess records 1.047ns/op, 0B/op, 0 allocs/op versus DefaultGemma4FastRuntimeGates at 18.00ns/op, 16B/op, 1 alloc/op. CLI fast-lane defaults drop from 200ns/136B/4 allocs to 180.8ns/120B/3 allocs, and runtime-gates drop from 352B/3 allocs to 336B/2 allocs.

Co-Authored-By: Virgil <virgil@lethean.io>
List the seven mlx-community Gemma 4 E2B pack types in the production quantisation policy while keeping q8/q6/q4 as the locked product ladder. Allow affine q5 through the Gemma 4 native layer quantisation predicate so the 5bit pack can be benchmarked instead of rejected before load.

Verification:

- env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run 'TestProductionLane_DefaultProductionQuantizationPolicy|TestProductionLane_DefaultPoliciesReturnDefensiveCopies|TestProductionLane_DefaultQuantizationPackLocks' -count=1

- env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/internal/metal -run 'TestGemma4_(ValidLayerQuantization|ValidateQuantizationConfig)' -count=1

- env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/cmd/mlx -run 'TestRunCommand_ProductionQuantization' -count=1

- env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run '^$' -bench 'BenchmarkSelectProductionQuantizationTier|BenchmarkDefaultProductionQuantizationPolicy' -benchmem -count=1

- env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx

- /private/tmp/go-mlx-self/bin/lthn-mlx production-quantization -json -context 32768

Co-Authored-By: Virgil <virgil@lethean.io>
Expose the supported Gemma 4 E2B quant pack list as a public defensive-copy API and add a name/model-id resolver for benchmark harnesses.

Wire production-quantization -pack to report a selected bench target without changing the app-facing q6/q8/q4 product ladder.

Verification: go test ./go -run 'TestProductionLane_(DefaultProductionQuantizationPolicy|DefaultPoliciesReturnDefensiveCopies|ProductionQuantizationPackByName)' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionQuantization' -count=1; go test ./go -run '^$' -bench 'Benchmark(SelectProductionQuantizationTier|ProductionQuantizationPackByName)' -benchmem -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx.

Benchmarks: SelectProductionQuantizationTier_DefaultQ6 48.48 ns/op 0 B/op 0 allocs/op; ProductionQuantizationPackByName_MXFP8 54.16 ns/op 0 B/op 0 allocs/op.

Co-Authored-By: Virgil <virgil@lethean.io>
Expose a production architecture status report derived from the shared profile registry so the no-Python fallback removal checklist is machine-readable.

Add lthn-mlx production-architectures with JSON and gaps-only output, covering the current 25-profile matrix: 16 native and 9 metadata-only gaps.

Verification: go test ./go -run 'TestProductionLane_(DefaultPoliciesReturnDefensiveCopies|DefaultProductionArchitectureStatus|ProductionQuantizationPackByName)' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures' -count=1; go test ./go -run '^$' -bench 'Benchmark(DefaultProductionArchitectureStatus|ProductionQuantizationPackByName)' -benchmem -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx.

Benchmarks: DefaultProductionArchitectureStatus 4660 ns/op 17632 B/op 132 allocs/op; ProductionQuantizationPackByName_MXFP8 54.59 ns/op 0 B/op 0 allocs/op.

Co-Authored-By: Virgil <virgil@lethean.io>
Move bert and bert_rerank out of metadata-only gaps by adding native staged loader profiles and Metal-side config/tokenizer validation.

Generation stays fail-closed with a staged-loader diagnostic until embedding pooling and rerank scorer kernels land. Update production architecture reporting from 16/9 to 18/7 and cover the new status in pack, CLI, profile, and Metal tests.

Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/internal/metal -run 'TestModel_LoadModel_(BERTStagedEncoderLoader|BERTRerankStagedLoader|BERTRerankMissingLabels|MetadataOnlyFamiliesHaveExplicitNativeGuards)|TestGenerate_Model_StagedBERTReturnsDecodeError' -count=1

Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run '^$' -bench 'BenchmarkDefaultProductionArchitectureStatus' -benchmem -count=1

Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx

Co-Authored-By: Virgil <virgil@lethean.io>
Pin all seven MLX-community Gemma 4 E2B derivative packs in the production quantization lock table: mxfp4, mxfp8, 4bit, 5bit, 6bit, 8bit, and bf16.

Keep the product ladder at q8 quality, q6 default, and q4 constrained fallback while exposing mxfp and bf16 variants as audit/benchmark targets rather than app defaults.

Regenerate the official Gemma 4 E2B source-lock artifact from the CLI, preserving policy fields and source-lock notes in the JSON report.

Verification: go test ./go -run 'TestProductionLane_DefaultQuantizationPackLocks|TestOfficialGemma4E2BSourceLockArtifact|TestProductionLane_DefaultProductionQuantizationPolicy|TestProductionLane_ProductionQuantizationPackByName' -count=1

Verification: go test ./go/cmd/mlx -run 'TestRunCommand_(OfficialGemma4LocksJSON|ProductionQuantizationDefaultJSON|ProductionQuantizationBenchPackJSON|ProductionQuantizationBenchPackPlain)' -count=1

Benchmark: go test ./go -run '^$' -bench 'BenchmarkProductionQuantizationPackByName|BenchmarkProdLane_DefaultProductionQuantizationPolicy' -benchmem -count=1

Co-Authored-By: Virgil <virgil@lethean.io>
"config_sha256": "614e876b4efcaff13ce4c7a3f96a5b9de86325e3d2ab9c622606ced688f1b8b7",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "d6be5b24cbc974d492804737716ade8d2575eb849ec90a1d316bb64e99838104",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
"config_sha256": "29b810ed760b55104943a3cc3b6f8b9ca079e6e00b09585d85aec54863a42fb4",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
"tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
"tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",
Snider and others added 15 commits June 2, 2026 06:54
Promote dense Qwen3.6/Qwen3.5 conditional checkpoints from metadata-only to a native staged profile. The loader now validates config/tokenizer metadata, exposes model info and quant metadata, and keeps generation fail-closed with an explicit hybrid linear-attention diagnostic until decode kernels land.

Production architecture status moves to 19/25 native-staged profiles with 6 metadata-only MoE/sparse-router gaps remaining.

Verified with GOWORK=/Users/snider/Code/core/go-mlx/go.work: go test ./go/internal/metal -run 'TestModel_LoadModel_Qwen36StagedLoader|TestGenerate_Model_StagedQwen36ReturnsDecodeError' -count=1; go test ./go/model -run 'TestInspectModelPack_(SafetensorsQwen36|MetadataOnlyArchitectureProfiles)' -count=1; go test ./go/profile -run 'TestArchitectureProfile_(MetadataFamilies|BuiltinIDs)' -count=1; go test ./go -run 'TestProductionLane_DefaultProductionArchitectureStatus' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures(JSON|GapsOnly)' -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx; lthn-mlx production-architectures -gaps-only/-json.

Co-Authored-By: Virgil <virgil@lethean.io>
Promote plain Qwen3 MoE checkpoints from metadata-only to a native staged profile. The staged loader validates sparse-expert config/tokenizer metadata, exposes model info and quant metadata, and keeps generation fail-closed with an explicit sparse-expert decode diagnostic until router kernels land.

Production architecture status moves to 20/25 native-staged profiles with 5 metadata-only MoE/MLA/channel-parser gaps remaining.

Verified with GOWORK=/Users/snider/Code/core/go-mlx/go.work: go test ./go/internal/metal -run 'TestModel_LoadModel_Qwen3MoEStagedLoader|TestGenerate_Model_StagedQwen3MoEReturnsDecodeError' -count=1; go test ./go/profile -run 'TestArchitectureProfile_(MetadataFamilies|BuiltinIDs)' -count=1; go test ./go -run 'TestProductionLane_DefaultProductionArchitectureStatus' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures(JSON|GapsOnly)' -count=1; go test ./go/model -run 'TestInspectModelPack_MetadataOnlyArchitectureProfiles' -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx; lthn-mlx production-architectures -gaps-only/-json.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 2, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
7.5% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants