Virgil Lemma foundations by Snider · Pull Request #8 · dAppCore/go-mlx

Snider · 2026-05-20T05:58:29Z

Summary by CodeRabbit

New Features
- Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
- Block‑prefix cache service and memvid bundle index for faster prefix restores.
- Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
Improvements
- Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
- Build/toolchain updated (C++23) and macOS deployment target raised.
Documentation
- Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

coderabbitai · 2026-05-20T05:58:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s)	Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs) `.gitignore`, `.gitmodules`, `CMakeLists.txt`, `cpp/CMakeLists.txt`, `external/`, `go/adapter.go`, `go/adapter/`, `go/backend.go`, `go/agent/`, `go/blockcache/`, `go/artifact/`, `go/_test.go`, `docs/*`	Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (10)

docs/inference/thinking.md (1)
74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to
markdown linter.
docs/runtime/README.md (2)
68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.
📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".
44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)
9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.
✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
Or alternatively:
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)
31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).
📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)
82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).
📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)
7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)
7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)
5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)
16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)

.gitignore
.gitmodules
CLAUDE.md
CMakeLists.txt
GOAL.md
docs/README.md
docs/architecture.md
docs/build.md
docs/cmd/violet.md
docs/compute/compute.md
docs/development.md
docs/examples/compute/frame-pipeline.md
docs/examples/daemon/violet-socket.md
docs/examples/eval/attention-probe.md
docs/examples/eval/perplexity.md
docs/examples/inference/batch.md
docs/examples/inference/chat.md
docs/examples/inference/quantization.md
docs/examples/inference/streaming.md
docs/examples/model-ops/hf-fit.md
docs/examples/model-ops/kv-snapshot.md
docs/examples/model-ops/merge.md
docs/examples/model-ops/quantize-gguf.md
docs/examples/training/distill.md
docs/examples/training/grpo.md
docs/examples/training/lora-finetune.md
docs/examples/training/lora-fuse.md
docs/history.md
docs/index.md
docs/inference/README.md
docs/inference/block_cache.md
docs/inference/decode_optimisation.md
docs/inference/parser_registry.md
docs/inference/scheduler.md
docs/inference/thinking.md
docs/memory/README.md
docs/memory/agent_memory.md
docs/memory/agentic_project_seed.md
docs/memory/kv_snapshot.md
docs/memory/kv_snapshot_blocks.md
docs/memory/kv_snapshot_index.md
docs/memory/kv_snapshot_memvid.md
docs/memory/medium.md
docs/memory/state_bundle.md
docs/model-operations.md
docs/model/README.md
docs/model/memory_plan.md
docs/model/model_pack.md
docs/models.md
docs/moe/README.md
docs/moe/codebook_vq.md
docs/moe/expert_residency.md
docs/moe/jang.md
docs/moe/minimax_m2.md
docs/observability/probe.md
docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-goal-completion-audit.md
docs/runtime/2026-05-19-runner-calibration.md
docs/runtime/2026-05-20-chapter-profile-safety.md
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
docs/runtime/README.md
docs/runtime/adapter.md
docs/runtime/local_autotune.md
docs/runtime/register_metal.md
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
docs/training/README.md
docs/training/distill.md
docs/training/eval.md
docs/training/grpo.md
docs/training/lora_adapter.md
docs/training/sft.md
docs/vmlx-feature-gap-report.md
external/go-ai
external/go-inference
external/go-ml
go/adapter.go
go/adapter/adapter.go
go/adapter_example_test.go
go/adapter_test.go
go/agent/helpers.go
go/agent/index.go
go/agent/index_test.go
go/agent/test_helpers_test.go
go/agent/wake_sleep.go
go/api_common.go
go/api_common_example_test.go
go/api_darwin_test.go
go/api_shape_test.go
go/api_stub.go
go/api_stub_example_test.go
go/api_stub_test.go
go/api_test.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_stub_example_test.go
go/api_tokenizer_stub_test.go
go/artifact/artifact.go
go/artifact/artifact_test.go
go/attention_test.go
go/backend.go
go/backend_example_test.go
go/backend_test.go
go/blockcache/blockcache.go
go/blockcache/blockcache_test.go
go/blockcache/helpers_test.go
go/bundle/bundle.go
go/bundle/bundle_test.go
go/bundle/example_test.go
go/bundle/sami.go
go/chaptersmoke/chaptersmoke.go
go/chaptersmoke/chaptersmoke_test.go
go/chat/chat.go
go/chat/chat_test.go
go/chat/example_test.go
go/cmd/go-mlx/main.go
go/cmd/go-mlx/main_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/cmd/mlx/split_ffn_tune.go
go/compute/compute.go
go/compute/compute_example_test.go
go/compute/compute_metal.go
go/compute/compute_metal_example_test.go
go/compute/compute_metal_helper_test.go
go/compute/compute_metal_test.go
go/compute/compute_test.go
go/compute_stub.go
go/compute_stub_example_test.go
go/compute_stub_test.go
go/compute_test.go
go/dataset/jsonl.go
go/dataset/sample.go
go/dataset_stream.go
go/dataset_stream_example_test.go
go/dataset_stream_test.go
go/device_info.go
go/distill.go
go/distill_test.go
go/eval.go
go/eval_darwin.go
go/eval_darwin_test.go
go/eval_stub.go
go/eval_test.go
go/fast_eval.go
go/fast_eval_example_test.go
go/fast_eval_runner.go
go/fast_eval_test.go
go/gguf/info.go
go/gguf/info_example_test.go
go/gguf/info_test.go
go/gguf/quantize.go
go/gguf/quantize_test.go
go/grpo.go
go/grpo_test.go
go/helpers.go
go/hf/hf.go
go/hf/hf_test.go
go/hf/test_helpers_test.go
go/hf_fit.go
go/inference_contract.go
go/inference_contract_test.go
go/internal/metal/activation_bridge.cpp
go/internal/metal/array.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/batch.go
go/internal/metal/cache.go
go/internal/metal/cache_test.go
go/internal/metal/close.go
go/internal/metal/codebook_vq.go
go/internal/metal/codebook_vq_test.go
go/internal/metal/compile.go
go/internal/metal/compile_test.go
go/internal/metal/decode.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/decode_bridge.h
go/internal/metal/decode_test.go
go/internal/metal/dense_matvec.go
go/internal/metal/dense_matvec_test.go
go/internal/metal/device.go
go/internal/metal/dtype.go
go/internal/metal/error_test.go
go/internal/metal/expert_id_matvec.go
go/internal/metal/expert_id_matvec_test.go
go/internal/metal/fast.go
go/internal/metal/fast_test.go
go/internal/metal/gemma3.go
go/internal/metal/gemma4.go
go/internal/metal/gemma4_assistant.go
go/internal/metal/gemma4_assistant_decode.go
go/internal/metal/gemma4_assistant_decode_example_test.go
go/internal/metal/gemma4_assistant_decode_test.go
go/internal/metal/gemma4_assistant_generate.go
go/internal/metal/gemma4_assistant_generate_test.go
go/internal/metal/gemma4_assistant_pair.go
go/internal/metal/gemma4_assistant_test.go
go/internal/metal/gemma4_ffn_residual.go
go/internal/metal/gemma4_ffn_residual_test.go
go/internal/metal/gemma4_router_topk.go
go/internal/metal/gemma4_router_topk_test.go
go/internal/metal/gemma4_test.go
go/internal/metal/gemma4_vision.go
go/internal/metal/generate.go
go/internal/metal/generate_test.go
go/internal/metal/jang_dequant.go
go/internal/metal/jang_dequant_test.go
go/internal/metal/kv_snapshot.go
go/internal/metal/metal.go
go/internal/metal/minimax_m2.go
go/internal/metal/minimax_m2_test.go
go/internal/metal/mlx_mlx_backend_cpu_available.cpp
go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
go/internal/metal/model.go
go/internal/metal/model_test.go
go/internal/metal/nn.go
go/internal/metal/nn_test.go
go/internal/metal/ops.go
go/internal/metal/process_memory_darwin.go
go/internal/metal/process_memory_stub.go
go/internal/metal/prompt_cache.go
go/internal/metal/prompt_cache_test.go
go/internal/metal/qwen3.go
go/internal/metal/qwen3_test.go
go/internal/metal/runtime_gate.go
go/internal/metal/runtime_gate_example_test.go
go/internal/metal/runtime_gate_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/internal/metal/session_example_test.go
go/internal/metal/session_test.go
go/internal/metal/split.go
go/internal/metal/split_test.go
go/internal/metal/stream.go
go/internal/metal/tokenizer.go
go/internal/metal/tokenizer_test.go
go/internal/metal/trace.go
go/internal/metal/trace_test.go
go/internal/metal/training.go
go/jang_test.go
go/kv/analysis.go
go/kv/analysis_example_test.go
go/kv/analysis_test.go
go/kv/bench.go
go/kv/bench_test.go
go/kv/blocks.go
go/kv/blocks_test.go
go/kv/helpers_test.go
go/kv/memvid.go
go/kv/memvid_test.go
go/kv/snapshot.go
go/kv/snapshot_example_test.go
go/kv/snapshot_test.go
go/kv_analysis_example_test.go
go/kv_cache_bench.go
go/kv_snapshot.go
go/kv_snapshot_example_test.go
go/kv_snapshot_test.go
go/local_tuning.go
go/local_tuning_test.go
go/lora/adapter.go
go/lora/fuse.go
go/lora/fuse_stub.go
go/lora/fuse_test.go
go/lora_adapter_darwin_test.go
go/lora_adapter_test.go
go/lora_fuse.go
go/lora_fuse_darwin.go
go/lora_fuse_darwin_test.go
go/lora_fuse_test.go
go/medium_test.go
go/memory/example_test.go
go/memory/memory.go
go/memory/memory_test.go
go/memory_plan.go
go/memory_plan_example_test.go
go/memory_plan_test.go
go/memvid_chapter_smoke.go
go/merge/compare.go
go/merge/compare_example_test.go
go/merge/compare_test.go
go/merge/helpers_test.go
go/merge/merge.go
go/merge/merge_test.go
go/mlx.go
go/mlx_example_test.go
go/mlx_internal_test.go
go/mlx_stub.go
go/mlx_stub_example_test.go

💤 Files with no reviewable changes (15)

go/api_test.go
go/api_stub_example_test.go
go/api_tokenizer_stub_test.go
go/adapter_example_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub_example_test.go
go/backend_example_test.go
go/api_common_example_test.go
go/api_shape_test.go
go/api_common.go
go/api_darwin_test.go
go/attention_test.go
go/api_stub.go
go/api_stub_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)

CMakeLists.txt
cpp/CMakeLists.txt
go/backend.go
go/backend_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/gemma4.go
go/internal/metal/gemma4_test.go
go/internal/metal/generate.go
go/internal/metal/metal.go
go/internal/metal/mlx_build_config.h
go/internal/metal/pinned_array.go
go/internal/metal/pinned_array_bridge.cpp
go/internal/metal/pinned_array_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/kv/snapshot.go
go/memvid_chapter_smoke.go

✅ Files skipped from review due to trivial changes (1)

cpp/CMakeLists.txt

github-advanced-security

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

+    book_path.write_text(
+        "# "
+        + title
+        + "\n\n"
+        + f"Generated by go-mlx retained State run `{report_path.name}`.\n\n"
+        + f"Seed prompt: `{seed['id']}`\n\n"
+        + seed["prompt"]
+        + "\n\n"
+        + "Distractor prompts were supplied one per chapter as entropy and "
+        "imagery pressure, not as replacement plot instructions.\n\n"
+        + "## Distractors\n\n"
+        + "\n".join(f"- `{item['id']}`" for item in distractors)
+        + "\n\n"
+        + "## Metrics\n\n"
+        + metric_line(report)
+        + "\n---\n\n"
+        + "\n\n".join(chapters)
+        + "\n",
+        encoding="utf-8",
+    )


+    parser.add_argument("--random-seed", type=int, default=0)
+    parser.add_argument("--count", type=int, default=1)
+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))


+    parser.add_argument("--count", type=int, default=1)
+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
+    parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))


+    parser.add_argument("--turns", type=int, default=10)
+    parser.add_argument("--run-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/book-runs"))
+    parser.add_argument("--book-dir", type=Path, default=Path("/private/tmp/go-mlx-goal/books"))
+    parser.add_argument("--manifest", type=Path, default=Path("/private/tmp/go-mlx-goal/books/manifest.jsonl"))


Pre-existing uncommitted AX-11 coverage for the internal/tokenizer BPE surface (DecodeOne / DecodeToken / Encode / bpeMerge). Measured clean; committed as-is to preserve the baseline alongside the metal-side optimisation. Co-Authored-By: Athena <athena@lthn.ai> Co-Authored-By: Hephaestus <hephaestus@lthn.ai>

…ad per-head copy toMetalKVSnapshot is the pure-Go conversion WarmPromptCacheFromKV runs before the Metal restorer. A v4 snapshot loaded with default (non- RawKVOnly) options populates BOTH layer-level native KeyBytes/ValueBytes AND decoded per-head float32. The restorer (kvLayerArrays) takes the native-slab branch and pins the layer bytes zero-copy via fromPinnedRawBytes — it never reads the per-head float32. But toMetalKVSnapshot was copying every head's float32 into a fresh slab regardless, materialising the entire prefix cache a second time alongside the zero-copy byte slab. That is the State-continuity restore doubling. Fix: when a layer carries native K/V slab bytes, skip the per-head slab allocation and pass head.Key/head.Value through by reference (same ownership contract as KeyBytes, whose source already outlives the metal snapshot for the restore call). Heads-only snapshots (v3, no layer bytes) keep the load-bearing defensive copy — there the heads ARE the cache data. Measured on the production dual shape (26 layers x 4 heads x 2048 tok x 256 dim, BenchmarkToMetalKVSnapshot_DualNativePlusHeads, -benchtime=200ms): before: 436,234,232 B/op 5 allocs/op 20.1 ms/op after: 26,048 B/op 4 allocs/op 2.47 us/op (~16,747x B/op reduction — the ~416 MiB float32 payload is no longer copied) Correctness: TestToMetalKVSnapshot_DualNativePlusHeads_Good asserts layer KeyBytes/ValueBytes byte-identical (what the restorer pins) and per-head float32 value-identical; TestToMetalKVSnapshot_HeadsOnly_Good asserts the heads-only path still deep-copies independently of the source. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

…te + open gates GOAL.md had grown to 4028 lines / 346KB, dominated by a ~2465-line chronological log of dated correction/measurement entries (2026-05-16..05-25) — done work whose full history lives in git + reports/*.json. Cut to 1580 lines: kept the Goal, the production-path invariants, a current-state summary (raw-decode ~1.26x gap is the live target), the open [ ] gates, the IDEAS.md optimisation brief, Acceptance Criteria, Baseline, Architecture Rules, the 8 Workstreams, and Verification. Dropped the done [x] open-gates; workstream progress checkboxes left as-is. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

…ror (Mantis #1829) A Metal library load failure mid-construction left m.Layers pre-allocated with nil entries; the deferred closeGemma4(m) cleanup then nil-deref'd layer.compiledNativeOwnerDecode, panicking a second time and masking the real error in the HTTP handler. Guard the model pointer and skip nil layer entries across closeGemma/closeGemma4/closeQwen3 so cleanup returns cleanly and the original load error propagates. Co-Authored-By: Virgil <virgil@lethean.io>

…Mantis #1780) F-7 N-2: the byte-prefix check (HasPrefix(resolved+"/", rootResolved)) rejected genuine children when macOS case-insensitive symlink resolution handed back a differently-cased root. Replace with a core.PathRel-based pathWithinDir helper that tests containment over cleaned path semantics and still rejects sibling dirs that merely share a prefix. Co-Authored-By: Virgil <virgil@lethean.io>

…antis #1781) F-6 N-3: the adminDownloadRegistry jobs map grew one entry per download for the process lifetime with no prune. Add maxDownloadJobsRetained (32) and evictOldDownloadJobsLocked, called when a new job is recorded; it drops finished (done/failed) jobs oldest-first and never evicts an in-flight job. Co-Authored-By: Virgil <virgil@lethean.io>

…1782) F-6 N-4: fetchAndVerify computed the HF-manifest size mismatch then dropped it on the floor (`_ = expectedSize`), so the drift was dead code. Emit a core.Warn the operator can correlate; sha256 remains the load-bearing integrity gate so drift stays non-fatal as the original comment intended. Co-Authored-By: Virgil <virgil@lethean.io>

F-6 N-9: isSafeHFEntryPath accepted segments beginning with `.`, so a compromised mirror could plant .git/, .ssh/, or other hidden config into the model tree. Reject any leading-dot segment; genuine model artefacts are never dotfiles, and git metadata (.gitattributes) is filtered out as non-model content rather than failing the download. Co-Authored-By: Virgil <virgil@lethean.io>

…antis #1784) F-6 N-6: writeModelManifest ranged over the digests map directly, so the .sha256 sidecar came out in a different byte order on every download — breaking diffing and reproducibility checks. Sort filenames via core.SliceSort before serialising. Co-Authored-By: Virgil <virgil@lethean.io>

…tis #1785) F-7 N-7: hotSwapResolver.Replace loaded the new model with only the per-reload opts (ContextLength + AdapterPath), discarding the auto-tuned boot options (CacheMode, BatchSize, PromptCache, allocator limits, …) the resolver was constructed with. Overlay the reload opts on top of initOpts via reloadLoadOpts so the tuned baseline survives and the reload only overrides the fields it explicitly carries (LoadOption apply is last-wins). Co-Authored-By: Virgil <virgil@lethean.io>

Add an app-facing Gemma 4 E2B quantisation policy that keeps the q4 lane as an archived control while making 6-bit the product default and 8-bit the quality tier. Report an explicitly labelled decode bandwidth proxy in driver-profile and state-ramp summaries so retained workflow reports can reason about active bytes per token without pretending to sample hardware bandwidth. Co-Authored-By: Virgil <virgil@lethean.io>

Add a regression that treats raw native State block bytes as the durable KV payload contract. The test proves saved block payloads are byte-for-byte the native-encoded KV blocks and that raw-only reload reconstructs the original native slabs without duplicated per-head payloads. Co-Authored-By: Virgil <virgil@lethean.io>

+		_ = os.Setenv("MLX_METALLIB_PATH", dst)
+		return
+	}
+	if err := os.MkdirAll(dir, 0o755); err != nil {


Co-Authored-By: Virgil <virgil@lethean.io>

Align architecture, build, and local-tuning docs with the current GOAL policy: metadata-only native gaps stay on the Metal planning path with native_runtime=false diagnostics, while mlxlm remains a legacy manual backend until it can be deleted. Co-Authored-By: Virgil <virgil@lethean.io>

Keep the legacy requires_python_conversion pack field false now that metadata-only architecture gaps stay on the Metal planning path with unsupported-runtime diagnostics. NativeLoadable and ModelPackIssueUnsupportedRuntime are the supported signals for pack consumers. Co-Authored-By: Virgil <virgil@lethean.io>

Record the current q6 go-mlx-vs-go-mlx driver-profile rows after the latest external/runtime refresh. The old combined gate remains healthy on the short prompt shape, but promotion still waits for retained-workflow evidence. Co-Authored-By: Virgil <virgil@lethean.io>

Add a 10-turn q6 retained-state self-benchmark comparing the current default gate against the forced old combined gate. The retained workflow rejects promoting paged decode fast concat despite its short-prompt decode win because wall time, energy, memory, and sampled output shape regress. Co-Authored-By: Virgil <virgil@lethean.io>

Carry state-ramp prompt and generation shape fields through the production MTP comparator and reject mismatched retained workflows as prompt-shape mismatches. This keeps official E2B MTP promotion evidence from comparing target-only and MTP rows that used different retained seed, append, turn, or sampling settings. Co-Authored-By: Virgil <virgil@lethean.io>

Record the rebuilt q6 go-mlx-vs-go-mlx short self-benchmark after the retained shape comparator fix. The current default remains faster than fast-lane-off and the forced old combined gate, while the retained workflow gate keeps paged fast concat diagnostic-only. Co-Authored-By: Virgil <virgil@lethean.io>

Aggregate token-phase and native-event trace summaries with nil-index linear scans for the small retained reporting shapes instead of allocating maps on every summary. This keeps non-trace state-ramp reporting unchanged and cuts the long retained trace summary from 18 to 12 allocs/op while reducing runtime from about 17.4us to 14.1us. Co-Authored-By: Virgil <virgil@lethean.io>

Keep DefaultGemma4FastRuntimeGates as a defensive-copy public API, but add count/index accessors for hot paths that only need read-only iteration. The CLI fast-lane default and runtime-gate reporters now avoid allocating that defensive slice copy per dispatch. Benchmarks: DefaultGemma4FastRuntimeGateAccess records 1.047ns/op, 0B/op, 0 allocs/op versus DefaultGemma4FastRuntimeGates at 18.00ns/op, 16B/op, 1 alloc/op. CLI fast-lane defaults drop from 200ns/136B/4 allocs to 180.8ns/120B/3 allocs, and runtime-gates drop from 352B/3 allocs to 336B/2 allocs. Co-Authored-By: Virgil <virgil@lethean.io>

List the seven mlx-community Gemma 4 E2B pack types in the production quantisation policy while keeping q8/q6/q4 as the locked product ladder. Allow affine q5 through the Gemma 4 native layer quantisation predicate so the 5bit pack can be benchmarked instead of rejected before load. Verification: - env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run 'TestProductionLane_DefaultProductionQuantizationPolicy|TestProductionLane_DefaultPoliciesReturnDefensiveCopies|TestProductionLane_DefaultQuantizationPackLocks' -count=1 - env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/internal/metal -run 'TestGemma4_(ValidLayerQuantization|ValidateQuantizationConfig)' -count=1 - env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/cmd/mlx -run 'TestRunCommand_ProductionQuantization' -count=1 - env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run '^$' -bench 'BenchmarkSelectProductionQuantizationTier|BenchmarkDefaultProductionQuantizationPolicy' -benchmem -count=1 - env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx - /private/tmp/go-mlx-self/bin/lthn-mlx production-quantization -json -context 32768 Co-Authored-By: Virgil <virgil@lethean.io>

Expose the supported Gemma 4 E2B quant pack list as a public defensive-copy API and add a name/model-id resolver for benchmark harnesses. Wire production-quantization -pack to report a selected bench target without changing the app-facing q6/q8/q4 product ladder. Verification: go test ./go -run 'TestProductionLane_(DefaultProductionQuantizationPolicy|DefaultPoliciesReturnDefensiveCopies|ProductionQuantizationPackByName)' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionQuantization' -count=1; go test ./go -run '^$' -bench 'Benchmark(SelectProductionQuantizationTier|ProductionQuantizationPackByName)' -benchmem -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx. Benchmarks: SelectProductionQuantizationTier_DefaultQ6 48.48 ns/op 0 B/op 0 allocs/op; ProductionQuantizationPackByName_MXFP8 54.16 ns/op 0 B/op 0 allocs/op. Co-Authored-By: Virgil <virgil@lethean.io>

Expose a production architecture status report derived from the shared profile registry so the no-Python fallback removal checklist is machine-readable. Add lthn-mlx production-architectures with JSON and gaps-only output, covering the current 25-profile matrix: 16 native and 9 metadata-only gaps. Verification: go test ./go -run 'TestProductionLane_(DefaultPoliciesReturnDefensiveCopies|DefaultProductionArchitectureStatus|ProductionQuantizationPackByName)' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures' -count=1; go test ./go -run '^$' -bench 'Benchmark(DefaultProductionArchitectureStatus|ProductionQuantizationPackByName)' -benchmem -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx. Benchmarks: DefaultProductionArchitectureStatus 4660 ns/op 17632 B/op 132 allocs/op; ProductionQuantizationPackByName_MXFP8 54.59 ns/op 0 B/op 0 allocs/op. Co-Authored-By: Virgil <virgil@lethean.io>

Move bert and bert_rerank out of metadata-only gaps by adding native staged loader profiles and Metal-side config/tokenizer validation. Generation stays fail-closed with a staged-loader diagnostic until embedding pooling and rerank scorer kernels land. Update production architecture reporting from 16/9 to 18/7 and cover the new status in pack, CLI, profile, and Metal tests. Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go/internal/metal -run 'TestModel_LoadModel_(BERTStagedEncoderLoader|BERTRerankStagedLoader|BERTRerankMissingLabels|MetadataOnlyFamiliesHaveExplicitNativeGuards)|TestGenerate_Model_StagedBERTReturnsDecodeError' -count=1 Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go test ./go -run '^$' -bench 'BenchmarkDefaultProductionArchitectureStatus' -benchmem -count=1 Verification: env GOWORK=/Users/snider/Code/core/go-mlx/go.work GOCACHE=/private/tmp/go-mlx-self/gocache go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx Co-Authored-By: Virgil <virgil@lethean.io>

Pin all seven MLX-community Gemma 4 E2B derivative packs in the production quantization lock table: mxfp4, mxfp8, 4bit, 5bit, 6bit, 8bit, and bf16. Keep the product ladder at q8 quality, q6 default, and q4 constrained fallback while exposing mxfp and bf16 variants as audit/benchmark targets rather than app defaults. Regenerate the official Gemma 4 E2B source-lock artifact from the CLI, preserving policy fields and source-lock notes in the JSON report. Verification: go test ./go -run 'TestProductionLane_DefaultQuantizationPackLocks|TestOfficialGemma4E2BSourceLockArtifact|TestProductionLane_DefaultProductionQuantizationPolicy|TestProductionLane_ProductionQuantizationPackByName' -count=1 Verification: go test ./go/cmd/mlx -run 'TestRunCommand_(OfficialGemma4LocksJSON|ProductionQuantizationDefaultJSON|ProductionQuantizationBenchPackJSON|ProductionQuantizationBenchPackPlain)' -count=1 Benchmark: go test ./go -run '^$' -bench 'BenchmarkProductionQuantizationPackByName|BenchmarkProdLane_DefaultProductionQuantizationPolicy' -benchmem -count=1 Co-Authored-By: Virgil <virgil@lethean.io>

+      "config_sha256": "614e876b4efcaff13ce4c7a3f96a5b9de86325e3d2ab9c622606ced688f1b8b7",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "d6be5b24cbc974d492804737716ade8d2575eb849ec90a1d316bb64e99838104",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


+      "config_sha256": "29b810ed760b55104943a3cc3b6f8b9ca079e6e00b09585d85aec54863a42fb4",
+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_blob_id": "13e92a44d19566f334d7450e7898935e16e16f3d",
+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",


+      "processor_config_sha256": "1bd0d00776284f369c1eff5fb631e865dfcdca861e0b7d60dbef27fcf37436a8",
+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",


+      "tokenizer_blob_id": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_sha256": "cc8d3a0ce36466ccc1278bf987df5f71db1719b9ca6b4118264f45cb627bfe0f",
+      "tokenizer_config_blob_id": "375b25dc8be85705251e41be1c25310d24932051",
+      "tokenizer_config_sha256": "90c3a3ba5bf53818383a58e1a776cbcacd2a038d4812eaa373e1522f2d06f3df",


Promote dense Qwen3.6/Qwen3.5 conditional checkpoints from metadata-only to a native staged profile. The loader now validates config/tokenizer metadata, exposes model info and quant metadata, and keeps generation fail-closed with an explicit hybrid linear-attention diagnostic until decode kernels land. Production architecture status moves to 19/25 native-staged profiles with 6 metadata-only MoE/sparse-router gaps remaining. Verified with GOWORK=/Users/snider/Code/core/go-mlx/go.work: go test ./go/internal/metal -run 'TestModel_LoadModel_Qwen36StagedLoader|TestGenerate_Model_StagedQwen36ReturnsDecodeError' -count=1; go test ./go/model -run 'TestInspectModelPack_(SafetensorsQwen36|MetadataOnlyArchitectureProfiles)' -count=1; go test ./go/profile -run 'TestArchitectureProfile_(MetadataFamilies|BuiltinIDs)' -count=1; go test ./go -run 'TestProductionLane_DefaultProductionArchitectureStatus' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures(JSON|GapsOnly)' -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx; lthn-mlx production-architectures -gaps-only/-json. Co-Authored-By: Virgil <virgil@lethean.io>

Promote plain Qwen3 MoE checkpoints from metadata-only to a native staged profile. The staged loader validates sparse-expert config/tokenizer metadata, exposes model info and quant metadata, and keeps generation fail-closed with an explicit sparse-expert decode diagnostic until router kernels land. Production architecture status moves to 20/25 native-staged profiles with 5 metadata-only MoE/MLA/channel-parser gaps remaining. Verified with GOWORK=/Users/snider/Code/core/go-mlx/go.work: go test ./go/internal/metal -run 'TestModel_LoadModel_Qwen3MoEStagedLoader|TestGenerate_Model_StagedQwen3MoEReturnsDecodeError' -count=1; go test ./go/profile -run 'TestArchitectureProfile_(MetadataFamilies|BuiltinIDs)' -count=1; go test ./go -run 'TestProductionLane_DefaultProductionArchitectureStatus' -count=1; go test ./go/cmd/mlx -run 'TestRunCommand_ProductionArchitectures(JSON|GapsOnly)' -count=1; go test ./go/model -run 'TestInspectModelPack_MetadataOnlyArchitectureProfiles' -count=1; go build -o /private/tmp/go-mlx-self/bin/lthn-mlx ./go/cmd/mlx; lthn-mlx production-architectures -gaps-only/-json. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

sonarqubecloud · 2026-06-02T10:31:50Z

Quality Gate failed

Failed conditions
3 Security Hotspots
7.5% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread go/backend.go

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

coderabbitai Bot approved these changes May 22, 2026

View reviewed changes

github-advanced-security AI found potential problems May 24, 2026

View reviewed changes

Comment thread scripts/state_book_from_phase0.py Fixed

Snider and others added 13 commits May 30, 2026 17:20

merge(ax11): zero-alloc tokenizer decode + 416MB KV-restore doubling fix

9661b00

Co-Authored-By: Virgil <virgil@lethean.io>

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

Comment thread go/cmd/mlx/embed_metallib.go

_ = os.Setenv("MLX_METALLIB_PATH", dst)

return

}

if err := os.MkdirAll(dir, 0o755); err != nil {

Snider and others added 7 commits May 31, 2026 07:37

feat(api): lock official gemma4 e2b snapshots

e5287c5

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): select gemma4 e2b quant tier

a6767b7

Co-Authored-By: Virgil <virgil@lethean.io>

fix(model): route official gemma4 e2b to text path

efed371

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): draft gemma4 ordered assistant logits

d35eb39

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): expose gemma4 mtp profile metrics

5f8904e

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): prove gemma4 ordered assistant logits

3d16eaa

Co-Authored-By: Virgil <virgil@lethean.io>

feat(bench): label gemma4 assistant mtp metrics

905d99c

Co-Authored-By: Virgil <virgil@lethean.io>

Snider and others added 13 commits June 2, 2026 05:19

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Snider and others added 15 commits June 2, 2026 06:54

feat(metal): complete native architecture checkpoint

c93cbac

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): add unit-scale moe router topk

b7c781f

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): share moe router selection

2c34dc8

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): share swiglu moe expert dispatch

39d7cbb

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): enable shared moe sparse runtimes

339620f

Co-Authored-By: Virgil <virgil@lethean.io>

docs(repo): link native gates at macos 26

18b69d7

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): add bert pooling rerank primitive

356c189

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): plan qwen36 hybrid attention layers

001de06

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): cover shared moe generation

4dc23b2

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): plan deepseek mla staging

f05ace3

Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): pin native macos floor

d80ab88

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): stage qwen36 hybrid caches

5cca2ad

Co-Authored-By: Virgil <virgil@lethean.io>

feat(metal): profile qwen36 cacheless layers

de95d9d

Co-Authored-By: Virgil <virgil@lethean.io>

Conversation

Snider commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 2, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Snider commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading