feat(wasm): add tools/wasm/ Emscripten entrypoint for browser-resident inference by wordingone · Pull Request #15 · AtomicBot-ai/atomic-llama-cpp-turboquant

wordingone · 2026-05-17T01:33:16Z

Summary

Adds tools/wasm/ containing a C++ entrypoint that links libllama.a and exposes browser-callable functions via EMSCRIPTEN_KEEPALIVE.

Build: emcmake cmake . && emmake make wasm-llama
Artifacts: wasm-llama.html, wasm-llama.js, wasm-llama.wasm (2.6 MiB)

Exported functions

Function	Signature	Description
`wasm_llama_init`	`(target_path, drafter_path) -> int`	Load target + MTP drafter GGUFs from virtual FS
`wasm_llama_health`	`() -> char*`	Returns `{status, mtp_loaded}` JSON
`wasm_llama_chat_completion`	`(request_json) -> char*`	OAI-compat completion; returns `{choices, _mtp_enabled, _spec_accept_rate, _latency_ms, _tps}`
`wasm_llama_free_str`	`(ptr)`	Free heap-allocated return values

Response shape

{"choices":[{"message":{"role":"assistant","content":"..."}}],"_mtp_enabled":true,"_spec_accept_rate":null,"_latency_ms":1234,"_tps":4.2}

_mtp_enabled reflects whether llama_model_load_mtp_from_file succeeded. _spec_accept_rate is null until pthreads (SharedArrayBuffer + COOP/COEP) are wired globally — tracked as follow-up work.

Changes

tools/wasm/wasm_llama.cpp — entrypoint implementation (~220 lines, no external deps beyond llama.h)
tools/wasm/CMakeLists.txt — Emscripten-specific build configuration
tools/CMakeLists.txt — wire add_subdirectory(wasm) in the if (EMSCRIPTEN) block (was previously empty)

Threading note

pthreads require all objects to be compiled with -matomics -mbulk-memory (i.e., -pthread at cmake-configure time). This PR compiles single-threaded, matching the current tools/ build posture. Full pthread support (enabling MTP's mtp_worker_loop) is a separate cmake-level change.

…21807)

* Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>

* webui: add setting for first-line chat titles Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first non-empty line of a prompt as the generated conversation title. Previously, the complete multi-line prompt was being used, which created long titles for complex queries. Coupled with "Ask for confirmation before changing conversation title", the dialog would overflow. * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Run build to update the bundle As requested in: ggml-org#21797 (review) * webui: Fix missing import for NEWLINE_SEPARATOR --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode

) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…20797) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks

…21785)

* docs: listing qwen3-asr and qwen3-omni as supported * nits

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

…ml-org#21870) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.

…gml-org#21644) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function

* cmake: fix CMP0194 warning on Windows with MSVC Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1. The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler. This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147). Closes ggml-org#20311 * cmake: apply cisc's formatting suggestion --------- Co-authored-by: texasich <texasich@users.noreply.github.com>

…g#21559)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : re-enable mac workflows * vulkan : fix compile warning

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

* ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

…ase: Reka-Edge) (ggml-org#21892)

…lt timeout (ggml-org#21901)

\

…pp-turboquant - Updated NEXTN.md to document the integration of `--mmproj` with speculative decoding types `mtp`, `nextn`, and `eagle3`, allowing coexistence on a single slot. - Revised README.md to reflect the new multimodal capabilities and their implications for text and image processing. - Added functions in `common/speculative.cpp` and `common/speculative.h` to check compatibility of speculative types with multimodal settings. - Enhanced server context handling to manage multimodal prompts and ensure correct behavior during speculative decoding. - Introduced a new script for running Gemma 4 with multimodal projector support, detailing expected behavior for text and image turns. - Updated documentation in `docs/speculative.md` to clarify per-turn behavior and future roadmap for draft acceleration on vision turns.

Enhance multimodal support and speculative decoding in atomic-llama-c…

…t inference Exposes libllama.a (with MTP + gemma4-assistant support) to the browser via four EMSCRIPTEN_KEEPALIVE exports: wasm_llama_init, wasm_llama_health, wasm_llama_chat_completion, wasm_llama_free_str. Response shape: {choices:[...], _mtp_enabled:bool, _spec_accept_rate:null} MTP threading requires SharedArrayBuffer (COOP/COEP) and global -pthread build. Build: emcmake cmake + emmake make wasm-llama -> .html/.js/.wasm (2.6M)

wordingone · 2026-05-19T10:20:24Z

Advisory review — Eli requested gate before merge.

Read tools/wasm/wasm_llama.cpp (213 lines) end-to-end against issue wordingone/gemma-architect#737 AC. All 5 AC items satisfied:

tools/wasm/ directory with CMakeLists.txt (52 lines) + entrypoint .cpp (213 lines)
Emscripten build produces .wasm + .js (artifacts at build-wasm3/bin/)
Exports wasm_llama_init / wasm_llama_chat_completion / wasm_llama_health (+ wasm_llama_free_str for heap-string ownership)
Response shape matches {choices, _mtp_enabled, _spec_accept_rate, _latency_ms, _tps} (lines 196-203)
Upstream PR opened (this PR)

Code quality — competent spike-tier work:

Resource lifecycle: globals freed on re-init (54-64), heap_str/free_str pair documented as caller-owns (lines 4, 43-47, 208-211). Clean.
MTP loading: two-step verify — llama_model_load_mtp_from_file then llama_model_has_mtp_assistant (80-81). Honest about partial-load failure modes.
_spec_accept_rate: null with inline // MTP threading requires SharedArrayBuffer (#739). Honest about the follow-up.
Single-threaded build acknowledged in PR body; pthreads is a separate cmake change.

Pragmatic shortcuts (acceptable for spike, worth follow-up):

JSON parsing by req.find("\"content\"", ...) (119-143) is brittle — escaped-quote handling at line 137 mishandles \\" (backslash-backslash-quote) sequences, and "content" appearing in nested objects would be matched. The "no heavy parser dep" trade-off is fine for the spike; once the WASM loader (gemma-architect Windows: reactivate sigint handler after each Ctrl-C ggml-org/llama.cpp#736 step 2) starts sending real OAI-compat payloads, a follow-up to vendor nlohmann/json or roll a minimal proper tokenizer is advisable.
std::stoi on max_tokens substring (line 124) is uncaught — throws std::invalid_argument / std::out_of_range on malformed input. In Emscripten, uncaught C++ exceptions crash the wasm runtime. Wrap in try/catch or pre-validate.
Sampler labeled "Greedy" at line 165 but chain is min_p(0.05) → temp(0.8) → dist(LLAMA_DEFAULT_SEED) — stochastic with deterministic seed. Comment-vs-code drift; either rename the comment or swap in llama_sampler_init_greedy() if argmax was intended.
n_ctx = 2048 hardcoded (line 85). gemma-architect's WebGPU path expanded to 16384 per Stops talking mid sentence ggml-org/llama.cpp#988. WASM loader call-site should override via init param; follow-up at Windows: reactivate sigint handler after each Ctrl-C ggml-org/llama.cpp#736 step 2.

Mergeable=CONFLICTING — needs rebase against fork master (upstream-llama.cpp churn carrying through, not Eli's diff conflict; the actual Eli contribution is 3 files / 265 added lines).

Advisory verdict: approve with the 4 follow-up notes above. Rebase + auto-merge per the standard auto-merge model.

— Leo

qnixsynapse and others added 30 commits April 13, 2026 09:44

sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#…

873c825

…21807)

Remove extra conditional check on debug mode. (ggml-org#21798)

bafae27

webui: MCP Diagnostics improvements (ggml-org#21803)

227ed28

* Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>

mtmd: use causal attn for gemma 4 audio (ggml-org#21824)

920b3e7

server: Expose build_info in router mode (ggml-org#21835)

ce8fd4b

common : add download cancellation and temp file cleanup (ggml-org#21813

aa00911

) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci: Also exempt 'security' tag from auto-close (ggml-org#21844)

a8bad38

chat: dedicated DeepSeek v3.2 parser + "official" template (ggml-org#…

1c0d908

…21785)

docs: listing qwen3-asr and qwen3-omni as supported (ggml-org#21857)

e974923

* docs: listing qwen3-asr and qwen3-omni as supported * nits

common/gemma4 : handle parsing edge cases (ggml-org#21760)

e21cdc1

server: support OAI /v1/audio/transcriptions API (ggml-org#21863)

e489a5c

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

vulkan: Support GGML_TYPE_NVFP4 (ggml-org#21455)

6a6780a

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (ggml-or…

2e05f06

…g#21559)

vendor : update BoringSSL to 0.20260413.0 (ggml-org#21881)

be76dd0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

metal : add XIELU unary op (ggml-org#20802)

aa0f189

ci : re-enable mac workflows (ggml-org#21894)

f4b5bf2

* ci : re-enable mac workflows * vulkan : fix compile warning

mtmd: add mtmd_image_tokens_get_decoder_pos() API (ggml-org#21851)

707c0b7

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

metal : fix FA support logic (ggml-org#21898)

c0de6ed

ggml : remove ggml-ext.h (ggml-org#21869)

fae3a28

* ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

read n_ctx back after making llama_context (ggml-org#21939)

e39eba2

autoparser: support case of JSON_NATIVE with per-call markers (test c…

e1a9a6d

…ase: Reka-Edge) (ggml-org#21892)

ci: disable test-backend-ops on Vulkan llvmpipe run and resture defau…

8dc530b

…lt timeout (ggml-org#21901)

Ooooze and others added 4 commits May 13, 2026 17:37

Merge pull request spiritbuun#13 from AtomicBot-ai/b1-mtp-qwen-rebase

8893692

\

Merge pull request spiritbuun#14 from AtomicBot-ai/b1-mtp-qwen-rebase

0a635dc

Enhance multimodal support and speculative decoding in atomic-llama-c…

github-actions Bot added documentation Improvements or additions to documentation testing examples server Apple Metal ggml python script model Nvidia GPU Vulkan SYCL IBM zDNN AMD ZenDNN build devops nix jinja parser Ascend NPU OpenCL Hexagon WebGPU OpenVINO labels May 17, 2026

wordingone mentioned this pull request May 19, 2026

MTP-WASM step 2a: tools/wasm/ entrypoint exposing libllama.a inference via Emscripten wordingone/gemma-architect#737

Closed

5 tasks

wordingone mentioned this pull request May 19, 2026

MTP in-browser via turboquant WASM build (#674 follow-up) wordingone/gemma-architect#736

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wasm): add tools/wasm/ Emscripten entrypoint for browser-resident inference#15

feat(wasm): add tools/wasm/ Emscripten entrypoint for browser-resident inference#15
wordingone wants to merge 498 commits into
AtomicBot-ai:masterfrom
wordingone:feat/wasm-llama-entrypoint

wordingone commented May 17, 2026

Uh oh!

wordingone commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

wordingone commented May 17, 2026

Summary

Exported functions

Response shape

Changes

Threading note

Uh oh!

wordingone commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants