This folder is the canonical, per-model reference for every architecture that TensorSharp can run. Each card is a self-contained brief: it walks an engineer or researcher from "I have never heard of this model" all the way to "I can explain the forward graph and reproduce the inference path in TensorSharp." If you only need a top-level pointer, use the table below; otherwise jump into the individual cards.
Each card follows the same shape so you can diff architectures cleanly:
- Origin and intent — who designed the model, what the GGUF arch keys are, and which capabilities (modalities, thinking, tools) it exposes.
- Model architecture — the high-level block diagram, layer counts, and any per-layer heterogeneity.
- Forward graph — the exact ordered list of ops a single token (decode) and a multi-token sequence (prefill) flow through, including residuals and normalizations.
- Components — every sub-block (attention, FFN/SSM, routing, normalization, RoPE flavor, vision/audio encoder) explained in detail with the math that governs it.
- Parameters and settings — the GGUF metadata keys, weight tensor naming convention, and dtype expectations.
- TensorSharp implementation — pointers to the C# source files, the
instantiation order, the cache layout, and the way the model plugs into
ModelBase/Ops/ native GGML kernels. - Prefill optimization — chunking, fused per-layer kernels, parallelization, cross-layer caches.
- Decode optimization — fused single-call kernels, pre-resolved weight pointers, batched MoE, in-place kernels, cache reuse.
- Memory and KV cache strategy — circular vs. linear caches, mmap-backed weights, pre-allocated decode buffers.
- Multimodal pipeline — how images / audio / video are processed, encoded, and injected into the language model.
- Output / chat template — protocol parser, stop tokens, thinking / tool formats.
- Optimization opportunities — work that has not been done yet but that we know would unlock more performance or capability.
| Architecture | Card | Source class | GGUF keys | Modalities | Reasoning | Tools | Notable acceleration |
|---|---|---|---|---|---|---|---|
| Gemma 3 | gemma3.md | Gemma3Model |
gemma3 |
Text, image | No | No | Alternating SWA / global attention, GeGLU FFN, QK-norm, V-norm |
| Gemma 4 | gemma4.md | Gemma4Model |
gemma4 |
Text, image, video, audio | Yes | Yes | Single-graph fused decode (all layers in one GGML dispatch), fused per-layer prefill, chunked prefill, circular SWA cache, PLE, KV sharing, MoE variants |
| Qwen 3 | qwen3.md | Qwen3Model |
qwen3 |
Text | Yes | Yes | Native whole-model decode with pre-resolved weight pointers |
| Qwen 3.5 / 3.6 family | qwen35.md | Qwen35Model |
qwen35, qwen35moe, qwen3next |
Text, image | Yes | Yes | Hybrid FullAttention + GatedDeltaNet recurrent, fused attention layer decode, fused prefill attention, fused output-projection + FFN, fused output-projection + norm + router, batched MoE (routed + shared + residual in a single kernel), fused vision encoder blocks |
| GPT OSS | gptoss.md | GptOssModel |
gptoss, gpt-oss |
Text | Yes (always) | No | Stacked MoE prefill kernel (mul_mat_id + add_id + swiglu_oai), attention sinks, MXFP4 expert weights |
| Nemotron-H | nemotron.md | NemotronModel |
nemotron_h, nemotron_h_moe |
Text, image (Omni-class) | Yes | Yes | Mamba2 + attention + MoE FFN hybrid stack, batched GPU MoE, optional Parakeet audio frontend, RADIO/v2_vl image encoder |
| Mistral 3 | mistral3.md | Mistral3Model |
mistral3 |
Text, image | No | No | YaRN-corrected RoPE with position-dependent Q scaling, fused QKV / gate_up, Pixtral vision encoder |
Model code is intentionally backend-agnostic. ModelBase selects tensor
storage through BackendType and the registered execution plan, then delegates
the actual ops to the backend that owns those allocators:
| Backend type | Package | Notes |
|---|---|---|
Cpu |
TensorSharp.Core |
Pure managed tensors with SIMD/managed quantized fast paths (RMSNorm, RoPE, softmax, fused activations, GEMM, dequant). |
Cuda |
TensorSharp.Backends.Cuda |
Direct CUDA Driver-API allocator and storage, cuBLAS GEMM, PTX kernels for hot ops, native quantized matmul / get_rows for supported quant types, CPU fallback for ops that are not yet implemented. |
GgmlCpu / GgmlMetal / GgmlCuda |
TensorSharp.Backends.GGML + TensorSharp.GGML.Native |
Native ggml bridge with quantized graph dispatch and platform backends. mmap-backed quantized weights are bound zero-copy through host-pointer buffers. |
When a card mentions a fused GGML kernel (for example Qwen35AttentionLayerDecode,
Gemma4LayerPrefill, or MoEExpertsSwiGLUResidual), the kernel is compiled from
TensorSharp.GGML.Native/ggml_ops_*.cpp and exposed through
TensorSharp.Backends.GGML/GgmlBasicOps.cs. The native bridge is the place to
look when a fused path engages on GGML CPU / Metal / CUDA but not on the pure
managed CPU or direct CUDA backends.
| Feature | Gemma 3 | Gemma 4 | Qwen 3 | Qwen 3.5 / 3.6 family | GPT OSS | Nemotron-H | Mistral 3 |
|---|---|---|---|---|---|---|---|
| Layer type | Dense | Dense / MoE | Dense | Hybrid (Attn + Recurrent) ± MoE | MoE | Hybrid (Mamba2 + Attn + MoE FFN) | Dense |
| Attention | SWA + Global | SWA + Global | Full GQA | Full GQA + Sigmoid Gate | Full + Sinks | Full GQA (no RoPE) | Full GQA |
| FFN activation | GeGLU | GeGLU | SwiGLU | SwiGLU | SiLUAlphaLimit (clamped GLU) | ReLU² | SwiGLU |
| RoPE variant | NeoX (dual base) | NeoX + proportional / partial | NeoX | NeoX / MRoPE | NeoX + YaRN | None | GPT-J + YaRN |
| QK-norm | Yes | Yes | Yes | Yes | No | No | No |
| V-norm | No | Yes (unweighted) | No | No | No | No | No |
| Bias in projections | No | No | No | No | Yes (all linear) | No | No |
| Per-layer scaling | No | Yes | No | No | No | No | No |
| Per-Layer Embedding (PLE) | No | Yes | No | No | No | No | No |
| KV sharing | No | Yes (tail layers) | No | No | No | No | No |
| Attention sinks | No | No | No | No | Yes | No | No |
| Circular KV cache | No | Yes (SWA layers) | No | No | No | No | No |
| SSM / recurrent layers | No | No | No | Yes (GatedDeltaNet) | No | Yes (Mamba2) | No |
| Shared experts | No | No | No | Yes (qwen35moe / qwen3next) | No | Yes (optional) | No |
| Latent bottleneck FFN | No | No | No | No | No | Yes (optional) | No |
| Position-dependent Q scaling | No | No | No | No | No | No | Yes (with YaRN) |
| Vision | Yes | Yes | No | Yes | No | Yes (Omni) | Yes (Pixtral) |
| Audio | No | Yes | No | No | No | Yes (Parakeet, when mmproj present) | No |
| Video | No | Yes | No | No | No | No | No |
| Thinking | No | Yes | Yes | Yes | Yes (always) | Yes | No |
| Tool calling | No | Yes | Yes | Yes | No | Yes | No |
| Fused QKV | No | Yes | Yes | Mixed (full attention layers split, recurrent layers fuse a 5-way pack) | Yes | Yes | Yes |
| Fused single-graph decode | No | Yes (Gemma4ModelDecode) | Yes (TransformerModelDecode, native loop) | Per-layer fused (Qwen35AttentionLayerDecode, FusedOutProjFFN, FusedOutProjNormRouter) | Per-layer | Per-layer / batched MoE | No |
| Fused single-graph prefill | No | Yes (Gemma4LayerPrefill, dense layers) | No | Yes (FusedPrefillAttention, FusedOutProjFFN, MoE prefill) | Yes (MoE prefill via mul_mat_id) | No | No |
| Batched GPU MoE | n/a | Pending | n/a | Yes (routed + shared + residual fused) | Yes (stacked weight slabs) | Yes | n/a |
| Fused vision encoder | n/a | Standard | n/a | Yes (FusedVisionAttention + FusedVisionMLP) | n/a | Standard (RADIO ViT) | Standard (Pixtral) |
| Output parser | PassthroughOutputParser |
Gemma4OutputParser |
Qwen3OutputParser |
Qwen35OutputParser |
HarmonyOutputParser (always required) |
Qwen3OutputParser |
PassthroughOutputParser |
When you add a new model:
- Create
TensorSharp.Models/Models/<Name>/<Name>Model.csinheritingModelBase. - In the constructor: read GGUF metadata via
_gguf.GetXxx(), callParseBaseConfig()andParseTokenizer(), callLoadWeights(), fuse weights, then initialize caches. - Implement
Forward(int[] tokens) → float[]: embedding → optional multimodal injection → transformer blocks → final norm → LM head → logit copy. - Implement
ResetKVCache()andDispose(). ImplementTruncateKVCache()when KV-cache reuse is supported. - Register in
ModelBase.Create()switch expression inTensorSharp.Models/ModelBase.cs. - Add an
IOutputParserimplementation inTensorSharp.Runtime/OutputParser.csif the model uses a non-standard output format and register it inOutputParserFactory.Create(). - Add chat template support in
TensorSharp.Runtime/ChatTemplate.cs/Jinja2Template.csif the model uses a novel template format. - Add a card under
docs/models/<name>.md(and<name>_zh-cn.mdif you want bilingual coverage), update this README's matrix, and link the card from the project root README. - Update
TensorSharp.Server/testdata/capability gates if the model exposes new modalities, thinking, or tool capabilities.