structure sparse tcu merge N=8, 32 by yanggon-kim · Pull Request #323 · vortexgpgpu/vortex

yanggon-kim · 2026-02-24T06:07:32Z

2:4 Structured Sparsity Support for Vortex Tensor Core Unit

Summary

This PR adds 2:4 structured sparsity support to the Vortex TCU, enabling ~2x reduction in matrix-A bandwidth
and halved K-loop iterations for sparse GEMM workloads. The implementation spans RTL hardware, kernel
software, and a new regression test.

Key changes:

Separate sparse instruction (mma_struct_sparse_sync, funct3=1) — dense path (mma_sync, funct3=0) is fully
preserved
VX_tcu_meta.sv — Per-warp metadata SRAM storing 2:4 sparsity bitmasks, with runtime-writable meta_store
instruction
VX_tcu_sel.sv — B-column gather mux driven by sparsity metadata (supports int8, fp16, int4)
VX_tcu_core.sv — Dense/sparse mux: dense reads B directly, sparse goes through metadata + gather
VX_tcu_uops.sv — Micro-op sequencer with is_sparse latch and counter reset between mma_sync calls
vx_tensor.h — Template-based load_matrix_sync<layout, sparse> and mma_sync with load_metadata_sync
for runtime metadata upload
New test: sgemm_tcu_struct_sparse/ — Full regression test with real data-dependent 2:4 pruning, compression,
and metadata packing

Supported Configurations

┌───────────┬────────────────┬──────┬───────┐
│ Data Type │ Input → Output │ NT=8 │ NT=32 │
├───────────┼────────────────┼──────┼───────┤
│ INT8 │ int8 → int32 │ PASS │ PASS │
├───────────┼────────────────┼──────┼───────┤
│ FP16 │ fp16 → fp32 │ PASS │ PASS │
├───────────┼────────────────┼──────┼───────┤
│ INT4 │ int4 → int32 │ PASS │ PASS │
└───────────┴────────────────┴──────┴───────┘

How to Build and Run sgemm_tcu_struct_sparse

Prerequisites: Set up the build environment from the build/ directory:
cd build
export TOOLDIR=/opt # or your toolchain path
source ./ci/toolchain_env.sh

NT=8 Configurations

INT8/INT32 (NT=8):

Build test binary

make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=int8 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

Run on RTLSim (default M=N=8, K=32)

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

Run with larger matrices

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

FP16/FP32 (NT=8):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=fp16 -DOTYPE=fp32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=16"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

INT4/INT32 (NT=8):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=int4 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=4"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

NT=32 Configurations

For NT=32, tile sizes are larger (tileM=16, tileN=16), so minimum matrix dimensions are 16×16:

INT8/INT32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=int8 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

FP16/FP32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=fp16 -DOTYPE=fp32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=16"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k32"

INT4/INT32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=int4 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=4"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

Important Build Notes

-DTCU_ITYPE_BITS=N must match the ITYPE (8 for int8, 16 for fp16, 4 for int4) — it controls RTL's I_RATIO
and metadata width
-DTCU_TYPE_DPI is required for RTLSim builds
Always clean before switching ITYPE/OTYPE — stale object files cause silent mismatches
Dense tests (sgemm_tcu) are unaffected; the sparse instruction uses a separate funct3 encoding

and update TCU files

# Conflicts: # hw/rtl/core/VX_uop_sequencer.sv # hw/rtl/tcu/VX_tcu_core.sv # tests/regression/sgemm_tcu/common.h # tests/regression/sgemm_tcu/kernel.cpp # tests/regression/sgemm_tcu/main.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The function was missing global barrier flag handling (bit 31 of bar_id). Other barrier functions in emulator.cpp already route global barriers to socket->get_barrier_phase(), but this getter did not, causing SimX test failures after the upstream barrier instruction merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolved conflict in kernel/include/vx_tensor.h: kept our mma_sync<sparse=true> template approach, dropped upstream's mma_sp_sync placeholder. Key upstream changes integrated: - TMA renamed to DXA (Data eXchange Accelerator) - DRL renamed to TFR in TCU - Barrier instruction encoding changed (arrive+wait → single) - ASIC synthesis fixes (Synopsys/Yosys) - mxint8 & fp8 support added Verified: dense (sgemm_tcu) and sparse (sgemm_tcu_struct_sparse) RTL tests pass with int8/int32, NT=8. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… arithmetic Replace the multiplier-based address calculation (step_m * HALF_K_STEPS + step_k) with a generate-if selecting pure bit-concatenation at elaboration time. This fixes the Verilator SELRANGE error when HALF_K_STEPS=1 (e.g. NT=4) without introducing combinational logic — all paths are wire routing only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yanggon-kim and others added 24 commits February 5, 2026 15:39

Add sgemm_tcu_struct_sparse test

d73a042

and update TCU files

Add sparse TCU support: VX_tcu_meta module and B-column mux

93752d2

Add sparse TCU support: B-column mux with VX_tcu_sel module

a580a6c

changed the cpu_ref function

aaa4a53

randomize the operands, fix the rtl index for b_col_1, b_col_2.

5164075

fp16 fp32 printing. This code works for int8 and int32

de3cd7a

fp16/fp32 done by claude

7a125dd

all pass with claude code

7630e3b

after all config passes, test the 0101/1010 two pattern sweap pass

815ee7c

new instruction working mma_struct_sparse_sync by claude code

eda31a1

prune and compress with fixed mast, fix matmul_cpu

bfcf24b

code minimization with same functionality

833bacf

loop code change

edd3361

NT=16 problem

1164bfe

meta_store new SRAM feeding instruction

6a203ac

real meta, dynamic meta generation and run

5bc74fb

separate the tcu only time using csr hardware count

348187e

past NT=16 clean up

0613bb2

comment from professor

741750a

Merge remote-tracking branch 'upstream/bug_fixes' into rtlsim_260203

9323558

# Conflicts: # hw/rtl/core/VX_uop_sequencer.sv # hw/rtl/tcu/VX_tcu_core.sv # tests/regression/sgemm_tcu/common.h # tests/regression/sgemm_tcu/kernel.cpp # tests/regression/sgemm_tcu/main.cpp

fix verilator lint warning for vld_mask after upstream merge

9a2c32d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

structure sparse tcu merge N=8, 32#323

structure sparse tcu merge N=8, 32#323
yanggon-kim wants to merge 24 commits intovortexgpgpu:bug_fixesfrom
yanggon-kim:pr_sparse_tcu_merge

yanggon-kim commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yanggon-kim commented Feb 24, 2026

Build test binary

Run on RTLSim (default M=N=8, K=32)

Run with larger matrices

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant