Skip to content

structure sparse tcu merge N=8, 32#323

Open
yanggon-kim wants to merge 24 commits intovortexgpgpu:bug_fixesfrom
yanggon-kim:pr_sparse_tcu_merge
Open

structure sparse tcu merge N=8, 32#323
yanggon-kim wants to merge 24 commits intovortexgpgpu:bug_fixesfrom
yanggon-kim:pr_sparse_tcu_merge

Conversation

@yanggon-kim
Copy link
Collaborator

2:4 Structured Sparsity Support for Vortex Tensor Core Unit

Summary

This PR adds 2:4 structured sparsity support to the Vortex TCU, enabling ~2x reduction in matrix-A bandwidth
and halved K-loop iterations for sparse GEMM workloads. The implementation spans RTL hardware, kernel
software, and a new regression test.

Key changes:

  • Separate sparse instruction (mma_struct_sparse_sync, funct3=1) — dense path (mma_sync, funct3=0) is fully
    preserved
  • VX_tcu_meta.sv — Per-warp metadata SRAM storing 2:4 sparsity bitmasks, with runtime-writable meta_store
    instruction
  • VX_tcu_sel.sv — B-column gather mux driven by sparsity metadata (supports int8, fp16, int4)
  • VX_tcu_core.sv — Dense/sparse mux: dense reads B directly, sparse goes through metadata + gather
  • VX_tcu_uops.sv — Micro-op sequencer with is_sparse latch and counter reset between mma_sync calls
  • vx_tensor.h — Template-based load_matrix_sync<layout, sparse> and mma_sync with load_metadata_sync
    for runtime metadata upload
  • New test: sgemm_tcu_struct_sparse/ — Full regression test with real data-dependent 2:4 pruning, compression,
    and metadata packing

Supported Configurations

┌───────────┬────────────────┬──────┬───────┐
│ Data Type │ Input → Output │ NT=8 │ NT=32 │
├───────────┼────────────────┼──────┼───────┤
│ INT8 │ int8 → int32 │ PASS │ PASS │
├───────────┼────────────────┼──────┼───────┤
│ FP16 │ fp16 → fp32 │ PASS │ PASS │
├───────────┼────────────────┼──────┼───────┤
│ INT4 │ int4 → int32 │ PASS │ PASS │
└───────────┴────────────────┴──────┴───────┘

How to Build and Run sgemm_tcu_struct_sparse

Prerequisites: Set up the build environment from the build/ directory:
cd build
export TOOLDIR=/opt # or your toolchain path
source ./ci/toolchain_env.sh

NT=8 Configurations

INT8/INT32 (NT=8):

Build test binary

make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=int8 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

Run on RTLSim (default M=N=8, K=32)

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

Run with larger matrices

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

FP16/FP32 (NT=8):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=fp16 -DOTYPE=fp32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=16"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

INT4/INT32 (NT=8):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=8 -DITYPE=int4 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=8 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=4"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse

NT=32 Configurations

For NT=32, tile sizes are larger (tileM=16, tileN=16), so minimum matrix dimensions are 16×16:

INT8/INT32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=int8 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=8"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

FP16/FP32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=fp16 -DOTYPE=fp32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=16"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k32"

INT4/INT32 (NT=32):
make -C tests/regression/sgemm_tcu_struct_sparse clean
CONFIGS="-DNUM_THREADS=32 -DITYPE=int4 -DOTYPE=int32" make -C tests/regression/sgemm_tcu_struct_sparse

CONFIGS="-DNUM_THREADS=32 -DEXT_TCU_ENABLE -DTCU_TYPE_DPI -DTCU_ITYPE_BITS=4"
./ci/blackbox.sh --driver=rtlsim --app=sgemm_tcu_struct_sparse --args="-m16 -n16 -k64"

Important Build Notes

  • -DTCU_ITYPE_BITS=N must match the ITYPE (8 for int8, 16 for fp16, 4 for int4) — it controls RTL's I_RATIO
    and metadata width
  • -DTCU_TYPE_DPI is required for RTLSim builds
  • Always clean before switching ITYPE/OTYPE — stale object files cause silent mismatches
  • Dense tests (sgemm_tcu) are unaffected; the sparse instruction uses a separate funct3 encoding

yanggon-kim and others added 24 commits February 5, 2026 15:39
  and update TCU files
# Conflicts:
#	hw/rtl/core/VX_uop_sequencer.sv
#	hw/rtl/tcu/VX_tcu_core.sv
#	tests/regression/sgemm_tcu/common.h
#	tests/regression/sgemm_tcu/kernel.cpp
#	tests/regression/sgemm_tcu/main.cpp
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The function was missing global barrier flag handling (bit 31 of bar_id).
Other barrier functions in emulator.cpp already route global barriers to
socket->get_barrier_phase(), but this getter did not, causing SimX test
failures after the upstream barrier instruction merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolved conflict in kernel/include/vx_tensor.h: kept our
mma_sync<sparse=true> template approach, dropped upstream's
mma_sp_sync placeholder.

Key upstream changes integrated:
- TMA renamed to DXA (Data eXchange Accelerator)
- DRL renamed to TFR in TCU
- Barrier instruction encoding changed (arrive+wait → single)
- ASIC synthesis fixes (Synopsys/Yosys)
- mxint8 & fp8 support added

Verified: dense (sgemm_tcu) and sparse (sgemm_tcu_struct_sparse)
RTL tests pass with int8/int32, NT=8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… arithmetic

Replace the multiplier-based address calculation (step_m * HALF_K_STEPS + step_k)
with a generate-if selecting pure bit-concatenation at elaboration time. This fixes
the Verilator SELRANGE error when HALF_K_STEPS=1 (e.g. NT=4) without introducing
combinational logic — all paths are wire routing only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant