Compute Core Modules

RTL source on GitHub

SystemVerilog sources documented on this page:

1. Matrix Core — Systolic Top

GEMM_systolic_top.sv wraps the 32 × 32 systolic array (cascade split at row 16 into two 32 × 16 sub-chains). It receives weight tiles from HP0/HP1 and activation rows from the L2 cache, and streams accumulated results to the post-processor.

2. Vector Core — GEMV Top

GEMV_top.sv instantiates 4 parallel GEMV cores. Each core has a 32-wide LUT-based MAC and a 5-stage reduction tree (Stage 1 uses 16 DSP48E2 slices; Stages 2–5 are LUT adders). Weights stream from HP2/HP3.

See also

GEMV Core

3. CVO / SFU Core

CVO_top.sv orchestrates the CORDIC + LUT hybrid units for non-linear operations: exp, sqrt, gelu, sin, cos, reduce_sum, scale, recip. Precision is promoted to BF16/FP32 for all computations.

4. DSP48E2 MAC Unit

GEMM_dsp_unit.sv implements the dual-channel W4A8 MAC using a single DSP48E2 slice. See DSP48E2 W4A8 Bit Packing and Sign Recovery for the bit-packing derivation.

Last verified against

Commit 773bd82 @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-21).