SoC: RISC-V with a Custom Vector Processing Unit

System on Chip Vector Processing Unit RISC-V TSMC 65nm
Back to all projects

Introduction

Welcome to the SoC team page! We present a complete RISC-V System-on-Chip (SoC) implemented and taped out in TSMC 65-nm technology. This work was developed as part of Columbia University’s EE6350 VLSI Laboratory, and we gratefully acknowledge the supervision of Prof. Mingoo Seok, the generous support from Apple Inc., and the support of the course TAs.

The chip adopts a CPU–VPU heterogeneous architecture, integrating the open-source PicoRV32 RISC-V CPU and a custom-designed VPU, along with on-chip memory and essential I/O (UART, SPI), to accelerate data-parallel workloads such as vector operations and convolution-like kernels used in image processing and lightweight AI pipelines. The CPU provides system-level programmability and orchestration, while the VPU couples a SIMD datapath with a dedicated control plane for vector decode, memory sequencing, and PE scheduling. This separation enables straightforward CPU-only vs. VPU-offloaded benchmarking, making acceleration benefits easy to reproduce and quantify on silicon.

From RTL design and verification through synthesis, place-and-route, and pad integration, we completed an end-to-end ASIC flow and validated the chip via post-silicon board bring-up and testing. The result is a compact, software-programmable platform that demonstrates digital system integration and silicon-proven hardware/software co-design.

System Architecture

This diagram illustrates the architecture of our RISC-V based System-on-Chip (SoC). At its core, a RISC-V processor is paired with a dedicated Vector Processing Unit to handle accelerated parallel computations. These processing units communicate with system resources via a central AXI-Lite Interconnect. Essential external communication is handled through UART and SPI interfaces. Additionally, the design incorporates support infrastructure, including a Clock Generator, Finite State Machine (FSM), and a Scan Chain for hardware testing and debugging.

Key Modules

Design Flow

The following diagram outlines the complete RTL-to-GDSII design flow implemented for this project. It highlights the transition from behavioral Verilog RTL through Synthesis and Place & Route, utilizing industry-standard tools for verification at every stage.

RTL-to-GDSII Design Flow

RTL Design

The RTL (Register Transfer Level) design serves as the core of our digital implementation, bridging the gap between high-level algorithms and physical hardware.

Hierarchical Design & Verification

We adopted a rigorous Bottom-Up Implementation strategy. Design work starts with coding individual, low-level submodules. Once these leaf blocks are completed, they are integrated into larger parent modules, eventually culminating in the Top-Level SoC.

Verification follows a corresponding layered approach:

  • Unit-Level Verification: Submodules are first verified in isolation using dedicated testbenches to ensure local correctness.
  • Integration Verification: As blocks are assembled, verification focuses on interface protocols and data flow between modules.
  • Full-Chip Verification: The complete design is simulated (using tools like ModelSim, VCS, and Verdi) and cross-compared against the C reference model to validate system-wide functionality.
Version Feature
0.0Basic Model from PicoRV32 RISCV Core
0.1Integrated Instruction SRAM with memory wrappers
0.2Integrated AXI-Lite Interconnect
0.3Integrated Data SRAM with memory wrappers
0.4Integrated Scan Chain
0.5Integrated UART and SPI
0.6Integrated Vector Processing Unit
0.7Integrated Clock Generator, Scan Chain and FSM
0.8Modified Vector Processing Unit for Smaller Size
0.9Integrated Pad Frame and IO Cells
1.0Verified Functionality from Chip Level and RTL Freeze

Synthesis

We used Synopsys Design Compiler for logic synthesis. Before running synthesis, we prepared a complete SDC file that defined all necessary constraints, including clock definitions, clock constraints, drive and load constraints, operating conditions, and wire-load models. Specifically, the SDC file included parameters such as clock name, clock period, clock uncertainty, clock transition, as well as input and output delay constraints. Our chosen synthesis strategy followed a bottom-to-top approach. We first synthesized several major submodules independently, such as the CPU core and the VPU module. The remaining modules were synthesized together within the top-level design, while the pre-synthesized blocks were marked as don't touch to preserve their optimized structures. This process allowed us to generate the complete set of netlist files for the entire chip.

Physical Design

Auto P&R

We used Cadence Innovus for our physical design. In our place-and-route flow, the major steps included: loading design files, floorplanning, power routing, cell placement, pre-CTS optimization, clock tree synthesis (CTS), signal routing, RC parasitic extraction, filler cell insertion, design verification, and final GDS files output.
The physical hierarchy of the chip consists of submodules, the top module, the chip-level module, and the IO pads. Our place-and-route strategy followed the same bottom-up approach used during synthesis. The Dmem and Imem blocks used existing layout macros, while modules such as the SPI block, scan chain, test FSM, coprocessor, and CPU core were placed and routed independently. The remaining modules were integrated and routed within the top-level design.

Physical design flow

The final result is the complete chip-level layout, as shown in the figure. The two memory blocks are placed on opposite ends of the layout, while the central region contains key macros such as the CPU core and the VPU. The scan chain and clock generator are positioned toward the outer region but kept as close as possible to the center to minimize clock skew. Surrounding the entire design are the required IO pads that interface the chip with the external system.

Chip layout labeled

IO Pad

Below shows the IO Pad integration based on LQFP64L: We have 11 + 16 + 1 = 28 IOs. These IOs include input/output signals (SPI, UART, Scan_Chain, rst_n, clk), POC, CVDD, DVDD, VDD, VDD_TEST, and GND. We tried to minimize our IO number because we need a bigger area for logic circuits.

IO pad integration

Package

Below is the picture of our taped out chip with bonding wires and packages.

Description of P1
Description of P2

STA & Sign-off

During the STA phase, we used PrimeTime with a target operating frequency of 100 MHz. After place-and-route, the clock uncertainty was controlled within approximately 0.5 ns. According to the timing reports, all critical paths within the top module met their setup and hold time requirements, and no setup or hold violations were observed. In addition, the asynchronous reset signals passed all recovery and removal checks without violations.

STA and sign-off summary

After completing the physical design, we performed full signoff verification on each submodule and the top-level design. Using Mentor Calibre, we conducted DRC, LVS, ESD, and antenna rule checks to ensure that all results were clean. Only after passing all signoff criteria did we proceed with the final GDS files export of our chip.

PPA Optimization

In our chip design, we placed strong emphasis on PPA (Power, Performance, Area) optimization. Several low-power techniques were applied throughout the design stages:

  • Fixed Voltage Domains: The chip uses fixed voltage domains, with 2.5V for IO pads and 1.0V for the core logic, ensuring stable and efficient power distribution.
  • Clock Division: We applied clock division techniques. For example, the SPI module operates on a divided clock derived from the main clock, reducing switching activity and lowering dynamic power consumption.
  • RTL Resource Sharing: At the RTL level, we adopted resource sharing, reusing registers and functional units whenever possible to reduce redundant hardware.
  • State Encoding Optimizations: State encoding optimizations were applied, such as using Gray code instead of conventional binary encoding to minimize bit toggles and reduce dynamic power.
  • Automatic Clock Gating: During synthesis, automatic clock gating was enabled to shut off clocks when modules are idle. Although this increases area slightly, it significantly reduces dynamic power consumption.
  • High-Vth Cell Insertion: In the physical design stage, we selectively replaced LVT (Low-Vth) cells with HVT (High-Vth) cells, which reduces leakage power at the cost of some performance and area overhead.

These combined strategies allowed us to effectively manage power while maintaining overall performance and design efficiency.

Software & Testing Flow

The software and testing flow converts C-level test programs into memory-mapped initialization data, generates scan sequences, and loads them onto silicon through an FPGA-based scan controller. This unified environment supports both pre-silicon verification and post-silicon validation of the CPU, memory system, and VPU.

Software Testing Flow Diagram

Software Development

Test applications are written in C and compiled using a customized RISC-V GCC toolchain aligned with the chip's IMEM/DMEM memory map. The linker output is automatically converted into test.h, which contains the initialization arrays for instruction and data memory. These files form the common interface between software and the scan-based testing environment.

The software framework allows different test programs to be executed on silicon, including routines that stress the CPU pipeline, memory system, or vector/AI units. Inline RISC-V assembly is supported for low-level testing of microarchitectural features.

Testing Methodology

MATLAB scripts serialize the IMEM/DMEM initialization arrays into cycle-accurate scan-in command sequences. These command streams are transferred to an FPGA board over USB, where a Vitis-based scan controller shifts data into the chip’s scan chain. After execution, the FPGA retrieves scan_out data for functional comparison and validation.

PCB Design

We used a PCB board for our final demonstration. The PCB helps provide power, voltage stabilization, and decoupling.

Schematic

The schematic of the PCB layouts mainly includes three parts: main circuits, power supply circuit, and decoupling circuits.

PCB schematic

In the main circuits, we created the symbol for our chip. The footprint of our chip is LQFP-64 (10×10 mm, 0.5 mm pitch). Some pins are not used. Reset IO pins are connected to switches, so we can control reset to VDD/GND by hand (asynchronous reset). DVDD powers IO pads, and VDD powers the core. All resistors and capacitors use 1206 footprints. The PCB supports two power modes (battery or FPGA), selectable by a switch.

Power supply block

In the power supply circuit, the LDO transfers 5V to 2.5V and 1V. The resistor and capacitor values for input/output filtering follow the TI reference recommendations for TPS71710CDR.

LDO and decoupling details

The decoupling circuit is designed to reduce supply noise during operation. The number of capacitors is determined by the requirements of each power net.

The main circuit connects the chip to other board components. We used jumpers for debugging and left pins for FPGA connections. The PCB was manufactured by JLCPCB, and soldering was performed manually.

PCB Layout

Below shows the PCB Layout and PCB.

PCB layout
Assembled PCB

Demonstration

Demonstration Overview
We demonstrate a silicon-proven CPU–VPU heterogeneous RISC-V SoC on fabricated silicon. The CPU orchestrates program control and system flow, while a custom Vector Processing Unit (VPU) accelerates data-parallel kernels. This platform enables direct, reproducible comparisons between scalar execution and vector offload under the same memory map, I/O path, and workload code structure.

Demo Setup: PCB + FPGA Scan + MATLAB Control
The chip is mounted on a custom PCB and connected to an FPGA through a level shifter. The FPGA handles low-level scan-in/scan-out pin toggling and basic control signals. A MATLAB-based host script orchestrates the full loop: it streams the program and input data into on-chip memories, triggers execution, then retrieves scan-out data to visualize outputs and compute performance metrics. Demo setup: PCB + FPGA Scan + MATLAB Control

Measurement Method: CPU-only vs. CPU+VPU
Each workload is executed in two modes using the same inputs, same memory map, and identical output locations:
  • CPU-only: scalar inner loops on PicoRV32 (no vector offload).
  • CPU+VPU: inner loops offloaded through custom PCPI vector instructions.
Speedup is defined as:
Speedup = Cycles(CPU-only) / Cycles(CPU+VPU).

Demo Flow
We prepared three workloads that cover high, medium, and low VPU utilization. The goal is to show very clearly how VPU utilization translates into end-to-end speedup on our chip.

Demo 1 — CNN-style 4-Channel 3×3 Convolution

Demo 1 is a CNN-style 4-channel 3×3 convolution. Here almost every pixel goes through two 3×3 convolutions on four feature maps, so the code is very MAC-heavy. In this case, the VPU is active for around 80% of the total cycles, and we see a large speedup compared to the CPU-only baseline.

Demo1: CPU-only Mode
Demo1: 'CPU-only' Mode
Demo1: CPU+VPU Mode
Demo1: 'CPU + VPU' Mode
Demo1: Speedup
Demo1: End-to-end Speedup (CPU-only vs. CPU+VPU)

Demo 2 — Sobel Edge Detection (Full-frame vs. ROI-Sobel)

Demo 2 is edge detection with Sobel filters. We show two versions: full-frame Sobel, where we run VPU convolution on every pixel, and ROI-Sobel, where we first do a cheap scalar pre-check and only call the VPU on high-gradient regions. So full-frame Sobel uses the VPU for about 60% of the time, while ROI-Sobel is lower, around 45%. This demo highlights how even with the same algorithm family, changing the amount of VPU work shifts the speedup.

Demo2 ROI: onlyCPU Mode
Demo2 ROI: 'onlyCPU' Mode
Demo2 ROI: CPU+VPU Mode
Demo2 ROI: 'CPU+VPU' Mode
Demo2 ROI: Speedup
Demo2 ROI: End-to-end Speedup (CPU-only vs. CPU+VPU)


Demo2 Full-frame: onlyCPU Mode
Demo2 Full-frame: 'onlyCPU' Mode
Demo2 Full-frame: CPU+VPU Mode
Demo2 Full-frame: 'CPU+VPU' Mode
Demo2 Full-frame: Speedup
Demo2 Full-frame: End-to-end Speedup (CPU-only vs. CPU+VPU)

Demo 3 — 4-Level Posterize Filter

Demo 3 is a 4-level posterize filter. This is mostly simple per-pixel quantization with almost no vector multiply or convolution. Here the VPU utilization is only about 5%, and as a result we see very little acceleration — sometimes even slower — because the PCPI handshake + data movement cost dominates, and that overhead is larger than just doing the few adds/shifts directly on the CPU. As a result, end-to-end cycles are higher and the VPU appears “slower” for this posterize case.

Demo3: onlyCPU Mode
Demo3: 'onlyCPU' Mode
Demo3: CPU+VPU Mode
Demo3: 'CPU+VPU' Mode
Demo3: Speedup
Demo3: End-to-end Speedup (CPU-only vs. CPU+VPU)

So, across these three workloads, we sweep from VPU-heavy to light, and you can directly see how higher VPU utilization gives better speedup on our architecture.

Results Summary

Below summarizes our results. For the CNN-style 4-channel 3×3 convolution, the VPU is busy about 80% of the time (high utilization) and we see roughly speedup. For full-frame Sobel, VPU utilization drops to about 60% (medium utilization) and the speedup is around ; with ROI Sobel, utilization is around 45% and the speedup is about . Finally, the 4-level posterize workload only uses the VPU about 5% of the time (low utilization), so the speedup is basically — overhead dominates.

Overall, speedup clearly scales with vector intensity: the more time we spend in VPU-friendly vector math, the more benefit we get from our custom VPU. This conclusion was also reached after processing a large amount of images and data.

Workload VPU Utilization Speedup
CNN-style 4ch 3×3 Conv ~80% (High) ~6×
Sobel (Full-frame) ~60% (Medium) ~3×
Sobel (ROI) ~45% (Medium–Low) ~2×
Posterize (4-level) ~5% (Low) ~1×

Video Demonstration

Chip Specifications

Parameter Specification
Technology [TSMC 65nm LP]
Supporting ISAs [RISCV32I and Custom Vector ISAs]
Memory(SRAM) [32KB Data and 32KB Instruction]
Peripherals [SPI and Uart]
Programming Interface [Scan Chain]
Die Size [2.0mm × 1.1mm]
Gate Count [~766,000 gates]
Maxinum Operating Frequency [252.52 MHz]
Supply Voltage [1.0V core, 2.5V I/O]
Power Consumption [13 mW typical]
Performance Metric see in 'Demo' section
Package [QFN-64]

Internal Clock Frequency Table

div[0:3] fc[0:4] Frequency (Hz)
1111 11111 9.77k
1111 01010 21.12k
1111 00101 30.21k
1110 01010 42.33k
1110 00101 59.89k
1101 01010 83.92k
1101 00101 119.22k
1100 01010 167.63k
1100 00101 237.11k
1011 01010 333.46k
1011 00101 471.12k
1010 01010 665.49k
1010 00101 943.08k
1001 01010 1.32M
1001 00101 1.89M
1000 01010 2.64M
1000 00101 3.80M
0111 01010 5.29M
0111 00101 7.55M
0110 01010 10.64M
0110 00101 15.09M
0101 01010 21.67M
0101 00101 31.21M
0100 01010 44.50M
0100 00101 64.12M
0011 01010 90.78M
0011 00101 118.57M
0010 01010 178.23M
0010 00101 252.52M
0001 01010 349.75M
0001 00101 493.33M
0000 01010 680.21M
0000 00101 950.88M
0000 00000 1.52G

Conclusions

We taped out and demonstrated a silicon-proven CPU–VPU heterogeneous RISC-V SoC in TSMC 65 nm, featuring an in-house, custom-designed VPU attached to PicoRV32 via PCPI, along with on-chip IMEM/DMEM SRAM, essential I/O, and a scan-based bring-up/measurement flow. Live demos on fabricated silicon validate end-to-end hardware/software co-design and reveal a clear trend: end-to-end speedup increases monotonically with VPU utilization (vector intensity)—from ~1× on vector-light workloads to ~6× on convolution/MAC-heavy workloads. This behavior matches our architectural intent: the CPU orchestrates control flow while the VPU accelerates regular, data-parallel inner loops.

Key Achievements

  • In-house custom VPU microarchitecture: Designed and implemented the VPU datapath and controller from scratch, including dedicated decode and multi-cycle control FSMs that manage instruction sequencing, base/stride address generation, memory handshaking, and PE scheduling—reducing CPU-side loop and address-update overhead.
  • Custom Vector ISA + PCPI offload path: Created a custom vector ISA and integrated it through PicoRV32’s PCPI interface, enabling seamless CPU-issued vector offloads and clean, reproducible CPU-only vs. CPU+VPU benchmarking under the same workload structure and memory map.
  • Silicon-proven SoC integration: PicoRV32 + custom VPU + on-chip IMEM/DMEM + peripherals were fully integrated and validated on fabricated silicon.
  • Demo-ready post-silicon infrastructure: A repeatable MATLAB(PC) ⇄ FPGA(PMOD) ⇄ level shifter ⇄ PCB ⇄ chip pipeline supports scan-in / run / scan-out, automated visualization, and cycle-accurate performance measurement.
  • Measured end-to-end scaling: CNN-style multi-channel convolution achieves the highest utilization and the largest speedup, while ROI-based and vector-light workloads show proportionally smaller gains as fixed offload and data-movement overhead becomes visible. As a result, our architecture is particularly well-suited for CNN-style AI kernels where computation is dominated by regular, vectorizable MAC patterns, making this SoC a compact AI-acceleration chip for convolution-heavy workloads.

Lessons Learned

  • Acceleration is workload-dependent: The VPU delivers the biggest benefit when kernels are dominated by regular MAC/vector inner loops; when vector intensity is low, fixed PCPI and memory-transaction overhead can dominate and reduce (or negate) speedup.
  • Co-design is essential: ISA design, memory layout, and measurement methodology must be aligned with the datapath and controller to obtain silicon-reproducible gains, not just isolated micro-benchmark wins.
  • Bring-up infrastructure is first-class: A robust scan-based flow plus MATLAB/FPGA control significantly improves iteration speed and debugging confidence during post-silicon validation.

Future Work

  • Reduce demo latency: Speed up scan-in/scan-out using higher scan clock, more efficient scan framing, or a higher-throughput debug interface to improve live-demo responsiveness.
  • Scale beyond on-chip SRAM: Use SPI as a practical path to attach external storage (or an FPGA-based memory model) for larger tensors/weights and rapid re-parameterization of AI workloads.
  • Increase VPU capability and observability: Extend the custom vector ISA (more ops, better reduction support, improved load/store patterns) and add lightweight performance counters to attribute cycles spent in CPU vs. VPU more precisely.
  • Richer I/O demonstrations: Stream intermediate results and processed images more directly (e.g., UART-driven display pipeline) for more interactive, real-time demos.

References

  1. Clifford Wolf. PicoRV32 — A Size-Optimized RISC-V CPU. YosysHQ (GitHub repository). https://github.com/YosysHQ/picorv32
  2. Alex Forencich. verilog-axi — AXI and AXI-Stream Components for Verilog. (GitHub repository). https://github.com/alexforencich/verilog-axi
  3. Alex Forencich. verilog-axi: Tree / master (source code reference used for AXI modules/integration). https://github.com/alexforencich/verilog-axi/tree/master
  4. “pulp-platform/apb_spi_master,” GitHub, 2025. (accessed Dec. 15, 2025). https://github.com/pulp-platform/apb_spi_master
  5. AXI-4-Lite-to-APB3-Bridge, “GitHub - AXI-4-Lite-to-APB3-Bridge/AXI-4-Lite-to-APB3-Bridge,” GitHub, 2025. (accessed Dec. 08, 2025). https://github.com/AXI-4-Lite-to-APB3-Bridge/AXI-4-Lite-to-APB3-Bridge
  6. B. Green, D. Todd, J. C. Calhoun and M. C. Smith, "TIGRA: A Tightly Integrated Generic RISC-V Accelerator Interface," 2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, OR, USA, 2021, pp. 779-782, doi: 10.1109/Cluster48925.2021.00115.
  7. V. S. Chakravarthi and S. R. Koteshwar, SoC Physical Design. Springer Nature, 2022. verilog-axi: Tree / master (source code reference used for AXI modules/integration).

Acknowledgments

We want to sincerely thank everyone who supported us throughout this project and made our first tape-out and post-silicon bring-up possible.

First and foremost, we are deeply grateful to Prof. Mingoo Seok for his supervision and guidance throughout the project. His technical expertise and hands-on feedback were invaluable at every stage, from architecture decisions to post-silicon validation.

Special thanks to our Teaching Assistants — Chuan-Tung Lin, Da Won Kim, and Mosom Jana — for their constant support. They were always available to answer questions, troubleshoot issues, and help us overcome critical obstacles during design, integration, and bring-up.

We also thank Richard T. Lee for his help with the practical logistics of post-silicon testing, especially for supporting the purchase and preparation of the materials needed for chip evaluation and measurement.

Finally, we gratefully acknowledge Apple Inc. for generous sponsorship and support. Without this sponsorship and the associated design review feedback, this project would not have been possible.

Team Members

Jiajun Jiang VPU Development, Frontend Design, Design Verification, Software, Post‑silicon Testing
Zhenning Yang Frontend Design, Design Verification, Backend Design, Soldering
Yicheng Huang Frontend Design, Backend Design, PCB Design
Zhuohao Chang Frontend Design, DFT
Yu Jia Frontend Design, Design Verification, Software








Back to all projects