By IdeaWorksInnovations Engineering Team
At IdeaWorksInnovations, we don't just run models; we build the silicon they run on. In our latest R&D project, we set out to break the memory bottleneck in Edge AI inference. The result? A custom Neural Processing Unit (NPU) coupled with a RISC-V core that delivers a 45% performance boost over standard implementation.
The Challenge
Running quantized neural networks on standard embedded CPUs often hits a wall. It's not just about compute—it's about feeding the beast. In a standard loop, the CPU spends thousands of cycles just shuffling data: loading weights, packing them, and then finally executing.
The Solution: Hardware-Software Co-Design
We chose an industrial-grade RISC-V core and extended it with our custom NPU via the APU (Auxiliary Processing Unit) interface.
Architecture Highlights
- 8x8 Systolic Array: 64 parallel INT8 MAC units capable of single-cycle matrix multiplication. This massive parallelism is key to our throughput.
- Vector Register File (VRF): 32 registers, each 64-bits wide, serving as a high-speed scratchpad to keep data close to the compute engine and minimize memory traffic.
- DMA-Driven Weight Loading: Our proprietary "secret sauce". A dedicated DMA controller bursts data from main memory directly into the VRF, completely eliminating the CPU overhead typically associated with weight packing.
Code Spotlight: Simplicity by Design
We designed our software stack to be as elegant as our hardware. Here is how simple it is to program a matrix multiplication layer with our DMA intrinsics:
// Pre-packed weights are loaded in the background by DMA
// No CPU cycles wasted on "load" instructions!
npu_dma_load(VRF_B_BASE, &weights[tile_idx], 8);
// RISC-V core can do other work, or wait for sync
npu_fence();
// Fire the 8x8 Systolic Array
// Computes 64 MACs per cycle
npu_matmul(VRF_A_BASE, VRF_B_BASE, count);
The Results
We benchmarked this on standard edge workloads. The numbers speak for themselves:
| Metric | Improvement |
|---|---|
| Inference Speed | +45% Faster |
| Cycle Count | 31% Reduction |
| Efficiency | Significant Power Savings |
This isn't just theory. We have reduced the cycle count from 138k to 95k for standard classification tasks, purely by optimizing how data moves through the silicon.
Take the Next Step
This design demonstrates how IdeaWorksInnovations solves the toughest edge AI challenges. Whether you need custom IP for your SoC or optimized deployment for your models, we have the expertise to make it happen.
Interested in seeing the full benchmark data?
Discuss how this IP can accelerate your product.
Contact Our Engineering Team