Pipelining NEORV32 #1298

mmcheraghi · 2025-06-25T10:33:14Z

mmcheraghi
Jun 25, 2025

Hi
Have you ever tried to pipeline NEORV32 CPU?
Although the area overhead is higher in this case, pipelining makes this project suitable for high performance applications (esp. for FP computations).

mmcheraghi · 2025-06-25T10:34:16Z

mmcheraghi
Jun 25, 2025
Author

I really like to cooperate with you in this case if you want.

0 replies

Unike267 · 2025-06-25T11:37:26Z

Unike267
Jun 25, 2025
Collaborator

Hi there!

The task of pipeline the CPU is not trivial.

In my opinion, NEORV32 is perfect for learning how RISC-V microcontrollers work and with a single thread everything operates in a relatively "simple" way to understand it. By taking a look to the asm sw you can view clearly how this scrip is executing in the core step by step, instruction by instruction one after the other.

However, for high-performance contexts, Stephan has added the possibility to implement several cores. Likely, in my opinion, this is the best choice because it keeps the core simple for those who want to learn and also provides an option for those who need a more powerful setup to solve demanding problems.

In addition to that, if you need to deal with a truly high demand problem, you should to consider using a 64-bit or even 128-bit RISC-V pipelined implementation.

In conclusion, I would maintain the single thread version, although I wouldn’t rule out a parallel pipelined version. 🤔

Just my opinion. 😃

Cheers!

2 replies

mmcheraghi Jun 25, 2025
Author

I agree with you about the simplicity as a great feature of this project.

That was just a suggestion. If we could preserve the clear coding style and documentation while introducing this feature, NEORV32 would enter in the world of high performance computation.

Thanks

Unike267 Jun 25, 2025
Collaborator

I agree with you about the simplicity as a great feature of this project.

❤️

NEORV32 would enter in the world of high performance computation.

Yeah, that's right! Maybe we should think about it.

Cheers!

stnolting · 2025-06-30T05:49:06Z

stnolting
Jun 30, 2025
Maintainer

Pipelining is actually at the top of my to-do list. 😅 I've thought about a lot of concepts and have already tried out some of them in hardware. The problem is actually the additional hardware overhead.

This is manageable for the actual pipeline. In the end, you only need forwarding and a kind of ready-valid handshake between the stages.

But then there are all the traps that can occur in different stages (illegal instruction in DECODE, bus access error in MEMORY ACCESS, privilege mode error in WRITE BACK, ...).

It gets even worse when there are multi-cycle operations (division, or a bus access with wait states). These must be synchronized across all (previous) stages to halt everything until the operation is completed.

This is all feasible (and almost standard nowadays), but costs additional hardware. And at some point we end up with Ibex, VexRISC and the like with their more classic DLX pipelines.

In addition, the bus interface would have to be adapted, as it currently needs at least 2 clocks to answer a request (ok, we now also have bursts...).

Unfortunately, I can't give you any concrete figures, but I would say from a feeling that full pipelining would make the CPU about twice as big.

But then the dual-core configuration is interesting again. Since the CPUs need several clocks per instruction anyway, they surprisingly don't get in each other's way much when it comes to bus accesses and you get roughly +100% performance (I made a very simple test for that: https://github.com/stnolting/neorv32/tree/main/sw/example/demo_dual_core_primes).

TL;DR

So yeah, pipelining is something I would find very cool. But due to the high hardware costs, I would only limit it to some parts of the CPU. E.g. faster jumps would be a real performance boost. Maybe there are things that can be improved? 🤔

22 replies

stnolting Jul 30, 2025
Maintainer

That's right, but ultimately that's how the calculations are performed. Furthermore, for most ALU operations, EX_EXECUTE returns directly to EX_DISPATCH:

neorv32/rtl/core/neorv32_cpu_control.vhd

Lines 428 to 435 in b29cc30

    
           -- is base rv32i/e ALU[I] instruction (excluding shifts)? -- 
        
           if ((opcode(5) = '0') and (funct3_v /= funct3_sll_c) and (funct3_v /= funct3_sr_c)) or -- base ALUI instruction (excluding SLLI, SRLI, SRAI) 
        
              ((opcode(5) = '1') and (((funct3_v = funct3_sadd_c) and (funct7_v = "0000000")) or ((funct3_v = funct3_sadd_c) and (funct7_v = "0100000")) or 
        
                                      ((funct3_v = funct3_slt_c)  and (funct7_v = "0000000")) or ((funct3_v = funct3_sltu_c) and (funct7_v = "0000000")) or 
        
                                      ((funct3_v = funct3_xor_c)  and (funct7_v = "0000000")) or ((funct3_v = funct3_or_c)   and (funct7_v = "0000000")) or 
        
                                      ((funct3_v = funct3_and_c)  and (funct7_v = "0000000")))) then -- base ALU instruction (excluding SLL, SRL, SRA) 
        
             ctrl_nxt.rf_wb_en    <= '1'; -- valid RF write-back (won't happen if exception) 
        
             exe_engine_nxt.state <= EX_DISPATCH;

mmcheraghi Jul 30, 2025
Author

I think this is a critical point in pipelining. The operation (reading operands, calculation, and write back) is done during the next state (EX_DISPATCH).

stnolting Aug 3, 2025
Maintainer

That's right. The FSM uses output registers for all control signals in order to keep the critical path (instruction decoding -> reg -> ALU operation select) as short as possible.

mmcheraghi Aug 12, 2025
Author

Hi
According to IPB_BURST_FILL, a new instruction is filled into IPB each clock cycle right now.

Next, I want to separate the main FSM into two parts:
1- The first part contains the EX_DISPATCH state that fills exe_engine_stg1 only.
2- The second part is responsible to assign exe_engine_stg1 to exe_engine_stg2 and decode and execute the instruction.

Any Idea about how to handle TRAP state?
Should it be handled in first part?

stnolting Aug 12, 2025
Maintainer

The trap logic is (by far) the most complicated part of the core...
We must ensure that all traps are always precise - that is, that they can be clearly assigned to the trap-triggering instructions. And that's exactly the part where I haven't been able to find a good concept for a deeper pipeline yet - at least not without massively increasing hardware costs. 🙈

Unike267 · 2025-07-01T11:43:19Z

Unike267
Jul 1, 2025
Collaborator

FTR: https://ieeexplore.ieee.org/document/11035005

1 reply

mmcheraghi Jul 1, 2025
Author

Thanks

mmcheraghi · 2025-09-18T17:39:56Z

mmcheraghi
Sep 18, 2025
Author

Hi
The first pipeline scheme is now available in https://github.com/mmcheraghi/neorv32.
It's main purpose is to make simple instructions' execution fast.
It fills IPB in BURST mode and makes it possible to fetch a new instruction in each cycle.

See result:
https://github.com/mmcheraghi/neorv32/actions/runs/17836612058

*)Notes:
1- It only works with cache right now and needs further enhancements for other memory hierarchy scenarios.
2- I should see how the FPGA timing is effected.
3- The speedup of processor check program is 12% with this scheme.

2 replies

stnolting Sep 22, 2025
Maintainer

3- The speedup of processor check program is 12% with this scheme.

Coremark speedup is about 25%, right?

mmcheraghi Sep 22, 2025
Author

Yes.

mmcheraghi · 2025-09-26T17:57:48Z

mmcheraghi
Sep 26, 2025
Author

Hi
I checked piplined processor with riscof. It seems that there is some problem with PMP. But processor check was successful in PMP related items. Any Idea?
riscof action

Maybe it's due to time-multiplexed comparator of PMP module for LSU and instruction accesses? (It assumes instructions are multicycle)

2 replies

stnolting Sep 29, 2025
Maintainer

I checked piplined processor with riscof.

That looks pretty good so far!

Maybe it's due to time-multiplexed comparator of PMP module for LSU and instruction accesses? (It assumes instructions are multicycle)

Yes, that's exactly where the problem lies. But I don't know why it works now with the PMP checks in the processor test... 🤔 I'm afraid the only option here is to split up the PMP logic again: i.e., separate comparators for load/store accesses and instruction fetch.

mmcheraghi Sep 29, 2025
Author

Thanks a lot
Sorry for latency. I'm a bit busy these days. I'll look at it in coming days inshallah...

Pipelining NEORV32 #1298

Uh oh!

mmcheraghi Jun 25, 2025

Replies: 6 comments · 29 replies

Uh oh!

mmcheraghi Jun 25, 2025 Author

Uh oh!

Uh oh!

Unike267 Jun 25, 2025 Collaborator

Uh oh!

mmcheraghi Jun 25, 2025 Author

Uh oh!

Unike267 Jun 25, 2025 Collaborator

Uh oh!

stnolting Jun 30, 2025 Maintainer

TL;DR

Uh oh!

stnolting Jul 30, 2025 Maintainer

Uh oh!

mmcheraghi Jul 30, 2025 Author

Uh oh!

stnolting Aug 3, 2025 Maintainer

Uh oh!

Uh oh!

mmcheraghi Aug 12, 2025 Author

Uh oh!

stnolting Aug 12, 2025 Maintainer

Uh oh!

Unike267 Jul 1, 2025 Collaborator

Uh oh!

mmcheraghi Jul 1, 2025 Author

Uh oh!

Uh oh!

mmcheraghi Sep 18, 2025 Author

Uh oh!

stnolting Sep 22, 2025 Maintainer

Uh oh!

mmcheraghi Sep 22, 2025 Author

Uh oh!

Uh oh!

mmcheraghi Sep 26, 2025 Author

Uh oh!

stnolting Sep 29, 2025 Maintainer

Uh oh!

Uh oh!

mmcheraghi Sep 29, 2025 Author

mmcheraghi
Jun 25, 2025

Replies: 6 comments 29 replies

mmcheraghi
Jun 25, 2025
Author

Unike267
Jun 25, 2025
Collaborator

mmcheraghi Jun 25, 2025
Author

Unike267 Jun 25, 2025
Collaborator

stnolting
Jun 30, 2025
Maintainer

stnolting Jul 30, 2025
Maintainer

mmcheraghi Jul 30, 2025
Author

stnolting Aug 3, 2025
Maintainer

mmcheraghi Aug 12, 2025
Author

stnolting Aug 12, 2025
Maintainer

Unike267
Jul 1, 2025
Collaborator

mmcheraghi Jul 1, 2025
Author

mmcheraghi
Sep 18, 2025
Author

stnolting Sep 22, 2025
Maintainer

mmcheraghi Sep 22, 2025
Author

mmcheraghi
Sep 26, 2025
Author

stnolting Sep 29, 2025
Maintainer

mmcheraghi Sep 29, 2025
Author