FPU roadmap #1404

esherriff · 2025-10-12T10:58:08Z

esherriff
Oct 12, 2025

Hi Stephan,

I wanted to know if there is a roadmap for further options on the FPU. Particularly in the area of FDIV and the fused multiply-add instructions to offer more performance at the cost of resources. Given the available compiler switches to disable FDIV/FSQRT and fused operations, I would like to suggest the following levels of conditional synthesis be available:

Existing ops, works with -mno-fdiv and -ffp-contract=off
Existing ops + FDIV + FSQRT, works with -ffp-contract=off
Existing ops + FMADD, FMSUB, FNMADD, FNMSUB, works with -mno-fdiv
All ops.

The existing policy of flushing subnormals is fine. Another option for adding more operators is to restrict rounding modes on the more advanced ops, for example supporting only round to nearest ties to even. Or perhaps even just round to nearest ties away, if that ends up simpler.

I think there is an argument for the option 3 of implementing fused multiply add without FDIV or FSQRT as fused operators can offer considerable performance increases on factorised polynomials used in optimised "good enough" approximations of math operators like logarithms or trigonometry, as well as Newton Raphson based software implementations of FDIV and FSQRT.

stnolting · 2025-10-18T20:31:10Z

stnolting
Oct 18, 2025
Maintainer

Hey @esherriff!

Sorry for the late reply... 🙈

Actually, division and square root operations are already on my TODO list. Division isn't really complicated, but I don't have a good (= small) implementation idea for square roots yet (PRs welcome! 😉).

[...] as fused operators can offer considerable performance increases [...]

Fused operations are a tricky thing. Since FPU operations require several cycles anyway, fused operations would not bring a significant performance gain. However, they would probably reduce register file pressure and register spilling rates. In any case, I have not planned to implement fused operations for the time being.

The existing policy of flushing subnormals is fine.

Actually, that's something I'd like to fix so that we can (finally) run the official RISC-V FPU tests. However, processing denormalized operands requires a considerable amount of hardware resources, so we should refrain from doing so for the time being (although it could perhaps be included as an additional optional configuration option).

Another option for adding more operators is to restrict rounding modes on the more advanced ops, for example supporting only round to nearest ties to even. Or perhaps even just round to nearest ties away, if that ends up simpler.

Good idea, but the rounding operations are really quite cheap in terms of hardware.

0 replies

esherriff · 2025-10-18T21:08:57Z

esherriff
Oct 18, 2025
Author

Well for my own Altera based fork, for the option 2 unit, I used the Altera megafunctions for FP divide and square root which themselves don't support denormal or all rounding modes but are fast. The square root uses a block RAM look up table so uses less logic. Just having options to do such tradeoffs between logic and other resources can be useful.

For option 3 and 4, I am going to use the arithmetic units from this project. I've used these before in my forks of the Neo430 and Bonfire for Microchip Igloo2 and Smartfusion2 so am pretty sure they will work in this Altera implementation.

One reason I am interested in fused operations is to minimise the toll on memory bandwidth and increase code density for maths routines. I will either be running the code from XIP, external SRAM or the limited quantity of block RAM in the smaller Cyclone 10LP dies, so every instruction helps.

0 replies

stnolting · 2025-10-23T02:43:04Z

stnolting
Oct 23, 2025
Maintainer

Well for my own Altera based fork [...]

I also like the Vivado floating point modules. They have an AXI stream interface that can be easily connected via SLINK.

For option 3 and 4, I am going to use the arithmetic units from this project.

That looks very good! Perhaps something from there could be used as inspiration for implementing division and square roots.

One reason I am interested in fused operations is to minimise the toll on memory bandwidth and increase code density for maths routines.

You could also implement custom compressed instructions for simple floating-point operations. This would allow you to encode, for example, an ADD and a MUL sequentially as 16-bit instructions, resulting in the same instruction bandwidth as a single MADD instruction.

0 replies

esherriff · 2025-10-23T18:24:30Z

esherriff
Oct 23, 2025
Author

That looks very good! Perhaps something from there could be used as inspiration for implementing division and square roots.

Yes, you'll see the division and square root are a single unit. With options either for either multi-cycle iteration or a look up table. I modified my fork of it to fuse the look up tables into a single inferred dual port ROM then added some pipelining to that so it takes a few extra cycles.

Similarly for the fused-multiply add I had to unroll it into a registered pipeline due to the excessive combinatorial paths. I ended up with a six stage pipeline excluding the rounding unit.

You could also implement custom compressed instructions for simple floating-point operations.

Yeah one of the reasons our port got traction resourcing wise is because our existing cores all use memory mapped FPU peripherals and it was irritating to see a wall of macros instead of the arithmetic itself. So I doubt we will depart from what is supported by Zfinx.

Speaking of Zfinx, I have had real issues making the xpack compiler (Windows x86_64 14.2.0) to emit an FSQRT instruction. Given that your existing demo project's makefile is including the common makefile, which includes mno-fdiv in the compiler flags, are you sure you have tested the square root instruction can be emitted for march's including Zfinx?

Due to the apparent unlikelyhood of ever getting a square root, I decided to remove it from my option 2 FPU. Given that my build tools inject the FPU generic as a C #define into the software build, it was easy enough to write a custom square root routine that uses fast inverse square root on the FP adder and multiplier and a similar algorithm for reciprocal when division is not present in options 1 or 3.

Another feature branch in work is the required DWARF decoders for the testbench, our other cores all use these to provide simulation traces of the C code line being run and the content of global and in-scope local variables. It's much more convenient to just have the C code as a wave beside everything else than having to parse a post-run trace log file, or try to get GDB talking through the simulator.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FPU roadmap #1404

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

FPU roadmap #1404

Uh oh!

Uh oh!

esherriff Oct 12, 2025

Replies: 4 comments

Uh oh!

stnolting Oct 18, 2025 Maintainer

Uh oh!

Uh oh!

esherriff Oct 18, 2025 Author

Uh oh!

stnolting Oct 23, 2025 Maintainer

Uh oh!

Uh oh!

esherriff Oct 23, 2025 Author

esherriff
Oct 12, 2025

stnolting
Oct 18, 2025
Maintainer

esherriff
Oct 18, 2025
Author

stnolting
Oct 23, 2025
Maintainer

esherriff
Oct 23, 2025
Author