Skip to content

Conversation

@sermazz
Copy link

@sermazz sermazz commented Feb 18, 2025

  • riscv cores with TNN extension (ISA extension for ternary-weight neural networks) [moved to https://github.com/Integrate TNN-enabled Flex-V core #98]
  • up-to-date dmac_wrap for iDMA
  • iDMA requests to SoC do not go anymore through cluster bus, but through an additional, independent (wide) port
  • iDMA transfers to/from TCDM can now go through wide HWPEs port of TCDM

Cc: @arpansur @da-gazzi @DanielKellerM

To-do

  • Rebase on master

Dependencies

Hardware

  • Solve combinational loop when iDMA connects to HWPE port of HCI (for now solved with FIFOs)
  • Solve iDMA warnings for mismatching of port sizes
  • Parametrize the additional wide port of cluster and HCI (if using mchan we don't want the additional wide master port on the interface of the cluster, or we might want iDMA without wide transfers)

Verification

  • Verify with different hw configurations (multiple/different HWPEs, with/without EEC, with/without HMR, with different cores, with iDMA connected to narrow ports of HCI)
  • Merge pulp-cluster-nonfree to have proper CI testing those configurations

@sermazz sermazz force-pushed the smazzola/chimera branch 2 times, most recently from 0b4d3c1 to e329d44 Compare February 20, 2025 10:10
@FrancescoConti
Copy link
Member

As #87 was just merged, this is the next major PR for the cluster. @sermazz check if you can rebase this on master!

env/env.sh Outdated

if (hostname | grep -qE "\.ee\.ethz\.ch$") ; then
export PULP_RUNTIME_GCC_TOOLCHAIN=/usr/pack/riscv-1.0-kgf/pulp-gcc-1.0.16
export PULP_RUNTIME_GCC_TOOLCHAIN=/usr/scratch/dolcedorme/smazzola/install/pulp-tnn-gnu-7.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compiler available open-source? This is important before we merge this PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is open-source, but as a fork of @da-gazzi.. we should merge it but I'm no expert here
https://github.com/da-gazzi/pulp-tnn-gnu-toolchain
As far as I could tell it's also outdated with respect to the currently used toolchain of the upstream PULP Cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's evoke @gtagliavini for this question :D

XBAR_PERIPH_BUS.Slave hci_ecc_periph_slave,
hci_core_intf.target core_tcdm_slave [0 : NB_CORES-1 ],
hci_core_intf.target hwpe_tcdm_slave [0 : 0 ],
hci_core_intf.target hwpe_tcdm_slave [0 : NB_HWPE-1 ],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if NB_HWPE=0, which I think is still possible (if iDMA uses LIC ports). See pulp-platform/hci#50 for a possible better solution.

end
// if DMA_USE_HWPE_PORT, assign DMA interfaces to the rest of s_hwpe_intc[:]
for (genvar i=NB_HWPE; i<N_HCI_HWPE_PORTS; i++) begin : connect_dma_hwpe_intf
hci_core_assign_expand #(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on pulp-platform/hci#45, which should be merged before this one is merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are also other dependencies that I am taking an eye on, that need to be merged before we merge this (iDMA for example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check-list in the PR description to keep track!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it!

.N_MEM ( NB_TCDM_BANKS ),
.IW ( TCDM_ID_WIDTH ),
.TS_BIT ( TEST_SET_BIT ),
.EXPFIFO ( 2 ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the FIFO? did you have problems with timing closure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is a combinational loop.. I was not sure how to solve it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we should check this one separately!

.N_MEM ( NB_TCDM_BANKS ),
.IW ( TCDM_ID_WIDTH ),
.TS_BIT ( TEST_SET_BIT ),
.EXPFIFO ( 2 ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the FIFO? problems with timing closure?

assign core_busy_o = ~core_sleep;
end else if ( CORE_TYPE_CL == 1 ) begin: RI5CY_CORE
assign boot_addr = boot_addr_i;
riscv_core #(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I believe we need to manage the two cores differently. They are all riscv_core but there are substantial differences and I fear merging all the changes is far from trivial.
We have

  • a riscv_nn that now resides in https://github.com/pulp-platform/flex-v, with the HMR branch that is referred to in the current cluster v3
  • the TNN core from @da-gazzi you target here (where is it? where did it branch out?)
  • CV32E40P, with a different history
    @alenad95 do you have suggestions on how to cut this Gordian knot?

Copy link
Author

@sermazz sermazz Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, this is an important point.
The RI5CY (riscv-nn) used in this PR is on our gitlab: https://iis-git.ee.ethz.ch/pulpissimov2/riscv-nn/-/tree/53b053cd0be8dae25caa24bd57824e81cf87ba21

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's not tragic, it's only a few commits away from the main of https://github.com/pulp-platform/flex-v/network .
@sermazz can you push TNN to a branch of flex-v? then @alenad95 i think we should have a look whether the changes can be merged (at least with the Shaheen branch), which will leave us at two flex-v branches (main + hmr) and cv32e40p.
Then we can tidy the room up a bit in a future PR with an if-generate like in PULPissimo...

@RiccardoGandolfi
Copy link

RiccardoGandolfi commented May 19, 2025

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

The main changes on my side until now are:

  • updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)
  • added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing
  • iDMA rtl:
  • a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers
  • fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

  • I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.

  • @DanielKellerM @sermazz @FrancescoConti

@RiccardoGandolfi
Copy link

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

The main changes on my side until now are:

  • updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)
  • added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing
  • iDMA rtl:
  • a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers
  • fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

  • I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.
  • @DanielKellerM @sermazz @FrancescoConti

I updated the multicore tests as follows:

  • introduced a TEST_ALL_CORES flag that switches between single-core execution and multi-core execution (parallel-only for now, I'll add the sequential one as well on my side).
  • I refined a bit the randomization of the constraints, meaning the dst_stride can't be lower than the length parameter (overwriting results which messes up the comparison in the test) and the size as well can't be lower than the length parameter.
  • I'm still facing some weird issues almost exclusively in multicore mode where some cores will fail due to some mismatches. I'm investigating these cases at the moment.

@RiccardoGandolfi
Copy link

RiccardoGandolfi commented May 30, 2025

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

The main changes on my side until now are:

  • updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)
  • added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing
  • iDMA rtl:
  • a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers
  • fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

  • I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.
  • @DanielKellerM @sermazz @FrancescoConti

I updated the multicore tests as follows:

  • introduced a TEST_ALL_CORES flag that switches between single-core execution and multi-core execution (parallel-only for now, I'll add the sequential one as well on my side).
  • I refined a bit the randomization of the constraints, meaning the dst_stride can't be lower than the length parameter (overwriting results which messes up the comparison in the test) and the size as well can't be lower than the length parameter.
  • I'm still facing some weird issues almost exclusively in multicore mode where some cores will fail due to some mismatches. I'm investigating these cases at the moment.

I pushed some further updates:

  • The issues I was facing in multicore execution were related to the use of the pulp_dma_cl_barrier function, which works fine as long as you transfer towards L2. So I added some new wait functions to work with transfers in the opposite direction as well. This seems to fix the issues.
  • All tests now contain: single core mode, sequential and parallel multicore modes + transfers L1->L2, L2->L1 and L1->L1
  • I properly fixed the arbitration on the stream_idx signal by changing the sv template, so that the correct version of the frontend can be automatically generated inside the iDMA repo.

@DanielKellerM @sermazz @FrancescoConti

@DanielKellerM
Copy link

Hello @RiccardoGandolfi @sermazz

I have created some new feature branches to continue the work from the branches in #89 (comment) and the on-going work from Riccardo.

The updated branches are:
PULP Cluster: https://github.com/pulp-platform/pulp_cluster/tree/dkeller/chimera-v2
Pulp-runtime: https://github.com/pulp-platform/pulp-runtime/tree/dkeller/chimera-v2
Regression-tests: https://github.com/pulp-platform/regression_tests/tree/dkeller/chimera-v2

@RiccardoGandolfi
Copy link

RiccardoGandolfi commented Jun 10, 2025

I pushed some substantial updates to the multi_core tests:

https://github.com/RiccardoGandolfi/regression_tests/tree/add_iDMA_tests

Main changes are:

  • now using pi_l1_malloc and pi_l2_malloc functions to divide the memory space between the different cores that use the iDMA
  • some restructuring of the tests to have a more understandable code
  • some refinements on the stimuli generation

These last changes should have fixed all the remaining issues with multi core execution. My next steps will be improving the stimuli generation. @FrancescoConti @sermazz @DanielKellerM @yvantor

@sermazz sermazz mentioned this pull request Jul 10, 2025
2 tasks
@sermazz sermazz force-pushed the smazzola/chimera branch 5 times, most recently from 08415ee to ddf4846 Compare July 10, 2025 14:29
@sermazz sermazz force-pushed the smazzola/chimera branch 4 times, most recently from 60cbbc7 to a3ac2c2 Compare July 11, 2025 11:11
@DanielKellerM
Copy link

DanielKellerM commented Jul 19, 2025

I have rebased branch dkeller/chimera-v2 onto this latest update from Sergio.
Working on #99 to merge the contributions from @RiccardoGandolfi and mine

@DanielKellerM
Copy link

DanielKellerM commented Jul 24, 2025

@sermazz maybe this point/task can be added to the CI?

  • Verify with different hw configurations (multiple/different HWPEs, with/without EEC, with/without HMR, with different cores, with iDMA connected to narrow ports of HCI)

sermazz and others added 8 commits August 12, 2025 16:47
The linker script has L1 address ORIGIN set to 0x10000004 even through
in hardware it is set to 0x10000000. However the testbench assumes 64b
alignment to initialize the L1. Thus, the data was shifted by 32b in the
simulation. While the AXI bursts are set to 64b, the misalignment needs
to be handled coming from the linker script.
Being unnecessarily unpacked, it was not compatible with other systems (like Cheshire)
@sermazz sermazz force-pushed the smazzola/chimera branch 3 times, most recently from ece4dda to cd19098 Compare August 13, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants