Upstream features from `chimera` #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

sermazz wants to merge 14 commits into master from smazzola/chimera

sermazz commented Feb 18, 2025 •

edited by DanielKellerM

Loading

~~riscv cores with TNN extension (ISA extension for ternary-weight neural networks)~~ [moved to https://github.com/Integrate TNN-enabled Flex-V core #98]
up-to-date dmac_wrap for iDMA
iDMA requests to SoC do not go anymore through cluster bus, but through an additional, independent (wide) port
iDMA transfers to/from TCDM can now go through wide HWPEs port of TCDM

Cc: @arpansur @da-gazzi @DanielKellerM

To-do

Rebase on master

Dependencies

Clean up and merge hardware dependencies
- hci supporting multiple wide ports + ECC support Multiple HWPE support + ECC hci#53 = prasadar/multi-hwpe-support rebased on lg/ecc_rebase_v2.1.1
- iDMA Upstream PULPv2 features iDMA#66
- Add iDMA work from Daniel & Riccardo
- neureka https://github.com/pulp-platform/neureka/compare/prasadar/devel
Clean up and merge software dependencies
- pulp-runtime https://github.com/pulp-platform/pulp-runtime/tree/smazzola/chimera = georgr/nn_tests rebased on upstream_features from Astral
- regression_tests https://github.com/pulp-platform/regression_tests/tree/smazzola/chimera = prasadar/chimera rebased on Astral's lg/upstream (remove unused old tests like ./neureka, parametrize sw libraries to reflect hw configuration like pulp_nnx which has HWPEs IDs hardcoded only to the PULP Cluster's default config with 3 HWPEs)

Hardware

Solve combinational loop when iDMA connects to HWPE port of HCI (for now solved with FIFOs)
Solve iDMA warnings for mismatching of port sizes
Parametrize the additional wide port of cluster and HCI (if using mchan we don't want the additional wide master port on the interface of the cluster, or we might want iDMA without wide transfers)

Verification

Verify with different hw configurations (multiple/different HWPEs, with/without EEC, with/without HMR, with different cores, with iDMA connected to narrow ports of HCI)
Merge pulp-cluster-nonfree to have proper CI testing those configurations

sermazz force-pushed the smazzola/chimera branch 2 times, most recently from 0b4d3c1 to e329d44 Compare

February 20, 2025 10:10

Member

FrancescoConti commented Feb 24, 2025

As #87 was just merged, this is the next major PR for the cluster. @sermazz check if you can rebase this on master!

sermazz force-pushed the smazzola/chimera branch from 72a5642 to e309337 Compare

February 25, 2025 19:06

FrancescoConti reviewed

View reviewed changes

env/env.sh Outdated

    
              if (hostname | grep -qE "\.ee\.ethz\.ch$") ;  then

                export PULP_RUNTIME_GCC_TOOLCHAIN=/usr/pack/riscv-1.0-kgf/pulp-gcc-1.0.16

                export PULP_RUNTIME_GCC_TOOLCHAIN=/usr/scratch/dolcedorme/smazzola/install/pulp-tnn-gnu-7.1.1

Member

FrancescoConti Feb 25, 2025

Is this compiler available open-source? This is important before we merge this PR

Author

sermazz Feb 26, 2025

It is open-source, but as a fork of @da-gazzi.. we should merge it but I'm no expert here
https://github.com/da-gazzi/pulp-tnn-gnu-toolchain
As far as I could tell it's also outdated with respect to the currently used toolchain of the upstream PULP Cluster

Member

FrancescoConti Feb 26, 2025

Let's evoke @gtagliavini for this question :D

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv

    
                XBAR_PERIPH_BUS.Slave                hci_ecc_periph_slave,

                hci_core_intf.target                 core_tcdm_slave   [0             : NB_CORES-1     ],

                hci_core_intf.target                 hwpe_tcdm_slave   [0             : 0              ],

                hci_core_intf.target                 hwpe_tcdm_slave   [0             : NB_HWPE-1      ],

Member

FrancescoConti Feb 25, 2025

This will fail if NB_HWPE=0, which I think is still possible (if iDMA uses LIC ports). See pulp-platform/hci#50 for a possible better solution.

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv Show resolved Hide resolved

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv

    
                    end

                    // if DMA_USE_HWPE_PORT, assign DMA interfaces to the rest of s_hwpe_intc[:]

                    for (genvar i=NB_HWPE; i<N_HCI_HWPE_PORTS; i++) begin : connect_dma_hwpe_intf

                      hci_core_assign_expand #(

Member

FrancescoConti Feb 25, 2025

This depends on pulp-platform/hci#45, which should be merged before this one is merged.

Author

sermazz Feb 26, 2025

Yes there are also other dependencies that I am taking an eye on, that need to be merged before we merge this (iDMA for example)

Member

FrancescoConti Feb 26, 2025

maybe add a check-list in the PR description to keep track!

Author

sermazz Feb 27, 2025

On it!

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv Show resolved Hide resolved

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv

    
                        .N_MEM  ( NB_TCDM_BANKS            ),

                        .IW     ( TCDM_ID_WIDTH            ),

                        .TS_BIT ( TEST_SET_BIT             ),

                        .EXPFIFO ( 2                       ),

Member

FrancescoConti Feb 25, 2025

Why the FIFO? did you have problems with timing closure?

Author

sermazz Feb 26, 2025

Because there is a combinational loop.. I was not sure how to solve it

Member

FrancescoConti Feb 26, 2025

Ok, we should check this one separately!

FrancescoConti reviewed

View reviewed changes

rtl/cluster_interconnect_wrap.sv

    
                        .N_MEM  ( NB_TCDM_BANKS            ),

                        .IW     ( TCDM_ID_WIDTH            ),

                        .TS_BIT ( TEST_SET_BIT             ),

                        .EXPFIFO ( 2                       ),

Member

FrancescoConti Feb 25, 2025

why the FIFO? problems with timing closure?

FrancescoConti reviewed

View reviewed changes

rtl/core_region.sv

    
                    assign core_busy_o = ~core_sleep;

                  end else if ( CORE_TYPE_CL == 1 ) begin: RI5CY_CORE

                    assign boot_addr = boot_addr_i;

                    riscv_core #(

Member

FrancescoConti Feb 25, 2025

Here I believe we need to manage the two cores differently. They are all riscv_core but there are substantial differences and I fear merging all the changes is far from trivial.
We have

a riscv_nn that now resides in https://github.com/pulp-platform/flex-v, with the HMR branch that is referred to in the current cluster v3
the TNN core from @da-gazzi you target here (where is it? where did it branch out?)
CV32E40P, with a different history
@alenad95 do you have suggestions on how to cut this Gordian knot?

Author

sermazz Feb 26, 2025 •

edited

Loading

Absolutely, this is an important point.
The RI5CY (riscv-nn) used in this PR is on our gitlab: https://iis-git.ee.ethz.ch/pulpissimov2/riscv-nn/-/tree/53b053cd0be8dae25caa24bd57824e81cf87ba21

Member

FrancescoConti Feb 26, 2025

OK, it's not tragic, it's only a few commits away from the main of https://github.com/pulp-platform/flex-v/network .
@sermazz can you push TNN to a branch of flex-v? then @alenad95 i think we should have a look whether the changes can be merged (at least with the Shaheen branch), which will leave us at two flex-v branches (main + hmr) and cv32e40p.
Then we can tidy the room up a bit in a future PR with an if-generate like in PULPissimo...

FrancescoConti reviewed

View reviewed changes

rtl/pulp_cluster.sv Outdated Show resolved Hide resolved

FrancescoConti reviewed

View reviewed changes

rtl/pulp_cluster.sv Show resolved Hide resolved

sermazz force-pushed the smazzola/chimera branch from e309337 to 67bcfb3 Compare

February 27, 2025 10:02

sermazz requested a review from arpansur

February 27, 2025 10:42

sermazz mentioned this pull request

Integrate PULP Cluster pulp-platform/chimera#66

Draft

7 tasks

sermazz force-pushed the smazzola/chimera branch from bced681 to d16ac0a Compare

March 11, 2025 18:44

DanielKellerM commented Mar 13, 2025 •

edited

Loading

On-going: rebasing dkeller/chimera branches on smazzola/chimera:
Pulp-runtime:

https://github.com/pulp-platform/pulp-runtime/commits/dkeller/chimera/ rebase on https://github.com/pulp-platform/pulp-runtime/tree/smazzola/chimera

Regression-tests:

https://github.com/pulp-platform/regression_tests/tree/dkeller/chimera rebase on https://github.com/pulp-platform/regression_tests/tree/smazzola/chimera

iDMA:

https://github.com/pulp-platform/iDMA/tree/dkeller/chimera rebase on release (Release v0.6.4)

PULP cluster:

https://github.com/pulp-platform/pulp_cluster/tree/dkeller/chimera commit by commit rebase on https://github.com/pulp-platform/pulp_cluster/tree/smazzola/chimera.

RiccardoGandolfi commented May 19, 2025 •

edited

Loading

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

The main changes on my side until now are:

updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)
added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing
iDMA rtl:
a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers
fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.
@DanielKellerM @sermazz @FrancescoConti

RiccardoGandolfi commented May 23, 2025

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

https://github.com/RiccardoGandolfi/pulp_cluster/tree/update_idma

https://github.com/RiccardoGandolfi/pulp-runtime/tree/lg/upstream

https://github.com/RiccardoGandolfi/regression_tests/tree/add_iDMA_tests

https://github.com/RiccardoGandolfi/iDMA/tree/dkeller/chimera

The main changes on my side until now are:

updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)

added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing

iDMA rtl:

a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers

fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.

@DanielKellerM @sermazz @FrancescoConti

I updated the multicore tests as follows:

introduced a TEST_ALL_CORES flag that switches between single-core execution and multi-core execution (parallel-only for now, I'll add the sequential one as well on my side).
I refined a bit the randomization of the constraints, meaning the dst_stride can't be lower than the length parameter (overwriting results which messes up the comparison in the test) and the size as well can't be lower than the length parameter.
I'm still facing some weird issues almost exclusively in multicore mode where some cores will fail due to some mismatches. I'm investigating these cases at the moment.

RiccardoGandolfi commented May 30, 2025 •

edited

Loading

As mentioned in today's meeting, I'm adding here the links to my forked repositories related to the work on iDMA:

https://github.com/RiccardoGandolfi/pulp_cluster/tree/update_idma

https://github.com/RiccardoGandolfi/pulp-runtime/tree/lg/upstream

https://github.com/RiccardoGandolfi/regression_tests/tree/add_iDMA_tests

https://github.com/RiccardoGandolfi/iDMA/tree/dkeller/chimera

The main changes on my side until now are:

updated hal functions to have 3d transfers (as I understand most of this is repeated work of what Daniel already did)

added tests for 2d, 3d and multi-core execution of transfers (+ some stimuli randomization: transfer sizes, 2D and 3D strides). Multicore + 2D and 3D is still ongoing

iDMA rtl:

a comment inside the register top module for the 32_3d frontend in order to enable the dimension counters for 3d transfers

fixes for arbitration in multi_core execution: stream_idx is now passed to the rr_arb_tree as part of the dma request

Another point I forgot to mention in the meeting:

I didn't test yet the L1->L1 transfers, I remember doing a quick trial but something was failing. Might be related to how the test was written though, I'll get back to this.

@DanielKellerM @sermazz @FrancescoConti

I updated the multicore tests as follows:

introduced a TEST_ALL_CORES flag that switches between single-core execution and multi-core execution (parallel-only for now, I'll add the sequential one as well on my side).

I refined a bit the randomization of the constraints, meaning the dst_stride can't be lower than the length parameter (overwriting results which messes up the comparison in the test) and the size as well can't be lower than the length parameter.

I'm still facing some weird issues almost exclusively in multicore mode where some cores will fail due to some mismatches. I'm investigating these cases at the moment.

I pushed some further updates:

The issues I was facing in multicore execution were related to the use of the pulp_dma_cl_barrier function, which works fine as long as you transfer towards L2. So I added some new wait functions to work with transfers in the opposite direction as well. This seems to fix the issues.
All tests now contain: single core mode, sequential and parallel multicore modes + transfers L1->L2, L2->L1 and L1->L1
I properly fixed the arbitration on the stream_idx signal by changing the sv template, so that the correct version of the frontend can be automatically generated inside the iDMA repo.

@DanielKellerM @sermazz @FrancescoConti

DanielKellerM commented Jun 2, 2025

Hello @RiccardoGandolfi @sermazz

I have created some new feature branches to continue the work from the branches in #89 (comment) and the on-going work from Riccardo.

The updated branches are:
PULP Cluster: https://github.com/pulp-platform/pulp_cluster/tree/dkeller/chimera-v2
Pulp-runtime: https://github.com/pulp-platform/pulp-runtime/tree/dkeller/chimera-v2
Regression-tests: https://github.com/pulp-platform/regression_tests/tree/dkeller/chimera-v2

RiccardoGandolfi commented Jun 10, 2025 •

edited

Loading

I pushed some substantial updates to the multi_core tests:

https://github.com/RiccardoGandolfi/regression_tests/tree/add_iDMA_tests

Main changes are:

now using pi_l1_malloc and pi_l2_malloc functions to divide the memory space between the different cores that use the iDMA
some restructuring of the tests to have a more understandable code
some refinements on the stimuli generation

These last changes should have fixed all the remaining issues with multi core execution. My next steps will be improving the stimuli generation. @FrancescoConti @sermazz @DanielKellerM @yvantor

sermazz mentioned this pull request

Integrate TNN-enabled Flex-V core #98

Draft

2 tasks

sermazz force-pushed the smazzola/chimera branch 5 times, most recently from 08415ee to ddf4846 Compare

July 10, 2025 14:29

sermazz force-pushed the smazzola/chimera branch 4 times, most recently from 60cbbc7 to a3ac2c2 Compare

July 11, 2025 11:11

DanielKellerM commented Jul 19, 2025 •

edited

Loading

I have rebased branch dkeller/chimera-v2 onto this latest update from Sergio.
Working on #99 to merge the contributions from @RiccardoGandolfi and mine

DanielKellerM commented Jul 24, 2025 •

edited

Loading

@sermazz maybe this point/task can be added to the CI?

Verify with different hw configurations (multiple/different HWPEs, with/without EEC, with/without HMR, with different cores, with iDMA connected to narrow ports of HCI)

sermazz force-pushed the smazzola/chimera branch from a3ac2c2 to 7a75821 Compare

July 25, 2025 12:08

sermazz force-pushed the smazzola/chimera branch from 7a75821 to 5b810dd Compare

August 12, 2025 14:27

sermazz and others added 8 commits

August 12, 2025 16:47


          bender,hw: Bump iDMA and update wrapper

a96bfc1


          hw,tb: Detach DMA from cluster bus, add wide port, test

dfaeb01


          hw,tb: Share HCI shallow intc between (wide) DMA and HWPEs

b79adbd


          tb: handle non-word (64-bit bursts) aligned sections

90fcddc

The linker script has L1 address ORIGIN set to 0x10000004 even through
in hardware it is set to 0x10000000. However the testbench assumes 64b
alignment to initialize the L1. Thus, the data was shifted by 32b in the
simulation. While the AXI bursts are set to 64b, the misalignment needs
to be handled coming from the linker script.


          hw: Parametrize number of EXT ports to TCDM

d14d19c


          [WIP] hw: Fix combinational loop in HCI

3c22e0b


          package: Change config struct to packed type

4d7f555

Being unnecessarily unpacked, it was not compatible with other systems (like Cheshire)


          hw,bender: Fix idma_wrap for updated obi v0.1.3

sermazz force-pushed the smazzola/chimera branch 3 times, most recently from ece4dda to cd19098 Compare

August 13, 2025 08:25

sermazz added 6 commits

August 13, 2025 11:11


          makefile: Add target to generate iDMA hardware

e7e8350


          hw: 🐛 Fix cluster_id_i tied to 0 in cluster_interco and core_demux

a659d55

It created problems in routing of requests through peripheral interconnect


          bender,hw: Bump hci to align ECC and non-ECC HCI interfaces + update …

35ded25

…Bender.lock


          makefile: Bump pulp-runtime, regression_tests, and nonfree revs

8bb49ab


          ci,runtime: Disable ECC test in CI + disable HMR in runtime

1e0eb0b


          tb,bender: Disable ECC and HMR + bump to Neureka without hardcoded ECC

d41ca84

sermazz force-pushed the smazzola/chimera branch from cd19098 to d41ca84 Compare

August 13, 2025 09:12

DanielKellerM mentioned this pull request

Upstream: Add support for FC-less cluster targets and HMR, iDMA pulp-platform/pulp-runtime#51

Merged

sermazz closed this

sermazz mentioned this pull request

Upstream Chimera features #103

Open

4 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet