Skip to content

Conversation

@singalsu
Copy link
Collaborator

@singalsu singalsu commented Mar 6, 2025

Now ready, please check the commit messages for changes in every patch.

@singalsu
Copy link
Collaborator Author

singalsu commented Mar 7, 2025

Checked that the change is bit exact with old, looks good. Though I could test with testbench only 44100 to 48000 that uses the 6th order (n7) polynomial function. Capture asrc run does not work with testbench (should fix it) so I can't test full component with other n4/n5/n6 functions. But in a separate small test code they were bit exact too.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 11.6 MCPS saving !

@singalsu
Copy link
Collaborator Author

singalsu commented Mar 7, 2025

Nice 11.6 MCPS saving !

It worked well, finally. Now with FIR part in ASRC seems the saving is about 1 MCPS or less with update from dual-MAC to quad-MAC, but I'm still checking things.

There are not functional changes, just duplicate
asrc_farrow_hifi3.c to asrc_farrow_hifi5.c and update
the SOF_USE_HIFI() macros and CMakeLists.txt to handle
the files. The copyright texts are updated also.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
These coefficients are re-ordered for use with new HiFi5
version of functions asrc_calc_impulse_response_n[4-7]().

This patch only adds the files, nothing is changed yet.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This patch optimizes for HiFi5 the polynomials evaluations to
calculate rate conversion FIR impulse responses. The functions
asrc_calc_impulse_response_n[4-7]() calculate with Horner's
method each coefficient of the FIR with 3th to 5th order
polynomial.

The header files with polynomials coefficients are re-ordered for
for direct 128 bits int32x4 loads. The loop is modified to calculate
four FIR coefficients per loop. Since there is no suitable quad-MAC
instruction found, the previous dual-MAC is used twice.

The saving is 11.6 MCPS, from 38.75 to 27.20 MCPS for 32 bits stereo
44.1 to 48 kHz conversion, push mode.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
There are more efficient HiFi instructions to load 128 bits
aligned forward than backward. This change makes FIR write
index to go backward when a new sample is written to delay
line. Since the FIR is computed from newest samples to eldest
direction the FIR compute read direction becomes forward.

ASRC FIR avoids circular modification with duplicated write of
data into double length buffer. Original code keeps
buffer_write_position index in the second half. After this
change the index is kept in the first half.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This change improves efficiency of FIR filter compute. The
FIR calculation is unrolled by four. The coefficients and
delay line reads are changed to 128 bits 4x int32_t reads.

The MAC instruction is changed from dual-MAC to quad-MAC.
The FIR accuracy improves a bit due to internal Q17.47 format
instead of Q1.31.

The saving is 1.7 MCPS, from 27.2 to 25.5 MCPS with 32 bit
44.1 to 48.8 kHz stereo push mode ASRC.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This change improves efficiency of FIR filter compute. The
filter coefficients load is changed to 128 bits wide for four
32 bit coefficients. The dual-MAC is changed to quad-MAC with
single accumulator.

The saving is 1.3 MCPS, from 25.4 to 24.1 MCPS with 16 bit
44.1 to 48.8 kHz stereo push mode ASRC.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse delay write diction patch is a bit of a head scratcher, but seems ok in the end!

@singalsu
Copy link
Collaborator Author

The reverse delay write diction patch is a bit of a head scratcher, but seems ok in the end!

Yep, fortunaly testbench valgrind caught the first mistakes I made with it. I think we could get about same MCPS with a circular FIR delay buffer (and save about 128 int32_t RAM), then initialize to 1st half or 2nd half and begin or end would not matter. But I didn't want to change too much this time.

@lgirdwood lgirdwood merged commit acc6762 into thesofproject:main Mar 11, 2025
43 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants