-
Notifications
You must be signed in to change notification settings - Fork 349
Audio: ASRC: HiFi5 optimizations #9875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Checked that the change is bit exact with old, looks good. Though I could test with testbench only 44100 to 48000 that uses the 6th order (n7) polynomial function. Capture asrc run does not work with testbench (should fix it) so I can't test full component with other n4/n5/n6 functions. But in a separate small test code they were bit exact too. |
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 11.6 MCPS saving !
It worked well, finally. Now with FIR part in ASRC seems the saving is about 1 MCPS or less with update from dual-MAC to quad-MAC, but I'm still checking things. |
There are not functional changes, just duplicate asrc_farrow_hifi3.c to asrc_farrow_hifi5.c and update the SOF_USE_HIFI() macros and CMakeLists.txt to handle the files. The copyright texts are updated also. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
These coefficients are re-ordered for use with new HiFi5 version of functions asrc_calc_impulse_response_n[4-7](). This patch only adds the files, nothing is changed yet. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This patch optimizes for HiFi5 the polynomials evaluations to calculate rate conversion FIR impulse responses. The functions asrc_calc_impulse_response_n[4-7]() calculate with Horner's method each coefficient of the FIR with 3th to 5th order polynomial. The header files with polynomials coefficients are re-ordered for for direct 128 bits int32x4 loads. The loop is modified to calculate four FIR coefficients per loop. Since there is no suitable quad-MAC instruction found, the previous dual-MAC is used twice. The saving is 11.6 MCPS, from 38.75 to 27.20 MCPS for 32 bits stereo 44.1 to 48 kHz conversion, push mode. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
There are more efficient HiFi instructions to load 128 bits aligned forward than backward. This change makes FIR write index to go backward when a new sample is written to delay line. Since the FIR is computed from newest samples to eldest direction the FIR compute read direction becomes forward. ASRC FIR avoids circular modification with duplicated write of data into double length buffer. Original code keeps buffer_write_position index in the second half. After this change the index is kept in the first half. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This change improves efficiency of FIR filter compute. The FIR calculation is unrolled by four. The coefficients and delay line reads are changed to 128 bits 4x int32_t reads. The MAC instruction is changed from dual-MAC to quad-MAC. The FIR accuracy improves a bit due to internal Q17.47 format instead of Q1.31. The saving is 1.7 MCPS, from 27.2 to 25.5 MCPS with 32 bit 44.1 to 48.8 kHz stereo push mode ASRC. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
This change improves efficiency of FIR filter compute. The filter coefficients load is changed to 128 bits wide for four 32 bit coefficients. The dual-MAC is changed to quad-MAC with single accumulator. The saving is 1.3 MCPS, from 25.4 to 24.1 MCPS with 16 bit 44.1 to 48.8 kHz stereo push mode ASRC. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
5bd1b45 to
407710b
Compare
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reverse delay write diction patch is a bit of a head scratcher, but seems ok in the end!
Yep, fortunaly testbench valgrind caught the first mistakes I made with it. I think we could get about same MCPS with a circular FIR delay buffer (and save about 128 int32_t RAM), then initialize to 1st half or 2nd half and begin or end would not matter. But I didn't want to change too much this time. |
Now ready, please check the commit messages for changes in every patch.