Skip to content

Conversation

@lyakh
Copy link
Collaborator

@lyakh lyakh commented Feb 21, 2025

Move all of IPC and some initialisation code to DRAM.

@marcinszkudlinski
Copy link
Contributor

I understand why init functions should go to DRAM, but why IPC?

@lyakh
Copy link
Collaborator Author

lyakh commented Feb 24, 2025

I understand why init functions should go to DRAM, but why IPC?

@marcinszkudlinski the idea is that only audio protocols are "hot" - only schedulers and audio processing threads. Everything else can be "cold" and IPC processing is one of such large code areas. But if you have concerns that this can break something, let's discuss, maybe we're overlooking some use-cases?

@marcinszkudlinski
Copy link
Contributor

@lyakh not really
we're already facing problems with performance - when starting multiple sophisticated pipelines it happens that some of LL cycles are lost - because of long operations like "prepare" for each component
We need to be careful what goes to DRAM, it is slower, and worse, the access time is not guaranteed - as the physical memory is shared with linux/windows/chrome and our requests go last.

I think - as long as we do have enough HPSRAM, use it.

@abonislawski
Copy link
Member

IPC part looks really suspicious, do you have any data what is the profit and perf drop? Especially when main CPU is under high load and we will lag more with DRAM access

@lgirdwood
Copy link
Member

HPSRAM is precious, agree need to be really careful what we put in DRAM it should only be parts of IPC that are not time critical. i.e. trigger is time critical, but load module is not time critical. We need to find this balance, Linux only really cares about prepare()/trigger() driver ops and any associated IPCs. Don't know about Windows ?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some functions are really obvious pipeline construction/free APIs, but some utility APIs could be used in the stream triggering flow. Best to check.

@lyakh
Copy link
Collaborator Author

lyakh commented Feb 25, 2025

@lgirdwood @marcinszkudlinski @abonislawski as far as I understand the worst would be cases when we're running close to 100% performance capacity and at that moment the user is issuing some IPCs - maybe to start an additional light stream. In principle we still have a couple of free DSP cycles to run an additional stream, but while preparing it, IPC processing adds significant DSP load. So, if we process IPCs in DRAM, that processing becomes slower. As long as we don't disable interrupts during IPC processing for too long, we still shouldn't disturb higher priority audio processing, running in parallel, but IPC response time will become longer. Is that what we're worried about? Is that important? Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

@lgirdwood
Copy link
Member

Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

We don't lose LL cycles since LL preempts low priority workloads/threads (even if workload TEXT is in DRAM, stack/heap will be SRAM). @jsarha can you share some data soon. Thanks

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, @lrgirdwo you mention in comments that " trigger is time critical, but load module is not time critical". The current PR doesn't seem to make any provision to keep trigger related code in hot memory. Not sure how to review this, is this intentional or not?

}

int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)
__cold int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't trigger ops supposed to be kept on the warm path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha can you share the hot vs cold trigger IPC logs to help here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific function doesn't run in LL-context, it only runs in IPC context, but I updated trigger module-interface method documentation, and also double-checked other functions, and removed __cold from a few of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh this was the last trigger() I could find.

}

void ipc_cmd(struct ipc_cmd_hdr *_hdr)
__cold void ipc_cmd(struct ipc_cmd_hdr *_hdr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to separate trigger/start from less timing critical IPCs, then we need to keep this top-level ipc_cmd as warm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should keep most important ipc funcs in HPSRAM, especially the whole trigger path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.

@lgirdwood I've already removed anything that I could identify as potentially running in LL-scheduling context. Do you see anything that I've missed? As for adding debugging - I was considering pros and cons of doing it in this PR or adding them in a follow-up... But you're probably right - let's add them here. Will be safe (or at least safer) from the beginning and a new PR would anyway need all the CI runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can only see the ipc_trigger() API call that needs to be hot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood sorry, which exactly function do you mean? If you mean ipc4_pipeline_trigger() then I think it's good for it to be cold. It's only called from the IPC handler and from IDC, never in the LL-scheduler context. If it were, my assert debugging would catch it - it's working, the previous iteration was all red because of that.

@jsarha
Copy link
Contributor

jsarha commented Feb 27, 2025

Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it...

We don't lose LL cycles since LL preempts low priority workloads/threads (even if workload TEXT is in DRAM, stack/heap will be SRAM). @jsarha can you share some data soon. Thanks

Screenshot at 2025-02-27 16-22-11

There is indeed some impact to MCPS at least in 44.1kHz playback trough SRC. SRC playback was chosen because its readily available on nocodec topology and SRC has a lot of __cold tagged functions in its configuration code. In addition to this PR I also merged #9844 on top of it. The test is a 5min 44.1kHz playback using the branch built with xcc using both CONFIG_COLD_STORE_EXECUTE_DRAM=n and y. It was run on LNL RVP using nocodec topology. The original mtrace files are here:
testb-dram-y-hw02-300s-mtrace.log
testb-dram-n-hw02-300s-mtrace.log

@lgirdwood
Copy link
Member

There is indeed some impact to MCPS at least in 44.1kHz playback trough SRC. SRC playback was chosen because its readily available on nocodec topology and SRC has a lot of __cold tagged functions in its configuration code. In addition to this PR I also merged #9844 on top of it. The test is a 5min 44.1kHz playback using the branch built with xcc using both CONFIG_COLD_STORE_EXECUTE_DRAM=n and y. It was run on LNL RVP using nocodec topology. The original mtrace files are here:
testb-dram-y-hw02-300s-mtrace.log
testb-dram-n-hw02-300s-mtrace.log

Thanks @jsarha - there is a 20kcps delta with DRAM=y and this PR on LNL. I think the Peaks are related to L1 exit work, I think the 20kcps is due the the relocatable code used for llext. @lyakh do you concur ?
@jsarha btw - can you upstream the script that scrapes the logs and produces the plots :)

@abonislawski
Copy link
Member

I think the Peaks are related to L1 exit work

In general Peaks are a bigger problem than avg in DRAM due to access instability (comparing to "flat" HPSRAM).

@jsarha could you please repeat your test with added main cpu (and mem controller) load? (try large fft and smallest fft).
Currently it shows best-case scenario with very low cpu/mem load, with fft test it will show worst-case scenario.
It is very important to test both for DRAM because our latencies may vary significantly

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 4, 2025

Thanks @jsarha - there is a 20kcps delta with DRAM=y and this PR on LNL. I think the Peaks are related to L1 exit work, I think the 20kcps is due the the relocatable code used for llext. @lyakh do you concur ?

@lgirdwood I think we need some more research there

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 6, 2025

@lgirdwood @jsarha lightly checked on MTL with nocodec on Port2 playback (core 2, includes SRC, volume and mixin - all have __cold code and some have __cold_rodata data) with no additional load - the difference remains within around 15kcps.
UPDATE: for a counter-test moved mixin_process() and mixout_process() to DRAM and ran the same test. This time processing jumped by 1550kcps, by more than 80%

@lgirdwood
Copy link
Member

@lgirdwood @jsarha lightly checked on MTL with nocodec on Port2 playback (core 2, includes SRC, volume and mixin - all have __cold code and some have __cold_rodata data) with no additional load - the difference remains within around 15kcps. UPDATE: for a counter-test moved mixin_process() and mixout_process() to DRAM and ran the same test. This time processing jumped by 1550kcps, by more than 80%

Ok, thanks @jsarha that makes sense, mixin/mixout are not yet __cold enabled so doing process() in DRAM will have that level of impact.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 6, 2025

CI:

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove the trigger path and add the new debug API

}

void ipc_cmd(struct ipc_cmd_hdr *_hdr)
__cold void ipc_cmd(struct ipc_cmd_hdr *_hdr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.

@lyakh lyakh force-pushed the dram branch 2 times, most recently from e3c38e8 to 7586c5f Compare March 7, 2025 11:48
@jsarha
Copy link
Contributor

jsarha commented Mar 11, 2025

You need to put load on cpu memory controller, otherwise such measurements will only show the best case scenario

I run the test on this PR again, this time with this script running in the background all the time:
while true; do dd bs=1024 count=1048576 if=/dev/urandom of=/run/dummy ; rm /run/dummy; done

e.g. copy giga-byte of data from /dev/urandom to ramfs, delete the file, and do it all over again.

It does not appear to have too much effect on the results:
host-copier.0.playback init min 196 us max 456 us average 322 us of 100
gain.1.1 init min 192 us max 552 us average 327 us of 100
gain.1.1 conf min 197 us max 494 us average 325 us of 100
mixin.1.1 init min 193 us max 8219 us average 406 us of 100
mixout.2.1 init min 197 us max 573 us average 327 us of 100
gain.2.1 init min 197 us max 484 us average 324 us of 100
gain.2.1 conf min 193 us max 418 us average 322 us of 100
smart_amp.2.1 init min 194 us max 526 us average 326 us of 100
smart_amp.2.1 conf min 197 us max 1826 us average 393 us of 100
dai-copier.SSP.NoCodec-0.playback init min 202 us max 2135 us average 479 us of 100
pipeline.1: host-copier.0.playback, gain.1.1, mixin.1.1,
pipeline.1 3 min 200 us max 2592 us average 553 us of 200
pipeline.1 4 min 228 us max 3202 us average 673 us of 100
pipeline.1 2 min 208 us max 1867 us average 565 us of 100
pipeline.2: mixout.2.1, gain.2.1, smart_amp.2.1, dai-copier.SSP.NoCodec-0.playback,
pipeline.2 3 min 199 us max 2667 us average 863 us of 200
pipeline.2 4 min 205 us max 8656 us average 794 us of 100

And from the handling times of GLB_SET_PIPELINE_STATE messages from FW:
max 1790 min 139 avg 748.895

BTW, while the FW measured times are more accurate in an individual case, the multiple line test get distorted due to dupplicated log lines. There is also one idle PAUSE message before the start of the playback and its usually handled really fast. The next RUNNING command actually starts the playback and takes longer, as does the final PAUSE command that actually stops the playback.

@abonislawski
Copy link
Member

abonislawski commented Mar 12, 2025

It does not appear to have too much effect on the results:

I can see significant differences, even if we are not doing much work in dram, look how mixin/out peak value can increase, its crazy but of course also expected for dram

I run the test on this PR again, this time with this script running in the background all the time: while true; do dd bs=1024 count=1048576 if=/dev/urandom of=/run/dummy ; rm /run/dummy; done

e.g. copy giga-byte of data from /dev/urandom to ramfs, delete the file, and do it all over again.

try the hard way: prime95 small/large fft on all threads, let it burn :))

@lgirdwood
Copy link
Member

@lyakh @jsarha @abonislawski for Linux based OSes the critical time sensitive IPC operations are trigger() as this is done in atomic kernel context, so keeping this under 4ms is necessary (which currents results show). The other IPCs for pipeline construction are NOT time sensitive for Linux OSes (but maybe different for Windows flows ?) as host DMA buffers are still empty at this point. Pls also consider that this platform is not DRAM optimized so results will be slower than optimized platforms.

@lyakh are you able to sub Kconfig some of the core IPC function as cold/hot (so that this can be a tuning option), i.e. functions that will be used on all IPCs. Thanks !

@jsarha if you cant run prime95 on test HW, pls build some kernels with a high -j (greater than number of cores/threads)

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 12, 2025

Repeated a simple 20s playback test on MTL with nocodec on Port2 / core 2 - interestingly, stats show a 10kcps saving with this version, possibly because of a smaller SRAM footprint and thus a better cache hit ratio

@lyakh are you able to sub Kconfig some of the core IPC function as cold/hot (so that this can be a tuning option), i.e. functions that will be used on all IPCs. Thanks !

@lgirdwood sure, don't see why it shouldn't be possible. Any specific wish which ones? Trigger-related? But before that - can we check how long IPCs take with CONFIG_COLD_STORE_EXECUTE_DRAM=n and under DRAM load? For that can we concentrate on a smaller number of values? Now that we know roughly how different values relate, maybe take one, that is usually among the smallest, one that's usually the largest, and one more somewhere in the middle @jsarha?

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 13, 2025

@lgirdwood now updated to exclude the trigger path from DRAM. I guess we need that measured by @jsarha . I'll work on adding some debugging for this...

@jsarha
Copy link
Contributor

jsarha commented Mar 13, 2025

@jsarha if you cant run prime95 on test HW, pls build some kernels with a high -j (greater than number of cores/threads)

Oh, I can. I've been just trying to get my FW side processing time measurements working reliably, but the results are still weird. Any way here are Linux log measured timings.

First main branch with CONFIG_COLD_STORE_EXECUTE_DRAM=n and no background load:

host-copier.0.playback init min 182 us max 378 us average 310 us of 100
gain.1.1 init min 188 us max 436 us average 312 us of 100
gain.1.1 conf min 187 us max 416 us average 310 us of 100
mixin.1.1 init min 183 us max 362 us average 304 us of 100
mixout.2.1 init min 182 us max 391 us average 304 us of 100
gain.2.1 init min 181 us max 427 us average 310 us of 100
gain.2.1 conf min 181 us max 503 us average 305 us of 100
smart_amp.2.1 init min 182 us max 456 us average 306 us of 100
smart_amp.2.1 conf min 180 us max 383 us average 307 us of 100
dai-copier.SSP.NoCodec-0.playback init min 182 us max 538 us average 309 us of 100
pipeline.1: host-copier.0.playback, gain.1.1, mixin.1.1,
pipeline.1 3 min 179 us max 1653 us average 594 us of 200
pipeline.1 4 min 186 us max 469 us average 304 us of 100
pipeline.1 2 min 189 us max 406 us average 312 us of 100
pipeline.2: mixout.2.1, gain.2.1, smart_amp.2.1, dai-copier.SSP.NoCodec-0.playback,
pipeline.2 3 min 185 us max 480 us average 332 us of 200
pipeline.2 4 min 178 us max 422 us average 305 us of 100

These measurements has been made with this PR[1], CONFIG_COLD_STORE_EXECUTE_DRAM=y and while having mprime running in background with 8 threads and Large FFTs:

host-copier.0.playback init min 310 us max 2367 us average 425 us of 100
gain.1.1 init min 309 us max 3448 us average 451 us of 100
gain.1.1 conf min 315 us max 1529 us average 439 us of 100
mixin.1.1 init min 309 us max 1062 us average 403 us of 100
mixout.2.1 init min 309 us max 1512 us average 402 us of 100
gain.2.1 init min 311 us max 698 us average 397 us of 100
gain.2.1 conf min 301 us max 1064 us average 391 us of 100
smart_amp.2.1 init min 307 us max 2449 us average 432 us of 100
smart_amp.2.1 conf min 305 us max 664 us average 385 us of 100
dai-copier.SSP.NoCodec-0.playback init min 309 us max 449 us average 378 us of 100
pipeline.1: host-copier.0.playback, gain.1.1, mixin.1.1,
pipeline.1 3 min 307 us max 1718 us average 526 us of 200
pipeline.1 4 min 303 us max 463 us average 375 us of 100
pipeline.1 2 min 314 us max 1723 us average 534 us of 100
pipeline.2: mixout.2.1, gain.2.1, smart_amp.2.1, dai-copier.SSP.NoCodec-0.playback,
pipeline.2 3 min 303 us max 1514 us average 410 us of 200
pipeline.2 4 min 308 us max 467 us average 377 us of 100

[1] This was run before the latest update. Version used was 7586c5f. Sorry I did not notice in time there was a new version. I'll do yet another run tomorrow.

Move several initialisation functions to run from DRAM directly.
Note, that we cannot use assert_can_be_cold() in these functions yet
because the schedulers haven't been initialised at that point yet,
but we're sufficiently confident, that these functions never run
in LL-scheduler context.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
lyakh added 6 commits March 18, 2025 10:24
Mark most IPC functions as "cold" to run them directly in DRAM.
Explicitly avoid making exported functions "cold," also those that
are either known to or can potentially be called from hot code paths.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
ipc4_pipeline_complete() returns POSIX error codes, not IPC4 ones.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
struct module_interface::trigger() is called from both hot and cold
contexts, therefore it usually shouldn't be marked as __cold.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Add the function name, that was called in DRAM from LL context to
simplify debugging.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
struct module_interface::reset() is called from the trigger IPC
context, therefore it shouldn't be "cold."

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
.prepare() is called as a part of the trigger processing flow, it
shouldn't be "cold."

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@jsarha
Copy link
Contributor

jsarha commented Mar 19, 2025

One more test result round:

IPC response time comparison report on 18th Mar 2025 for PR 9850

The test concists of two SW configurations and two tests sequences and
two different load conditions, so 4 different tests tests. All the
tests were run on LNL SDW RVP sitting in Espoo lab with nocodec
topology. The test itself was 10 playback runs of 2 seconds each to
hw:0,2 (Port 2) at 44.1kHz with 3 seconds sleep in between.

The SW configureations are SOF main (1ef4e603a61d) built with
CONFIG_COLD_STORE_EXECUTE_DRAM=n and latest PR9850
CONFIG_COLD_STORE_EXECUTE_DRAM=y (28b93ee9aaf). The two load
conditions were no load situation and Prime95 running with 8 threads
and large FFTs.


1.1 Playback tests, no load

1.1.1 Playback tests, no load, main branch 

host-copier.2.playback fw init  min 75 us       max 91 us       average 83 us of 10
gain.5.1 fw init        min 44 us       max 274 us      average 72 us of 10
gain.5.1 fw conf        min 23 us       max 23 us       average 23 us of 10
src.5.1 fw init 	min 40 us       max 257 us      average 94 us of 10
mixin.5.1 fw init       min 36 us       max 227 us      average 102 us of 10
mixout.6.1 fw init      min 31 us       max 31 us       average 31 us of 10
gain.6.1 fw init        min 37 us       max 37 us       average 37 us of 10
gain.6.1 fw conf        min 20 us       max 21 us       average 20 us of 10
dai-copier.SSP.NoCodec-2.playback fw init       min 141 us      max 148 us     average 142 us of 10

pipeline.5: host-copier.2.playback, gain.5.1, src.5.1, mixin.5.1, 
pipeline.5 4 fw min 516 us      max 1363 us     average 833 us of 10
pipeline.5 3 fw min 348 us      max 658 us      average 415 us of 10
pipe multi 2 fw min 226 us      max 227 us      average 226 us of 10

pipeline.6: mixout.6.1, gain.6.1, dai-copier.SSP.NoCodec-2.playback, 
pipeline.6 4 fw min 488 us      max 533 us      average 506 us of 10
pipeline.6 3 fw min 366 us      max 1038 us     average 517 us of 10

1.1.2 Playback tests, no load, pr9850

host-copier.2.playback fw init  min 193 us      max 267 us      average 226 us of 10
gain.5.1 fw init        min 150 us      max 218 us      average 192 us of 10
gain.5.1 fw conf        min 81 us       max 104 us      average 87 us of 10
src.5.1 fw init 	min 214 us      max 241 us      average 229 us of 10
mixin.5.1 fw init       min 186 us      max 223 us      average 210 us of 10
mixout.6.1 fw init      min 185 us      max 208 us      average 198 us of 10
gain.6.1 fw init        min 189 us      max 211 us      average 202 us of 10
gain.6.1 fw conf        min 78 us       max 92 us       average 83 us of 10
dai-copier.SSP.NoCodec-2.playback fw init       min 308 us      max 331 us     average 319 us of 10

pipeline.5: host-copier.2.playback, gain.5.1, src.5.1, mixin.5.1, 
pipeline.5 4 fw min 1156 us     max 2135 us     average 1581 us of 10
pipeline.5 3 fw min 307 us      max 770 us      average 476 us of 10
pipe multi 2 fw min 222 us      max 223 us      average 222 us of 10

pipeline.6: mixout.6.1, gain.6.1, dai-copier.SSP.NoCodec-2.playback, 
pipeline.6 4 fw min 502 us      max 541 us      average 525 us of 10
pipeline.6 3 fw min 386 us      max 1047 us     average 571 us of 10

1.2 Playback tests, Prime95 load

1.2.1 Playback tests, Prime95 load, main branch

host-copier.2.playback fw init  min 92 us       max 146 us      average 112 us of 10
gain.5.1 fw init        min 63 us       max 106 us      average 86 us of 10
gain.5.1 fw conf        min 23 us       max 23 us       average 23 us of 10
src.5.1 fw init 	min 93 us       max 139 us      average 111 us of 10
mixin.5.1 fw init       min 77 us       max 116 us      average 94 us of 10
mixout.6.1 fw init      min 81 us       max 103 us      average 92 us of 10
gain.6.1 fw init        min 89 us       max 113 us      average 100 us of 10
gain.6.1 fw conf        min 19 us       max 20 us       average 19 us of 10
dai-copier.SSP.NoCodec-2.playback fw init       min 199 us      max 295 us     average 223 us of 10

pipeline.5: host-copier.2.playback, gain.5.1, src.5.1, mixin.5.1, 
pipeline.5 4 fw min 1072 us     max 2062 us     average 1722 us of 10
pipeline.5 3 fw min 992 us      max 1006 us     average 999 us of 10
pipe multi 2 fw min 252 us      max 420 us      average 271 us of 10

pipeline.6: mixout.6.1, gain.6.1, dai-copier.SSP.NoCodec-2.playback, 
pipeline.6 4 fw min 516 us      max 725 us      average 579 us of 10
pipeline.6 3 fw min 417 us      max 716 us      average 681 us of 10

1.2.2 Playback tests, Prime95 load, pr9850

host-copier.2.playback fw init  min 122 us      max 251 us      average 188 us of 10
gain.5.1 fw init        min 97 us       max 197 us      average 149 us of 10
gain.5.1 fw conf        min 39 us       max 88 us       average 68 us of 10
src.5.1 fw init 	min 121 us      max 225 us      average 192 us of 10
mixin.5.1 fw init       min 121 us      max 207 us      average 178 us of 10
mixout.6.1 fw init      min 162 us      max 198 us      average 181 us of 10
gain.6.1 fw init        min 174 us      max 199 us      average 186 us of 10
gain.6.1 fw conf        min 61 us       max 92 us       average 80 us of 10
dai-copier.SSP.NoCodec-2.playback fw init       min 281 us      max 321 us     average 301 us of 10

pipeline.5: host-copier.2.playback, gain.5.1, src.5.1, mixin.5.1, 
pipeline.5 4 fw min 988 us      max 2018 us     average 1588 us of 10
pipeline.5 3 fw min 991 us      max 1047 us     average 1001 us of 10
pipe multi 2 fw min 222 us      max 223 us      average 222 us of 10

pipeline.6: mixout.6.1, gain.6.1, dai-copier.SSP.NoCodec-2.playback, 
pipeline.6 4 fw min 454 us      max 752 us      average 549 us of 10
pipeline.6 3 fw min 426 us      max 722 us      average 654 us of 10

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

@jsarha thanks for the numbers! Let me try to pick up a couple of examples for easy comparison. I'll only look at average values.

  1. no load, main vs. PR:
  • gain.5.1 init: 72us vs. 192us
  • gain.5.1 conf: 23us vs 87us
  • pipeline.5 starting: 833us vs 1581us
  • pipeline.5 stopping 415us vs 476us
  • pipeline.6 startinig 506us vs 525us
  1. memory load, main vs. PR
  • gain.5.1 init: 86us vs. 149us
  • gain.5.1 conf: 23us vs 68us
  • pipeline.5 starting: 1722us vs 1588us
  • pipeline.5 stopping 999us vs 1001us
  • pipeline.6 startinig 796us vs 549us

Some conclusions: DRAM execution does increase IPC processing times significantly, sometimes by up to a factor of 4. Under load differences become smaller and actually DRAM times improve. This leads to pipeline state set IPC times to actually reverse and to become better when run from DRAM...

@lgirdwood
Copy link
Member

Thanks for the results @lyakh and @jsarha Whilst IPCs can take longer the results are still negligible in the non time critical flows.

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

Thanks for the results @lyakh and @jsarha Whilst IPCs can take longer the results are still negligible in the non time critical flows.

yes, looks like an acceptable price to pay for non-performance critical code when SRAM is really tight

@lyakh
Copy link
Collaborator Author

lyakh commented Mar 19, 2025

Also note, that #9907 has not identified any DRAM-on-hot-path violations in this PR. Of course, it's possible that the verification isn't perfect, but I've used it to identify and clean up many of such violations, so at least we should be much cleaner now. Once this PR is merged, we should merge #9907 too

@lgirdwood lgirdwood merged commit bac3031 into thesofproject:main Mar 19, 2025
45 of 49 checks passed
@lyakh lyakh deleted the dram branch March 19, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants