-
Notifications
You must be signed in to change notification settings - Fork 349
DRAM: more cold functions #9850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I understand why init functions should go to DRAM, but why IPC? |
@marcinszkudlinski the idea is that only audio protocols are "hot" - only schedulers and audio processing threads. Everything else can be "cold" and IPC processing is one of such large code areas. But if you have concerns that this can break something, let's discuss, maybe we're overlooking some use-cases? |
|
@lyakh not really I think - as long as we do have enough HPSRAM, use it. |
|
IPC part looks really suspicious, do you have any data what is the profit and perf drop? Especially when main CPU is under high load and we will lag more with DRAM access |
|
HPSRAM is precious, agree need to be really careful what we put in DRAM it should only be parts of IPC that are not time critical. i.e. trigger is time critical, but load module is not time critical. We need to find this balance, Linux only really cares about prepare()/trigger() driver ops and any associated IPCs. Don't know about Windows ? |
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some functions are really obvious pipeline construction/free APIs, but some utility APIs could be used in the stream triggering flow. Best to check.
|
@lgirdwood @marcinszkudlinski @abonislawski as far as I understand the worst would be cases when we're running close to 100% performance capacity and at that moment the user is issuing some IPCs - maybe to start an additional light stream. In principle we still have a couple of free DSP cycles to run an additional stream, but while preparing it, IPC processing adds significant DSP load. So, if we process IPCs in DRAM, that processing becomes slower. As long as we don't disable interrupts during IPC processing for too long, we still shouldn't disturb higher priority audio processing, running in parallel, but IPC response time will become longer. Is that what we're worried about? Is that important? Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it... |
|
Replying to @marcinszkudlinski - do we really lose LL cycles because of IPC processing? That shouldn't happen AFAICS? If we have code, locking interrupts, we have to identify and improve it... We don't lose LL cycles since LL preempts low priority workloads/threads (even if workload TEXT is in DRAM, stack/heap will be SRAM). @jsarha can you share some data soon. Thanks |
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, @lrgirdwo you mention in comments that " trigger is time critical, but load module is not time critical". The current PR doesn't seem to make any provision to keep trigger related code in hot memory. Not sure how to review this, is this intentional or not?
src/ipc/ipc4/handler.c
Outdated
| } | ||
|
|
||
| int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed) | ||
| __cold int ipc4_pipeline_trigger(struct ipc_comp_dev *ppl_icd, uint32_t cmd, bool *delayed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't trigger ops supposed to be kept on the warm path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsarha can you share the hot vs cold trigger IPC logs to help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific function doesn't run in LL-context, it only runs in IPC context, but I updated trigger module-interface method documentation, and also double-checked other functions, and removed __cold from a few of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyakh this was the last trigger() I could find.
src/ipc/ipc4/handler.c
Outdated
| } | ||
|
|
||
| void ipc_cmd(struct ipc_cmd_hdr *_hdr) | ||
| __cold void ipc_cmd(struct ipc_cmd_hdr *_hdr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to separate trigger/start from less timing critical IPCs, then we need to keep this top-level ipc_cmd as warm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should keep most important ipc funcs in HPSRAM, especially the whole trigger path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.
@lgirdwood I've already removed anything that I could identify as potentially running in LL-scheduling context. Do you see anything that I've missed? As for adding debugging - I was considering pros and cons of doing it in this PR or adding them in a follow-up... But you're probably right - let's add them here. Will be safe (or at least safer) from the beginning and a new PR would anyway need all the CI runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can only see the ipc_trigger() API call that needs to be hot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood sorry, which exactly function do you mean? If you mean ipc4_pipeline_trigger() then I think it's good for it to be cold. It's only called from the IPC handler and from IDC, never in the LL-scheduler context. If it were, my assert debugging would catch it - it's working, the previous iteration was all red because of that.
There is indeed some impact to MCPS at least in 44.1kHz playback trough SRC. SRC playback was chosen because its readily available on nocodec topology and SRC has a lot of __cold tagged functions in its configuration code. In addition to this PR I also merged #9844 on top of it. The test is a 5min 44.1kHz playback using the branch built with xcc using both CONFIG_COLD_STORE_EXECUTE_DRAM=n and y. It was run on LNL RVP using nocodec topology. The original mtrace files are here: |
Thanks @jsarha - there is a 20kcps delta with DRAM=y and this PR on LNL. I think the Peaks are related to L1 exit work, I think the 20kcps is due the the relocatable code used for llext. @lyakh do you concur ? |
In general Peaks are a bigger problem than avg in DRAM due to access instability (comparing to "flat" HPSRAM). @jsarha could you please repeat your test with added main cpu (and mem controller) load? (try large fft and smallest fft). |
@lgirdwood I think we need some more research there |
|
@lgirdwood @jsarha lightly checked on MTL with nocodec on Port2 playback (core 2, includes SRC, volume and mixin - all have |
Ok, thanks @jsarha that makes sense, mixin/mixout are not yet __cold enabled so doing |
|
CI:
|
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove the trigger path and add the new debug API
src/ipc/ipc4/handler.c
Outdated
| } | ||
|
|
||
| void ipc_cmd(struct ipc_cmd_hdr *_hdr) | ||
| __cold void ipc_cmd(struct ipc_cmd_hdr *_hdr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IPC path runs outside of the LL context and in most cases is not time sensitive with the exception of trigger path. @lyakh can you update to remove the trigger path from __Cold and add in the new debug API to the non trigger calls.
e3c38e8 to
7586c5f
Compare
I run the test on this PR again, this time with this script running in the background all the time: e.g. copy giga-byte of data from /dev/urandom to ramfs, delete the file, and do it all over again. It does not appear to have too much effect on the results: And from the handling times of GLB_SET_PIPELINE_STATE messages from FW: BTW, while the FW measured times are more accurate in an individual case, the multiple line test get distorted due to dupplicated log lines. There is also one idle PAUSE message before the start of the playback and its usually handled really fast. The next RUNNING command actually starts the playback and takes longer, as does the final PAUSE command that actually stops the playback. |
I can see significant differences, even if we are not doing much work in dram, look how mixin/out peak value can increase, its crazy but of course also expected for dram
try the hard way: prime95 small/large fft on all threads, let it burn :)) |
|
@lyakh @jsarha @abonislawski for Linux based OSes the critical time sensitive IPC operations are trigger() as this is done in atomic kernel context, so keeping this under 4ms is necessary (which currents results show). The other IPCs for pipeline construction are NOT time sensitive for Linux OSes (but maybe different for Windows flows ?) as host DMA buffers are still empty at this point. Pls also consider that this platform is not DRAM optimized so results will be slower than optimized platforms. @lyakh are you able to sub Kconfig some of the core IPC function as cold/hot (so that this can be a tuning option), i.e. functions that will be used on all IPCs. Thanks ! @jsarha if you cant run prime95 on test HW, pls build some kernels with a high -j (greater than number of cores/threads) |
|
Repeated a simple 20s playback test on MTL with nocodec on Port2 / core 2 - interestingly, stats show a 10kcps saving with this version, possibly because of a smaller SRAM footprint and thus a better cache hit ratio
@lgirdwood sure, don't see why it shouldn't be possible. Any specific wish which ones? Trigger-related? But before that - can we check how long IPCs take with |
|
@lgirdwood now updated to exclude the trigger path from DRAM. I guess we need that measured by @jsarha . I'll work on adding some debugging for this... |
Oh, I can. I've been just trying to get my FW side processing time measurements working reliably, but the results are still weird. Any way here are Linux log measured timings. First main branch with CONFIG_COLD_STORE_EXECUTE_DRAM=n and no background load: host-copier.0.playback init min 182 us max 378 us average 310 us of 100 These measurements has been made with this PR[1], CONFIG_COLD_STORE_EXECUTE_DRAM=y and while having mprime running in background with 8 threads and Large FFTs: host-copier.0.playback init min 310 us max 2367 us average 425 us of 100 [1] This was run before the latest update. Version used was 7586c5f. Sorry I did not notice in time there was a new version. I'll do yet another run tomorrow. |
Move several initialisation functions to run from DRAM directly. Note, that we cannot use assert_can_be_cold() in these functions yet because the schedulers haven't been initialised at that point yet, but we're sufficiently confident, that these functions never run in LL-scheduler context. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Mark most IPC functions as "cold" to run them directly in DRAM. Explicitly avoid making exported functions "cold," also those that are either known to or can potentially be called from hot code paths. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
ipc4_pipeline_complete() returns POSIX error codes, not IPC4 ones. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
struct module_interface::trigger() is called from both hot and cold contexts, therefore it usually shouldn't be marked as __cold. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Add the function name, that was called in DRAM from LL context to simplify debugging. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
struct module_interface::reset() is called from the trigger IPC context, therefore it shouldn't be "cold." Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
.prepare() is called as a part of the trigger processing flow, it shouldn't be "cold." Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
|
One more test result round: |
|
@jsarha thanks for the numbers! Let me try to pick up a couple of examples for easy comparison. I'll only look at average values.
Some conclusions: DRAM execution does increase IPC processing times significantly, sometimes by up to a factor of 4. Under load differences become smaller and actually DRAM times improve. This leads to pipeline state set IPC times to actually reverse and to become better when run from DRAM... |
|
Also note, that #9907 has not identified any DRAM-on-hot-path violations in this PR. Of course, it's possible that the verification isn't perfect, but I've used it to identify and clean up many of such violations, so at least we should be much cleaner now. Once this PR is merged, we should merge #9907 too |

Move all of IPC and some initialisation code to DRAM.