-
Notifications
You must be signed in to change notification settings - Fork 349
DRAM debug take 2 #9907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAM debug take 2 #9907
Conversation
8b5e30c to
6c3faba
Compare
|
empty PTL. retest |
|
SOFCI TEST |
|
...but in fact the important for DRAM testing platform is currently MTL, because it has |
|
CI: strange pause-release failures on
Cannot reproduce locally so far. Since there's now a conflict, I'll rebase and see if it re-occurs... |
Adding all source files in a single, giant zephyr/CMakeLists.txt is inconvenient and does not scale. Link: thesofproject#8260 Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?
| endif() | ||
| add_subdirectory(tester) | ||
|
|
||
| is_zephyr(it_is) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @kv2019i has improved this check now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could stay just with assert_can_be_cold to not overcomplicate dram things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood don't think so, we plan to do this globally but I don't think this has been changed yet, I still see this in "main"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could stay just with assert_can_be_cold to not overcomplicate dram things
@abonislawski that part doesn't change. That's remains the only check that we add to __cold functions. But inside that function it now performs 2 checks: (1) for LL-context, and (2) for any "critical path" violations
@lgirdwood it is very easy to miss functions that can be called on hot paths and assign them as "cold." Without this debugging I'd have missed 10 or more of them. Particularly those, called during trigger IPC handling. And we don't have automated performance checking yet either. So I think we need this. Maybe we could enable it selectively. E.g. on MTL and LNL only? I've run both playback and capture tests on MTL nocodec on various interfaces over 13 seconds. The difference between the first (after timer sync) and the last firmware message with debugging was in most cases 0-4ms longer. Only in one case it was 14ms longer. LL measurements were on average 10us longer. So, I think it should be ok to enable this debugging for all. |
|
CI:
|
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly ok, a couple of note/questions inline.
src/ipc/ipc-zephyr.c
Outdated
| } | ||
|
|
||
| mem_hot_path_stop_watching(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we power down on L243 and we have context save enabled in the build? Wouldn't that mess the tracker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i not sure why it should confuse it? The tracker is just a couple of simple variables, if the context is saved, then those variables are saved too. But in fact if any of those power on / off functions are cold, then it might trigger a bug. Let me reduce the watched scope
| return IPC4_INVALID_REQUEST; | ||
| } | ||
|
|
||
| mem_hot_path_confirm(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be an excellent place to add a comment why this is considered a hot-path.
DRAM execution is banned in LL task context, but there are also time
intervals during which performance is important. Moreover, sometimes
such critical periods aren't certain until a later time. E.g. when
processing IPCs, only some of them are performance critical. And
those time-critical IPCs should be handled with maximum performance
from IPC-entry till IPC-exit, not only while handling that specific
IPC. Consider this pseudo-code:
void ipc_cmd()
{
common_pre_processing();
switch (cmd) {
case NON_CRITICAL_CMD:
handle_non_critical();
break;
case CRITICAL_CMD:
handle_critical();
}
common_post_processing();
}
In this case ipc_cmd(), common_pre_processing(), handle_critical()
and common_post_processing() cannot be executed in DRAM, while
handle_non_critical() can be executed in DRAM. If we place
start-critical-debugging and stop-critical-debugging to cover all of
ipc_cmd(), we'll be forced to also place handle_non_critical() in
SRAM. OTOH if we only surround handle_critical() with those markers,
we will miss all the common code. To support such cases we use 3
markers:
mem_hot_path_start_watching()
mem_hot_path_stop_watching()
mem_hot_path_confirm()
Then we place start- and stop-watching in the beginning and end of
ipc_cmd(), and mem_hot_path_confirm() under the CRITICAL_CMD case.
Then if we made common_pre_processing() or common_post_processing()
__cold, the fact of them being called while watching will be recorded.
Then mem_hot_path_confirm() will set a confirmation flag. Then when
watching is stopped, we check if the flag was raised and a cold
function was called, in which case we issue an error.
Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
|
CI:
|
| static const char *cold_path_fn; | ||
| static bool hot_path_confirmed; | ||
|
|
||
| void mem_cold_path_enter(const char *fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need to put all debug logic behind a dbg_ API prefix and a Kconfig to enable and disable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood so far we don't seem to have such a consistent convention. If you grep the source tree for dbg_ you'll see a few internal uses and a few global macros in src/include/sof/debug/debug.h which don't seem to be used anywhere any more, so they can be removed. After that a consistent dbg_* namespace could be established. FWIW other files under src/debug/ don't use it either. One of the files there uses the debug_ prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make it consistent starting with this file. We really have to namespace APIs better.
|
Ok, lets come back and use the |
Additional DRAM / cold code debugging. This allows to designate and verify and run-time interval as performance-critical and banned for DRAM code. ATM this includes #9850 , will remove once merged