DRAM debug take 2 #9907

lyakh · 2025-03-18T16:01:58Z

Additional DRAM / cold code debugging. This allows to designate and verify and run-time interval as performance-critical and banned for DRAM code. ATM this includes #9850 , will remove once merged

lyakh · 2025-03-19T14:12:17Z

empty PTL. retest

lyakh · 2025-03-19T14:12:30Z

SOFCI TEST

lyakh · 2025-03-19T14:22:10Z

...but in fact the important for DRAM testing platform is currently MTL, because it has CONFIG_COLD_STORE_EXECUTE_DRAM=y in its default configuration, and https://sof-ci.01.org/sofpr/PR9907/build11620/devicetest/index.html was clean (apart from HDA not running)

lyakh · 2025-03-20T14:10:23Z

CI: strange pause-release failures on

Cannot reproduce locally so far. Since there's now a conflict, I'll rebase and see if it re-occurs...

Adding all source files in a single, giant zephyr/CMakeLists.txt is inconvenient and does not scale. Link: thesofproject#8260 Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lgirdwood

Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?

lgirdwood · 2025-03-20T15:00:08Z

src/debug/CMakeLists.txt

-endif()
+add_subdirectory(tester)
+
+is_zephyr(it_is)


I think @kv2019i has improved this check now.

I think we could stay just with assert_can_be_cold to not overcomplicate dram things

@lgirdwood don't think so, we plan to do this globally but I don't think this has been changed yet, I still see this in "main"

I think we could stay just with assert_can_be_cold to not overcomplicate dram things

@abonislawski that part doesn't change. That's remains the only check that we add to __cold functions. But inside that function it now performs 2 checks: (1) for LL-context, and (2) for any "critical path" violations

lyakh · 2025-03-20T16:50:41Z

Not sure we need to go down to this level when we have the performance testing and the cold assert check now ?

@lgirdwood it is very easy to miss functions that can be called on hot paths and assign them as "cold." Without this debugging I'd have missed 10 or more of them. Particularly those, called during trigger IPC handling. And we don't have automated performance checking yet either. So I think we need this. Maybe we could enable it selectively. E.g. on MTL and LNL only? I've run both playback and capture tests on MTL nocodec on various interfaces over 13 seconds. The difference between the first (after timer sync) and the last firmware message with debugging was in most cases 0-4ms longer. Only in one case it was 14ms longer. LL measurements were on average 10us longer. So, I think it should be ok to enable this debugging for all.

lyakh · 2025-03-21T13:51:00Z

CI:

MTL: HDA not tested https://sof-ci.01.org/sofpr/PR9907/build11693/devicetest/index.html
cAVS: TGL-nocodec not tested and a suspend-resume timeout on ADL-nocodec https://sof-ci.01.org/sofpr/PR9907/build11694/devicetest/index.html

kv2019i

Mostly ok, a couple of note/questions inline.

kv2019i · 2025-03-21T13:58:39Z

src/ipc/ipc-zephyr.c

 	}

+	mem_hot_path_stop_watching();
+


What if we power down on L243 and we have context save enabled in the build? Wouldn't that mess the tracker?

@kv2019i not sure why it should confuse it? The tracker is just a couple of simple variables, if the context is saved, then those variables are saved too. But in fact if any of those power on / off functions are cold, then it might trigger a bug. Let me reduce the watched scope

kv2019i · 2025-03-21T13:59:16Z

src/ipc/ipc4/handler.c

 		return IPC4_INVALID_REQUEST;
 	}

+	mem_hot_path_confirm();


Would be an excellent place to add a comment why this is considered a hot-path.

DRAM execution is banned in LL task context, but there are also time intervals during which performance is important. Moreover, sometimes such critical periods aren't certain until a later time. E.g. when processing IPCs, only some of them are performance critical. And those time-critical IPCs should be handled with maximum performance from IPC-entry till IPC-exit, not only while handling that specific IPC. Consider this pseudo-code: void ipc_cmd() { common_pre_processing(); switch (cmd) { case NON_CRITICAL_CMD: handle_non_critical(); break; case CRITICAL_CMD: handle_critical(); } common_post_processing(); } In this case ipc_cmd(), common_pre_processing(), handle_critical() and common_post_processing() cannot be executed in DRAM, while handle_non_critical() can be executed in DRAM. If we place start-critical-debugging and stop-critical-debugging to cover all of ipc_cmd(), we'll be forced to also place handle_non_critical() in SRAM. OTOH if we only surround handle_critical() with those markers, we will miss all the common code. To support such cases we use 3 markers: mem_hot_path_start_watching() mem_hot_path_stop_watching() mem_hot_path_confirm() Then we place start- and stop-watching in the beginning and end of ipc_cmd(), and mem_hot_path_confirm() under the CRITICAL_CMD case. Then if we made common_pre_processing() or common_post_processing() __cold, the fact of them being called while watching will be recorded. Then mem_hot_path_confirm() will set a confirmation flag. Then when watching is stopped, we check if the flag was raised and a cold function was called, in which case we issue an error. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh · 2025-03-24T12:06:57Z

CI:

MTL only an alsabat failure on HDA https://sof-ci.01.org/sofpr/PR9907/build11723/devicetest/index.html
TGL 2 nocodec configurations not tested
PTL also alsabat failure https://sof-ci.01.org/sofpr/PR9907/build11722/devicetest/index.html

lgirdwood · 2025-03-24T17:02:13Z

src/debug/dram.c

+static const char *cold_path_fn;
+static bool hot_path_confirmed;
+
+void mem_cold_path_enter(const char *fn)


We really need to put all debug logic behind a dbg_ API prefix and a Kconfig to enable and disable.

@lgirdwood so far we don't seem to have such a consistent convention. If you grep the source tree for dbg_ you'll see a few internal uses and a few global macros in src/include/sof/debug/debug.h which don't seem to be used anywhere any more, so they can be removed. After that a consistent dbg_* namespace could be established. FWIW other files under src/debug/ don't use it either. One of the files there uses the debug_ prefix

Lets make it consistent starting with this file. We really have to namespace APIs better.

src/debug/dram.c

lgirdwood · 2025-03-28T14:58:23Z

Ok, lets come back and use the dbg_ prefix for this API as next step

lyakh force-pushed the dram-dbg2 branch 3 times, most recently from 8b5e30c to 6c3faba Compare March 19, 2025 11:52

lyakh mentioned this pull request Mar 19, 2025

DRAM: more cold functions #9850

Merged

lyakh force-pushed the dram-dbg2 branch from 6c3faba to 80a0e9b Compare March 19, 2025 16:06

lyakh marked this pull request as ready for review March 19, 2025 16:07

lyakh requested review from abonislawski, bardliao, dbaluta, iuliana-prodan, kv2019i, lbetlej, lgirdwood, marcinszkudlinski, mmaka1, pblaszko and plbossart as code owners March 19, 2025 16:07

cmake/zephyr: decentralize src/debug/

8679ca3

Adding all source files in a single, giant zephyr/CMakeLists.txt is inconvenient and does not scale. Link: thesofproject#8260 Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh force-pushed the dram-dbg2 branch from 80a0e9b to 094f8b1 Compare March 20, 2025 14:24

lgirdwood reviewed Mar 20, 2025

View reviewed changes

kv2019i reviewed Mar 21, 2025

View reviewed changes

lyakh force-pushed the dram-dbg2 branch from 094f8b1 to e78f624 Compare March 21, 2025 14:21

kv2019i approved these changes Mar 24, 2025

View reviewed changes

lgirdwood reviewed Mar 24, 2025

View reviewed changes

kv2019i mentioned this pull request Mar 27, 2025

cmake/zephyr: unify cmake rules for lib, probes and... #9931

Merged

jsarha reviewed Mar 28, 2025

View reviewed changes

src/debug/dram.c Show resolved Hide resolved

jsarha approved these changes Mar 28, 2025

View reviewed changes

src/debug/dram.c Show resolved Hide resolved

lyakh mentioned this pull request Mar 28, 2025

fix IPC timeouts #9926

Merged

lgirdwood approved these changes Mar 28, 2025

View reviewed changes

lgirdwood merged commit 25f704a into thesofproject:main Mar 28, 2025
45 of 49 checks passed

lyakh deleted the dram-dbg2 branch March 28, 2025 14:59

lyakh mentioned this pull request Mar 28, 2025

Logging: switch to delayed work #9929

Merged

DRAM debug take 2 #9907

DRAM debug take 2 #9907

Uh oh!

Conversation

lyakh commented Mar 18, 2025

Uh oh!

lyakh commented Mar 19, 2025

Uh oh!

lyakh commented Mar 19, 2025

Uh oh!

lyakh commented Mar 19, 2025

Uh oh!

lyakh commented Mar 20, 2025

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh commented Mar 20, 2025

Uh oh!

lyakh commented Mar 21, 2025

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyakh commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lgirdwood commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants