fix: join dump-collector writer thread on aborted runs to stop host hang#920
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR fixes a host-side hang occurring when tensor dump is enabled but the run terminates abnormally. Both a2a3 and a5 platform variants now explicitly clean up the writer thread in ChangesWriter thread cleanup on abnormal finalize
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the finalize() method in TensorDumpCollector for both a2a3 and a5 platforms to join the writer thread if it was started, preventing thread leaks and potential crashes when the export path is skipped. The reviewer suggests also closing bin_file_ in finalize() if it remains open to ensure proper resource cleanup and prevent subsequent run failures.
… to stop host hang on aborted runs TensorDumpCollector's writer thread was only signalled (writer_done_) and joined in export_dump_files(), which runs only on the success path, and bin_file_ (opened in start_writer_thread_once()) was closed only there too. When a run ends abnormally, DeviceRunner::run() returns early and skips the collector-teardown block; the perf_cleanup guard still reaches finalize(), but finalize() and the base-class stop() only join the mgmt + poll threads. The writer thread was left blocked on its condition variable with a still-joinable std::thread, wedging teardown until the external watchdog killed the host process; bin_file_ was also left open, so a re-run's bin_file_.open() would set failbit. In finalize(), after stop(), join the writer thread and close bin_file_ if open. finalize() is reached via run()'s perf_cleanup guard on every exit path, so both are now reclaimed on early error returns too. Both are idempotent: export_dump_files() clears writer_started_ and closes bin_file_ on the success path, making the new code a no-op there. Applied to a2a3 and a5. Fixes hw-native-sys#919
f395aa1 to
a338fd3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 580-584: finalize() currently joins writer_thread_ but doesn't
close bin_file_, which allows start_writer_thread_once() to reopen an
already-open .bin stream; modify finalize() to explicitly close bin_file_ (call
bin_file_.close()) after joining the writer_thread_ and before resetting
writer_started_ (or before any subsequent open), ensuring
writer_done_.store(true) / write_cv_.notify_one() / writer_thread_.join()
sequence remains intact; reference writer_started_, writer_thread_,
writer_done_, write_cv_, finalize(), export_dump_files(), bin_file_.close(), and
start_writer_thread_once() so the change is applied in the same teardown path
where writer_thread_.join() is called.
In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 627-631: In TensorDumpCollector::finalize(), ensure the bin_file_
stream/descriptor is closed on the abnormal exit path: after setting
writer_done_ and notifying write_cv_ and joining writer_thread_ (the block that
checks writer_started_ && writer_thread_.joinable()), add logic to check if
bin_file_.is_open() (or the equivalent for the stream type opened in
start_writer_thread_once()) and call bin_file_.close(); also consider
null/validity checks and avoid calling close twice since export_dump_files()
also closes bin_file_; this guarantees start_writer_thread_once()/writer_thread_
cleanup does not leave the .bin descriptor open when export_dump_files() is
skipped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dc1986cd-1b93-445a-95ab-9308984bc71d
📒 Files selected for processing (2)
src/a2a3/platform/src/host/tensor_dump_collector.cppsrc/a5/platform/src/host/tensor_dump_collector.cpp
Problem
Fixes #919.
With
--dump-tensorenabled, a run that ends abnormally (times out or the runtime reports a fatal status) does not fail cleanly on the host — the process hangs and is only terminated by the external watchdog (e.g. the 300s task-queue limit). The same failing run without--dump-tensorexits promptly, so the hang is purely host-side and specific to the dump path.Root cause
TensorDumpCollector's writer thread is only signalled (writer_done_) and joined insideexport_dump_files(), which runs only on the success path. On an abnormal run,DeviceRunner::run()returns early and skips the collector-teardown block. The cleanup path still reachesTensorDumpCollector::finalize(), butfinalize()and the base-classstop()only join the mgmt + poll threads — never the writer thread. The writer thread is left blocked on its condition variable whilewriter_thread_stays joinable, wedging teardown.Fix
Join the writer thread in
finalize(), right afterstop().finalize()is reached viarun()'sperf_cleanupguard on every exit path (including early error returns), so the writer is always reclaimed. The join is idempotent:export_dump_files()clearswriter_started_on the success path, making the new block a no-op there.Applied symmetrically to a2a3 and a5.
Test
TensorDumpCollector; the reproducer is an end-to-end hardware run with--dump-tensorthat ends abnormally.export_dump_files()).