Skip to content

fix: join dump-collector writer thread on aborted runs to stop host hang#920

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/collector-teardown-error-path
May 31, 2026
Merged

fix: join dump-collector writer thread on aborted runs to stop host hang#920
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/collector-teardown-error-path

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

Problem

Fixes #919.

With --dump-tensor enabled, a run that ends abnormally (times out or the runtime reports a fatal status) does not fail cleanly on the host — the process hangs and is only terminated by the external watchdog (e.g. the 300s task-queue limit). The same failing run without --dump-tensor exits promptly, so the hang is purely host-side and specific to the dump path.

Root cause

TensorDumpCollector's writer thread is only signalled (writer_done_) and joined inside export_dump_files(), which runs only on the success path. On an abnormal run, DeviceRunner::run() returns early and skips the collector-teardown block. The cleanup path still reaches TensorDumpCollector::finalize(), but finalize() and the base-class stop() only join the mgmt + poll threads — never the writer thread. The writer thread is left blocked on its condition variable while writer_thread_ stays joinable, wedging teardown.

Fix

Join the writer thread in finalize(), right after stop(). finalize() is reached via run()'s perf_cleanup guard on every exit path (including early error returns), so the writer is always reclaimed. The join is idempotent: export_dump_files() clears writer_started_ on the success path, making the new block a no-op there.

Applied symmetrically to a2a3 and a5.

Test

  • No existing cpput harness for TensorDumpCollector; the reproducer is an end-to-end hardware run with --dump-tensor that ends abnormally.
  • Verified the change is a no-op on the success path (writer already torn down by export_dump_files()).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c0b052a6-3012-40b7-9b59-d435c97ec3a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR fixes a host-side hang occurring when tensor dump is enabled but the run terminates abnormally. Both a2a3 and a5 platform variants now explicitly clean up the writer thread in finalize() by signaling completion and joining when the thread remains active after skipping the normal export_dump_files() path.

Changes

Writer thread cleanup on abnormal finalize

Layer / File(s) Summary
Terminate blocked writer thread in finalize() for both platforms
src/a2a3/platform/src/host/tensor_dump_collector.cpp, src/a5/platform/src/host/tensor_dump_collector.cpp
Both platform implementations add a conditional teardown in finalize(): if the writer thread was started and is joinable, the code sets writer_done_, notifies write_cv_, and joins the thread. This prevents the writer from remaining blocked when abnormal termination skips export_dump_files(), ensuring clean shutdown on timeout and failure paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A thread was blocked, waiting in the night,
When dumps went wrong and runs took flight.
Now finalize sings its cleanup song—
"Writer, wake up, you've waited long!"
Two platforms fixed with idempotent grace,
No more hangs in the teardown race. 🎪

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main fix: joining the dump-collector writer thread to prevent host hangs on aborted runs.
Description check ✅ Passed The description thoroughly explains the problem, root cause, fix, and testing approach, all directly related to the changeset.
Linked Issues check ✅ Passed The code changes fully address issue #919 by adding writer thread join logic in finalize() to prevent host hangs on abnormal runs.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the writer thread teardown issue in TensorDumpCollector, with no out-of-scope modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the finalize() method in TensorDumpCollector for both a2a3 and a5 platforms to join the writer thread if it was started, preventing thread leaks and potential crashes when the export path is skipped. The reviewer suggests also closing bin_file_ in finalize() if it remains open to ensure proper resource cleanup and prevent subsequent run failures.

Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp
Comment thread src/a5/platform/src/host/tensor_dump_collector.cpp
… to stop host hang on aborted runs

TensorDumpCollector's writer thread was only signalled (writer_done_) and
joined in export_dump_files(), which runs only on the success path, and
bin_file_ (opened in start_writer_thread_once()) was closed only there too.
When a run ends abnormally, DeviceRunner::run() returns early and skips the
collector-teardown block; the perf_cleanup guard still reaches finalize(),
but finalize() and the base-class stop() only join the mgmt + poll threads.
The writer thread was left blocked on its condition variable with a
still-joinable std::thread, wedging teardown until the external watchdog
killed the host process; bin_file_ was also left open, so a re-run's
bin_file_.open() would set failbit.

In finalize(), after stop(), join the writer thread and close bin_file_ if
open. finalize() is reached via run()'s perf_cleanup guard on every exit
path, so both are now reclaimed on early error returns too. Both are
idempotent: export_dump_files() clears writer_started_ and closes bin_file_
on the success path, making the new code a no-op there. Applied to a2a3 and a5.

Fixes hw-native-sys#919
@ChaoZheng109 ChaoZheng109 force-pushed the fix/collector-teardown-error-path branch from f395aa1 to a338fd3 Compare May 30, 2026 10:16
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 580-584: finalize() currently joins writer_thread_ but doesn't
close bin_file_, which allows start_writer_thread_once() to reopen an
already-open .bin stream; modify finalize() to explicitly close bin_file_ (call
bin_file_.close()) after joining the writer_thread_ and before resetting
writer_started_ (or before any subsequent open), ensuring
writer_done_.store(true) / write_cv_.notify_one() / writer_thread_.join()
sequence remains intact; reference writer_started_, writer_thread_,
writer_done_, write_cv_, finalize(), export_dump_files(), bin_file_.close(), and
start_writer_thread_once() so the change is applied in the same teardown path
where writer_thread_.join() is called.

In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 627-631: In TensorDumpCollector::finalize(), ensure the bin_file_
stream/descriptor is closed on the abnormal exit path: after setting
writer_done_ and notifying write_cv_ and joining writer_thread_ (the block that
checks writer_started_ && writer_thread_.joinable()), add logic to check if
bin_file_.is_open() (or the equivalent for the stream type opened in
start_writer_thread_once()) and call bin_file_.close(); also consider
null/validity checks and avoid calling close twice since export_dump_files()
also closes bin_file_; this guarantees start_writer_thread_once()/writer_thread_
cleanup does not leave the .bin descriptor open when export_dump_files() is
skipped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc1986cd-1b93-445a-95ab-9308984bc71d

📥 Commits

Reviewing files that changed from the base of the PR and between aa6ce64 and f395aa1.

📒 Files selected for processing (2)
  • src/a2a3/platform/src/host/tensor_dump_collector.cpp
  • src/a5/platform/src/host/tensor_dump_collector.cpp

Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp
Comment thread src/a5/platform/src/host/tensor_dump_collector.cpp
@ChaoWao ChaoWao merged commit 3939da9 into hw-native-sys:main May 31, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Host hangs until watchdog kill when a --dump-tensor run times out or fails

2 participants