Skip to content

fix(core): spawn blocking ops onto worker pool to avoid stack overflow#7371

Merged
Xuanwo merged 2 commits intoapache:mainfrom
kszucs:fix/blocking-stack-overflow
Apr 10, 2026
Merged

fix(core): spawn blocking ops onto worker pool to avoid stack overflow#7371
Xuanwo merged 2 commits intoapache:mainfrom
kszucs:fix/blocking-stack-overflow

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Apr 8, 2026

Summary

  • Fix SIGSEGV in Java binding's BlockingWriteTest on Linux x86_64 when using HF/XET backend

Root cause

blocking::Operator called Handle::block_on() which polls the entire async state machine on the calling thread's stack. For backends with deep async call chains — HF/XET uploads go through:

block_on (calling thread, 1 MB stack on Linux JVM)
  → RetryLayer::write
    → HfBackend::write
      → HfWriter::try_new
        → xet_upload_commit
          → bridge_async
            → XetUploadCommit::new
              → upload_stream
                → CAS upload tasks

This exceeds the default 1 MB thread stack size on Linux, hitting a guard page → SIGSEGV (SEGV_ACCERR).

Why macOS wasn't affected: macOS default thread stack is 8 MB.
Why pure Rust tests passed: Rust's default thread stack is 8 MB.
Why the JVM crashed without an hs_err file: The overflow happened on the JVM's own thread before its crash handler could activate.

Workaround

Increate java stack size:

diff --git a/bindings/java/pom.xml b/bindings/java/pom.xml
index 575d470fe..e8c46fe9a 100644
--- a/bindings/java/pom.xml
+++ b/bindings/java/pom.xml
@@ -241,6 +241,13 @@
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-surefire-plugin</artifactId>
                 <version>${maven-surefire-plugin.version}</version>
+                <configuration>
+                    <!-- Some storage backends (e.g. HF/XET) flatten deep async
+                         state machines onto the calling thread via block_on(),
+                         which can exceed the default 1 MB thread stack on Linux.
+                         See https://github.com/apache/opendal/issues/7367 -->
+                    <argLine>-Xss8m</argLine>
+                </configuration>
             </plugin>
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>

Fix

Replace direct Handle::block_on(future) calls with Handle::block_on(Handle::spawn(future)) for the main I/O operations (stat, read, write, copy, rename, delete, list, create_dir).

The async future now runs on tokio worker threads (which have adequate stack space — typically 2-8 MB) while the calling thread only waits on a lightweight JoinHandle. This keeps the calling thread's stack usage minimal regardless of how deep the backend's async state machine goes.

Operations that return streaming handles (reader, writer, lister, deleter) still use direct block_on since they only set up lightweight state — the actual I/O happens through separate blocking wrapper types that already manage their own block_on calls.

Verification

Tested in an x86_64 Docker container (Rosetta) with:

  • Java 17 (default -Xss1m)
  • HF bucket backend with XET protocol
  • BlockingWriteTest — all 3 tests pass without any -Xss workaround

Test plan

  • BlockingWriteTest passes on Linux x86_64 via Docker (previously SIGSEGV)
  • Async HF write roundtrip still passes on macOS
  • cargo check -p opendal-core --features blocking compiles cleanly

Closes #7367

@kszucs kszucs requested a review from Xuanwo as a code owner April 8, 2026 20:57
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. releases-note/fix The PR fixes a bug or has a title that begins with "fix" labels Apr 8, 2026
@Xuanwo
Copy link
Copy Markdown
Member

Xuanwo commented Apr 9, 2026

Thank you for figure this out. I will enable hf on java after #7368 been merged.

Also cc @tisonkun to take a look on the java side changes?

kszucs added 2 commits April 9, 2026 19:15
blocking::Operator previously called Handle::block_on() which polls the
entire async state machine on the calling thread's stack. For backends
with deep async call chains (e.g. HF/XET uploads going through retry
layers, bridge_async, upload commits, and CAS streams), this can exceed
the default 1 MB thread stack on Linux — causing SIGSEGV in the Java
binding where JVM threads use this default.

Replace direct block_on() calls with spawn() + block_on(JoinHandle) for
the main I/O operations (stat, read, write, copy, rename, delete, list,
create_dir). The async future now runs on tokio worker threads (which
have adequate stack space) while the calling thread only waits on a
lightweight JoinHandle.

Closes apache#7367
@kszucs kszucs force-pushed the fix/blocking-stack-overflow branch from d83bdb7 to 48cc3a4 Compare April 9, 2026 17:15
Copy link
Copy Markdown
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 9, 2026
@Xuanwo Xuanwo merged commit a3cd6b0 into apache:main Apr 10, 2026
374 of 375 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer releases-note/fix The PR fixes a bug or has a title that begins with "fix" size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(bindings/java): HF blocking path segfaults on Linux with XET async upload

2 participants