Better vcztools concurrency#9
Merged
jeromekelleher merged 6 commits intosgkit-dev:mainfrom May 6, 2026
Merged
Conversation
5cd7cdb to
73258ce
Compare
The FUSE handler used to await indefinitely on any operation against the plink-server subprocess: a slow or wedged worker pinned the consumer's syscall in uninterruptible D-state, fusermount3 -u couldn't unmount, and the mount was unrecoverable without SIGKILL. Three concentric timeouts close that gap: - BedConnection.read wraps the lock-and-I/O block in ``trio.move_on_after(_REQUEST_TIMEOUT_S)``. On expiry the connection is marked closed (so queued readers short-circuit immediately), the stream is best-effort closed inside a shielded scope, and ``OSError(EIO)`` is raised — already mapped to ``FUSEError(EIO)`` by PlinkOps.read. - BedConnection.aclose runs its send_eof + aclose under shielded ``trio.move_on_after(_ACLOSE_TIMEOUT_S)``, so release can never outlive its deadline. PlinkOps.release already swallows aclose exceptions; with the deadline in place it can no longer hang. - PlinkClient._connect_stream wraps each ``connect()`` with ``trio.fail_after(_OPEN_TIMEOUT_S)`` so a stuck accept queue can't lock the parent. - PlinkOps.open wraps the limiter acquire with ``trio.move_on_after(_LIMITER_TIMEOUT_S)`` and surfaces ``FUSEError(EAGAIN)`` on expiry, guarding against a leaked slot permanently wedging FUSE_OPEN. Verified by killing the plink-server mid-mount: dd surfaces ECONNREFUSED in 12ms and fusermount3 -u completes in 10ms. fs_tests fio runner goes from a 1314s hang-with-SIGKILL to a 161s clean fail-fast where every job either succeeds or returns an errno within its configured runtime.
The robustness layer made the mount durable, but EAGAIN was still firing under repeated fio churn and we couldn't tell whether the slow path was parent-side limiter contention, parent-side aclose, or server-side encoder shutdown. Extend the access log with a ``kind`` field (default ``read``) and emit lifecycle events alongside reads: - ``open`` — PlinkOps.open entry → return. - ``limiter_wait`` — only for .bed; the time spent on the CapacityLimiter acquire. - ``release`` — PlinkOps.release entry → return. - ``aclose`` — BedConnection.aclose entry → return, via a new ``on_aclose`` callback hook the FUSE layer wires in at open time. Server-side: add DEBUG-level logs around encoder construction, each encoder.read, and encoder.close so a -vv run records the per-connection timing on the worker. Add ``fs_tests/harness/trace_summary.py`` to print per-kind duration distributions and the slowest events from an access.jsonl; running it on a real fio trace pinpointed the bottleneck as read latency itself (max ~1 s) rather than anything FUSE-side (every lifecycle op is sub-5 ms). Two existing tests adapted: ``test_records_per_read`` filters to ``kind="read"`` now that lifecycle events share the log; the in-test fakes accept ``on_aclose`` and propagate it.
The plink-server's ``logger.debug``/``info`` messages were going nowhere: the subprocess never called ``logging.basicConfig``, so the default ``WARNING`` level dropped them. Pass the parent's effective level into ``_server_main`` and configure logging at entry, so a ``-vv`` parent run now also shows the per-connection encoder lifecycle traces (``encoder created in N.NNNs``, ``encoder.read off=O size=S in N.NNNs``, ``encoder.close done in N.NNNs``). Default of ``WARNING`` keeps subprocess logs quiet under normal operation.
- BedConnection.read: replace ``assert cs.cancelled_caught`` with an explicit ``if not ... raise RuntimeError`` so the timeout EIO path stays correct under ``python -O``. The branch is unreachable in normal operation; tagged with a coverage pragma. - _server_main: pass ``force=True`` to ``logging.basicConfig`` so the explicitly-forwarded log level wins even if an upstream import (or the parent's logging state replayed via ``spawn``) has already configured the root logger.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.