Better vcztools concurrency by jeromekelleher · Pull Request #9 · sgkit-dev/biofuse

jeromekelleher · 2026-05-05T13:50:46Z

No description provided.

The FUSE handler used to await indefinitely on any operation against the plink-server subprocess: a slow or wedged worker pinned the consumer's syscall in uninterruptible D-state, fusermount3 -u couldn't unmount, and the mount was unrecoverable without SIGKILL. Three concentric timeouts close that gap: - BedConnection.read wraps the lock-and-I/O block in ``trio.move_on_after(_REQUEST_TIMEOUT_S)``. On expiry the connection is marked closed (so queued readers short-circuit immediately), the stream is best-effort closed inside a shielded scope, and ``OSError(EIO)`` is raised — already mapped to ``FUSEError(EIO)`` by PlinkOps.read. - BedConnection.aclose runs its send_eof + aclose under shielded ``trio.move_on_after(_ACLOSE_TIMEOUT_S)``, so release can never outlive its deadline. PlinkOps.release already swallows aclose exceptions; with the deadline in place it can no longer hang. - PlinkClient._connect_stream wraps each ``connect()`` with ``trio.fail_after(_OPEN_TIMEOUT_S)`` so a stuck accept queue can't lock the parent. - PlinkOps.open wraps the limiter acquire with ``trio.move_on_after(_LIMITER_TIMEOUT_S)`` and surfaces ``FUSEError(EAGAIN)`` on expiry, guarding against a leaked slot permanently wedging FUSE_OPEN. Verified by killing the plink-server mid-mount: dd surfaces ECONNREFUSED in 12ms and fusermount3 -u completes in 10ms. fs_tests fio runner goes from a 1314s hang-with-SIGKILL to a 161s clean fail-fast where every job either succeeds or returns an errno within its configured runtime.

The robustness layer made the mount durable, but EAGAIN was still firing under repeated fio churn and we couldn't tell whether the slow path was parent-side limiter contention, parent-side aclose, or server-side encoder shutdown. Extend the access log with a ``kind`` field (default ``read``) and emit lifecycle events alongside reads: - ``open`` — PlinkOps.open entry → return. - ``limiter_wait`` — only for .bed; the time spent on the CapacityLimiter acquire. - ``release`` — PlinkOps.release entry → return. - ``aclose`` — BedConnection.aclose entry → return, via a new ``on_aclose`` callback hook the FUSE layer wires in at open time. Server-side: add DEBUG-level logs around encoder construction, each encoder.read, and encoder.close so a -vv run records the per-connection timing on the worker. Add ``fs_tests/harness/trace_summary.py`` to print per-kind duration distributions and the slowest events from an access.jsonl; running it on a real fio trace pinpointed the bottleneck as read latency itself (max ~1 s) rather than anything FUSE-side (every lifecycle op is sub-5 ms). Two existing tests adapted: ``test_records_per_read`` filters to ``kind="read"`` now that lifecycle events share the log; the in-test fakes accept ``on_aclose`` and propagate it.

The plink-server's ``logger.debug``/``info`` messages were going nowhere: the subprocess never called ``logging.basicConfig``, so the default ``WARNING`` level dropped them. Pass the parent's effective level into ``_server_main`` and configure logging at entry, so a ``-vv`` parent run now also shows the per-connection encoder lifecycle traces (``encoder created in N.NNNs``, ``encoder.read off=O size=S in N.NNNs``, ``encoder.close done in N.NNNs``). Default of ``WARNING`` keeps subprocess logs quiet under normal operation.

- BedConnection.read: replace ``assert cs.cancelled_caught`` with an explicit ``if not ... raise RuntimeError`` so the timeout EIO path stays correct under ``python -O``. The branch is unreachable in normal operation; tagged with a coverage pragma. - _server_main: pass ``force=True`` to ``logging.basicConfig`` so the explicitly-forwarded log level wins even if an upstream import (or the parent's logging state replayed via ``spawn``) has already configured the root logger.

jeromekelleher added 2 commits May 5, 2026 14:00

Udpate vcztools version

a18566e

Update to use new vcztools API

73258ce

jeromekelleher force-pushed the better-vcztools-concurrency branch from 5cd7cdb to 73258ce Compare May 5, 2026 14:03

jeromekelleher added 4 commits May 5, 2026 16:22

jeromekelleher merged commit 9702ff5 into sgkit-dev:main May 6, 2026
5 checks passed

jeromekelleher deleted the better-vcztools-concurrency branch May 6, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better vcztools concurrency#9

Better vcztools concurrency#9
jeromekelleher merged 6 commits intosgkit-dev:mainfrom
jeromekelleher:better-vcztools-concurrency

jeromekelleher commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeromekelleher commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant