Conversation
procfs emulation now treats the OOM trio (oom_score_adj, legacy oom_adj, read-only oom_score) as one process-wide adjustment with per-path read and write semantics: legacy oom_adj scales to oom_score_adj on writes (special-casing OOM_DISABLE -> SCORE_ADJ_MIN and OOM_ADJUST_MAX -> SCORE_ADJ_MAX so the boundary intent survives the lossy multiply) and back-clamps to [-17, 15] on reads; oom_score is read-only with a stub zero. The OOM write path serializes the truncate+pwrite+lseek under a new oom_write_lock and publishes the global atomic only after the backing rewrite succeeds, so a partial-rewrite failure no longer leaves the process-wide value diverged from a returned -1. Zero-length writes short-circuit to success (matches Linux for proc nodes; sys_writev previously hit -EINVAL in the parser). Stat reports st_size 0 for every synthetic /proc file so callers that pre-size buffers from stat cannot truncate (a 256-byte cap had silently chopped /proc/cpuinfo on hosts with many CPUs; a 2-byte cap had reduced -1000 to -1 on oom_score_adj). A new read-intercept path mirrors the write side. proc_intercept_read and proc_intercept_readv let read/pread/readv/preadv on the OOM nodes return the live atomic value rather than the per-open temp file content, and sendfile/copy_file_range route through the same hook so proc-source byte counts stay consistent with the value an immediately following open would observe. /proc/self/fdinfo gains type-specific lines for the special fd classes elfuse implements: eventfd-count (16-char hex matching fs/eventfd.c), sigmask (16-char hex), and timerfd clockid/ticks/it_value/it_interval. The accessors live in src/syscall/fd.c (eventfd_fdinfo_snapshot, signalfd_fdinfo_snapshot, timerfd_fdinfo_snapshot) and read state under sfd_lock to prevent tearing across concurrent read/write/settime. The per-fd lseek probe now uses fd_to_host_dup so a concurrent close+reopen on another vCPU cannot redirect the probe to an unrelated host fd, and errno is saved/restored across the ESPIPE-prone lseek so non-seekable fds (sockets, pipes) do not pollute the caller's state. /proc/self/fdinfo and /proc/self/fd no longer share one static backing directory across opens. The previous design let a second open unlink and recreate entries while a sibling thread iterated its dirfd; both nodes now go through proc_open_fd_scratch, which mkdtemps a private directory per open, populates it from a fresh fd-table snapshot, and tracks the path in proc_scratch_dirs[] for atexit cleanup so the previously-leaked backing dirs are reaped at process exit. The unix-net visitor's buffer-tail margin grew from 128 to 256 bytes to fit the longest possible row (54 fixed + 108 sun_path + newline); the previous margin let the snprintf truncate the path and drop the trailing newline. Eight explicit /proc/<pid>/X cases collapsed into one general alias-and-recurse, so /proc/<our_pid>/maps, /oom_score_adj, /limits, etc. now route through the matching /proc/self handler. Locked in by tests/test-tier-b.c (35 cases including oom write persistence, out-of-range -EINVAL, oom_adj=15 -> 1000 scaling, oom_score read-only and write-rejected, zero-length writev, stat-size-zero, fdinfo eventfd-count hex, fdinfo sigmask, fdinfo timerfd next expiry for periodic timers, concurrent fdinfo enumeration, and a /proc/net/tcp sl-density regression that opens non-TCP sockets before TCP listeners so the iterator visits rejected sockets first; the post-fix dense sl=0,1,... output matches qemu Linux ground truth, and a manual bug reintroduction confirms the test catches the sparse-slot regression with sl=4 expected=0). tests/test-io-opt.c adds sendfile and copy_file_range coverage for the read-intercept path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
procfs emulation now treats the OOM trio (oom_score_adj, legacy oom_adj, read-only oom_score) as one process-wide adjustment with per-path read and write semantics: legacy oom_adj scales to oom_score_adj on writes (special-casing OOM_DISABLE -> SCORE_ADJ_MIN and OOM_ADJUST_MAX -> SCORE_ADJ_MAX so the boundary intent survives the lossy multiply) and back-clamps to [-17, 15] on reads; oom_score is read-only with a stub zero. The OOM write path serializes the truncate+pwrite+lseek under a new oom_write_lock and publishes the global atomic only after the backing rewrite succeeds, so a partial-rewrite failure no longer leaves the process-wide value diverged from a returned -1. Zero-length writes short-circuit to success (matches Linux for proc nodes; sys_writev previously hit -EINVAL in the parser). Stat reports st_size 0 for every synthetic /proc file so callers that pre-size buffers from stat cannot truncate (a 256-byte cap had silently chopped /proc/cpuinfo on hosts with many CPUs; a 2-byte cap had reduced -1000 to -1 on oom_score_adj).
A new read-intercept path mirrors the write side. proc_intercept_read and proc_intercept_readv let read/pread/readv/preadv on the OOM nodes return the live atomic value rather than the per-open temp file content, and sendfile/copy_file_range route through the same hook so proc-source byte counts stay consistent with the value an immediately following open would observe.
/proc/self/fdinfo gains type-specific lines for the special fd classes elfuse implements: eventfd-count (16-char hex matching fs/eventfd.c), sigmask (16-char hex), and timerfd clockid/ticks/it_value/it_interval. The accessors live in src/syscall/fd.c (eventfd_fdinfo_snapshot, signalfd_fdinfo_snapshot, timerfd_fdinfo_snapshot) and read state under sfd_lock to prevent tearing across concurrent read/write/settime. The per-fd lseek probe now uses fd_to_host_dup so a concurrent close+reopen on another vCPU cannot redirect the probe to an unrelated host fd, and errno is saved/restored across the ESPIPE-prone lseek so non-seekable fds (sockets, pipes) do not pollute the caller's state.
/proc/self/fdinfo and /proc/self/fd no longer share one static backing directory across opens. The previous design let a second open unlink and recreate entries while a sibling thread iterated its dirfd; both nodes now go through proc_open_fd_scratch, which mkdtemps a private directory per open, populates it from a fresh fd-table snapshot, and tracks the path in proc_scratch_dirs[] for atexit cleanup so the previously-leaked backing dirs are reaped at process exit.
The unix-net visitor's buffer-tail margin grew from 128 to 256 bytes to fit the longest possible row (54 fixed + 108 sun_path + newline); the previous margin let the snprintf truncate the path and drop the trailing newline. Eight explicit /proc//X cases collapsed into one general alias-and-recurse, so /proc/<our_pid>/maps, /oom_score_adj, /limits, etc. now route through the matching /proc/self handler.
Locked in by tests/test-tier-b.c (35 cases including oom write persistence, out-of-range -EINVAL, oom_adj=15 -> 1000 scaling, oom_score read-only and write-rejected, zero-length writev, stat-size-zero, fdinfo eventfd-count hex, fdinfo sigmask, fdinfo timerfd next expiry for periodic timers, concurrent fdinfo enumeration, and a /proc/net/tcp sl-density regression that opens non-TCP sockets before TCP listeners so the iterator visits rejected sockets first; the post-fix dense sl=0,1,... output matches qemu Linux ground truth, and a manual bug reintroduction confirms the test catches the sparse-slot regression with sl=4 expected=0). tests/test-io-opt.c adds sendfile and copy_file_range coverage for the read-intercept path.
Summary by cubic
Hardened procfs emulation for OOM controls and fdinfo to match Linux behavior and eliminate races. Adds live read intercepts (incl. sendfile/copy_file_range), richer fdinfo for special fds, per-open scratch dirs, and st_size=0 for synthetic files to prevent truncation.
New Features
oom_score_adj; legacyoom_adjscales on write;oom_scoreis read-only and returns 0./proc/self/fdinfo: adds eventfd count, signalfd mask, and timerfd clock/ticks/it_value/it_interval (snapshotted under lock).Bug Fixes
/procfiles now report st_size=0 to avoid caller-side truncation./proc/self/fdand/proc/self/fdinfouse per-open scratch dirs to prevent cross-open races and leaks; fix the lseek probe via fd_to_host_dup and preserve errno./proc/<pid>/Xuniformly aliases to/proc/self;/proc/netexists and lists tcp/udp/unix.Written for commit 33fc800. Summary will update on new commits.