Skip to content

Harden /proc/self oom and fdinfo nodes#10

Merged
jserv merged 1 commit intomainfrom
procfs
May 6, 2026
Merged

Harden /proc/self oom and fdinfo nodes#10
jserv merged 1 commit intomainfrom
procfs

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 5, 2026

procfs emulation now treats the OOM trio (oom_score_adj, legacy oom_adj, read-only oom_score) as one process-wide adjustment with per-path read and write semantics: legacy oom_adj scales to oom_score_adj on writes (special-casing OOM_DISABLE -> SCORE_ADJ_MIN and OOM_ADJUST_MAX -> SCORE_ADJ_MAX so the boundary intent survives the lossy multiply) and back-clamps to [-17, 15] on reads; oom_score is read-only with a stub zero. The OOM write path serializes the truncate+pwrite+lseek under a new oom_write_lock and publishes the global atomic only after the backing rewrite succeeds, so a partial-rewrite failure no longer leaves the process-wide value diverged from a returned -1. Zero-length writes short-circuit to success (matches Linux for proc nodes; sys_writev previously hit -EINVAL in the parser). Stat reports st_size 0 for every synthetic /proc file so callers that pre-size buffers from stat cannot truncate (a 256-byte cap had silently chopped /proc/cpuinfo on hosts with many CPUs; a 2-byte cap had reduced -1000 to -1 on oom_score_adj).

A new read-intercept path mirrors the write side. proc_intercept_read and proc_intercept_readv let read/pread/readv/preadv on the OOM nodes return the live atomic value rather than the per-open temp file content, and sendfile/copy_file_range route through the same hook so proc-source byte counts stay consistent with the value an immediately following open would observe.

/proc/self/fdinfo gains type-specific lines for the special fd classes elfuse implements: eventfd-count (16-char hex matching fs/eventfd.c), sigmask (16-char hex), and timerfd clockid/ticks/it_value/it_interval. The accessors live in src/syscall/fd.c (eventfd_fdinfo_snapshot, signalfd_fdinfo_snapshot, timerfd_fdinfo_snapshot) and read state under sfd_lock to prevent tearing across concurrent read/write/settime. The per-fd lseek probe now uses fd_to_host_dup so a concurrent close+reopen on another vCPU cannot redirect the probe to an unrelated host fd, and errno is saved/restored across the ESPIPE-prone lseek so non-seekable fds (sockets, pipes) do not pollute the caller's state.

/proc/self/fdinfo and /proc/self/fd no longer share one static backing directory across opens. The previous design let a second open unlink and recreate entries while a sibling thread iterated its dirfd; both nodes now go through proc_open_fd_scratch, which mkdtemps a private directory per open, populates it from a fresh fd-table snapshot, and tracks the path in proc_scratch_dirs[] for atexit cleanup so the previously-leaked backing dirs are reaped at process exit.

The unix-net visitor's buffer-tail margin grew from 128 to 256 bytes to fit the longest possible row (54 fixed + 108 sun_path + newline); the previous margin let the snprintf truncate the path and drop the trailing newline. Eight explicit /proc//X cases collapsed into one general alias-and-recurse, so /proc/<our_pid>/maps, /oom_score_adj, /limits, etc. now route through the matching /proc/self handler.

Locked in by tests/test-tier-b.c (35 cases including oom write persistence, out-of-range -EINVAL, oom_adj=15 -> 1000 scaling, oom_score read-only and write-rejected, zero-length writev, stat-size-zero, fdinfo eventfd-count hex, fdinfo sigmask, fdinfo timerfd next expiry for periodic timers, concurrent fdinfo enumeration, and a /proc/net/tcp sl-density regression that opens non-TCP sockets before TCP listeners so the iterator visits rejected sockets first; the post-fix dense sl=0,1,... output matches qemu Linux ground truth, and a manual bug reintroduction confirms the test catches the sparse-slot regression with sl=4 expected=0). tests/test-io-opt.c adds sendfile and copy_file_range coverage for the read-intercept path.


Summary by cubic

Hardened procfs emulation for OOM controls and fdinfo to match Linux behavior and eliminate races. Adds live read intercepts (incl. sendfile/copy_file_range), richer fdinfo for special fds, per-open scratch dirs, and st_size=0 for synthetic files to prevent truncation.

  • New Features

    • Unified OOM controls: process-wide oom_score_adj; legacy oom_adj scales on write; oom_score is read-only and returns 0.
    • Live reads for OOM nodes: read/pread/readv/preadv and sendfile/copy_file_range return the current value, not a temp-file snapshot.
    • /proc/self/fdinfo: adds eventfd count, signalfd mask, and timerfd clock/ticks/it_value/it_interval (snapshotted under lock).
  • Bug Fixes

    • OOM writes are serialized and only publish after a successful rewrite; zero-length writes succeed.
    • Synthetic /proc files now report st_size=0 to avoid caller-side truncation.
    • /proc/self/fd and /proc/self/fdinfo use per-open scratch dirs to prevent cross-open races and leaks; fix the lseek probe via fd_to_host_dup and preserve errno.
    • Proc-backed fds stay on the slow I/O path so read/write interceptors always run; /proc/<pid>/X uniformly aliases to /proc/self; /proc/net exists and lists tcp/udp/unix.

Written for commit 33fc800. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

procfs emulation now treats the OOM trio (oom_score_adj, legacy oom_adj,
read-only oom_score) as one process-wide adjustment with per-path read
and write semantics: legacy oom_adj scales to oom_score_adj on writes
(special-casing OOM_DISABLE -> SCORE_ADJ_MIN and OOM_ADJUST_MAX ->
SCORE_ADJ_MAX so the boundary intent survives the lossy multiply) and
back-clamps to [-17, 15] on reads; oom_score is read-only with a stub
zero. The OOM write path serializes the truncate+pwrite+lseek under a
new oom_write_lock and publishes the global atomic only after the
backing rewrite succeeds, so a partial-rewrite failure no longer leaves
the process-wide value diverged from a returned -1. Zero-length writes
short-circuit to success (matches Linux for proc nodes; sys_writev
previously hit -EINVAL in the parser). Stat reports st_size 0 for every
synthetic /proc file so callers that pre-size buffers from stat cannot
truncate (a 256-byte cap had silently chopped /proc/cpuinfo on hosts
with many CPUs; a 2-byte cap had reduced -1000 to -1 on oom_score_adj).

A new read-intercept path mirrors the write side. proc_intercept_read
and proc_intercept_readv let read/pread/readv/preadv on the OOM nodes
return the live atomic value rather than the per-open temp file
content, and sendfile/copy_file_range route through the same hook so
proc-source byte counts stay consistent with the value an immediately
following open would observe.

/proc/self/fdinfo gains type-specific lines for the special fd classes
elfuse implements: eventfd-count (16-char hex matching fs/eventfd.c),
sigmask (16-char hex), and timerfd clockid/ticks/it_value/it_interval.
The accessors live in src/syscall/fd.c (eventfd_fdinfo_snapshot,
signalfd_fdinfo_snapshot, timerfd_fdinfo_snapshot) and read state under
sfd_lock to prevent tearing across concurrent read/write/settime. The
per-fd lseek probe now uses fd_to_host_dup so a concurrent close+reopen
on another vCPU cannot redirect the probe to an unrelated host fd, and
errno is saved/restored across the ESPIPE-prone lseek so
non-seekable fds (sockets, pipes) do not pollute the caller's state.

/proc/self/fdinfo and /proc/self/fd no longer share one static backing
directory across opens. The previous design let a second open unlink
and recreate entries while a sibling thread iterated its dirfd; both
nodes now go through proc_open_fd_scratch, which mkdtemps a private
directory per open, populates it from a fresh fd-table snapshot, and
tracks the path in proc_scratch_dirs[] for atexit cleanup so the
previously-leaked backing dirs are reaped at process exit.

The unix-net visitor's buffer-tail margin grew from 128 to 256 bytes
to fit the longest possible row (54 fixed + 108 sun_path + newline);
the previous margin let the snprintf truncate the path and drop the
trailing newline. Eight explicit /proc/<pid>/X cases collapsed into
one general alias-and-recurse, so /proc/<our_pid>/maps,
/oom_score_adj, /limits, etc. now route through the matching
/proc/self handler.

Locked in by tests/test-tier-b.c (35 cases including oom write
persistence, out-of-range -EINVAL, oom_adj=15 -> 1000 scaling,
oom_score read-only and write-rejected, zero-length writev,
stat-size-zero, fdinfo eventfd-count hex, fdinfo sigmask, fdinfo
timerfd next expiry for periodic timers, concurrent fdinfo
enumeration, and a /proc/net/tcp sl-density regression that opens
non-TCP sockets before TCP listeners so the iterator visits rejected
sockets first; the post-fix dense sl=0,1,... output matches qemu
Linux ground truth, and a manual bug reintroduction confirms the
test catches the sparse-slot regression with sl=4 expected=0).
tests/test-io-opt.c adds sendfile and copy_file_range coverage for
the read-intercept path.
@jserv jserv merged commit e0552f6 into main May 6, 2026
4 checks passed
@jserv jserv deleted the procfs branch May 6, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant