Skip to content

Fix unix socket driver listener lifecycle races#2156

Draft
petermm wants to merge 4 commits intoatomvm:mainfrom
petermm:fix-unix-socket_driver
Draft

Fix unix socket driver listener lifecycle races#2156
petermm wants to merge 4 commits intoatomvm:mainfrom
petermm:fix-unix-socket_driver

Conversation

@petermm
Copy link
Contributor

@petermm petermm commented Mar 4, 2026

https://ampcode.com/threads/T-019cb8b8-9e4c-7316-9566-c7e3f5f2b6db

Claims to fix #2155

1

Fix a use-after-free race condition in the generic_unix socket driver's close handler, detected by Valgrind during CI gen_tcp tests.

The close handler in socket_consume_mailbox used a two-phase locking pattern: it acquired the glb->listeners lock to NULL-out the socket_data listener pointers, released it, then called sys_unregister_listener (which re-acquires the lock) to remove the listener from the linked list. Between the unlock and re-lock, the event loop thread could also unlink the same listener node via process_listener_handler after the callback returned NULL. The subsequent list_remove in sys_unregister_listener then operated on stale prev/next pointers, corrupting the list or writing to freed memory.

The fix makes the pointer detach and list unlink atomic under a single lock hold by introducing sys_unregister_listener_nolock, a variant that assumes the caller already holds the glb->listeners write lock. The close handler now NULLs the pointers, unlinks the listeners, and releases the lock before freeing the memory.

2

Follow-up analysis also found a separate listener publication race in the same generic_unix socket driver path. Several active and passive listener setup paths registered a listener in the global listener list before publishing the corresponding socket_data->{active,passive}_listener pointer. If the event loop processed that listener in the small window before the pointer assignment, the callback could consume, free, or replace the listener first, and the later assignment would leave socket_data pointing at stale listener memory. That race could surface as random hangs or corruption in CI gen_tcp tests.

Together, these fixes make listener publication, teardown, and unlinking consistent with the event loop's locking model on generic_unix.

This pattern is specific to generic_unix; ESP32 and RP2 use a single global listener for the socket driver subsystem and are not affected.

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

@petermm petermm force-pushed the fix-unix-socket_driver branch 4 times, most recently from 101457a to 350a449 Compare March 11, 2026 14:00
@petermm petermm changed the title Fix use-after-free race in socket driver close Fix socket driver listener lifecycle races Mar 11, 2026
@petermm petermm changed the title Fix socket driver listener lifecycle races Fix unix socket driver listener lifecycle races Mar 11, 2026
@petermm petermm force-pushed the fix-unix-socket_driver branch from 93f634e to 299c561 Compare March 12, 2026 20:17
https://ampcode.com/threads/T-019cb8b8-9e4c-7316-9566-c7e3f5f2b6db

Fix a use-after-free race condition in the generic_unix socket driver's close
handler, detected by Valgrind during CI gen_tcp tests.

The close handler in socket_consume_mailbox used a two-phase locking pattern:
it acquired the glb->listeners lock to NULL-out the socket_data listener
pointers, released it, then called sys_unregister_listener (which re-acquires
the lock) to remove the listener from the linked list. Between the unlock and
re-lock, the event loop thread could also unlink the same listener node via
process_listener_handler after the callback returned NULL. The subsequent
list_remove in sys_unregister_listener then operated on stale prev/next
pointers, corrupting the list or writing to freed memory.

The fix makes the pointer detach and list unlink atomic under a single lock hold
by introducing sys_unregister_listener_nolock — a variant that assumes the
caller already holds the glb->listeners write lock. The close handler now
NULLs the pointers, unlinks the listeners, and releases the lock before freeing
the memory.

This pattern is specific to generic_unix; ESP32 and RP2 use a single global
listener for the socket driver subsystem and are not affected.

Signed-off-by: Peter M <petermm@gmail.com>
@petermm petermm force-pushed the fix-unix-socket_driver branch from 299c561 to e7b8f0c Compare March 13, 2026 18:29
That commit fixed the close-time double-unlink/use-after-free race in
generic_unix listener teardown. This change addresses a separate race in
listener registration, where a listener could become visible to the
event loop before socket_data published the corresponding pointer. Both
fixes are needed; this patch complements the earlier teardown fix rather
than replacing it.

Fix a race in the generic_unix socket driver where newly created
listeners were registered in the global listener list before the
corresponding socket_data->{active,passive}_listener pointer was
published.

If the event loop processed the listener in that window, the callback
could consume, free, or replace the listener before the socket driver
stored the pointer. The later assignment then left socket_data
pointing at stale listener memory, which could surface as random hangs
or corruption in gen_tcp tests, including timeouts waiting for the
server helper process to start.

Publish the listener pointer before calling sys_register_listener
in all affected paths:

active UDP receive listener setup
active TCP receive listener setup
passive recv/recvfrom listener setup
accept listener setup
This complements the earlier close-path fix by removing another
generic_unix listener lifecycle race.

Signed-off-by: Peter M <petermm@gmail.com>
@petermm petermm force-pushed the fix-unix-socket_driver branch from e7b8f0c to bf13b72 Compare March 14, 2026 16:27
petermm and others added 2 commits March 14, 2026 18:34
Configure newly accepted generic_unix TCP sockets as nonblocking before publishing them to the socket driver machinery. If fcntl fails, close the accepted fd and reply with an error so callers never observe a partially initialized connection.

Signed-off-by: Peter M <petermm@gmail.com>
If the owning process terminates between the accept() call and
globalcontext_get_process_lock(), ctx will be NULL. The code
immediately dereferences ctx->platform_data without checking,
causing a segfault. Add a NULL check consistent with other
callbacks (e.g. recv_callback), closing the accepted fd if
needed and freeing the listener before returning.

Amp-Thread-ID: https://ampcode.com/threads/T-019ced69-2d93-772b-b884-c90e4cbaefe5
Co-authored-by: Amp <amp@ampcode.com>
@petermm petermm marked this pull request as draft March 19, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Invalid write" in socket_consume_mailbox / sys_unregister_listener / sys_register_listener

2 participants