Bug report
Summary
PHP-FPM worker processes crash with SIGABRT due to heap corruption in the background writer thread (`_dd_writer_loop`) with dd-trace-php 1.18.0. Each crash causes in-flight HTTP requests to be dropped with a 504 Gateway Timeout from the upstream reverse proxy.
Environment
- dd-trace-php version: 1.18.0
- PHP: 7.4 NTS, PHP-FPM (standard process manager, not FrankenPHP / not ZTS)
- OS: Linux x86_64 (glibc)
- libcurl: system libcurl (`/lib64/libcurl.so.4`)
- Rollback version that resolved the issue: 1.0.0
Symptom
PHP-FPM worker processes terminate with `SIGABRT` under production load. The crash is triggered by glibc detecting heap corruption (`malloc_printerr`) when the background writer thread calls `curl_slist_free_all` on the stored HTTP header list (`writer->headers`).
From the upstream reverse proxy (nginx), the crashing worker produces a `504 Gateway Timeout` (`upstream timed out, 110: Connection timed out`), interrupting request processing mid-flight.
Coredump stack trace
#0 0x00007f59ee354690 in raise () from /lib64/libpthread.so.0
#1 0x00007f59dfa617b0 in libdd_crashtracker::collector::signal_handler_manager::chain_signal_handler ()
at libdatadog/libdd-crashtracker/src/collector/signal_handler_manager.rs:125
#2 libdd_crashtracker::collector::crash_handler::handle_posix_sigaction () at libdatadog/libdd-crashtracker/src/collector/crash_handler.rs:209
#3 <signal handler called>
#4 0x00007f59ee594ae0 in raise () from /lib64/libc.so.6
#5 0x00007f59ee595f88 in abort () from /lib64/libc.so.6
#6 0x00007f59ee5d4b94 in __libc_message () from /lib64/libc.so.6
#7 0x00007f59ee5da80a in malloc_printerr () from /lib64/libc.so.6
#8 0x00007f59ee5dad16 in munmap_chunk () from /lib64/libc.so.6
#9 0x00007f59eb30021d in curl_slist_free_all () from /lib64/libcurl.so.4
#10 0x00007f59df6512d8 in _dd_curl_reset_headers (writer=<optimized out>)
at /go/src/github.com/DataDog/apm-reliability/dd-trace-php/tmp/build_extension/ext/coms.c:928
#11 0x00007f59df6521db in _dd_curl_send_stack (metrics=0x7f59d4f1e990, stack=<optimized out>, writer=0x7f59e0215960)
at /go/src/github.com/DataDog/apm-reliability/dd-trace-php/tmp/build_extension/ext/coms.c:1057
#12 _dd_writer_loop (_=<optimized out>) at /go/src/github.com/DataDog/apm-reliability/dd-trace-php/tmp/build_extension/ext/coms.c:1268
#13 0x00007f59ee34a40b in start_thread () from /lib64/libpthread.so.0
#14 0x00007f59ee64de7f in clone () from /lib64/libc.so.6
Impact
Intermittent crashes over multiple days, with each crash dropping an in-flight request and leaving dozens of core dump files across multiple web hosts.
Workaround
Rolling back to 1.0.0 resolved the issue immediately.
Notes
- `DD_TRACE_CLI_ENABLED=false` and `DD_TRACE_SIDECAR_TRACE_SENDER=false` were already set in the environment at the time the crash occurred. These flags are documented to suppress the background writer thread for CLI, but the crash happened in PHP-FPM (web) regardless. The reason this configuration did not prevent the crash is unknown.
- Code inspection of `coms.c` shows that `writer->headers` is declared as `_Atomic(struct curl_slist *)` (line 412) but is written with a plain non-atomic assignment in `_dd_curl_set_headers` (line 955).
- `dd_agent_curl_headers` (the global slist used to seed per-request headers) is read from the writer thread without a lock, while it can be freed or reassigned from the main PHP thread (e.g., `ddtrace_coms_curl_shutdown`). A race between these two accesses is a candidate root cause for the heap corruption.
Related
Bug report
Summary
PHP-FPM worker processes crash with SIGABRT due to heap corruption in the background writer thread (`_dd_writer_loop`) with dd-trace-php 1.18.0. Each crash causes in-flight HTTP requests to be dropped with a 504 Gateway Timeout from the upstream reverse proxy.
Environment
Symptom
PHP-FPM worker processes terminate with `SIGABRT` under production load. The crash is triggered by glibc detecting heap corruption (`malloc_printerr`) when the background writer thread calls `curl_slist_free_all` on the stored HTTP header list (`writer->headers`).
From the upstream reverse proxy (nginx), the crashing worker produces a `504 Gateway Timeout` (`upstream timed out, 110: Connection timed out`), interrupting request processing mid-flight.
Coredump stack trace
Impact
Intermittent crashes over multiple days, with each crash dropping an in-flight request and leaving dozens of core dump files across multiple web hosts.
Workaround
Rolling back to 1.0.0 resolved the issue immediately.
Notes
Related