Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570
Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570kgofron wants to merge 3 commits intoareaDetector:masterfrom
Conversation
|
Hi Kaz! Have you seen #572 ? It would be nice to determine how these two interact, since you're currently registering your exit handler manually, and you could take advantage of |
|
You need the latest asyn, and the ADCore from the PR that Erico linked above. Then, follow these guidelines |
destructible-driversI compiled with destructible-drivers branch, but unfortunatly exit after acquisition still results in segmentation fault. Perhaps I missed something.
SummaryThe destructible-drivers branch (ADCore PR 572) fixes shutdown order (asyn calls shutdownPortDriver() and then deletes the driver), but it does not include the pool-safety fix from ADCore PR 570. So the crash you see is still the same: after the driver (and its NDArrayPool) are destroyed, pvAccess (PVA) can later call NDArray::release() on arrays that belonged to that pool → use-after-free → SIGSEGV. So ADTimePix3 needs both:
What to doApply the ADCore PR 570 (destroyed-pool) changes on top of your current destructible-drivers ADCore. That PR adds:
Ways to get that into your tree:
After ADCore has both destructible-drivers and the pool-safety logic, exit should no longer segfault. If you paste your ADCore branch/commit and the PR 570 patch or link, I can outline exact merge/cherry-pick steps or a minimal patch for your tree. |
Stack trace:“SIGSEGV on exit with ADCore master. Backtrace shows crash in NDArrayPool::release (ADCore) called from PVA teardown (freeNDArray → NDArray::release) after the driver and its pool are already destroyed. Full bt full and info sharedlibrary attached.” Attached: The full bt full output (and optionally info sharedlibrary) from your tpx3_SIGSEGV.md file, or the relevant frames (#0–#2, #37, #49, #65–#66, #71–#72, #79–#81) to see the PVA → NDArray release → pool release path and the exit-handler order. GDB analysis – SIGSEGV on exit (ADCore master)Build: ADTimePix3 IOC, run with ADCore current master (no PR 570, no destructible PR 572 in this run). Where it crashes#0 – NDArrayPool::release(this=0x555556ccc870, pArray=0x7fff3c001a00) at NDArrayPool.cpp:373 So the fault is inside the pool’s release() (use-after-free on the pool or its internals). Call chain (who called into the pool)#81–79 – User types exit → epicsExit(0) → C library runs atexit handlers. So: PVA is shutting down (atexit), destroying MonitorElements that still hold NDArray-backed data. Their deleter calls NDArray::release(), which calls NDArrayPool::release() on a pool that no longer exists. Why the pool is goneShutdown order is:
ConclusionRoot cause: Use-after-free in ADCore: NDArray::release() is called from PVA’s deleter after the driver’s NDArrayPool has already been destroyed. The crash is in NDArrayPool::release (ADCore), not in ADTimePix3. tpx3_SIGSEGV.md Why fix ADCore
So the use‑after‑free is between PVA’s lifetime and ADCore’s pool, not inside ADTimePix3. What ADTimePix3 can and cannot do
|
Segmentation fault
"fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed" refers to Segmentation fault after ioc exits, when acquisition was performed (memory/pool allocated).
Fix applied to ADCore 3.14.0 master.
Problem
When an IOC exits (e.g. user types exit) after acquisition has run, the process can hit a SIGSEGV (signal 11). The crash is in NDArrayPool::release() (or equivalent use of the pool) after the detector driver and its NDArrayPool have already been destroyed.
Cause: Shutdown order: the detector driver destructor runs and deletes pNDArrayPoolPvt_. Later, the pvAccess ServerContext is torn down (atexit). Its MonitorElements still hold NDArray-derived data. The deleter used by ntndArrayConverter (freeNDArray) calls NDArray::release() on those arrays. By then the pool is gone, so release() runs against freed memory → SIGSEGV.
This has been seen with areaDetector IOCs (e.g. ADTimePix3) using ADCore 3.12.1 and 3.14.0. See issue areaDetector/ADTimePix3#5.
Approach
Two parts:
“Destroyed pool” registry
So any late release() (from PVA or elsewhere) no-ops safely, even for NDArrays that are not the driver’s pArrays[] (e.g. copies handed to PVA).
asynNDArrayDriver destructor
Changes
ADCore314_fix.md
References