Skip to content

Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570

Open
kgofron wants to merge 3 commits intoareaDetector:masterfrom
kgofron:destroyed-pool
Open

Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570
kgofron wants to merge 3 commits intoareaDetector:masterfrom
kgofron:destroyed-pool

Conversation

@kgofron
Copy link
Member

@kgofron kgofron commented Feb 19, 2026

Segmentation fault

"fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed" refers to Segmentation fault after ioc exits, when acquisition was performed (memory/pool allocated).

epics> auto_settings.sav: 2354 of 2354 PV's connected
ACQUIRE CHANGE: ADAcquire=1 (was 0), current ADStatus=0
PrvHst: Checking if TCP streaming should start - WritePrvHst=0
PrvHst: WritePrvHst is disabled (0) - TCP streaming not started
After acquireStart: ADStatus=1
ACQUIRE CHANGE: ADAcquire=0 (was 1), current ADStatus=1
PrvImg TCP connection closed by peer
PrvHst TCP disconnected
After acquireStop: ADStatus=0

epics> exit
PrvHst TCP disconnected
./st.cmd: line 5: 2343260 Segmentation fault      ../../bin/linux-x86_64/tpx3App st_base.cmd

Fix applied to ADCore 3.14.0 master.

epics> exit
PrvHst TCP disconnected

Problem

When an IOC exits (e.g. user types exit) after acquisition has run, the process can hit a SIGSEGV (signal 11). The crash is in NDArrayPool::release() (or equivalent use of the pool) after the detector driver and its NDArrayPool have already been destroyed.
Cause: Shutdown order: the detector driver destructor runs and deletes pNDArrayPoolPvt_. Later, the pvAccess ServerContext is torn down (atexit). Its MonitorElements still hold NDArray-derived data. The deleter used by ntndArrayConverter (freeNDArray) calls NDArray::release() on those arrays. By then the pool is gone, so release() runs against freed memory → SIGSEGV.
This has been seen with areaDetector IOCs (e.g. ADTimePix3) using ADCore 3.12.1 and 3.14.0. See issue areaDetector/ADTimePix3#5.

Approach

Two parts:
“Destroyed pool” registry

  • Before the driver deletes its pool, it registers the pool pointer in a static set.
  • In NDArray::release(), we check that set using only the pool address (no dereference).
  • If the pool was registered as destroying, we set pNDArrayPool = NULL and return without calling the pool.
    So any late release() (from PVA or elsewhere) no-ops safely, even for NDArrays that are not the driver’s pArrays[] (e.g. copies handed to PVA).

asynNDArrayDriver destructor

  • Store maxAddr in a member maxAddr_.
  • In ~asynNDArrayDriver(): call NDArrayPool::registerDestroyingPool(pNDArrayPoolPvt_), null pNDArrayPool on each pArrays[i], then delete pNDArrayPoolPvt_.

Changes

File Change
NDArray.h Declare NDArrayPool::registerDestroyingPool(NDArrayPool*) and NDArrayPool::isPoolDestroyed(NDArrayPool*).
NDArrayPool.cpp Implement both with a static std::set<NDArrayPool*> and a mutex. Pools are only ever added; the set is process-lifetime.
NDArray.cpp At the start of NDArray::release(), if isPoolDestroyed(pNDArrayPool) then set pNDArrayPool = NULL and return ND_ERROR without calling the pool.
asynNDArrayDriver.h Add private member int maxAddr_.
asynNDArrayDriver.cpp Constructor: initialize maxAddr_(maxAddr) (initializer order matches member declaration). Destructor: call registerDestroyingPool(pNDArrayPoolPvt_), then loop over pArrays[0..maxAddr_-1] and set pArrays[i]->pNDArrayPool = NULL, then delete pNDArrayPoolPvt_.

ADCore314_fix.md

References

@ericonr
Copy link
Member

ericonr commented Mar 5, 2026

Hi Kaz! Have you seen #572 ? It would be nice to determine how these two interact, since you're currently registering your exit handler manually, and you could take advantage of ASYN_DESTRUCTIBLE in the future

@exzombie
Copy link

exzombie commented Mar 5, 2026

You need the latest asyn, and the ADCore from the PR that Erico linked above. Then, follow these guidelines

@kgofron
Copy link
Member Author

kgofron commented Mar 11, 2026

destructible-drivers

I compiled with destructible-drivers branch, but unfortunatly exit after acquisition still results in segmentation fault.

epics> exit
PrvHst TCP disconnected
./st.cmd: line 5: 2228300 Segmentation fault      ../../bin/linux-x86_64/tpx3App st_base.cmd

Perhaps I missed something.

kg1@lap133454:/epics/support2/areaDetector/ADCore/iocBoot$ git branch -a
* destructible-drivers
kg1@lap133454:/epics/support2/areaDetector/ADCore/iocBoot$ git log
commit 91d002c3afa482c6da827f3b11b95706d1d5028b (HEAD -> destructible-drivers, origin/destructible-drivers)
Author: Jure Varlec <jure.varlec@cosylab.com>
Date:   Fri Feb 27 09:25:06 2026 +0000

    Add a release note about port shutdown
  • asyn/asyn/asynDriver/asynDriver.h
/* Version number names similar to those provide by base
 * These macros are always numeric */
#define ASYN_VERSION       4
#define ASYN_REVISION     45
#define ASYN_MODIFICATION  0

Summary

The destructible-drivers branch (ADCore PR 572) fixes shutdown order (asyn calls shutdownPortDriver() and then deletes the driver), but it does not include the pool-safety fix from ADCore PR 570. So the crash you see is still the same: after the driver (and its NDArrayPool) are destroyed, pvAccess (PVA) can later call NDArray::release() on arrays that belonged to that pool → use-after-free → SIGSEGV.

So ADTimePix3 needs both:

  • Destructible drivers (PR 572)
  • Destroyed-pool safety (PR 570) – this is what prevents the SIGSEGV when PVA (or anything else) calls release() after the pool is gone.

What to do

Apply the ADCore PR 570 (destroyed-pool) changes on top of your current destructible-drivers ADCore. That PR adds:

  • NDArrayPool: registerDestroyingPool(NDArrayPool*) and isPoolDestroyed(NDArrayPool*) (e.g. static set + mutex).
  • NDArray::release(): at the start, if isPoolDestroyed(pNDArrayPool) then set pNDArrayPool = NULL and return (do not call the pool).
  • asynNDArrayDriver destructor: call registerDestroyingPool(pNDArrayPoolPvt_), null pNDArrayPool on each pArrays[i], then delete pNDArrayPoolPvt_.

Ways to get that into your tree:

  • Option A: In your ADCore repo (on destructible-drivers), merge or cherry-pick the commits from PR 570 (the “fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed” commit and any dependencies). Then rebuild ADCore and your IOC.
  • Option B : Manually apply the same code changes from PR 570’s diff into your current ADCore (same files and logic as above), then rebuild.

After ADCore has both destructible-drivers and the pool-safety logic, exit should no longer segfault. If you paste your ADCore branch/commit and the PR 570 patch or link, I can outline exact merge/cherry-pick steps or a minimal patch for your tree.

@kgofron
Copy link
Member Author

kgofron commented Mar 13, 2026

Stack trace:

“SIGSEGV on exit with ADCore master. Backtrace shows crash in NDArrayPool::release (ADCore) called from PVA teardown (freeNDArray → NDArray::release) after the driver and its pool are already destroyed. Full bt full and info sharedlibrary attached.”

Attached: The full bt full output (and optionally info sharedlibrary) from your tpx3_SIGSEGV.md file, or the relevant frames (#0–#2, #37, #49, #65#66, #71#72, #79#81) to see the PVA → NDArray release → pool release path and the exit-handler order.

GDB analysis – SIGSEGV on exit (ADCore master)

Build: ADTimePix3 IOC, run with ADCore current master (no PR 570, no destructible PR 572 in this run).
Repro: Acquisition run, then exit in the IOC shell.
Crash: SIGSEGV in NDArrayPool::release() at NDArrayPool.cpp:373 (onReleaseArray(pArray)).

Where it crashes

#0 – NDArrayPool::release(this=0x555556ccc870, pArray=0x7fff3c001a00) at NDArrayPool.cpp:373 So the fault is inside the pool’s release() (use-after-free on the pool or its internals).

Call chain (who called into the pool)

#81–79 – User types exit → epicsExit(0) → C library runs atexit handlers.
#72–71 – One of those handlers destroys the pvAccess ServerContext (static/atexit cleanup).
#70–65 – PVA tears down transports and channels, then ServerMonitorRequesterImpl::destroy.
#57–48 – MonitorElementQueue and its MonitorElements are destroyed.
#37–36 – MonitorElement destructor runs (PVA held a structure that contained array data).
#30–11 – Destructors for PVStructure → PVValueArray → shared_vector; the shared storage uses a custom deleter.
#2 – freeNDArray::operator() (ADCore ntndArrayConverter.cpp:44) – deleter for the PVA copy of the NDArray data.
#1 – NDArray::release() – called from that deleter.
#0 – NDArrayPool::release() – crash.

So: PVA is shutting down (atexit), destroying MonitorElements that still hold NDArray-backed data. Their deleter calls NDArray::release(), which calls NDArrayPool::release() on a pool that no longer exists.

Why the pool is gone

Shutdown order is:

  • Earlier in exit, the ADTimePix driver is destroyed (either by epicsAtExit(exitCallbackC) or by asyn destructible teardown). That runs ~asynNDArrayDriver, which deletes the NDArrayPool (pNDArrayPoolPvt_). So the pool at 0x555556ccc870 is freed.
  • Later, the pvAccess ServerContext is destroyed (another atexit). Its MonitorElements still hold NDArrays (or views) that point at that same pool. When those are freed, freeNDArray → NDArray::release() → NDArrayPool::release() runs on the already deleted pool → use-after-free → SIGSEGV.
    So the trace matches the “destroyed pool” scenario: driver (and pool) destroyed first, PVA tears down later and calls release() on that pool.

Conclusion

Root cause: Use-after-free in ADCore: NDArray::release() is called from PVA’s deleter after the driver’s NDArrayPool has already been destroyed. The crash is in NDArrayPool::release (ADCore), not in ADTimePix3.
Fix: This is exactly what ADCore PR 570 (destroyed-pool safety) addresses: make NDArray::release() (and optionally the pool) safe when the pool has already been marked destroyed / unregistered, so PVA’s late release no-ops instead of touching freed pool memory.

tpx3_SIGSEGV.md
tpx3_SIGSEGV_frames.md

Why fix ADCore

  • ADCore / PVA lifetime bug, and ADTimePix3 can at best work around it, not truly fix it.

  • The crash is in:

    • NDArrayPool::release() → NDArray::release() → freeNDArray (from ntndArrayConverter)
    • Called while pvAccess ServerContext is being destroyed at atexit.
  • By that time, the driver’s NDArrayPool has already been deleted (driver shutdown/destructor ran earlier), but PVA still holds NDArrays whose deleter calls NDArray::release().

So the use‑after‑free is between PVA’s lifetime and ADCore’s pool, not inside ADTimePix3.

What ADTimePix3 can and cannot do

  • What it cannot do:

    • It cannot intercept NDArray::release() or freeNDArray calls coming from PVA. Once an NDArray is handed to PVA, only ADCore’s NDArray/NDArrayPool layer can protect against late release() calls (this is exactly what PR 570 does with the “destroyed pool” registry and early return in NDArray::release()).
  • What it can do as a workaround (no ADCore change):

    • Do not destroy the driver/pool at exit.
      • Don’t pass ASYN_DESTRUCTIBLE in the IOC (use ADTimePixConfigWithFlags(..., 0)), and
      • Don’t register epicsAtExit (or guard it behind a macro).
    • Then, when PVA shuts down at atexit, the NDArrayPool is still alive; NDArray::release() works and the process exits cleanly.
    • Trade‑off: the driver, its threads, and the pool leak until process exit (which is usually fine for IOC shutdown).
  • If we are willing to leak on exit: yes, ADTimePix3 can avoid the crash by not being destructible (no ASYN_DESTRUCTIBLE, no epicsAtExit deletion) when using an unpatched ADCore.

  • If you want a correct, leak‑free, future‑proof fix: ADCore must be changed (PR 570 or equivalent). That’s the only place that can safely handle “PVA calls NDArray::release() after the pool is gone.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants