Skip to content

Deterministic test proving AppRuntime destructor deadlock#145

Closed
bghgary wants to merge 3 commits intoBabylonJS:mainfrom
bghgary:repro/appruntime-deadlock
Closed

Deterministic test proving AppRuntime destructor deadlock#145
bghgary wants to merge 3 commits intoBabylonJS:mainfrom
bghgary:repro/appruntime-deadlock

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 2, 2026

Deterministic test proving the AppRuntime destructor has a deadlock race condition.

The Bug

The destructor cancels from the main thread via cancel() + notify_all(). The notify fires without holding the queue mutex, so if the worker thread hasn't entered condition_variable::wait() yet, the signal is lost and join() hangs forever.

How This Test Works

Uses an arcana testing hook (ARCANA_TESTING_HOOKS) that sleeps while holding the queue mutex, right before wait(). This guarantees the worker is between the condition check and wait() when the destructor fires.

  • Old code: cancel() + notify_all() don't need the mutex, so they fire and the notification is lost while the worker sleeps. Worker enters wait() with no pending signal. Deadlock.
  • Fixed code: Append(cancel) calls push() which needs the mutex, so it blocks until the worker enters wait(). Push completes, notifies, worker wakes. No deadlock.

Expected Result

This PR should FAIL CI -- it intentionally has the old broken destructor to demonstrate the deadlock. The test will timeout after 5 seconds on iteration 0.

Dependencies

bghgary and others added 3 commits April 1, 2026 13:24
WorkQueue was a separate class with its own thread, dispatcher, and cancel
source. AppRuntime captured 'this' in lambdas passed to WorkQueue, creating
split-lifetime issues where the worker thread could access partially-destroyed
AppRuntime members during shutdown.

This change inlines WorkQueue's functionality directly into AppRuntime:
- Thread, dispatcher, cancel source, and env are now AppRuntime members
- Destruction order is deterministic (no cross-object lifetime issues)
- Cancellation is dispatched as a work item instead of signaled externally,
  eliminating the race between cancel and blocking_tick's condition_variable
- WorkQueue.h and WorkQueue.cpp are removed
- AppRuntime_JSI.cpp updated to use AppRuntime::Dispatch instead of
  WorkQueue::Append

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests that destroying AppRuntime does not deadlock when the worker thread
is inside blocking_tick's condition_variable::wait. Uses a timeout to
detect deadlock deterministically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Uses arcana testing hooks to sleep while holding the queue mutex right
before condition_variable::wait(). This guarantees the worker is between
the condition check and wait() when the destructor fires cancel+notify.

The old destructor's notify_all() fires without the mutex, so the signal
is lost and the worker deadlocks in wait(). The test detects this via a
5-second timeout.

This branch intentionally has the OLD broken destructor to demonstrate
the failure. The fix branch (fix/merge-workqueue-into-appruntime) uses
Append(cancel) which coordinates through push()/mutex and passes this
test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 2, 2026

Superseded — was based on the merged WorkQueue branch, not main.

@bghgary bghgary closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant