Skip to content

Fix WorkQueue destructor deadlock#147

Merged
bghgary merged 13 commits intoBabylonJS:mainfrom
bghgary:fix/workqueue-deadlock
Apr 7, 2026
Merged

Fix WorkQueue destructor deadlock#147
bghgary merged 13 commits intoBabylonJS:mainfrom
bghgary:fix/workqueue-deadlock

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 2, 2026

Fix the WorkQueue destructor deadlock by using a mutex-coordinated wake-up.

The Bug

WorkQueue::~WorkQueue() cancelled from the main thread via cancel() + notify_all(). The notify fired without holding the queue mutex, so if the worker thread hadn't entered condition_variable::wait() yet, the signal was lost and join() hung forever. See #146 for a deterministic repro of the deadlock against the old code.

Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       check: cancelled? -> no
                                       unlock(mutex)  <- about to call wait()
                                            |
~WorkQueue():                          RACE WINDOW
  cancel()                            (no mutex held)
  notify_all() ---- LOST! --------->
                                            |
                                       lock(mutex)
                                       wait(lock) ---- blocks forever

  join() ----------------------------- blocks forever
       |
   DEADLOCK

The Fix

Cancel immediately (so pending work is dropped promptly), then append a no-op work item to wake the worker. The no-op goes through push() which acquires the same mutex that wait() releases, so the notification cannot be lost.

Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       wait(lock) ---- releases mutex
                                            |
~WorkQueue():
  cancel()
  Append(no-op):
    push():
      lock(mutex) <---- acquires it (worker is in wait)
      queue.push(no-op)
      unlock(mutex)
      notify_one() ------------------> wakes up!
                                            |
                                       drain -> execute no-op
                                       loop check: cancelled? -> yes
                                       exit loop
  join() <----------------------------- thread exits
       |
   SUCCESS

Also fixes member declaration order in AppRuntime.h so m_options outlives m_workQueue during destruction.

Regression Test

Includes a deterministic test using arcana testing hooks (microsoft/arcana.cpp#59, merged) that sleeps while holding the queue mutex before wait(). This guarantees the worker is in the vulnerable window when destruction fires. The test passes with this fix and deadlocks with the old code (#146).

Follow-up

Merging WorkQueue into AppRuntime will be done in a separate PR to eliminate split-lifetime issues.

@bghgary bghgary force-pushed the fix/workqueue-deadlock branch 2 times, most recently from 8649d1f to 3e5dd39 Compare April 2, 2026 20:32
@bghgary bghgary marked this pull request as ready for review April 2, 2026 21:02
Copilot AI review requested due to automatic review settings April 2, 2026 21:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a shutdown deadlock in the JavaScript worker queue by inlining the former WorkQueue implementation into Babylon::AppRuntime and changing cancellation to be dispatched as a queued work item, then adds a deterministic regression test using arcana testing hooks to reproduce the old race.

Changes:

  • Merge WorkQueue into AppRuntime (thread, dispatcher, cancellation, suspend/resume) and cancel via queued work item.
  • Add a deterministic unit test (DestroyDoesNotDeadlock) guarded by ARCANA_TESTING_HOOKS.
  • Update test harness / build configuration (Win32 gtest argv init; arcana fork + ARCANA_TESTING_HOOKS definition).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
Tests/UnitTests/Win32/App.cpp Initializes gtest with argc/argv and runs tests directly.
Tests/UnitTests/Shared/Shared.cpp Adds deterministic deadlock regression test using arcana hooks.
Core/AppRuntime/Source/WorkQueue.h Removes standalone WorkQueue (merged into AppRuntime).
Core/AppRuntime/Source/WorkQueue.cpp Removes standalone WorkQueue implementation.
Core/AppRuntime/Source/AppRuntime.cpp Moves queue/thread/cancel/suspend logic into AppRuntime and cancels via queued work item.
Core/AppRuntime/Source/AppRuntime_JSI.cpp Switches V8 JSI task runner adapter to post tasks via AppRuntime::Dispatch.
Core/AppRuntime/Include/Babylon/AppRuntime.h Adds dispatcher/thread/cancellation members and in-class Append helper.
Core/AppRuntime/CMakeLists.txt Removes WorkQueue sources from build.
CMakeLists.txt Switches arcana dependency to a fork and defines ARCANA_TESTING_HOOKS when tests are enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bghgary bghgary force-pushed the fix/workqueue-deadlock branch 2 times, most recently from 8452ce3 to 8ac4ba4 Compare April 2, 2026 21:12
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all()
fired without the queue mutex, so the signal could be lost if the worker
thread hadn't entered condition_variable::wait() yet, causing join() to
hang forever.

This change merges WorkQueue into AppRuntime, eliminating split-lifetime
issues, and dispatches cancellation as a work item via Append(). Since
push() acquires the queue mutex, it blocks until the worker enters wait(),
guaranteeing the notification is delivered.

Changes:
- Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel
  source, env, suspension lock)
- Remove WorkQueue.h and WorkQueue.cpp
- Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch
- Add deterministic regression test using arcana testing hooks
- Fix member declaration order so m_options outlives worker thread

The regression test uses arcana::set_before_wait_callback() to sleep while
holding the queue mutex before wait(), deterministically triggering the race.
See BabylonJS#146 for the test running against the old broken code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the fix/workqueue-deadlock branch from 8ac4ba4 to 354f12a Compare April 2, 2026 23:26
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 2, 2026

⚠️ Do not merge until microsoft/arcana.cpp#59 is merged. The arcana.cpp FetchContent currently points to bghgary/arcana.cpp. Once the upstream PR merges, update the GIT_REPOSITORY and GIT_TAG back to microsoft/arcana.cpp.

bghgary and others added 4 commits April 2, 2026 16:59
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eded)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… item

The old destructor called cancel() + notify_all() from the main thread
without the queue mutex. If the worker thread hadn't entered
condition_variable::wait() yet, the notification was lost and join()
hung forever.

The fix dispatches cancellation as a work item via Append(). Since push()
acquires the same mutex that wait() releases, the notification cannot be
lost.

Also fixes member declaration order in AppRuntime.h so m_options outlives
m_workQueue during destruction.

Includes a deterministic regression test using arcana testing hooks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary changed the title Fix WorkQueue destructor deadlock by merging into AppRuntime Fix WorkQueue destructor deadlock Apr 7, 2026
@bghgary bghgary requested a review from ryantrem April 7, 2026 20:49
Clang/GCC with -Werror,-Wreorder-ctor requires the initializer list to
match the member declaration order. m_options is now declared before
m_workQueue, so initialize it first.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bghgary and others added 3 commits April 7, 2026 14:06
The define is always set when tests are built, so the #ifdef guards
in Shared.cpp are unnecessary noise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Destructor now calls cancel() immediately then appends a no-op to wake
  the worker, preserving prompt shutdown semantics.
- Test cleanup (hook reset, detachable destroy) runs on all paths including
  hook timeout, preventing hangs from scope-exit destruction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The global add_compile_definitions in the subdirectory scope doesn't
apply to UnitTestsJNI which is defined in the parent Android CMakeLists.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary enabled auto-merge (squash) April 7, 2026 22:48
…or issue)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary disabled auto-merge April 7, 2026 22:57
bghgary and others added 2 commits April 7, 2026 16:08
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the hook didn't fire, there's no deadlock risk — the runtime destructs
normally on scope exit. The detachable thread is only needed when the
worker is confirmed stuck in the hook.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary merged commit 598f004 into BabylonJS:main Apr 7, 2026
19 checks passed
bghgary added a commit to BabylonJS/BabylonNative that referenced this pull request Apr 7, 2026
Updates JsRuntimeHost to include the fix for a race condition in
WorkQueue::~WorkQueue() where cancel() + notify_all() could miss
condition_variable::wait(), causing a deadlock on thread join.

See BabylonJS/JsRuntimeHost#147 for details.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to BabylonJS/BabylonNative that referenced this pull request Apr 8, 2026
Bump JsRuntimeHost to include the fix for a race condition in
WorkQueue::~WorkQueue() where cancel() + notify_all() could miss
condition_variable::wait(), causing a deadlock on thread join during
shutdown.

See BabylonJS/JsRuntimeHost#147 for full details and threading diagrams.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary deleted the fix/workqueue-deadlock branch April 8, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants