Fix WorkQueue destructor deadlock#147
Conversation
8649d1f to
3e5dd39
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a shutdown deadlock in the JavaScript worker queue by inlining the former WorkQueue implementation into Babylon::AppRuntime and changing cancellation to be dispatched as a queued work item, then adds a deterministic regression test using arcana testing hooks to reproduce the old race.
Changes:
- Merge
WorkQueueintoAppRuntime(thread, dispatcher, cancellation, suspend/resume) and cancel via queued work item. - Add a deterministic unit test (
DestroyDoesNotDeadlock) guarded byARCANA_TESTING_HOOKS. - Update test harness / build configuration (Win32 gtest argv init; arcana fork +
ARCANA_TESTING_HOOKSdefinition).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/UnitTests/Win32/App.cpp | Initializes gtest with argc/argv and runs tests directly. |
| Tests/UnitTests/Shared/Shared.cpp | Adds deterministic deadlock regression test using arcana hooks. |
| Core/AppRuntime/Source/WorkQueue.h | Removes standalone WorkQueue (merged into AppRuntime). |
| Core/AppRuntime/Source/WorkQueue.cpp | Removes standalone WorkQueue implementation. |
| Core/AppRuntime/Source/AppRuntime.cpp | Moves queue/thread/cancel/suspend logic into AppRuntime and cancels via queued work item. |
| Core/AppRuntime/Source/AppRuntime_JSI.cpp | Switches V8 JSI task runner adapter to post tasks via AppRuntime::Dispatch. |
| Core/AppRuntime/Include/Babylon/AppRuntime.h | Adds dispatcher/thread/cancellation members and in-class Append helper. |
| Core/AppRuntime/CMakeLists.txt | Removes WorkQueue sources from build. |
| CMakeLists.txt | Switches arcana dependency to a fork and defines ARCANA_TESTING_HOOKS when tests are enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8452ce3 to
8ac4ba4
Compare
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8ac4ba4 to
354f12a
Compare
|
|
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eded) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… item The old destructor called cancel() + notify_all() from the main thread without the queue mutex. If the worker thread hadn't entered condition_variable::wait() yet, the notification was lost and join() hung forever. The fix dispatches cancellation as a work item via Append(). Since push() acquires the same mutex that wait() releases, the notification cannot be lost. Also fixes member declaration order in AppRuntime.h so m_options outlives m_workQueue during destruction. Includes a deterministic regression test using arcana testing hooks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clang/GCC with -Werror,-Wreorder-ctor requires the initializer list to match the member declaration order. m_options is now declared before m_workQueue, so initialize it first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The define is always set when tests are built, so the #ifdef guards in Shared.cpp are unnecessary noise. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Destructor now calls cancel() immediately then appends a no-op to wake the worker, preserving prompt shutdown semantics. - Test cleanup (hook reset, detachable destroy) runs on all paths including hook timeout, preventing hangs from scope-exit destruction. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The global add_compile_definitions in the subdirectory scope doesn't apply to UnitTestsJNI which is defined in the parent Android CMakeLists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…or issue) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the hook didn't fire, there's no deadlock risk — the runtime destructs normally on scope exit. The detachable thread is only needed when the worker is confirmed stuck in the hook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updates JsRuntimeHost to include the fix for a race condition in WorkQueue::~WorkQueue() where cancel() + notify_all() could miss condition_variable::wait(), causing a deadlock on thread join. See BabylonJS/JsRuntimeHost#147 for details. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bump JsRuntimeHost to include the fix for a race condition in WorkQueue::~WorkQueue() where cancel() + notify_all() could miss condition_variable::wait(), causing a deadlock on thread join during shutdown. See BabylonJS/JsRuntimeHost#147 for full details and threading diagrams. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix the WorkQueue destructor deadlock by using a mutex-coordinated wake-up.
The Bug
WorkQueue::~WorkQueue() cancelled from the main thread via cancel() + notify_all(). The notify fired without holding the queue mutex, so if the worker thread hadn't entered condition_variable::wait() yet, the signal was lost and join() hung forever. See #146 for a deterministic repro of the deadlock against the old code.
The Fix
Cancel immediately (so pending work is dropped promptly), then append a no-op work item to wake the worker. The no-op goes through push() which acquires the same mutex that wait() releases, so the notification cannot be lost.
Also fixes member declaration order in AppRuntime.h so m_options outlives m_workQueue during destruction.
Regression Test
Includes a deterministic test using arcana testing hooks (microsoft/arcana.cpp#59, merged) that sleeps while holding the queue mutex before wait(). This guarantees the worker is in the vulnerable window when destruction fires. The test passes with this fix and deadlocks with the old code (#146).
Follow-up
Merging WorkQueue into AppRuntime will be done in a separate PR to eliminate split-lifetime issues.