Deterministic test proving WorkQueue destructor deadlock#146
Open
bghgary wants to merge 1 commit intoBabylonJS:mainfrom
Open
Deterministic test proving WorkQueue destructor deadlock#146bghgary wants to merge 1 commit intoBabylonJS:mainfrom
bghgary wants to merge 1 commit intoBabylonJS:mainfrom
Conversation
2cea4fa to
421a7ba
Compare
d806b08 to
fe9c37f
Compare
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2c0dcab to
6763017
Compare
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
41ade60 to
3c4d787
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a deterministic unit test that reproduces a WorkQueue/AppRuntime destructor deadlock by leveraging Arcana “before wait” testing hooks, and updates the Windows unit-test entrypoint to support standard gtest CLI arguments.
Changes:
- Switches the
arcana.cppFetchContent dependency to a fork that includes testing hooks and definesARCANA_TESTING_HOOKSwhen tests are enabled. - Adds a new
AppRuntime.DestroyDoesNotDeadlocktest (guarded byARCANA_TESTING_HOOKS) that attempts to detect the destructor deadlock via timeouts. - Updates Win32 unit test
mainto passargc/argvinto GoogleTest initialization and run tests viaRUN_ALL_TESTS().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| CMakeLists.txt | Points Arcana dependency to a fork and globally defines ARCANA_TESTING_HOOKS under JSRUNTIMEHOST_TESTS. |
| Tests/UnitTests/Shared/Shared.cpp | Adds a new deterministic deadlock-repro test using Arcana testing hooks. |
| Tests/UnitTests/Win32/App.cpp | Updates test entrypoint to initialize gtest with argc/argv and run tests directly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
89322f6 to
9190b96
Compare
bghgary
added a commit
to bghgary/JsRuntimeHost
that referenced
this pull request
Apr 2, 2026
WorkQueue::~WorkQueue() had a race condition where cancel() + notify_all() fired without the queue mutex, so the signal could be lost if the worker thread hadn't entered condition_variable::wait() yet, causing join() to hang forever. This change merges WorkQueue into AppRuntime, eliminating split-lifetime issues, and dispatches cancellation as a work item via Append(). Since push() acquires the queue mutex, it blocks until the worker enters wait(), guaranteeing the notification is delivered. Changes: - Merge WorkQueue members into AppRuntime (thread, dispatcher, cancel source, env, suspension lock) - Remove WorkQueue.h and WorkQueue.cpp - Update AppRuntime_JSI.cpp TaskRunnerAdapter to use AppRuntime::Dispatch - Add deterministic regression test using arcana testing hooks - Fix member declaration order so m_options outlives worker thread The regression test uses arcana::set_before_wait_callback() to sleep while holding the queue mutex before wait(), deterministically triggering the race. See BabylonJS#146 for the test running against the old broken code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary
added a commit
that referenced
this pull request
Apr 7, 2026
Fix the WorkQueue destructor deadlock by using a mutex-coordinated wake-up. ## The Bug WorkQueue::~WorkQueue() cancelled from the main thread via cancel() + notify_all(). The notify fired **without holding the queue mutex**, so if the worker thread hadn't entered condition_variable::wait() yet, the signal was lost and join() hung forever. See #146 for a deterministic repro of the deadlock against the old code. ``` Main Thread Worker Thread internal_drain(): lock(mutex) check: empty? -> yes check: cancelled? -> no unlock(mutex) <- about to call wait() | ~WorkQueue(): RACE WINDOW cancel() (no mutex held) notify_all() ---- LOST! ---------> | lock(mutex) wait(lock) ---- blocks forever join() ----------------------------- blocks forever | DEADLOCK ``` ## The Fix Cancel immediately (so pending work is dropped promptly), then append a no-op work item to wake the worker. The no-op goes through push() which acquires the same mutex that wait() releases, so the notification cannot be lost. ``` Main Thread Worker Thread internal_drain(): lock(mutex) check: empty? -> yes wait(lock) ---- releases mutex | ~WorkQueue(): cancel() Append(no-op): push(): lock(mutex) <---- acquires it (worker is in wait) queue.push(no-op) unlock(mutex) notify_one() ------------------> wakes up! | drain -> execute no-op loop check: cancelled? -> yes exit loop join() <----------------------------- thread exits | SUCCESS ``` Also fixes member declaration order in AppRuntime.h so m_options outlives m_workQueue during destruction. ## Regression Test Includes a deterministic test using arcana testing hooks (microsoft/arcana.cpp#59, merged) that sleeps while holding the queue mutex before wait(). This guarantees the worker is in the vulnerable window when destruction fires. The test **passes** with this fix and **deadlocks** with the old code (#146). ## Follow-up Merging WorkQueue into AppRuntime will be done in a separate PR to eliminate split-lifetime issues. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
Superseded by #147 which is now merged. This PR served its purpose demonstrating the deadlock with the old code. |
Reverts the deadlock fix to demonstrate the test detects it. Test updated with cleanup from BabylonJS#151. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0d21742 to
15bb2f1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deterministic test proving the AppRuntime destructor has a deadlock race condition.
The Bug
WorkQueue::~WorkQueue() cancels from the main thread via cancel() + notify_all(). The notify fires without holding the queue mutex, so if the worker thread hasn't entered condition_variable::wait() yet, the signal is lost and join() hangs forever.
How This Test Works
Uses arcana testing hooks (microsoft/arcana.cpp#59) to sleep while holding the queue mutex right before wait(). This guarantees the worker is between the condition check and wait() when the destructor fires.
Expected Result
This PR should FAIL CI -- the test will detect the deadlock via a 5-second timeout on iteration 0.
Changes
All additive, no production code modified:
Dependencies