Skip to content

Add optional testing hook for blocking_concurrent_queue#59

Merged
bghgary merged 3 commits intomicrosoft:masterfrom
bghgary:feature/testing-hooks
Apr 2, 2026
Merged

Add optional testing hook for blocking_concurrent_queue#59
bghgary merged 3 commits intomicrosoft:masterfrom
bghgary:feature/testing-hooks

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 2, 2026

Add a global before-wait callback to blocking_concurrent_queue, invoked while holding the queue mutex right before condition_variable::wait().

Motivation

This enables deterministic testing of lost-wakeup race conditions. By sleeping in the callback (while holding the mutex), a test can guarantee a worker thread is between the condition check and wait(). Code that coordinates through push() (which acquires the mutex) will block until the worker enters wait() and deliver the notification correctly. Code that uses bare notify_all() (without the mutex) will lose the signal.

Changes

  • blocking_concurrent_queue.h: global g_beforeWaitCallback (initialized to no-op), called before each wait()

All behind ARCANA_TESTING_HOOKS -- zero cost when not enabled. No other files changed.

Usage

Used by BabylonJS/JsRuntimeHost#146 to deterministically reproduce a deadlock in AppRuntime shutdown.

@bghgary bghgary force-pushed the feature/testing-hooks branch from 63bc00d to d5804f8 Compare April 2, 2026 16:31
@bghgary bghgary force-pushed the feature/testing-hooks branch 2 times, most recently from 0afc70e to fa054fd Compare April 2, 2026 18:17
@bghgary bghgary changed the title Add optional testing hooks for blocking_concurrent_queue Add optional testing hook for blocking_concurrent_queue Apr 2, 2026
@bghgary bghgary force-pushed the feature/testing-hooks branch from fa054fd to 8328898 Compare April 2, 2026 18:36
@bghgary bghgary marked this pull request as ready for review April 2, 2026 22:55
Copilot AI review requested due to automatic review settings April 2, 2026 22:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a test-only hook to arcana::blocking_concurrent_queue to allow deterministic reproduction of lost-wakeup races by running a callback immediately before condition_variable::wait() while the queue mutex is held.

Changes:

  • Introduces a global (test-only) beforeWaitCallback and set_before_wait_callback() setter behind ARCANA_TESTING_HOOKS.
  • Invokes the callback right before m_dataReady.wait(lock) in both internal_pop() and internal_drain().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add a global before-wait callback invoked while holding the queue mutex,
right before condition_variable::wait(). This enables deterministic testing
of lost-wakeup race conditions in code that uses the queue.

Behind #ifdef ARCANA_TESTING_HOOKS -- zero cost when not enabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the feature/testing-hooks branch from 8328898 to cf25e34 Compare April 2, 2026 23:53
bghgary and others added 2 commits April 2, 2026 16:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary merged commit d5dd03c into microsoft:master Apr 2, 2026
15 checks passed
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 2, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to bghgary/JsRuntimeHost that referenced this pull request Apr 3, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit to BabylonJS/JsRuntimeHost that referenced this pull request Apr 7, 2026
Fix the WorkQueue destructor deadlock by using a mutex-coordinated
wake-up.

## The Bug

WorkQueue::~WorkQueue() cancelled from the main thread via cancel() +
notify_all(). The notify fired **without holding the queue mutex**, so
if the worker thread hadn't entered condition_variable::wait() yet, the
signal was lost and join() hung forever. See #146 for a deterministic
repro of the deadlock against the old code.

```
Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       check: cancelled? -> no
                                       unlock(mutex)  <- about to call wait()
                                            |
~WorkQueue():                          RACE WINDOW
  cancel()                            (no mutex held)
  notify_all() ---- LOST! --------->
                                            |
                                       lock(mutex)
                                       wait(lock) ---- blocks forever

  join() ----------------------------- blocks forever
       |
   DEADLOCK
```

## The Fix

Cancel immediately (so pending work is dropped promptly), then append a
no-op work item to wake the worker. The no-op goes through push() which
acquires the same mutex that wait() releases, so the notification cannot
be lost.

```
Main Thread                          Worker Thread
                                     internal_drain():
                                       lock(mutex)
                                       check: empty? -> yes
                                       wait(lock) ---- releases mutex
                                            |
~WorkQueue():
  cancel()
  Append(no-op):
    push():
      lock(mutex) <---- acquires it (worker is in wait)
      queue.push(no-op)
      unlock(mutex)
      notify_one() ------------------> wakes up!
                                            |
                                       drain -> execute no-op
                                       loop check: cancelled? -> yes
                                       exit loop
  join() <----------------------------- thread exits
       |
   SUCCESS
```

Also fixes member declaration order in AppRuntime.h so m_options
outlives m_workQueue during destruction.

## Regression Test

Includes a deterministic test using arcana testing hooks
(microsoft/arcana.cpp#59, merged) that sleeps while holding the queue
mutex before wait(). This guarantees the worker is in the vulnerable
window when destruction fires. The test **passes** with this fix and
**deadlocks** with the old code (#146).

## Follow-up

Merging WorkQueue into AppRuntime will be done in a separate PR to
eliminate split-lifetime issues.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants