auth thread safety by mikasenghaas · Pull Request #454 · PrimeIntellect-ai/prime

mikasenghaas · 2026-03-20T17:32:49Z

Note

Medium Risk
Touches auth token caching and refresh behavior, adding new locking/coalescing paths that could deadlock or cause stale/failed token reuse if incorrect, but scope is limited to client-side cache management.

Overview
Auth cache concurrency was hardened. SandboxAuthCache is now thread-safe and async-safe via locks, and it coalesces concurrent POST /sandbox/{id}/auth refreshes so only one request is issued per sandbox while others wait.

Cache lifecycle and I/O were adjusted. The cache can be lazily loaded (used by AsyncSandboxClient), expired entries are cleaned on load with a save-on-clean, async file ops avoid blocking the event loop (to_thread for mkdir/unlink/exists/makedirs), and AsyncSandboxClient adds clear_auth_cache_async() alongside the existing sync method.

^{Written by Cursor Bugbot for commit dcf61b7. This will update automatically on new commits. Configure here.}

_get_async_lock() had a race where two coroutines (or threads running per-thread event loops via ThreadedAsyncSandboxClient) could both see _async_lock as None and create separate Lock instances. Use a threading.Lock with double-checked locking to ensure exactly one asyncio.Lock is created. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When many sandboxes are created in a burst, all cache-miss concurrently and fire redundant auth POST requests. Add per-sandbox in-flight tracking so only one request is issued per sandbox_id while others wait for the result. Applies to both sync (threading.Event) and async (asyncio.Future) paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_save_cache() does blocking file I/O (json.dump to disk). When called inside self._lock, it blocks all other threads waiting for the lock while doing disk writes. Move the save outside the critical section — the in-memory cache is already updated atomically under the lock, so the file write can happen without holding it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- _save_cache_async(): mkdir was sync, now offloaded to thread - AsyncSandboxClient.download_file(): os.makedirs was sync, now offloaded - AsyncSandboxClient.upload_file(): os.path.exists was sync, now offloaded These sync syscalls block the event loop. Under high concurrency they contribute to event loop lag accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dcf61b7c1d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T17:38:33Z

+        except BaseException as e:
+            async with self._get_async_lock():
+                fut = self._inflight_async.pop(sandbox_id, None)
+            if fut is not None and not fut.done():
+                fut.set_exception(e)


Stop one cancelled auth refresh from aborting peer requests

In SandboxAuthCache.get_or_refresh_async, concurrent callers for the same sandbox share the leader's future. If that leader task is cancelled mid-request (for example because one caller is wrapped in asyncio.wait_for or a client disconnects), this except BaseException path stores the CancelledError on the shared future, so every other waiter is cancelled too even though their own operations are still valid. Before this coalescing change those peer commands/downloads would have failed or succeeded independently.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-20T17:38:33Z

+        if event is not None:
+            # Wait for the in-flight request to complete
+            event.wait()
+            with self._lock:
+                cached_auth = self._check_cached_auth(sandbox_id)
+                if cached_auth:
+                    return cached_auth
+            # Fallback: if the in-flight request failed, fetch ourselves
+            return self.get_or_refresh(sandbox_id)


Avoid recursive retries after a shared auth failure

When many threads call SandboxAuthCache.get_or_refresh for the same sandbox while /sandbox/{id}/auth is failing, each waiter wakes up here and re-enters get_or_refresh() recursively. Only one waiter becomes the new leader each round, so the others can accumulate one extra stack frame per failure; under a large burst this can end in RecursionError instead of surfacing the original auth error. An iterative retry path would preserve the coalescing without making outage handling stack-depth dependent.

Useful? React with 👍 / 👎.

yeah, this recursion is sus.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-20T17:49:49Z

+        with self._lock:
+            if sandbox_id in self._auth_cache:
+                self._auth_cache[sandbox_id]["is_gpu"] = is_gpu
+        self._save_cache()


Sync is_gpu saves cache unconditionally unlike async

Low Severity

In the sync is_gpu, _save_cache() was moved outside the if sandbox_id in self._auth_cache block, so it now runs even when no update was made. The original code (and the async counterpart is_gpu_async) only calls save inside the if block. This is an unintentional behavioral change that performs unnecessary file I/O and is inconsistent with the async version.

Additional Locations (1)

packages/prime-sandboxes/src/prime_sandboxes/sandbox.py#L430-L434

cursor · 2026-03-20T17:49:49Z

+        self._async_lock: Optional[asyncio.Lock] = None
+        # Per-sandbox in-flight auth futures to coalesce concurrent requests
+        self._inflight_sync: Dict[str, threading.Event] = {}
+        self._inflight_sync_results: Dict[str, Dict[str, Any]] = {}


Unused _inflight_sync_results dict is dead code

Low Severity

_inflight_sync_results is initialized in __init__ but never read or written to anywhere in the codebase. It appears to be a leftover from an earlier design of the sync coalescing logic that was replaced by the event-based approach. This unused attribute adds confusion about the concurrency design.

DamianB-BitFlipper

I get the gist of this PR, but it is trying to shoehorn new logic in to the existing foundation that is not super compatible.

A lot of the async.to_thread and the thread lock + async lock, is due to def _load_cache(self) -> tuple[Dict[str, Any], bool]: and other methods being "sync".

By my inspection, nothing prevents them from being async (or having async variants). Then one would not need to do the whole to_thread business.

DamianB-BitFlipper · 2026-03-20T18:05:00Z

I would also not define lazy flag. Just have this be "lazily loaded" always. I do not see where this is worse than eager.

mikasenghaas · 2026-03-20T18:15:02Z

By my inspection, nothing prevents them from being async (or having async variants). Then one would not need to do the whole to_thread business.

would love this, let's do it

kcoopermiller and others added 12 commits March 18, 2026 16:53

SandboxAuthCache thread safety

f0de711

bug bot

73a605b

comment

6b9eab8

async clear

8095340

fix clear async

e18a0c7

backwards compatibility

0f21c0a

async clear fix

f7d28f8

bug bot

88dc2ac

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

cursor bot reviewed Mar 20, 2026

View reviewed changes

DamianB-BitFlipper requested changes Mar 20, 2026

View reviewed changes

mikasenghaas marked this pull request as draft March 20, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auth thread safety#454

auth thread safety#454
mikasenghaas wants to merge 12 commits intomainfrom
mika/auth-thread-safety

mikasenghaas commented Mar 20, 2026 •

edited by cursor bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Uh oh!

DamianB-BitFlipper Mar 20, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 20, 2026

Uh oh!

cursor bot Mar 20, 2026

Uh oh!

DamianB-BitFlipper left a comment

Uh oh!

DamianB-BitFlipper commented Mar 20, 2026

Uh oh!

mikasenghaas commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mikasenghaas commented Mar 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

DamianB-BitFlipper Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

Sync is_gpu saves cache unconditionally unlike async

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

Unused _inflight_sync_results dict is dead code

Uh oh!

DamianB-BitFlipper left a comment

Choose a reason for hiding this comment

Uh oh!

DamianB-BitFlipper commented Mar 20, 2026

Uh oh!

mikasenghaas commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Mar 20, 2026 •

edited by cursor bot

Loading

Sync `is_gpu` saves cache unconditionally unlike async

Unused `_inflight_sync_results` dict is dead code