Skip to content

auth thread safety#454

Draft
mikasenghaas wants to merge 12 commits intomainfrom
mika/auth-thread-safety
Draft

auth thread safety#454
mikasenghaas wants to merge 12 commits intomainfrom
mika/auth-thread-safety

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented Mar 20, 2026

Note

Medium Risk
Touches auth token caching and refresh behavior, adding new locking/coalescing paths that could deadlock or cause stale/failed token reuse if incorrect, but scope is limited to client-side cache management.

Overview
Auth cache concurrency was hardened. SandboxAuthCache is now thread-safe and async-safe via locks, and it coalesces concurrent POST /sandbox/{id}/auth refreshes so only one request is issued per sandbox while others wait.

Cache lifecycle and I/O were adjusted. The cache can be lazily loaded (used by AsyncSandboxClient), expired entries are cleaned on load with a save-on-clean, async file ops avoid blocking the event loop (to_thread for mkdir/unlink/exists/makedirs), and AsyncSandboxClient adds clear_auth_cache_async() alongside the existing sync method.

Written by Cursor Bugbot for commit dcf61b7. This will update automatically on new commits. Configure here.

kcoopermiller and others added 12 commits March 18, 2026 16:53
_get_async_lock() had a race where two coroutines (or threads running
per-thread event loops via ThreadedAsyncSandboxClient) could both see
_async_lock as None and create separate Lock instances. Use a
threading.Lock with double-checked locking to ensure exactly one
asyncio.Lock is created.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When many sandboxes are created in a burst, all cache-miss concurrently
and fire redundant auth POST requests. Add per-sandbox in-flight
tracking so only one request is issued per sandbox_id while others wait
for the result. Applies to both sync (threading.Event) and async
(asyncio.Future) paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_save_cache() does blocking file I/O (json.dump to disk). When called
inside self._lock, it blocks all other threads waiting for the lock
while doing disk writes. Move the save outside the critical section —
the in-memory cache is already updated atomically under the lock, so
the file write can happen without holding it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- _save_cache_async(): mkdir was sync, now offloaded to thread
- AsyncSandboxClient.download_file(): os.makedirs was sync, now offloaded
- AsyncSandboxClient.upload_file(): os.path.exists was sync, now offloaded

These sync syscalls block the event loop. Under high concurrency they
contribute to event loop lag accumulation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dcf61b7c1d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +412 to +416
except BaseException as e:
async with self._get_async_lock():
fut = self._inflight_async.pop(sandbox_id, None)
if fut is not None and not fut.done():
fut.set_exception(e)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop one cancelled auth refresh from aborting peer requests

In SandboxAuthCache.get_or_refresh_async, concurrent callers for the same sandbox share the leader's future. If that leader task is cancelled mid-request (for example because one caller is wrapped in asyncio.wait_for or a client disconnects), this except BaseException path stores the CancelledError on the shared future, so every other waiter is cancelled too even though their own operations are still valid. Before this coalescing change those peer commands/downloads would have failed or succeeded independently.

Useful? React with 👍 / 👎.

Comment on lines +320 to +328
if event is not None:
# Wait for the in-flight request to complete
event.wait()
with self._lock:
cached_auth = self._check_cached_auth(sandbox_id)
if cached_auth:
return cached_auth
# Fallback: if the in-flight request failed, fetch ourselves
return self.get_or_refresh(sandbox_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid recursive retries after a shared auth failure

When many threads call SandboxAuthCache.get_or_refresh for the same sandbox while /sandbox/{id}/auth is failing, each waiter wakes up here and re-enters get_or_refresh() recursively. Only one waiter becomes the new leader each round, so the others can accumulate one extra stack frame per failure; under a large burst this can end in RecursionError instead of surfacing the original auth error. An iterative retry path would preserve the coalescing without making outage handling stack-depth dependent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this recursion is sus.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

with self._lock:
if sandbox_id in self._auth_cache:
self._auth_cache[sandbox_id]["is_gpu"] = is_gpu
self._save_cache()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync is_gpu saves cache unconditionally unlike async

Low Severity

In the sync is_gpu, _save_cache() was moved outside the if sandbox_id in self._auth_cache block, so it now runs even when no update was made. The original code (and the async counterpart is_gpu_async) only calls save inside the if block. This is an unintentional behavioral change that performs unnecessary file I/O and is inconsistent with the async version.

Additional Locations (1)
Fix in Cursor Fix in Web

self._async_lock: Optional[asyncio.Lock] = None
# Per-sandbox in-flight auth futures to coalesce concurrent requests
self._inflight_sync: Dict[str, threading.Event] = {}
self._inflight_sync_results: Dict[str, Dict[str, Any]] = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused _inflight_sync_results dict is dead code

Low Severity

_inflight_sync_results is initialized in __init__ but never read or written to anywhere in the codebase. It appears to be a leftover from an earlier design of the sync coalescing logic that was replaced by the event-based approach. This unused attribute adds confusion about the concurrency design.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

@DamianB-BitFlipper DamianB-BitFlipper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the gist of this PR, but it is trying to shoehorn new logic in to the existing foundation that is not super compatible.

A lot of the async.to_thread and the thread lock + async lock, is due to def _load_cache(self) -> tuple[Dict[str, Any], bool]: and other methods being "sync".

By my inspection, nothing prevents them from being async (or having async variants). Then one would not need to do the whole to_thread business.

@DamianB-BitFlipper
Copy link
Copy Markdown
Contributor

I would also not define lazy flag. Just have this be "lazily loaded" always. I do not see where this is worse than eager.

@mikasenghaas
Copy link
Copy Markdown
Member Author

By my inspection, nothing prevents them from being async (or having async variants). Then one would not need to do the whole to_thread business.

would love this, let's do it

@mikasenghaas mikasenghaas marked this pull request as draft March 20, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants