Skip to content

Fix/publish races#1572

Open
neolynx wants to merge 3 commits into
masterfrom
fix/publish-races
Open

Fix/publish races#1572
neolynx wants to merge 3 commits into
masterfrom
fix/publish-races

Conversation

@neolynx
Copy link
Copy Markdown
Member

@neolynx neolynx commented May 23, 2026

Fixes #

Requirements

All new code should be covered with tests, documentation should be updated. CI should pass.

Also, to speed up things, if you could kindly "Allow edits and access to secrets by maintainers" in the
PR settings, as this allows us to rebase the PR on master, fix conflicts, run coverage and help with
implementing code and tests.

Description of the Change

Checklist

  • allow Maintainers to edit PR (rebase, run coverage, help with tests, ...)
  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)
  • man page updated (if applicable)
  • bash completion updated (if applicable)
  • documentation updated
  • author name in AUTHORS

neolynx added 3 commits May 23, 2026 13:54
apiPublishRepoOrSnapshot appended published.Key() to resources inside
the task closure, after maybeRunTaskInBackground had already been called.
The task's locked-resource set is fixed at submission time, so that append
had no effect — the published repo key was never registered as a resource.

Two concurrent POST /api/publish/{prefix} requests for the same
prefix/distribution therefore did not conflict in the task queue: both
ran in parallel, each loaded an empty PublishedRepoCollection from the DB,
both passed CheckDuplicate, and the second Add silently overwrote the first.

Fix: compute the published repo key ("U{storagePrefix}>>{distribution}")
from the already-known storage/prefix/distribution values and append it to
resources before calling maybeRunTaskInBackground, so concurrent creates
for the same destination are serialised by the task queue.  The now-dead
append inside the closure is removed.
…oints

Affected endpoints: apiPublishAddSource, apiPublishSetSources,
apiPublishUpdateSource, apiPublishRemoveSource, apiPublishDropChanges.

All five handlers shared the same flawed pattern: they loaded the
published repo from the DB and mutated it (ObtainRevision / DropRevision)
outside the task closure, before the task lock was acquired.  Each task
closure then just wrote back the already-mutated, pre-lock object.

Because the task queue serialises tasks that share a resource key, two
concurrent requests appear safe — but each task closure holds a stale
copy of the object captured before the lock was taken:

  Request A loads published: revision = {}
  Request B loads published: revision = {}   <- same DB state
  A mutates: revision = {main: snap1}
  B mutates: revision = {contrib: snap2}
  Task A runs: saves {main: snap1}           OK
  Task B runs: saves {contrib: snap2}        <- clobbers A's change

Fix: perform only a shallow ByStoragePrefixDistribution outside the task
(for the early 404 response, resource key, and task name).  Inside the
task closure a dedicated taskCollectionFactory is created, the published
repo is re-read fresh from the DB (after the lock is acquired), and
LoadComplete + all mutations + Update are executed against that
authoritative copy.
Affected endpoints: apiPublishUpdateSwitch (PUT), apiPublishUpdate (POST).

Both handlers loaded the published repo and mutated scalar fields
(Label, Origin, SkipContents, SkipBz2, AcquireByHash, SignedBy,
MultiDist, Version) outside the task closure, before the lock was
acquired.  Inside the task, LoadComplete only refreshed sourceItems —
it did not reload scalar fields or the Revision.  Two concurrent
requests therefore each operated on a stale base:

  Request A loads published (Label="old"), sets Label="A"
  Request B loads published (Label="old"), sets Label="B"
  Task A runs: Update() + Publish() + collection.Update() -> saves Label="A"
  Task B runs: Update() on B's stale copy -> saves Label="B",
               silently discarding A's Label change and potentially
               reconciling a Revision built against the pre-A state.

Fix: remove all field mutations and the LoadComplete call from the HTTP
handler.  Inside the task, a fresh taskCollectionFactory is created, the
published repo is re-read via ByStoragePrefixDistribution + LoadComplete
(obtaining the current DB state after the lock is held), and then all
field mutations are applied before Update / Publish / collection.Update.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 73.58491% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.23%. Comparing base (8dc61cf) to head (b7969c7).

Files with missing lines Patch % Lines
api/publish.go 73.58% 21 Missing and 21 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1572      +/-   ##
==========================================
+ Coverage   77.22%   77.23%   +0.01%     
==========================================
  Files         161      161              
  Lines       15085    15114      +29     
==========================================
+ Hits        11649    11673      +24     
+ Misses       2292     2290       -2     
- Partials     1144     1151       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread api/publish.go
// Pre-register the published repo key in resources so that concurrent
// POST requests for the same prefix/distribution are serialized by the
// task queue rather than racing on CheckDuplicate + Add.
if b.Distribution != "" {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Comment thread api/publish.go
taskCollectionFactory := context.NewCollectionFactory()
taskCollection := taskCollectionFactory.PublishedRepoCollection()

published, err := taskCollection.ByStoragePrefixDistribution(storage, prefix, distribution)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

published already there ?

Comment thread api/publish.go

// Field mutations and fresh DB load are deferred to inside the task so
// they always operate on a consistent state after the lock is held.
resources := []string{string(published.Key())}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock snapshots?

Comment thread api/publish.go
taskName := fmt.Sprintf("Update published %s repository %s/%s", published.SourceKind, published.StoragePrefix(), published.Distribution)
maybeRunTaskInBackground(c, taskName, resources, func(out aptly.Progress, _ *task.Detail) (*task.ProcessReturnValue, error) {
err = collection.LoadComplete(published, collectionFactory)
taskCollectionFactory := context.NewCollectionFactory()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not existing factory?

Comment thread api/publish.go
maybeRunTaskInBackground(c, taskName, resources, func(out aptly.Progress, _ *task.Detail) (*task.ProcessReturnValue, error) {
err = collection.LoadComplete(published, collectionFactory)
taskCollectionFactory := context.NewCollectionFactory()
taskCollection := taskCollectionFactory.PublishedRepoCollection()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not existing ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant