Skip to content

Memberlist: prevent accidental cross-cluster joins with cluster labels#7385

Open
sandy2008 wants to merge 3 commits intocortexproject:masterfrom
sandy2008:sandy/memberlist-cluster-label
Open

Memberlist: prevent accidental cross-cluster joins with cluster labels#7385
sandy2008 wants to merge 3 commits intocortexproject:masterfrom
sandy2008:sandy/memberlist-cluster-label

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

@sandy2008 sandy2008 commented Mar 30, 2026

Summary

This fixes a Cortex gap in the memberlist integration: the vendored memberlist library already supports cluster labels, but Cortex did not expose that functionality in its own config. As a result, separate deployments that could reach the same gossip seed addresses could accidentally join the same memberlist cluster.

This change:

  • exposes memberlist.cluster-label and memberlist.cluster-label-verification-disabled in Cortex and wires them into memberlist
  • adds unit tests for same-label clustering, mixed-label isolation, and migration mode with inbound verification disabled
  • adds single-binary integration coverage for cross-cluster isolation and two-phase rollout migration
  • documents rollout guidance and updates example configs

Note: The doc change in docs/guides/migration-kv-store-to-memberlist.md also fixes a pre-existing typo (abort_if_join_fails to abort_if_cluster_join_fails), which is unrelated to the cluster label feature itself.

Security Note

Cluster labels are transmitted in plaintext over the gossip protocol. They are not a security boundary, only an operational safety net. An attacker on the network could still join by matching the label. This is documented in the migration guide.

Why

Without this, Cortex operators have no built-in way to prevent accidental cross-cluster memberlist merges caused by shared seed addresses or cross-namespace discovery. This is the same class of problem described in Mimir issue #6187.

Verification

  • go test ./pkg/ring/kv/memberlist -run "Test(BuildMemberlistConfigClusterLabelOptions|MultipleClients)"
  • go test -v -tags=integration,integration_memberlist -run "TestSingleBinaryWithMemberlistClusterLabel(Isolation|MigrationPhases)$" -count=1 ./integration/...

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 changed the title Memberlist: add cluster label support Memberlist: prevent accidental cross-cluster joins with cluster labels Mar 30, 2026
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Copy link
Copy Markdown
Member

@CharlieTLe CharlieTLe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Review

This is a solid, well-tested feature PR. The core implementation is minimal and correct — it properly wires Cortex config to the already-supported memberlist library fields. The test coverage (unit + integration) and documentation updates are thorough.

Main actionable items:

  1. Fix CHANGELOG category[BUGFIX][FEATURE] or [ENHANCEMENT] (see inline comment)
  2. Reconsider the log line change — the original per-RandomizeNodeName log was removed and replaced with an unconditional one; keep original behavior or document the change (see inline comment)
  3. Clean up the polling loop in the test helper (see inline comment)
  4. Minor nits on struct field ordering, test matrix coverage, and migration guide doc change noted inline

Security note: Cluster labels are transmitted in plaintext over the gossip protocol — they are not a security boundary, only an operational safety net. Consider adding a note in the docs that this is not a security mechanism; an attacker on the network could still join by matching the label.

…, clean up polling loop, add test cases, and strengthen integration assertion

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @CharlieTLe! All feedback addressed in the latest push:

  1. CHANGELOG — Changed [BUGFIX][FEATURE]
  2. Log line — Restored the original per-RandomizeNodeName log and added a separate conditional log for cluster label
  3. Polling loop — Replaced busy-wait with time.Ticker + time.After
  4. getClientErr() — Acknowledged as intentional test diagnostic behavior
  5. Test matrix — Added "configured label with verification enabled" case
  6. Integration negative assertion — Added a 5s observation window + re-assertion to verify cross-cluster isolation stays stable
  7. Doc fix callout — Noted the abort_if_join_fails rename in the PR description
  8. Security note — Added a Security Note section to the PR description clarifying cluster labels are not a security boundary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants