Skip to content

Bug report draft: SHC ManualDetention + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress) #1676

@dpericaxon

Description

@dpericaxon

Describe the request

When running Search Head Cluster (SHC) on Kubernetes, the Splunk Operator’s rollout/recycle logic can become stuck if an SHC member enters ManualDetention and Kubernetes readiness is used as the primary gate to proceed.

In addition, when ingress/session stickiness is enabled (common for Splunk Web), users can appear “stuck” on a specific SH member during rollouts/detention events, even though the cluster is attempting to drain traffic from that member.

We need a robust operator behavior that:

  • avoids deadlocking upgrades/rollouts when an SH member is detained, and
  • avoids or minimizes user impact when a member is out of service (detained/unhealthy) in environments that use sticky sessions.

Expected behavior

  • During an operator-driven SHC recycle/upgrade:

    • The operator detains a member, waits for safe conditions, updates/restarts as needed, then releases detention.
    • The rollout should always converge without requiring manual intervention.
  • During detention/restart events:

    • Users should not be pinned indefinitely to an SH member that is out of service.
    • If sticky sessions are used, there should be a failover path when a pinned backend becomes unhealthy/out of service.

Splunk setup on K8S

  • Splunk Enterprise 10.0.3 deployed with Splunk Operator 3.0.0.
  • Includes:
    • SearchHeadCluster (SHC) with multiple members
    • (Optionally) Standalone instances in the same cluster (not required to reproduce this issue)
  • Splunk Web traffic routed through Kubernetes Ingress (NGINX Ingress Controller) with cookie-based session affinity enabled (typical for Splunk Web session behavior).

Reproduction/Testing steps

A) Deadlock / “circular loop” between detention and readiness

  1. Start a normal SHC with N members managed by Splunk Operator.
  2. Trigger an operator-driven rolling operation (e.g., image upgrade / recycle path).
  3. Ensure an SH member becomes detained as part of the recycle process (status=ManualDetention at the Splunk SHC member layer).
  4. If a readiness probe is implemented such that:
    • ManualDetention → readiness probe fails → Kubernetes marks the pod Ready=false
  5. Observe:
    • Kubernetes removes the pod from Service endpoints (expected).
    • The Splunk Operator’s SHC rollout can stall waiting for ReadyReplicas / pod readiness gates, and never reaches the step that releases detention.

The loop in one line

Operator detains a member → probe marks it NotReady → operator waits for all pods Ready → operator never reaches the step that undetains → member stays detained → probe keeps it NotReady → rollout is stuck.

Why this happens (mechanically)

During an operator-driven SHC upgrade/recycle, the operator:

  • puts a member into detention (Splunk-side “out of service”)
  • later, when safe, it releases detention (i.e., clears manual detention)

But the operator’s control loop gates progress on Kubernetes readiness (e.g., StatefulSet ReadyReplicas == replicas or equivalent “cluster is ready” conditions).

If the readiness probe is defined such that detention implies NotReady, Kubernetes will keep that pod Ready=false while detained. If the operator requires all pods Ready before proceeding to the release/undetain step, the process can deadlock.

Important nuance

A “detention ⇒ NotReady” probe is reasonable for human/manual detentions (you want traffic drained).

It conflicts specifically with operator-driven rolling restarts/upgrades, because the operator expects to be able to detain a member and still progress through the rest of the orchestration, eventually releasing detention.

We attempted a “fail-open” guard approach (only mark NotReady for detention when a rolling restart is not in progress). Without a guard like that (or a different operator gating model), the deadlock can occur.

B) User impact: sticky ingress can strand users on a detained member

  1. Enable cookie-based stickiness in the ingress for Splunk Web (common).
  2. Have a user establish a session routed to SH member X.
  3. During a rollout, member X is detained/restarted/unhealthy.
  4. Observe:
    • The user’s browser continues to send requests with the same affinity cookie.
    • Depending on ingress behavior, the user can appear “stuck” (errors/timeouts/looping) until the cookie expires or the ingress fails over the session.

This is typically addressed at the ingress layer (e.g., session-cookie-change-on-failure and proxy-next-upstream settings), but it’s tightly coupled to how detention is reflected in readiness/endpoints during operator-driven operations.


K8s environment

  • Kubernetes cluster (managed)
  • NGINX Ingress Controller
  • (Optional) strict session stickiness enabled for Splunk Web ingress

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions