Bug report draft: SHC `ManualDetention` + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress)

## Describe the request

When running **Search Head Cluster (SHC)** on Kubernetes, the Splunk Operator’s rollout/recycle logic can become **stuck** if an SHC member enters **`ManualDetention`** and Kubernetes readiness is used as the primary gate to proceed.

In addition, when **ingress/session stickiness** is enabled (common for Splunk Web), users can appear “stuck” on a specific SH member during rollouts/detention events, even though the cluster is attempting to drain traffic from that member.

We need a robust operator behavior that:
- avoids **deadlocking** upgrades/rollouts when an SH member is detained, and
- avoids or minimizes user impact when a member is out of service (detained/unhealthy) in environments that use sticky sessions.

---

## Expected behavior

- During an operator-driven SHC recycle/upgrade:
  - The operator detains a member, waits for safe conditions, updates/restarts as needed, then **releases detention**.
  - The rollout should **always converge** without requiring manual intervention.

- During detention/restart events:
  - Users should not be pinned indefinitely to an SH member that is **out of service**.
  - If sticky sessions are used, there should be a **failover** path when a pinned backend becomes unhealthy/out of service.

---

## Splunk setup on K8S

- Splunk Enterprise 10.0.3 deployed with Splunk Operator 3.0.0.
- Includes:
  - **SearchHeadCluster** (SHC) with multiple members
  - (Optionally) Standalone instances in the same cluster (not required to reproduce this issue)
- Splunk Web traffic routed through Kubernetes Ingress (NGINX Ingress Controller) with **cookie-based session affinity** enabled (typical for Splunk Web session behavior).

---

## Reproduction/Testing steps

### A) Deadlock / “circular loop” between detention and readiness

1. Start a normal SHC with N members managed by Splunk Operator.
2. Trigger an operator-driven rolling operation (e.g., image upgrade / recycle path).
3. Ensure an SH member becomes **detained** as part of the recycle process (`status=ManualDetention` at the Splunk SHC member layer).
4. If a readiness probe is implemented such that:
   - `ManualDetention` → readiness probe fails → Kubernetes marks the pod `Ready=false`
5. Observe:
   - Kubernetes removes the pod from Service endpoints (expected).
   - The Splunk Operator’s SHC rollout can **stall** waiting for `ReadyReplicas` / pod readiness gates, and never reaches the step that releases detention.

#### The loop in one line
Operator detains a member → probe marks it NotReady → operator waits for all pods Ready → operator never reaches the step that undetains → member stays detained → probe keeps it NotReady → rollout is stuck.

#### Why this happens (mechanically)
During an operator-driven SHC upgrade/recycle, the operator:
- puts a member into detention (Splunk-side “out of service”)
- later, when safe, it releases detention (i.e., clears manual detention)

But the operator’s control loop gates progress on Kubernetes readiness (e.g., StatefulSet `ReadyReplicas == replicas` or equivalent “cluster is ready” conditions).

If the readiness probe is defined such that *detention implies NotReady*, Kubernetes will keep that pod `Ready=false` while detained. If the operator requires all pods Ready before proceeding to the release/undetain step, the process can deadlock.

#### Important nuance
A “detention ⇒ NotReady” probe is reasonable for **human/manual detentions** (you want traffic drained).

It conflicts specifically with **operator-driven rolling restarts/upgrades**, because the operator expects to be able to detain a member and still progress through the rest of the orchestration, eventually releasing detention.

We attempted a “fail-open” guard approach (only mark NotReady for detention when a rolling restart is not in progress). Without a guard like that (or a different operator gating model), the deadlock can occur.

### B) User impact: sticky ingress can strand users on a detained member

1. Enable cookie-based stickiness in the ingress for Splunk Web (common).
2. Have a user establish a session routed to SH member X.
3. During a rollout, member X is detained/restarted/unhealthy.
4. Observe:
   - The user’s browser continues to send requests with the same affinity cookie.
   - Depending on ingress behavior, the user can appear “stuck” (errors/timeouts/looping) until the cookie expires or the ingress fails over the session.

This is typically addressed at the ingress layer (e.g., `session-cookie-change-on-failure` and `proxy-next-upstream` settings), but it’s tightly coupled to how detention is reflected in readiness/endpoints during operator-driven operations.

---

## K8s environment

- Kubernetes cluster (managed)
- NGINX Ingress Controller
- (Optional) strict session stickiness enabled for Splunk Web ingress


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug report draft: SHC `ManualDetention` + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress) #1676

Describe the request

Expected behavior

Splunk setup on K8S

Reproduction/Testing steps

A) Deadlock / “circular loop” between detention and readiness

The loop in one line

Why this happens (mechanically)

Important nuance

B) User impact: sticky ingress can strand users on a detained member

K8s environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug report draft: SHC ManualDetention + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress) #1676

Description

Describe the request

Expected behavior

Splunk setup on K8S

Reproduction/Testing steps

A) Deadlock / “circular loop” between detention and readiness

The loop in one line

Why this happens (mechanically)

Important nuance

B) User impact: sticky ingress can strand users on a detained member

K8s environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug report draft: SHC `ManualDetention` + Kubernetes readiness gating can deadlock operator-driven rollouts (and can strand users with sticky ingress) #1676