Skip to content

feat(valkey): add Valkey cluster addon as a sibling to redis#2

Open
mogita wants to merge 3 commits intomainfrom
feature/valkey-addon
Open

feat(valkey): add Valkey cluster addon as a sibling to redis#2
mogita wants to merge 3 commits intomainfrom
feature/valkey-addon

Conversation

@mogita
Copy link
Copy Markdown
Collaborator

@mogita mogita commented May 6, 2026

Summary

Adds addons/valkey/ as a cluster-mode-only side-by-side KubeBlocks addon. It replaces the five post-install Helm hooks in stream-infra/kubernetes/codebase/charts/valkey/templates/hooks/ (patch-cache-config, patch-maxmemory, patch-prefer-ip, patch-reshard-cm, patch-valkey-image) by baking the same behaviour into the addon at template-level — no more racing the operator with kubectl patch jobs against KubeBlocks-managed ConfigMaps.

The Valkey addon is independent of addons/redis/. Upstream redis evolution can land cleanly via merge — there are no shared files to conflict on.

What's in the addon

  • Single Valkey major (9.x) — no multi-version range loop, no sentinel, no twemproxy. Slim by design; we can extend valkeyVersions in values.yaml for new patches.
  • cmpv-valkey-cluster.yaml ships docker.io/valkey/valkey:<version>. dbctl and agamotto stay on apecloud (KubeBlocks-side tooling, not the engine).
  • ShardingDefinition with minShards: 1 — provisions 1, 2, 3+ shards. The create_redis_cluster helper branches on primary_count == 1 to use CLUSTER ADDSLOTSRANGE 0 16383 (mirroring AWS ElastiCache's approach), bypassing redis-cli --cluster create which Redis itself rejects below 3 masters.
  • redis.conf tuned for cache workload (config/valkey-cluster-config.tpl): appendonly no, save "", io-threads 1, latency-monitor-threshold 25, maxmemory-policy allkeys-lru, maxmemory at 85% of pod memory limit.
  • valkey-cluster-server-start.sh emits cluster-preferred-endpoint-type ip on the default-network branch (upstream emits hostname), so CLUSTER SLOTS announces VPC-routable IPs.
  • valkey-cluster-manage.sh skips the legacy redis-cli --cluster reshard call on scale-out — slot migration is driven by ASM (CLUSTER MIGRATESLOTS via ape-dts) through the OpsDefinition in stream-infra.

Internal function names keep their redis_* identifiers to minimise diff vs. upstream redis scripts (easier future bug-porting). Filenames and CR names are valkey-cluster-* for clarity.

What this retires in stream-infra

Hook Replaced by
patch-valkey-image.yaml cmpv-valkey-cluster.yaml ships valkey/valkey directly
patch-cache-config.yaml valkey-cluster-config.tpl bakes appendonly/save/io-threads/latency-monitor
patch-maxmemory.yaml Same template, 85% of PHY_MEMORY (limit), allkeys-lru policy
patch-prefer-ip.yaml valkey-cluster-server-start.sh line 652 set to ip
patch-reshard-cm.yaml valkey-cluster-manage.sh simply doesn't call scale_out_shard_reshard

What stays in stream-infra: NLB / TargetGroupBinding, NetworkPolicies, ServiceMonitor, the auto-heal CronJob, the ASM OpsDefinition. Engine-agnostic infra.

Settings global for v1

No per-cluster Helm knobs yet — every Valkey cluster on this addon picks up the same tunings. If divergence is needed later we can wire either values.yaml overrides into the config tpl or a real ParametersDefinition / ParamConfigRenderer for per-Cluster overrides.

Verification

  • helm template addons/valkey renders 5 resources cleanly (ShardingDefinition, ComponentDefinition, ComponentVersion, plus 2 ConfigMap templates). All 9 script files mount into the scripts ConfigMap.
  • shellspec for build_single_shard_addslots_command and the create_redis_cluster branch: 4 examples, 0 failures (run on bash 5).
  • Upstream redis spec still passes (no shared files).
  • e2e: switch a Cluster from componentDef: redis-cluster-8 to componentDef: valkey-cluster-9-0.1.0, verify provision, set/get via redis-cli -c.
  • e2e: provision with shards: 1; verify cluster_state ok and slots covered.
  • e2e: scale 1 → 3, then run ASM OpsRequest; verify slot migration and --cluster check passes.
  • e2e: confirm CLUSTER SLOTS announces pod IPs (not FQDNs) so chat-api on EC2 can reach the cluster.

Out of scope

  • Reverse path (3→2→1 scale-in) is untouched in this PR. The manage.sh lifecycle hook for shardRemove calls --pre-terminate; that flow is unchanged. We can revisit if/when we need to scale a Valkey cluster down through 1.
  • No changes to addons/redis/. The redis addon stays exactly as upstream — anyone still on a redis-componentDef cluster keeps the existing 3-shard floor and stock behaviour.

mogita added 3 commits May 6, 2026 12:43
Stand up addons/valkey/ as a cluster-mode-only side-by-side addon, so our
Valkey customizations live in their own file tree and never collide with
upstream redis evolution. This retires the five post-install Helm hooks
in stream-infra (patch-cache-config, patch-maxmemory, patch-prefer-ip,
patch-reshard-cm, patch-valkey-image) by baking the equivalent behaviour
into the addon at template-level.

What's in the addon
-------------------
- Single Valkey major (9.x) — no multi-version range loop, no sentinel,
  no twemproxy. cmpv-valkey-cluster.yaml ships docker.io/valkey/valkey
  images. dbctl/agamotto stay on apecloud.
- ShardingDefinition with `minShards: 1` (provisions 1, 2, 3+ shards,
  matching how AWS ElastiCache exposes the same engine).
- redis.conf tuned for a cache workload at template-level: appendonly no,
  save "" (no scheduled BGSAVE), io-threads 1 (avoids CFS throttling at
  our pod CPU limit), latency-monitor-threshold 25 (observability),
  maxmemory-policy allkeys-lru, maxmemory at 85% of pod memory limit.
- valkey-cluster-server-start.sh: emits `cluster-preferred-endpoint-type
  ip` on the default-network branch (was `hostname`), so CLUSTER SLOTS
  announces VPC-routable IPs for chat-api and other external clients.
- valkey-cluster-manage.sh: skips the legacy `redis-cli --cluster reshard`
  call on shard scale-out — slot migration is driven by ASM
  (CLUSTER MIGRATESLOTS via ape-dts) through the OpsDefinition in
  stream-infra.
- valkey-cluster-common.sh: branches `create_redis_cluster` on a single
  primary to use `CLUSTER ADDSLOTSRANGE 0 16383` (mirroring ElastiCache),
  bypassing `redis-cli --cluster create` which rejects fewer than 3
  masters. Lifts the matching guard in initialize_redis_cluster.

Function names inside the scripts intentionally keep their `redis_*`
identifiers to minimise the diff vs. upstream redis scripts and ease
future bug-porting.

Settings are global for now — no per-cluster Helm knobs. Add
ParametersDefinition / values overrides later if cluster-specific
tunings are needed.

Verification
------------
- `helm template addons/valkey` renders 5 resources cleanly:
  ShardingDefinition, ComponentDefinition, ComponentVersion, plus the
  config + scripts ConfigMap templates. All 9 script files mount.
- shellspec for `build_single_shard_addslots_command` and
  `create_redis_cluster` branch logic: 4 examples, 0 failures.
Add 9.0.0, 9.0.1, 9.0.2, 9.0.4 alongside existing 9.0.3 / 9.1.0 in
ComponentVersion releases. 9.0.4 (released 2026-05-06) becomes the chart
appVersion and the default `serviceVersion` on the ComponentDefinition.

The full 9.0.x range gives operators a pinned set of options for
OpsRequest type=Upgrade rollback / patch-version testing without needing
to redeploy the addon. Same-image-tag mapping; no behavioural change.
9.1.0 is still RC upstream and not yet a tagged release on
docker.io/valkey/valkey. Keep ComponentVersion to the stable 9.0.x line
(9.0.0 - 9.0.4) for now; re-add 9.1.0 once the GA tag ships.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant