Replace wal_sender_timeout-based liveness with TCP keepalive. by ibrarahmad · Pull Request #373 · pgEdge/spock

ibrarahmad · 2026-03-04T06:54:44Z

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout.

The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control.

Replace the entire mechanism with a clean two-layer model:

TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds.
wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive.
spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable.

Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only.

Remove maybe_send_feedback() as it is no longer needed.

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout. The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control. Replace the entire mechanism with a clean two-layer model: - TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds. - wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive. - spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable. Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only. Remove maybe_send_feedback() as it is no longer needed.

coderabbitai · 2026-03-04T06:55:00Z

📝 Walkthrough

Walkthrough

A new idle timeout control mechanism was introduced for apply workers with a configurable GUC parameter. Replication connection handling was extended to disable server-side wal_sender_timeout. Timeout logic in apply worker was refactored from keepalive-based to idle-timeout-based, with supporting infrastructure adjustments throughout the codebase.

Changes

Cohort / File(s)	Summary
Header Export `include/spock.h`	Added extern declaration for the new `spock_apply_idle_timeout` global variable to expose it publicly.
GUC Configuration & Connection Setup `src/spock.c`	Introduced global variable `spock_apply_idle_timeout` (default 300s) and registered corresponding GUC parameter for runtime configuration. Expanded connection parameter capacity from 9 to 10 slots. Disabled server-side wal_sender_timeout for replication connections by injecting `-c wal_sender_timeout=0` into connection parameters.
Apply Worker Timeout Refactoring `src/spock_apply.c`	Removed static `maybe_send_feedback` function and refactored timeout handling in `apply_work` to use new idle-timeout scheme. Updated timeout calculation, error messaging, and `last_receive_timestamp` management to align with the new idle-timeout control flow.

Poem

🐰 A timeout tale, now fresh and new,
With idle checks and keepalives too,
WAL senders silent, connections stable,
Apply workers configured—now ably able,
Spock's patience measured, second by second!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Replace wal_sender_timeout-based liveness with TCP keepalive' accurately summarizes the main architectural change: replacing wal_sender_timeout-based liveness detection with TCP keepalive as the primary mechanism.
Description check	✅ Passed	The description provides a comprehensive explanation of the problem being solved, the previous workaround's limitations, the new two-layer solution (TCP keepalive + apply_idle_timeout), the bug fix for last_receive_timestamp, and the removal of maybe_send_feedback().
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch FEED_BACK

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/spock.c (1)
345-356: Consider making TCP keepalive parameters configurable.

The hardcoded values (idle=10s, interval=5s, count=3) result in ~25s dead connection detection. While reasonable for most deployments, high-latency or unreliable network environments may experience false-positive disconnects.

Consider exposing these as GUCs (e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) to allow tuning without code changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock.c` around lines 345 - 356, Replace the hardcoded TCP keepalive
literals in the keys/vals block with configurable GUC-backed values: define GUCs
(e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count)
as int variables (suggest names spock_keepalives_idle,
spock_keepalives_interval, spock_keepalives_count) during module initialization
(e.g., in _PG_init or the existing GUC registration area), register them with
DefineCustomIntVariable, and then use those variables' stringified values when
populating vals[] for the keys "keepalives_idle", "keepalives_interval", and
"keepalives_count" in the code that sets keys[i]/vals[i]; keep the default
values 10/5/3 and ensure bounds checking on the GUCs (positive integers) when
registering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/spock.c`:
- Around line 345-356: Replace the hardcoded TCP keepalive literals in the
keys/vals block with configurable GUC-backed values: define GUCs (e.g.,
spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) as int
variables (suggest names spock_keepalives_idle, spock_keepalives_interval,
spock_keepalives_count) during module initialization (e.g., in _PG_init or the
existing GUC registration area), register them with DefineCustomIntVariable, and
then use those variables' stringified values when populating vals[] for the keys
"keepalives_idle", "keepalives_interval", and "keepalives_count" in the code
that sets keys[i]/vals[i]; keep the default values 10/5/3 and ensure bounds
checking on the GUCs (positive integers) when registering.

ℹ️ Review info

Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69aec04d-1d81-4de7-913f-14b9f06f8761

📥 Commits

Reviewing files that changed from the base of the PR and between 50764f4 and 1b0262a.

📒 Files selected for processing (3)

include/spock.h
src/spock.c
src/spock_apply.c

danolivo · 2026-03-04T09:53:21Z

src/spock_apply.c

+			 * kernel ACKs them, but no data is being sent.
 			 */
-			if (rc & WL_TIMEOUT)
+			if (rc & WL_TIMEOUT && spock_apply_idle_timeout > 0)


It seems like if walsender just doesn't have data to send for a long time, subscriber will restart. Am I wrong?

It would be better to modify walsender little: skip keepalive messages being busy and rely on TCP status. But send keepalive messages if no data arrives from the WAL. In this case we don't need any subscriber-side GUC at all.

ibrarahmad requested a review from mason-sharp March 4, 2026 06:55

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

danolivo reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace wal_sender_timeout-based liveness with TCP keepalive.#373

Replace wal_sender_timeout-based liveness with TCP keepalive.#373
ibrarahmad wants to merge 1 commit intomainfrom
FEED_BACK

ibrarahmad commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

danolivo Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibrarahmad commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

danolivo Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 4, 2026 •

edited

Loading