Skip to content

Async client: fix silent data loss on the final frame of a server stream#312

Open
shvbsle wants to merge 1 commit into
containerd:masterfrom
shvbsle:ttrpc-bug
Open

Async client: fix silent data loss on the final frame of a server stream#312
shvbsle wants to merge 1 commit into
containerd:masterfrom
shvbsle:ttrpc-bug

Conversation

@shvbsle
Copy link
Copy Markdown
Contributor

@shvbsle shvbsle commented Apr 18, 2026

Fixes #311

Process each frame inline in ClientReader::handle_msg instead of tokio::spawn-ing per frame. The read loop in connection.rs already awaits handle_msg per frame, so inline preserves per-stream wire order and restores natural back-pressure on the per-stream mpsc. handle_err adjusted the same way for consistency.

Server dispatch also spawns per frame, but there the spawn is load-bearing (user handlers can take arbitrary time) and a oneshot already preserves ordering. Left alone.

Regression tests

Added two examples that exercise the race on a multi_thread runtime:

  • async-stream-close-order - server-streaming handler sends one payload and returns (DATA immediately followed by REMOTE_CLOSED). Repeats 2000 times in batches of 200 and asserts every iteration receives the payload before EOF. Targets data loss from the close-frame removing the stream before the data-frame task looks it up.

  • async-data-order - server-streaming handler sends 50 sequential DATA frames per stream. Opens 100 concurrent streams and asserts every stream receives frames in sequence order. Targets reordering from spawned tasks executing out of submission order.

Without the fix:

repeating the experiment 2000 times, we find data loss 148/2000 times
repeating the experiment 100 times, we find out of order frames 8/100 times

With the fix both pass cleanly across repeated runs.

Compatibility

Public API, wire format, and observable behavior for correctly-behaving consumers are unchanged.

@shvbsle
Copy link
Copy Markdown
Contributor Author

shvbsle commented Apr 20, 2026

On CI Failures: The CI build failures are unrelated to the code changes. Cargo 1.81 cannot resolve the current crate ecosystem because new crate versions now require edition2024. Seems like infrastructure rot.

Comment thread src/asynchronous/client.rs Outdated
@fletcherw
Copy link
Copy Markdown

On CI Failures: The CI build failures are unrelated to the code changes. Cargo 1.81 cannot resolve the current crate ecosystem because new crate versions now require edition2024. Seems like infrastructure rot.

You can update rust-toolchain.toml to 1.95 to fix this.

@shvbsle
Copy link
Copy Markdown
Contributor Author

shvbsle commented Apr 20, 2026

You can update rust-toolchain.toml to 1.95 to fix this.

I had a PR for this but its not as simple as updating the toolchain. Need to fix the new clippy issues that are introduced in new versions:
#313

I can cherry pick or rebase this PR after the above one is merged

@wllenyj
Copy link
Copy Markdown
Collaborator

wllenyj commented Apr 22, 2026

LGTM.

At the time of implementation, it was simply designed to behave exactly like the Go version, even though the Go version might be incorrect...

@shvbsle shvbsle force-pushed the ttrpc-bug branch 2 times, most recently from 2dadc6b to c908838 Compare May 2, 2026 17:09
Copy link
Copy Markdown

@tzneal tzneal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dims
Copy link
Copy Markdown
Member

dims commented May 4, 2026

@tzneal @shvbsle can the reproducer be turned into a test case perhaps? worth it?

@shvbsle
Copy link
Copy Markdown
Contributor Author

shvbsle commented May 5, 2026

@tzneal @dims Added a regression test in the second commit. example/async-stream-close-order.rs sends one payload on a server-stream and returns, 1000 times on a multi_thread runtime. Verified it fails without the fix (8/1000 dropped) and passes with it. Wired into tests/run-examples.rs so it runs under cargo test.

Comment thread example/async-stream-close-order.rs Outdated
@shvbsle shvbsle force-pushed the ttrpc-bug branch 4 times, most recently from e6e7521 to 2afc15c Compare May 16, 2026 03:19
ClientReader::handle_msg spawns a new task per frame. For a
server-streaming RPC, the final DATA frame and the subsequent
FLAG_REMOTE_CLOSED frame then race: if the close-frame task grabs
the req_map lock first, it removes the stream from the map, and
the preceding data-frame task finds nothing and silently drops
the payload. The stream consumer sees Ok(None) (EOF) without
ever observing the payload the server sent.

The connection read loop already awaits handle_msg per frame, so
processing inline preserves per-stream wire order. It also gives
the per-stream mpsc natural back-pressure in place of the
unbounded per-frame spawning.

handle_err gets the same treatment for consistency.

Add two regression examples:
- async-stream-close-order: verifies DATA before CLOSE is not lost
- async-data-order: verifies multi-frame streams maintain sequence

Signed-off-by: Shiv Bhosale <shvbsle@amazon.com>
@shvbsle shvbsle requested a review from tzneal May 16, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async client silently drops the final DATA frame of a server-streaming RPC

5 participants