openvmm_entry: Replat ttrpc server on VmController#3259
openvmm_entry: Replat ttrpc server on VmController#3259jstarks wants to merge 3 commits intomicrosoft:mainfrom
Conversation
Refactor the ttrpc/gRPC server so that after CreateVm, VM lifecycle operations are routed through VmController instead of being managed inline by VmService. This eliminates duplicated worker lifecycle management between the ttrpc and REPL paths. VmController fields diag_inspector and paravisor_diag are now Option<_> so the ttrpc path can construct a VmController without paravisor-specific resources. VmService no longer owns a WorkerHandle directly; instead it holds a Sender<VmControllerRpc> and receives VmControllerEvents for guest halt notifications. Teardown/Quit send VmControllerRpc::Quit to the controller, which handles worker stop+join+mesh shutdown. Inspect routes through the controller via VmControllerRpc::Inspect. WaitVm stores a response sender that is completed when GuestHalt or WorkerStopped events arrive from the controller. Pause/Resume/SCSI/NIC operations remain as direct VmRpc calls, unchanged.
There was a problem hiding this comment.
Pull request overview
Refactors the ttrpc/gRPC VM service to route VM lifecycle operations through VmController (aligning the ttrpc and REPL entry paths), so worker lifecycle + mesh shutdown are owned in one place instead of being duplicated in VmService.
Changes:
VmServicenow spawns aVmControllertask afterCreateVm, sendsVmControllerRpc(Quit/Inspect), and consumesVmControllerEvent(halt/worker stop).VmControlleris made constructible for the ttrpc path by makingdiag_inspectorandparavisor_diagoptional.- Removes
parking_lotusage from the ttrpc path by eliminating the priorMutex<Option<Receiver<_>>>wait pattern.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| openvmm/openvmm_entry/src/vm_controller.rs | Makes diag/paravisor diagnostics optional and updates inspect + servicing paths accordingly. |
| openvmm/openvmm_entry/src/ttrpc/mod.rs | Replats ttrpc VM lifecycle management onto VmController, replacing inline worker-handle ownership and wait logic. |
| openvmm/openvmm_entry/src/lib.rs | Updates REPL path VmController construction to wrap new optional fields in Some(...). |
| openvmm/openvmm_entry/Cargo.toml | Drops parking_lot dependency now that it’s no longer needed by the ttrpc module. |
| Cargo.lock | Removes parking_lot from the resolved dependency graph for openvmm_entry. |
| vmservice::Vm::WaitVm((), response) => { | ||
| let r = self.wait_vm(ctx, vm); | ||
| self.start_rpc(response, r); | ||
| if self.wait_vm_response.is_some() { | ||
| response.send(Err(grpc_error(anyhow!("wait VM already in flight")))); | ||
| } else { | ||
| self.wait_vm_response = Some(response); | ||
| } |
There was a problem hiding this comment.
WaitVm no longer ties the stored response sender to the request's CancelContext. If the client cancels/disconnects before the VM halts, wait_vm_response stays occupied until a controller event arrives, which blocks subsequent WaitVm calls and can leak state. Consider tracking the CancelContext alongside the sender and clearing/completing it on cancellation/deadline (or reintroduce the previous per-request wait future).
| if let Some(event) = event { | ||
| self.handle_controller_event(event); | ||
| } else { | ||
| tracing::debug!("controller event channel closed"); |
There was a problem hiding this comment.
When the controller event channel is closed, this branch only logs but leaves vm_controller_events intact. Since recv.next().await will then return None immediately, the main loop can spin and burn CPU logging "controller event channel closed" repeatedly. On None, drop/clear self.vm_controller_events (and possibly controller_task) to stop polling a closed stream.
| tracing::debug!("controller event channel closed"); | |
| tracing::debug!("controller event channel closed"); | |
| self.vm_controller_events = None; |
| Ok(()) | ||
| }) | ||
| if let Some(response) = self.wait_vm_response.take() { | ||
| response.send(Err(grpc_error(anyhow!("VM worker stopped")))); |
There was a problem hiding this comment.
On VmControllerEvent::WorkerStopped, the error string from the controller (error: Option<String>) is logged but not propagated to a pending WaitVm response, which makes debugging harder. Consider including the controller-provided error string in the returned Status message (and distinguish normal stop vs failure if applicable).
| response.send(Err(grpc_error(anyhow!("VM worker stopped")))); | |
| let status = if let Some(err) = error { | |
| grpc_error(anyhow!("VM worker stopped: {}", err)) | |
| } else { | |
| grpc_error(anyhow!("VM worker stopped")) | |
| }; | |
| response.send(Err(status)); |
| InspectTarget::Paravisor => { | ||
| self.diag_inspector.inspect_mut(req); | ||
| if let Some(inspector) = &mut self.diag_inspector { | ||
| inspector.inspect_mut(req); |
There was a problem hiding this comment.
InspectTarget::Paravisor does nothing when diag_inspector is None, leaving the inspect::Request unevaluated. In the inspect framework, unevaluated nodes surface as "Unresolved" errors and can force callers to wait for timeouts. If paravisor inspection is unavailable, explicitly respond (e.g., req.ignore() or return a small node explaining diagnostics aren't configured).
| inspector.inspect_mut(req); | |
| inspector.inspect_mut(req); | |
| } else { | |
| req.ignore(); |
| vmservice::Vm::WaitVm((), response) => { | ||
| let r = self.wait_vm(ctx, vm); | ||
| self.start_rpc(response, r); | ||
| if self.wait_vm_response.is_some() { | ||
| response.send(Err(grpc_error(anyhow!("wait VM already in flight")))); | ||
| } else { | ||
| self.wait_vm_response = Some((ctx.clone(), response)); | ||
| } |
There was a problem hiding this comment.
WaitVm can miss a GuestHalt/WorkerStopped event if the controller event arrives before the client issues WaitVm. The main loop continuously drains vm_controller_events, and handle_controller_event drops GuestHalt when no wait_vm_response is set, so a later WaitVm may never complete (despite the VM already being halted). Consider buffering the last terminal controller state (halted/stopped/teardown) and completing WaitVm immediately when called, or only polling controller events while a WaitVm is in flight (while still recording terminal events for later).
Refactor the ttrpc/gRPC server so that after CreateVm, VM lifecycle operations are routed through VmController instead of being managed inline by VmService. This eliminates duplicated worker lifecycle management between the ttrpc and REPL paths.
VmController fields diag_inspector and paravisor_diag are now Option<_> so the ttrpc path can construct a VmController without paravisor-specific resources. VmService no longer owns a WorkerHandle directly; instead it holds a Sender and receives VmControllerEvents for guest halt notifications.
Teardown/Quit send VmControllerRpc::Quit to the controller, which handles worker stop+join+mesh shutdown. Inspect routes through the controller via VmControllerRpc::Inspect. WaitVm stores a response sender that is completed when GuestHalt or WorkerStopped events arrive from the controller. Pause/Resume/SCSI/NIC operations remain as direct VmRpc calls, unchanged.