nvme_driver: don't lazily restore interrupts (#3220)#3235
nvme_driver: don't lazily restore interrupts (#3220)#3235mattkur wants to merge 1 commit intomicrosoft:release/1.7.2511from
Conversation
Works around a device-side issue: some devices cannot handle the case where interrupts are reprogrammed out of IV order. (cherry picked from commit 958b8a4)
There was a problem hiding this comment.
Pull request overview
Adjusts NVMe driver servicing-restore behavior to avoid device bugs triggered when MSI-X interrupts are reprogrammed out of interrupt-vector (IV) order, by making the default restore path eagerly restore all I/O queues in IV-sorted order.
Changes:
- Default to an eager restore strategy that restores all I/O queues, sorted by interrupt vector, to preserve ascending IV programming order.
- Add
allow_lazy_restore: Option<bool>to saved state to optionally fall back to the prior eager+proto (lazy) split restore behavior. - Update Underhill NVMe manager test helpers to populate the new saved-state field.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs | Implements IV-sorted eager restoration of all I/O queues; adds saved-state flag to optionally enable the old lazy behavior. |
| openhcl/underhill_core/src/nvme_manager/save_restore_helpers.rs | Updates test-state construction to include the new allow_lazy_restore field. |
| // Eager restore path: restore ALL queues sorted by interrupt | ||
| // vector for ordered VPci allocation (MSI-X ordering workaround). | ||
| // | ||
| // Devnote: Safety of inline new_self_drained(): This loop is fully | ||
| // synchronous (no .await). Although IoQueue::restore() spawns | ||
| // queue handler tasks, they don't poll until the async runtime | ||
| // yields — which happens only after .collect() completes. So all | ||
| // new_self_drained() and new_draining() calls finish before any | ||
| // handler can fire the drain-complete signal. If this loop is ever | ||
| // refactored to be async, the waiters for empty queues must be | ||
| // pre-created (as done in the lazy path above). | ||
| let mut sorted_io: Vec<_> = saved_state.worker_data.io.iter().collect(); | ||
| sorted_io.sort_by_key(|q| q.iv); | ||
|
|
||
| worker.io = sorted_io | ||
| .into_iter() | ||
| .flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> { | ||
| let qid = q.queue_data.qid; | ||
| let cpu = q.cpu; | ||
| tracing::info!(qid, cpu, iv = q.iv, ?pci_id, "restoring queue"); | ||
| max_seen_qid = max_seen_qid.max(qid); | ||
| let interrupt = | ||
| worker.device.map_interrupt(q.iv, q.cpu).with_context(|| { | ||
| format!( | ||
| "failed to map interrupt for {}, cpu {}, iv {}", | ||
| pci_id, q.cpu, q.iv | ||
| ) | ||
| })?; | ||
| tracing::info!(qid, cpu, ?pci_id, "restoring queue: search for mem block"); | ||
| let mem_block = restored_memory | ||
| .iter() | ||
| .find(|mem| { | ||
| mem.len() == q.queue_data.mem_len | ||
| && q.queue_data.base_pfn == mem.pfns()[0] | ||
| }) | ||
| .expect("unable to find restored mem block") | ||
| .to_owned(); | ||
| tracing::info!(qid, cpu, ?pci_id, "restoring queue: restore IoQueue"); | ||
| let q = IoQueue::restore( | ||
| driver.clone(), | ||
| interrupt, | ||
| registers.clone(), | ||
| mem_block, | ||
| &pci_id, | ||
| q, | ||
| bounce_buffer, | ||
| if q.queue_data.handler_data.pending_cmds.commands.is_empty() { | ||
| drain_after_restore_template.new_self_drained() | ||
| } else { | ||
| drain_after_restore_template.new_draining() | ||
| }, |
There was a problem hiding this comment.
In the eager-restore path, empty queues get DrainAfterRestoreBuilder::new_self_drained() created inline while IoQueue::restore() is spawning queue-handler tasks. The devnote assumes those tasks won’t run until the async runtime yields, but VmTaskDriver explicitly allows spawned tasks to run on other executors/threads immediately. That reintroduces the missed-signal race you previously avoided by pre-creating waiters: a non-empty queue could drain and signal before later empty queues register their waiters, leaving them stuck in SelfDrained forever. Consider pre-creating the DrainAfterRestore handles for all queues before restoring/spawning any of them, or use the builder’s new_for_new_queue() (double-check) logic for empty queues so late registration can’t miss a completed drain.
Clean cherry pick of PR #3220
Works around a device-side issue: some devices cannot handle the case where interrupts are reprogrammed out of IV order.