Skip to content

chore: ensure no subnet message execution on canister aborted in install code#9619

Merged
mraszyk merged 7 commits intomasterfrom
mraszyk/no-subnet-message-on-aborted-install-code
Apr 1, 2026
Merged

chore: ensure no subnet message execution on canister aborted in install code#9619
mraszyk merged 7 commits intomasterfrom
mraszyk/no-subnet-message-on-aborted-install-code

Conversation

@mraszyk
Copy link
Copy Markdown
Contributor

@mraszyk mraszyk commented Mar 27, 2026

This PR hardens the scheduler implementation by ensuring that no subnet messages are executed on a canister aborted in install code. This prevents out-of-order subnet message execution since, e.g., a canister_status enqueued after install_code could return before install_code if canister_status quickly completes while the canister is aborted processing install_code. Currently, out-of-order subnet message execution is not possible because there could only be a single aborted install code execution and it is resumed (in advance_long_running_install_code) before other subnet messages are executed (in drain_subnet_queues), but relying on that assumption is quite fragile.

The new test from this PR would fail if the code from this PR hardening the scheduler is omitted and drain_subnet_queues is executed before advance_long_running_install_code:

diff --git a/rs/execution_environment/src/scheduler.rs b/rs/execution_environment/src/scheduler.rs
index d63de5041f9..9cc7d555046 100644
--- a/rs/execution_environment/src/scheduler.rs
+++ b/rs/execution_environment/src/scheduler.rs
@@ -466,6 +466,35 @@ impl SchedulerImpl {
                 scheduler_round_limits.update_subnet_round_limits(&subnet_round_limits);
             }
 
+            // Subnet queues: execute long running install code call if present.
+            {
+                let measurement_scope = MeasurementScope::nested(
+                    &self.metrics.round_advance_long_install_code,
+                    &root_measurement_scope,
+                );
+                let long_running_canister_ids = state
+                    .canister_states()
+                    .iter()
+                    .filter_map(|(&canister_id, canister)| match canister.next_execution() {
+                        NextExecution::None | NextExecution::StartNew => None,
+                        NextExecution::ContinueLong | NextExecution::ContinueInstallCode => {
+                            Some(canister_id)
+                        }
+                    })
+                    .collect();
+
+                let mut subnet_round_limits = scheduler_round_limits.subnet_round_limits();
+                state = self.advance_long_running_install_code(
+                    state,
+                    &mut subnet_round_limits,
+                    &long_running_canister_ids,
+                    &measurement_scope,
+                    registry_settings.subnet_size,
+                );
+
+                scheduler_round_limits.update_subnet_round_limits(&subnet_round_limits);
+            };
+
             let mut round_limits = scheduler_round_limits.canister_round_limits();
             if round_limits.instructions_reached() {
                 self.metrics
@@ -1451,25 +1480,6 @@ impl Scheduler for SchedulerImpl {
             scheduler_round_limits.update_subnet_round_limits(&subnet_round_limits);
         }
 
-        // Subnet queues: execute long running install code call if present.
-        {
-            let measurement_scope = MeasurementScope::nested(
-                &self.metrics.round_advance_long_install_code,
-                &root_measurement_scope,
-            );
-
-            let mut subnet_round_limits = scheduler_round_limits.subnet_round_limits();
-            state = self.advance_long_running_install_code(
-                state,
-                &mut subnet_round_limits,
-                &long_running_canister_ids,
-                &measurement_scope,
-                registry_settings.subnet_size,
-            );
-
-            scheduler_round_limits.update_subnet_round_limits(&subnet_round_limits);
-        };
-
         // Scheduling.
         let round_schedule = {
             let _timer = self.metrics.round_scheduling_duration.start_timer();
@@ -1983,11 +1993,7 @@ fn can_execute_subnet_msg(
                 // Note, this does NOT include aborted executions.
                 return false;
             }
-            Some(ExecutionTask::AbortedInstallCode { .. }) => {
-                // If there is an aborted install code in progress, we can't execute the subnet message.
-                // This is to prevent out-of-order subnet message execution.
-                return false;
-            }
+            Some(ExecutionTask::AbortedInstallCode { .. }) => true,
         };
 
     // Some heavy methods use round instructions.

@github-actions github-actions bot added the chore label Mar 27, 2026
@mraszyk mraszyk marked this pull request as ready for review March 27, 2026 10:03
@mraszyk mraszyk requested a review from a team as a code owner March 27, 2026 10:03
Comment thread rs/execution_environment/src/scheduler.rs Outdated
@mraszyk mraszyk enabled auto-merge April 1, 2026 06:43
@mraszyk mraszyk added this pull request to the merge queue Apr 1, 2026
Merged via the queue into master with commit 6b069e7 Apr 1, 2026
64 of 65 checks passed
@mraszyk mraszyk deleted the mraszyk/no-subnet-message-on-aborted-install-code branch April 1, 2026 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants