Skip to content

Configurable native stack size for Stylus with auto-retry on overflow#4538

Merged
eljobe merged 48 commits intomasterfrom
nm004-config-stylus-stack-size
Apr 10, 2026
Merged

Configurable native stack size for Stylus with auto-retry on overflow#4538
eljobe merged 48 commits intomasterfrom
nm004-config-stylus-stack-size

Conversation

@bragaigor
Copy link
Copy Markdown
Contributor

@bragaigor bragaigor commented Mar 20, 2026

  • Add --stylus-target.native-stack-size node config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)
  • Add NativeStackOverflow variant to UserOutcome/UserOutcomeKind to distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)
  • On native stack overflow with --stylus-target.allow-fallback=true (default):
    a. double the stack size once (capped at 100 MB) and retry with Cranelift
    b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
    c. If we hit native stack overflow when we have already doubled the stack, we panic
  • With --stylus-target.allow-fallback=false, no retry is attempted on overflow
  • Off-chain calls (eth_call, gas estimation) do not trigger retries or Cranelift compilation
  • Add new WasmTarget variants for Cranelift ASM storage (arm64-cranelift, amd64-cranelift, host-cranelift) in go-ethereum

pulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Comment thread crates/stylus/src/lib.rs Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 7.82209% with 601 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.80%. Comparing base (005cd96) to head (56e26a6).
⚠️ Report is 72 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4538      +/-   ##
==========================================
- Coverage   34.07%   33.80%   -0.27%     
==========================================
  Files         497      497              
  Lines       59178    59791     +613     
==========================================
+ Hits        20164    20212      +48     
- Misses      35455    36021     +566     
+ Partials     3559     3558       -1     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
4865 8 4857 0
View the top 3 failed tests by shortest run time
TestAliasingFlaky
Stack Traces | -0.000s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
INFO [04-10|10:40:47.077] Persisted trie from memory database      nodes=112 flushnodes=0 size=13.52KiB flushsize=0.00B time="225.461µs" flushtime=0s gcnodes=0 gcsize=0.00B gctime="3.717µs"   livenodes=110 livesize=20.44KiB
INFO [04-10|10:40:47.077] Writing cached state to disk             block=6  hash=aa93b5..60417d root=94189f..cc0c5a
INFO [04-10|10:40:47.077] Persisted trie from memory database      nodes=17  flushnodes=0 size=3.31KiB  flushsize=0.00B time="32.391µs"  flushtime=0s gcnodes=0 gcsize=0.00B gctime=0s          livenodes=93  livesize=17.13KiB
INFO [04-10|10:40:47.077] Writing cached state to disk             block=1  hash=f64e73..9f3478 root=c349f6..9b02e2
INFO [04-10|10:40:47.077] Persisted trie from memory database      nodes=25  flushnodes=0 size=4.56KiB  flushsize=0.00B time="62.456µs"  flushtime=0s gcnodes=0 gcsize=0.00B gctime=0s          livenodes=68  livesize=12.57KiB
INFO [04-10|10:40:47.077] Imported new potential chain segment     number=45 hash=0325f7..47e686 blocks=1  txs=1  mgas=0.021 elapsed="396.742µs" mgasps=52.931   triediffs=218.00KiB triedirty=0.00B
INFO [04-10|10:40:47.077] Writing snapshot state to disk           root=77ae46..2fbcae
INFO [04-10|10:40:47.078] Persisted trie from memory database      nodes=0   flushnodes=0 size=0.00B    flushsize=0.00B time="1.142µs"   flushtime=0s gcnodes=0 gcsize=0.00B gctime=0s          livenodes=68  livesize=12.57KiB
INFO [04-10|10:40:47.078] Chain head was updated                   number=45 hash=0325f7..47e686 root=3ebe8c..ef4466 elapsed="37.65µs"
INFO [04-10|10:40:47.078] Blockchain stopped
INFO [04-10|10:40:47.078] Ethereum protocol stopped
INFO [04-10|10:40:47.078] Transaction pool stopped
INFO [04-10|10:40:47.078] Persisting dirty state                   head=34 root=4c5b35..d20f65 layers=34
INFO [04-10|10:40:47.078] Starting work on payload                 id=0x037cdfe7978413db
INFO [04-10|10:40:47.078] Updated payload                          id=0x037cdfe7978413db                      number=46 hash=42bb48..4db166 txs=0  withdrawals=0 gas=0         fees=0              root=c809d7..d79836 elapsed="254.235µs"
INFO [04-10|10:40:47.079] Stopping work on payload                 id=0x037cdfe7978413db                      reason=delivery
INFO [04-10|10:40:47.079] Imported new potential chain segment     number=46 hash=42bb48..4db166 blocks=1  txs=0  mgas=0.000 elapsed="496.571µs" mgasps=0.000    triediffs=221.21KiB triedirty=0.00B
INFO [04-10|10:40:47.079] Chain head was updated                   number=46 hash=42bb48..4db166 root=c809d7..d79836 elapsed="49.232µs"
INFO [04-10|10:40:47.079] Persisted dirty state to disk            size=166.16KiB elapsed=1.329ms
INFO [04-10|10:40:47.080] Blockchain stopped
TestBatchPosterL1SurplusMatchesBatchGasFlaky
Stack Traces | 0.550s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x20f5db2]

goroutine 9 [running]:
testing.tRunner.func1.2({0x386bc60, 0x63069e0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x386bc60?, 0x63069e0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0x1656b900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:211 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x7)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:226 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc000412fc0)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:839 +0x725
testing.tRunner(0xc000412fc0, 0x4248498)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465
TestAnyTrustRekey
Stack Traces | 2.270s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
TRACE[04-10|10:40:32.758] Handled RPC response                     reqid=1529 duration="32µs"
INFO [04-10|10:40:32.757] New Key                                  name=Sequencer           Address=0xb386a74Dcab67b66F8AC07B4f08365d37495Dd23
DEBUG[04-10|10:40:32.758] Served eth_getBlockByNumber              reqid=1375 duration="44.372µs"
DEBUG[04-10|10:40:32.758] Executing EVM call finished              runtime="57.026µs"
DEBUG[04-10|10:40:32.758] Served eth_call                          reqid=1556 duration="86.241µs"
TRACE[04-10|10:40:32.757] Handled RPC response                     reqid=1336 duration="1.312µs"
TRACE[04-10|10:40:32.758] Handled RPC response                     reqid=1556 duration=922ns
INFO [04-10|10:40:32.758] New Key                                  name=Validator           Address=0x83FFCFaCE2Fb0E1286686815503608A16EF41e47
TRACE[04-10|10:40:32.758] Handled RPC response                     reqid=1375 duration="1.573µs"
DEBUG[04-10|10:40:32.758] Served eth_getCode                       reqid=1557 duration="25.999µs"
TRACE[04-10|10:40:32.757] Handled RPC response                     reqid=1374 duration="1.152µs"
TRACE[04-10|10:40:32.758] Handled RPC response                     reqid=1557 duration=330ns
DEBUG[04-10|10:40:32.758] Served eth_getBlockByNumber              reqid=1338 duration="50.024µs"
TRACE[04-10|10:40:32.756] Engine API request received              method=GetPayload                    id=0x0385c7fe82baf8f0
TRACE[04-10|10:40:32.758] Handled RPC response                     reqid=1338 duration="1.062µs"
DEBUG[04-10|10:40:32.758] Served eth_estimateGas                   reqid=70   duration="386.02µs"
DEBUG[04-10|10:40:32.758] Served eth_getBlockByNumber              reqid=1339 duration="51.035µs"
TRACE[04-10|10:40:32.758] Performed indexed log search             begin=48 end=48 "true matches"=1 "false positives"=0 elapsed="41.918µs"
--- FAIL: TestAnyTrustRekey (2.27s)
DEBUG[04-10|10:40:32.758] Served eth_getLogs                       reqid=1375 duration="124.362µs"

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@bragaigor bragaigor marked this pull request as ready for review March 30, 2026 22:22
@bragaigor bragaigor marked this pull request as draft March 31, 2026 00:46
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@tsahee tsahee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes

Comment thread arbos/programs/native.go Outdated
return userNativeStackOverflow, nil
}
if !runCtx.IsExecutedOnChain() {
log.Warn("native stack overflow, no stack doubling for off-chain execution",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't warn here, can log.Info

Comment thread arbos/programs/native.go Outdated
"program", address, "module", moduleHash)
return userNativeStackOverflow, nil
}
if hasDoubledNativeStack.Load() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we still need to try again with cranelift (not sure if first attempt was cranelift or not, and there might've been a race condition where initial run was actually with less stack
  2. I don't hasDoubled as a separate atomic.. feels racy even if it's not too important. Can probably just have a single "doubleStackSize" function that handles setting and will never race with itself.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think this Load() could (in theory) race with the Store on line 640.

In practice, though, isn't the EVM single-threaded? Also, wouldn't a race cause an idempotent "duplicate" doubling from the same baseStackSize to the same newStackSize?

Comment thread execution/gethexec/node.go Outdated
f.String(prefix+".host", DefaultStylusTargetConfig.Host, "stylus programs compilation target for system other than 64-bit ARM or 64-bit x86")
f.StringSlice(prefix+".extra-archs", DefaultStylusTargetConfig.ExtraArchs, fmt.Sprintf("Comma separated list of extra architectures to cross-compile stylus program to and cache in wasm store (additionally to local target). Currently must include at least %s. (supported targets: %s, %s, %s, %s)", rawdb.TargetWavm, rawdb.TargetWavm, rawdb.TargetArm64, rawdb.TargetAmd64, rawdb.TargetHost))
f.Bool(prefix+".allow-fallback", DefaultStylusTargetConfig.AllowFallback, "if true, fall back to an alternative compiler when compilation of a Stylus program fails")
f.Uint64(prefix+".native-stack-size", DefaultStylusTargetConfig.NativeStackSize, "initial native stack size in bytes for Wasmer coroutines used by Stylus execution (0 = default 1MB). On native stack overflow with allow-fallback, the stack size is doubled once (capped at 100MB) and the call is retried with cranelift-compiled code")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for that much info here. Enough to say this is native stack size for wasmer, no need to detail our fallback.

@eljobe eljobe assigned eljobe and unassigned tsahee Apr 8, 2026
@eljobe eljobe requested review from diegoximenes and eljobe and removed request for diegoximenes April 8, 2026 07:51
Comment thread arbos/programs/native.go Outdated
"program", address, "module", moduleHash)
return userNativeStackOverflow, nil
}
if hasDoubledNativeStack.Load() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think this Load() could (in theory) race with the Store on line 640.

In practice, though, isn't the EVM single-threaded? Also, wouldn't a race cause an idempotent "duplicate" doubling from the same baseStackSize to the same newStackSize?

Comment thread arbos/programs/native.go

depth := evm.Depth()
data, msg, err := status.toResult(rustBytesIntoBytes(output), debug)
if status == userNativeStackOverflow {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least document why we're panic-ing here?

It makes me nervous. Wouldn't it mean that even for calls like eth_call or eth_estimateGas, we might have nodes crashing. Should we return an error for those cases where !runCtx.IsExecutedOnChain() and save the panic for when we're actually trying to execute on chain?

Comment thread arbos/programs/native.go
// with a doubled stack size.
type savedState struct {
gas uint64
usedMultiGas multigas.MultiGas
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure these are the only 4 fields we need to save?
Like, one that pops into my mind is Contract.RetainedMultiGas

But, I'm not at all confident. Just want to make sure we're not going to be missing something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some more context on a similar question #4538 (comment)

RetainedMultiGas is exclusively modified by EVM interpreter opcodes — CALL, STATICCALL, DELEGATECALL, SLOAD, etc. in go-ethereum/core/vm/instructions.go and operations_acl.go. These are EVM-level accounting entries that track gas forwarded to child calls.

In the Stylus path, the callback handlers in arbos/programs/api.go only modify scope.Contract.UsedMultiGas (via SaturatingAddInto). They never touch RetainedMultiGas. Even when a Stylus program makes subcalls, the returned multi-gas goes into UsedMultiGas, not RetainedMultiGas.

Comment thread arbos/programs/native.go

// SetInitialNativeStackSize configures the Wasmer coroutine stack size and
// records it as the baseline for overflow recovery. Call once at node startup.
func SetInitialNativeStackSize(size uint64) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

I could be too paranoid here. But, setters that absolutely need to be called make me nervous. I could see someone coming long later and thinking. We're actually never setting this, let's just remove the call, and allow the default of 1MB as a starting stack size.

I think it might make sense to actually record that this method has been called. Maybe in a package global or something and then check it and panic with a programming error if it wasn't called at least once at process startup.

Is this just too paranoid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is if someone comes alone and think this is not being used they will see compilation errors on the callsites in execution/gethexec/wasmstorerebuilder.go and execution/gethexec/executionengine.go. On another note globals to guard a usage like this almost sounds like an overkill?

  1. Delete the function → compile errors at the two call sites → developer investigates why they exist
  2. Delete the call sites → SetInitialNativeStackSize becomes unused → linters/review catch dead code, and the developer has to look at the function to understand why it existed
  3. Delete both → nativeStackBaseline becomes unused → same signal

Each layer references the next, so we can't cleanly remove any piece without confronting the others. Adding a runtime guard (atomic.Bool) on top of that sounds a bit redundant.

Maybe a brief comment on SetInitialNativeStackSize noting why it must be called (establishes the baseline for doubleNativeStackSize) could be enough, adding that. But let me know if you feel uneasy (strong) with this and I can add a package global for it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments in 41e87c6

@eljobe eljobe assigned tsahee and unassigned eljobe Apr 8, 2026
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Comment thread arbos/programs/native.go Outdated
if base, err := rawdb.BaseTarget(target); err == nil {
compileTarget = base
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of that, the function PopulateStylusTargetCache, so that for every target you'll set flags for both cranelift and non-cranelift we'll do: programs.SetTarget(target, effectiveStylusTarget, isNative)
Set the cranelift version before the non-cranelift even though it doesn't really matter.
Later, we'll be able to use that to have different compile target/cpu/etc for cranelift vs singlepass.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 20aea22

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
tsahee
tsahee previously requested changes Apr 8, 2026
Comment thread arbos/programs/native.go Outdated
if newStackSize <= baseStackSize {
return false
}
if !nativeStackBaseline.CompareAndSwap(baseStackSize, 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do compareAndSwap at the end of the function, that way it's impossible for one thread to return before another thread has finished setting nativeStackSize and DrainStackPool.
  • Returning a bool from this function is a little suspect due to races.. since you only use it to print a log it's o.k., but I'd rather you return nothing from the function and print the log only if/after compareAndSwap succeeded, which will make it clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved compareAndSwap in 9dc0493

Comment thread arbos/programs/native.go Outdated
// contract gas, multi-gas, and Stylus page counters from the saved checkpoint.
// openWasmPages/everWasmPages are not journaled, so RevertToSnapshot alone
// does not restore them — we must do it explicitly.
func restoreState(scope *vm.ScopeContext, saved savedState, db vm.StateDB, snapshot int) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's take the encapsulation one step further.
Let's have a function:

saveState(scope *vm.ScopeContext, db vm.StateDB)

that returns saved state, and takes a snapshot and stores it as part of savedState,
and have a

func (s *savedState) restore(scope *vm.ScopeContext, db vm.StateDB)

That way it'll be very easy to see the relation between the two and easy to modify it when needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 9dc0493

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@eljobe eljobe enabled auto-merge April 10, 2026 07:31
@eljobe eljobe added this pull request to the merge queue Apr 10, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2026
@eljobe eljobe enabled auto-merge April 10, 2026 10:26
@eljobe eljobe added this pull request to the merge queue Apr 10, 2026
Merged via the queue into master with commit f7859c3 Apr 10, 2026
25 checks passed
@eljobe eljobe deleted the nm004-config-stylus-stack-size branch April 10, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants