Configurable native stack size for Stylus with auto-retry on overflow#4538
Configurable native stack size for Stylus with auto-retry on overflow#4538
Conversation
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4538 +/- ##
==========================================
- Coverage 34.07% 33.80% -0.27%
==========================================
Files 497 497
Lines 59178 59791 +613
==========================================
+ Hits 20164 20212 +48
- Misses 35455 36021 +566
+ Partials 3559 3558 -1 |
❌ 8 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| return userNativeStackOverflow, nil | ||
| } | ||
| if !runCtx.IsExecutedOnChain() { | ||
| log.Warn("native stack overflow, no stack doubling for off-chain execution", |
There was a problem hiding this comment.
don't warn here, can log.Info
| "program", address, "module", moduleHash) | ||
| return userNativeStackOverflow, nil | ||
| } | ||
| if hasDoubledNativeStack.Load() { |
There was a problem hiding this comment.
- we still need to try again with cranelift (not sure if first attempt was cranelift or not, and there might've been a race condition where initial run was actually with less stack
- I don't hasDoubled as a separate atomic.. feels racy even if it's not too important. Can probably just have a single "doubleStackSize" function that handles setting and will never race with itself.
There was a problem hiding this comment.
Yeah. I think this Load() could (in theory) race with the Store on line 640.
In practice, though, isn't the EVM single-threaded? Also, wouldn't a race cause an idempotent "duplicate" doubling from the same baseStackSize to the same newStackSize?
| f.String(prefix+".host", DefaultStylusTargetConfig.Host, "stylus programs compilation target for system other than 64-bit ARM or 64-bit x86") | ||
| f.StringSlice(prefix+".extra-archs", DefaultStylusTargetConfig.ExtraArchs, fmt.Sprintf("Comma separated list of extra architectures to cross-compile stylus program to and cache in wasm store (additionally to local target). Currently must include at least %s. (supported targets: %s, %s, %s, %s)", rawdb.TargetWavm, rawdb.TargetWavm, rawdb.TargetArm64, rawdb.TargetAmd64, rawdb.TargetHost)) | ||
| f.Bool(prefix+".allow-fallback", DefaultStylusTargetConfig.AllowFallback, "if true, fall back to an alternative compiler when compilation of a Stylus program fails") | ||
| f.Uint64(prefix+".native-stack-size", DefaultStylusTargetConfig.NativeStackSize, "initial native stack size in bytes for Wasmer coroutines used by Stylus execution (0 = default 1MB). On native stack overflow with allow-fallback, the stack size is doubled once (capped at 100MB) and the call is retried with cranelift-compiled code") |
There was a problem hiding this comment.
nit: no need for that much info here. Enough to say this is native stack size for wasmer, no need to detail our fallback.
| "program", address, "module", moduleHash) | ||
| return userNativeStackOverflow, nil | ||
| } | ||
| if hasDoubledNativeStack.Load() { |
There was a problem hiding this comment.
Yeah. I think this Load() could (in theory) race with the Store on line 640.
In practice, though, isn't the EVM single-threaded? Also, wouldn't a race cause an idempotent "duplicate" doubling from the same baseStackSize to the same newStackSize?
|
|
||
| depth := evm.Depth() | ||
| data, msg, err := status.toResult(rustBytesIntoBytes(output), debug) | ||
| if status == userNativeStackOverflow { |
There was a problem hiding this comment.
Can we at least document why we're panic-ing here?
It makes me nervous. Wouldn't it mean that even for calls like eth_call or eth_estimateGas, we might have nodes crashing. Should we return an error for those cases where !runCtx.IsExecutedOnChain() and save the panic for when we're actually trying to execute on chain?
| // with a doubled stack size. | ||
| type savedState struct { | ||
| gas uint64 | ||
| usedMultiGas multigas.MultiGas |
There was a problem hiding this comment.
Are we sure these are the only 4 fields we need to save?
Like, one that pops into my mind is Contract.RetainedMultiGas
But, I'm not at all confident. Just want to make sure we're not going to be missing something.
There was a problem hiding this comment.
Here's some more context on a similar question #4538 (comment)
RetainedMultiGas is exclusively modified by EVM interpreter opcodes — CALL, STATICCALL, DELEGATECALL, SLOAD, etc. in go-ethereum/core/vm/instructions.go and operations_acl.go. These are EVM-level accounting entries that track gas forwarded to child calls.
In the Stylus path, the callback handlers in arbos/programs/api.go only modify scope.Contract.UsedMultiGas (via SaturatingAddInto). They never touch RetainedMultiGas. Even when a Stylus program makes subcalls, the returned multi-gas goes into UsedMultiGas, not RetainedMultiGas.
|
|
||
| // SetInitialNativeStackSize configures the Wasmer coroutine stack size and | ||
| // records it as the baseline for overflow recovery. Call once at node startup. | ||
| func SetInitialNativeStackSize(size uint64) { |
There was a problem hiding this comment.
Optional:
I could be too paranoid here. But, setters that absolutely need to be called make me nervous. I could see someone coming long later and thinking. We're actually never setting this, let's just remove the call, and allow the default of 1MB as a starting stack size.
I think it might make sense to actually record that this method has been called. Maybe in a package global or something and then check it and panic with a programming error if it wasn't called at least once at process startup.
Is this just too paranoid?
There was a problem hiding this comment.
My thinking is if someone comes alone and think this is not being used they will see compilation errors on the callsites in execution/gethexec/wasmstorerebuilder.go and execution/gethexec/executionengine.go. On another note globals to guard a usage like this almost sounds like an overkill?
- Delete the function → compile errors at the two call sites → developer investigates why they exist
- Delete the call sites → SetInitialNativeStackSize becomes unused → linters/review catch dead code, and the developer has to look at the function to understand why it existed
- Delete both → nativeStackBaseline becomes unused → same signal
Each layer references the next, so we can't cleanly remove any piece without confronting the others. Adding a runtime guard (atomic.Bool) on top of that sounds a bit redundant.
Maybe a brief comment on SetInitialNativeStackSize noting why it must be called (establishes the baseline for doubleNativeStackSize) could be enough, adding that. But let me know if you feel uneasy (strong) with this and I can add a package global for it
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| if base, err := rawdb.BaseTarget(target); err == nil { | ||
| compileTarget = base | ||
| } | ||
| } |
There was a problem hiding this comment.
Instead of that, the function PopulateStylusTargetCache, so that for every target you'll set flags for both cranelift and non-cranelift we'll do: programs.SetTarget(target, effectiveStylusTarget, isNative)
Set the cranelift version before the non-cranelift even though it doesn't really matter.
Later, we'll be able to use that to have different compile target/cpu/etc for cranelift vs singlepass.
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| if newStackSize <= baseStackSize { | ||
| return false | ||
| } | ||
| if !nativeStackBaseline.CompareAndSwap(baseStackSize, 0) { |
There was a problem hiding this comment.
- Do compareAndSwap at the end of the function, that way it's impossible for one thread to return before another thread has finished setting nativeStackSize and DrainStackPool.
- Returning a bool from this function is a little suspect due to races.. since you only use it to print a log it's o.k., but I'd rather you return nothing from the function and print the log only if/after compareAndSwap succeeded, which will make it clearer.
| // contract gas, multi-gas, and Stylus page counters from the saved checkpoint. | ||
| // openWasmPages/everWasmPages are not journaled, so RevertToSnapshot alone | ||
| // does not restore them — we must do it explicitly. | ||
| func restoreState(scope *vm.ScopeContext, saved savedState, db vm.StateDB, snapshot int) { |
There was a problem hiding this comment.
let's take the encapsulation one step further.
Let's have a function:
saveState(scope *vm.ScopeContext, db vm.StateDB)
that returns saved state, and takes a snapshot and stores it as part of savedState,
and have a
func (s *savedState) restore(scope *vm.ScopeContext, db vm.StateDB)
That way it'll be very easy to see the relation between the two and easy to modify it when needed
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
--stylus-target.native-stack-sizenode config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)NativeStackOverflowvariant toUserOutcome/UserOutcomeKindto distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)--stylus-target.allow-fallback=true(default):a. double the stack size once (capped at 100 MB) and retry with Cranelift
b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
c. If we hit native stack overflow when we have already doubled the stack, we panic
--stylus-target.allow-fallback=false, no retry is attempted on overfloweth_call, gas estimation) do not trigger retries or Cranelift compilationpulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686