perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486
Open
fglock wants to merge 9 commits intofeature/multiplicityfrom
Open
perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486fglock wants to merge 9 commits intofeature/multiplicityfrom
fglock wants to merge 9 commits intofeature/multiplicityfrom
Conversation
Cache the ThreadLocal lookup result at method entry instead of calling PerlRuntime.current() multiple times per method. This eliminates redundant ThreadLocal lookups in hot paths: - GlobalVariable.getGlobalCodeRef(): 4 lookups → 1 - GlobalVariable.getGlobalVariable/Array/Hash(): 2 lookups → 1 - GlobalVariable.definedGlob(): 7 lookups → 1 - GlobalVariable.isPackageLoaded(): 3 lookups → 1 - InheritanceResolver.findMethodInHierarchy(): ~8 lookups → 1 - InheritanceResolver.linearizeHierarchy(): ~5 lookups → 1 - InheritanceResolver.invalidateCache(): 4 lookups → 1 Also optimized several other GlobalVariable accessors: defineGlobalCodeRef, replacePinnedCodeRef, aliasGlobalVariable, setGlobAlias, getGlobalIO, getGlobalFormatRef, definedGlobalFormatAsScalar, resetGlobalVariables, resolveStashAlias. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Consolidate 11 separate ThreadLocal stacks from WarningBitsRegistry, HintHashRegistry, and RuntimeCode into PerlRuntime instance fields. This reduces ThreadLocal lookups per subroutine call from ~14-17 (one per ThreadLocal.get()) to 1 (PerlRuntime.current(), then direct field access). Migrated ThreadLocals: - WarningBitsRegistry: currentBitsStack, callSiteBits, callerBitsStack, callSiteHints, callerHintsStack, callSiteHintHash, callerHintHashStack - HintHashRegistry: callSiteSnapshotId, callerSnapshotIdStack - RuntimeCode: evalRuntimeContext, argsStack The shared static ConcurrentHashMaps (WarningBitsRegistry.registry, HintHashRegistry.snapshotRegistry) remain static as they are shared across runtimes and only written at compile time. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add pushCallerState/popCallerState and pushSubState/popSubState batch methods to PerlRuntime, replacing 8-12 separate PerlRuntime.current() calls per subroutine call with just 2. Closure: 569 -> 601 ops/s (+5.6%) Method: 319 -> 336 ops/s (+5.3%) Lexical: 375K -> 458K ops/s (+22.2%) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
RegexState.save() and restore() each called 13 individual RuntimeRegex static accessors, each doing its own PerlRuntime.current() ThreadLocal lookup. Replaced with a single PerlRuntime.current() call and direct field access in both constructor and dynamicRestoreState(). Eliminates 24 ThreadLocal lookups per subroutine call. JFR profiling showed RegexState was the dominant ThreadLocal overhead source (126 of 143 PerlRuntime.current() samples in closure benchmark). Closure: 601 -> 814 ops/s (+35%, now -5.7% vs master) Method: 336 -> 399 ops/s (+19%, now -8.5% vs master) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…cy.md Document the profiling findings (RegexState was dominant overhead), optimization tiers applied, benchmark results, and remaining opportunities. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EmitterMethodCreator unconditionally emitted RegexState.save() at every subroutine entry, creating and pushing a 13-field snapshot even when the subroutine never uses regex. Now uses RegexUsageDetector to check the AST at compile time and only emits save/restore when the body contains regex operations or eval STRING (which may introduce regex at runtime). This is safe because subroutines without regex don't modify regex state, and any callees that use regex do their own save/restore at their boundary. Closure: 814 -> 1177 ops/s (+44%, now +36% FASTER than master) Method: 399 -> 417 ops/s (+5%, now -4.4% vs master) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Closure is now +36% faster than master (was -34%). Method is now -4.4% vs master (was -27%). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Migrate Mro caches (packageGenerations, isaRevCache, pkgGenIsaState), RuntimeIO.openHandles LRU cache, RuntimeRegex.optimizedRegexCache, OutputFieldSeparator.internalOFS, OutputRecordSeparator.internalORS, and ByteCodeSourceMapper (all 7 fields via new State inner class) to per-PerlRuntime instance fields for multiplicity thread-safety. No performance regression vs baseline benchmarks. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously, BytecodeCompiler.visitAnonymousSubroutine() always compiled anonymous sub bodies to InterpretedCode. Hot closures created via eval STRING (e.g., Benchmark.pm's timing wrapper) ran in the bytecode interpreter instead of as native JVM bytecode. Now tries JVM compilation first via EmitterMethodCreator.createClassWithMethod(), falling back to the interpreter on any failure. A new JvmClosureTemplate class holds the JVM-compiled class and instantiates closures with captured variables via reflection. Measured 4.5x speedup for eval STRING closures in isolation (6.4M iter/s vs 1.4M iter/s). Updated benchmark results in concurrency.md - all previously regressed benchmarks now match or exceed master. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance optimizations for the multiplicity branch, reducing ThreadLocal routing overhead and adding JVM compilation for anonymous subs inside
eval STRING.Optimizations applied (in order)
PerlRuntime.current()in local variables in hot paths (GlobalVariable,InheritanceResolver) — eliminates ~12 ThreadLocal lookups per method cache missWarningBitsRegistry,HintHashRegistry, andRuntimeCode.argsStackThreadLocal stacks intoPerlRuntimeinstance fields — reduces 14-17 separate ThreadLocal lookups to 1 per sub callpushCallerState/popCallerStateandpushSubState/popSubState— reduces 8-12 lookups to 2 per sub callRegexStatesave/restore into singlePerlRuntime.current()call — eliminates 24 ThreadLocal lookups per sub call (13 getters + 13 setters)RegexStatesave/restore entirely for subs that don't use regex — static analysis viaRegexUsageDetectorat compile timeeval STRING— previously always compiled toInterpretedCode, now tries JVM compilation first with interpreter fallback. 4.5x speedup for eval STRING closures in isolation.Benchmark results (vs master)
All benchmarks that originally regressed from multiplicity are now at or above master. The regex benchmark shows a small regression from ThreadLocal routing in regex-heavy paths.
Test plan
make— all unit tests pass (BUILD SUCCESSFUL)prove -r src/test/resources/unit— 167 files, 7314 tests passop/eval.t— 156/174 pass (18 pre-existing failures)op/closure.t— 246/266 pass (20 pre-existing failures)Generated with Devin