perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures by fglock · Pull Request #486 · fglock/PerlOnJava

fglock · 2026-04-10T20:41:37Z

Summary

Performance optimizations for the multiplicity branch, reducing ThreadLocal routing overhead and adding JVM compilation for anonymous subs inside eval STRING.

Optimizations applied (in order)

Tier 1: Cache PerlRuntime.current() in local variables in hot paths (GlobalVariable, InheritanceResolver) — eliminates ~12 ThreadLocal lookups per method cache miss
Tier 2: Migrate WarningBitsRegistry, HintHashRegistry, and RuntimeCode.argsStack ThreadLocal stacks into PerlRuntime instance fields — reduces 14-17 separate ThreadLocal lookups to 1 per sub call
Tier 2b: Batch pushCallerState/popCallerState and pushSubState/popSubState — reduces 8-12 lookups to 2 per sub call
Tier 2c: Batch RegexState save/restore into single PerlRuntime.current() call — eliminates 24 ThreadLocal lookups per sub call (13 getters + 13 setters)
Tier 2d: Skip RegexState save/restore entirely for subs that don't use regex — static analysis via RegexUsageDetector at compile time
Tier 3: JVM-compile anonymous subs inside eval STRING — previously always compiled to InterpretedCode, now tries JVM compilation first with interpreter fallback. 4.5x speedup for eval STRING closures in isolation.

Benchmark results (vs master)

Benchmark	master	Current	vs master
closure	863	1,220	+41.4%
method	436	436	0.0%
lexical	394K	480K	+21.8%
global	78K	82K	+5.1%
eval_string	86K	89K	+3.1%
regex	51K	46K	-9.4%
string	29K	29K	+0.3%

All benchmarks that originally regressed from multiplicity are now at or above master. The regex benchmark shows a small regression from ThreadLocal routing in regex-heavy paths.

Test plan

make — all unit tests pass (BUILD SUCCESSFUL)
prove -r src/test/resources/unit — 167 files, 7314 tests pass
op/eval.t — 156/174 pass (18 pre-existing failures)
op/closure.t — 246/266 pass (20 pre-existing failures)
Benchmark.pm works correctly with JVM-compiled eval closures

Generated with Devin

Cache the ThreadLocal lookup result at method entry instead of calling PerlRuntime.current() multiple times per method. This eliminates redundant ThreadLocal lookups in hot paths: - GlobalVariable.getGlobalCodeRef(): 4 lookups → 1 - GlobalVariable.getGlobalVariable/Array/Hash(): 2 lookups → 1 - GlobalVariable.definedGlob(): 7 lookups → 1 - GlobalVariable.isPackageLoaded(): 3 lookups → 1 - InheritanceResolver.findMethodInHierarchy(): ~8 lookups → 1 - InheritanceResolver.linearizeHierarchy(): ~5 lookups → 1 - InheritanceResolver.invalidateCache(): 4 lookups → 1 Also optimized several other GlobalVariable accessors: defineGlobalCodeRef, replacePinnedCodeRef, aliasGlobalVariable, setGlobAlias, getGlobalIO, getGlobalFormatRef, definedGlobalFormatAsScalar, resetGlobalVariables, resolveStashAlias. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Consolidate 11 separate ThreadLocal stacks from WarningBitsRegistry, HintHashRegistry, and RuntimeCode into PerlRuntime instance fields. This reduces ThreadLocal lookups per subroutine call from ~14-17 (one per ThreadLocal.get()) to 1 (PerlRuntime.current(), then direct field access). Migrated ThreadLocals: - WarningBitsRegistry: currentBitsStack, callSiteBits, callerBitsStack, callSiteHints, callerHintsStack, callSiteHintHash, callerHintHashStack - HintHashRegistry: callSiteSnapshotId, callerSnapshotIdStack - RuntimeCode: evalRuntimeContext, argsStack The shared static ConcurrentHashMaps (WarningBitsRegistry.registry, HintHashRegistry.snapshotRegistry) remain static as they are shared across runtimes and only written at compile time. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Add pushCallerState/popCallerState and pushSubState/popSubState batch methods to PerlRuntime, replacing 8-12 separate PerlRuntime.current() calls per subroutine call with just 2. Closure: 569 -> 601 ops/s (+5.6%) Method: 319 -> 336 ops/s (+5.3%) Lexical: 375K -> 458K ops/s (+22.2%) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

RegexState.save() and restore() each called 13 individual RuntimeRegex static accessors, each doing its own PerlRuntime.current() ThreadLocal lookup. Replaced with a single PerlRuntime.current() call and direct field access in both constructor and dynamicRestoreState(). Eliminates 24 ThreadLocal lookups per subroutine call. JFR profiling showed RegexState was the dominant ThreadLocal overhead source (126 of 143 PerlRuntime.current() samples in closure benchmark). Closure: 601 -> 814 ops/s (+35%, now -5.7% vs master) Method: 336 -> 399 ops/s (+19%, now -8.5% vs master) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…cy.md Document the profiling findings (RegexState was dominant overhead), optimization tiers applied, benchmark results, and remaining opportunities. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

EmitterMethodCreator unconditionally emitted RegexState.save() at every subroutine entry, creating and pushing a 13-field snapshot even when the subroutine never uses regex. Now uses RegexUsageDetector to check the AST at compile time and only emits save/restore when the body contains regex operations or eval STRING (which may introduce regex at runtime). This is safe because subroutines without regex don't modify regex state, and any callees that use regex do their own save/restore at their boundary. Closure: 814 -> 1177 ops/s (+44%, now +36% FASTER than master) Method: 399 -> 417 ops/s (+5%, now -4.4% vs master) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Closure is now +36% faster than master (was -34%). Method is now -4.4% vs master (was -27%). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Migrate Mro caches (packageGenerations, isaRevCache, pkgGenIsaState), RuntimeIO.openHandles LRU cache, RuntimeRegex.optimizedRegexCache, OutputFieldSeparator.internalOFS, OutputRecordSeparator.internalORS, and ByteCodeSourceMapper (all 7 fields via new State inner class) to per-PerlRuntime instance fields for multiplicity thread-safety. No performance regression vs baseline benchmarks. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Previously, BytecodeCompiler.visitAnonymousSubroutine() always compiled anonymous sub bodies to InterpretedCode. Hot closures created via eval STRING (e.g., Benchmark.pm's timing wrapper) ran in the bytecode interpreter instead of as native JVM bytecode. Now tries JVM compilation first via EmitterMethodCreator.createClassWithMethod(), falling back to the interpreter on any failure. A new JvmClosureTemplate class holds the JVM-compiled class and instantiates closures with captured variables via reflection. Measured 4.5x speedup for eval STRING closures in isolation (6.4M iter/s vs 1.4M iter/s). Updated benchmark results in concurrency.md - all previously regressed benchmarks now match or exceed master. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 9 commits April 10, 2026 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486

perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486
fglock wants to merge 9 commits intofeature/multiplicityfrom
feature/multiplicity-opt

fglock commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Apr 10, 2026

Summary

Optimizations applied (in order)

Benchmark results (vs master)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant