Skip to content

perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486

Open
fglock wants to merge 9 commits intofeature/multiplicityfrom
feature/multiplicity-opt
Open

perf: optimize ThreadLocal overhead and JVM-compile eval STRING closures#486
fglock wants to merge 9 commits intofeature/multiplicityfrom
feature/multiplicity-opt

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 10, 2026

Summary

Performance optimizations for the multiplicity branch, reducing ThreadLocal routing overhead and adding JVM compilation for anonymous subs inside eval STRING.

Optimizations applied (in order)

  • Tier 1: Cache PerlRuntime.current() in local variables in hot paths (GlobalVariable, InheritanceResolver) — eliminates ~12 ThreadLocal lookups per method cache miss
  • Tier 2: Migrate WarningBitsRegistry, HintHashRegistry, and RuntimeCode.argsStack ThreadLocal stacks into PerlRuntime instance fields — reduces 14-17 separate ThreadLocal lookups to 1 per sub call
  • Tier 2b: Batch pushCallerState/popCallerState and pushSubState/popSubState — reduces 8-12 lookups to 2 per sub call
  • Tier 2c: Batch RegexState save/restore into single PerlRuntime.current() call — eliminates 24 ThreadLocal lookups per sub call (13 getters + 13 setters)
  • Tier 2d: Skip RegexState save/restore entirely for subs that don't use regex — static analysis via RegexUsageDetector at compile time
  • Tier 3: JVM-compile anonymous subs inside eval STRING — previously always compiled to InterpretedCode, now tries JVM compilation first with interpreter fallback. 4.5x speedup for eval STRING closures in isolation.

Benchmark results (vs master)

Benchmark master Current vs master
closure 863 1,220 +41.4%
method 436 436 0.0%
lexical 394K 480K +21.8%
global 78K 82K +5.1%
eval_string 86K 89K +3.1%
regex 51K 46K -9.4%
string 29K 29K +0.3%

All benchmarks that originally regressed from multiplicity are now at or above master. The regex benchmark shows a small regression from ThreadLocal routing in regex-heavy paths.

Test plan

  • make — all unit tests pass (BUILD SUCCESSFUL)
  • prove -r src/test/resources/unit — 167 files, 7314 tests pass
  • op/eval.t — 156/174 pass (18 pre-existing failures)
  • op/closure.t — 246/266 pass (20 pre-existing failures)
  • Benchmark.pm works correctly with JVM-compiled eval closures

Generated with Devin

fglock and others added 9 commits April 10, 2026 19:11
Cache the ThreadLocal lookup result at method entry instead of calling
PerlRuntime.current() multiple times per method. This eliminates
redundant ThreadLocal lookups in hot paths:

- GlobalVariable.getGlobalCodeRef(): 4 lookups → 1
- GlobalVariable.getGlobalVariable/Array/Hash(): 2 lookups → 1
- GlobalVariable.definedGlob(): 7 lookups → 1
- GlobalVariable.isPackageLoaded(): 3 lookups → 1
- InheritanceResolver.findMethodInHierarchy(): ~8 lookups → 1
- InheritanceResolver.linearizeHierarchy(): ~5 lookups → 1
- InheritanceResolver.invalidateCache(): 4 lookups → 1

Also optimized several other GlobalVariable accessors:
defineGlobalCodeRef, replacePinnedCodeRef, aliasGlobalVariable,
setGlobAlias, getGlobalIO, getGlobalFormatRef, definedGlobalFormatAsScalar,
resetGlobalVariables, resolveStashAlias.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Consolidate 11 separate ThreadLocal stacks from WarningBitsRegistry,
HintHashRegistry, and RuntimeCode into PerlRuntime instance fields.
This reduces ThreadLocal lookups per subroutine call from ~14-17
(one per ThreadLocal.get()) to 1 (PerlRuntime.current(), then direct
field access).

Migrated ThreadLocals:
- WarningBitsRegistry: currentBitsStack, callSiteBits, callerBitsStack,
  callSiteHints, callerHintsStack, callSiteHintHash, callerHintHashStack
- HintHashRegistry: callSiteSnapshotId, callerSnapshotIdStack
- RuntimeCode: evalRuntimeContext, argsStack

The shared static ConcurrentHashMaps (WarningBitsRegistry.registry,
HintHashRegistry.snapshotRegistry) remain static as they are shared
across runtimes and only written at compile time.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Add pushCallerState/popCallerState and pushSubState/popSubState
batch methods to PerlRuntime, replacing 8-12 separate
PerlRuntime.current() calls per subroutine call with just 2.

Closure: 569 -> 601 ops/s (+5.6%)
Method: 319 -> 336 ops/s (+5.3%)
Lexical: 375K -> 458K ops/s (+22.2%)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
RegexState.save() and restore() each called 13 individual RuntimeRegex
static accessors, each doing its own PerlRuntime.current() ThreadLocal
lookup. Replaced with a single PerlRuntime.current() call and direct
field access in both constructor and dynamicRestoreState().

Eliminates 24 ThreadLocal lookups per subroutine call.

JFR profiling showed RegexState was the dominant ThreadLocal overhead
source (126 of 143 PerlRuntime.current() samples in closure benchmark).

Closure: 601 -> 814 ops/s (+35%, now -5.7% vs master)
Method: 336 -> 399 ops/s (+19%, now -8.5% vs master)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…cy.md

Document the profiling findings (RegexState was dominant overhead),
optimization tiers applied, benchmark results, and remaining opportunities.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EmitterMethodCreator unconditionally emitted RegexState.save() at every
subroutine entry, creating and pushing a 13-field snapshot even when the
subroutine never uses regex. Now uses RegexUsageDetector to check the AST
at compile time and only emits save/restore when the body contains regex
operations or eval STRING (which may introduce regex at runtime).

This is safe because subroutines without regex don't modify regex state,
and any callees that use regex do their own save/restore at their boundary.

Closure: 814 -> 1177 ops/s (+44%, now +36% FASTER than master)
Method: 399 -> 417 ops/s (+5%, now -4.4% vs master)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Closure is now +36% faster than master (was -34%).
Method is now -4.4% vs master (was -27%).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Migrate Mro caches (packageGenerations, isaRevCache, pkgGenIsaState),
RuntimeIO.openHandles LRU cache, RuntimeRegex.optimizedRegexCache,
OutputFieldSeparator.internalOFS, OutputRecordSeparator.internalORS,
and ByteCodeSourceMapper (all 7 fields via new State inner class)
to per-PerlRuntime instance fields for multiplicity thread-safety.

No performance regression vs baseline benchmarks.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously, BytecodeCompiler.visitAnonymousSubroutine() always compiled
anonymous sub bodies to InterpretedCode. Hot closures created via eval
STRING (e.g., Benchmark.pm's timing wrapper) ran in the bytecode
interpreter instead of as native JVM bytecode.

Now tries JVM compilation first via EmitterMethodCreator.createClassWithMethod(),
falling back to the interpreter on any failure. A new JvmClosureTemplate
class holds the JVM-compiled class and instantiates closures with captured
variables via reflection.

Measured 4.5x speedup for eval STRING closures in isolation (6.4M iter/s
vs 1.4M iter/s). Updated benchmark results in concurrency.md - all
previously regressed benchmarks now match or exceed master.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant