Skip to content
Open
82 changes: 82 additions & 0 deletions dev/design/concurrency.md
Original file line number Diff line number Diff line change
Expand Up @@ -1485,6 +1485,88 @@ and why it was reverted.)

*None yet.*

---

### Optimization Results (2026-04-10)

**Branch:** `feature/multiplicity-opt` (created from `feature/multiplicity`)

#### JFR Profiling Findings

Profiled the closure benchmark (`dev/bench/benchmark_closure.pl`) with Java
Flight Recorder to identify the dominant overhead sources. Key findings:

| Category | JFR Samples | Source |
|----------|-------------|--------|
| `PerlRuntime.current()` / ThreadLocal | 143 | The ThreadLocal.get() call itself |
| RuntimeRegex accessors | ~126 | **Dominant source** — 13 getters + 13 setters per sub call via RegexState save/restore |
| DynamicVariableManager | ~90 | `local` variable save/restore (includes regex state push) |
| pushCallerState (batch) | 10 | Caller bits/hints/hint-hash state |
| pushSubState (batch) | 3 | Args + warning bits |
| ArrayDeque.push | 13 | Stack operations |

**Critical finding:** `RegexState` save/restore was calling 13 individual
`RuntimeRegex` static accessors (each doing its own `PerlRuntime.current()`
ThreadLocal lookup) on every subroutine entry AND exit — **26 ThreadLocal
lookups per sub call** — even when the subroutine never uses regex.

#### Optimizations Applied

| Tier | Optimization | Files Changed | ThreadLocal Lookups Eliminated |
|------|-------------|---------------|-------------------------------|
| 1 | Cache `PerlRuntime.current()` in local variables | GlobalVariable.java, InheritanceResolver.java | ~12 per method cache miss |
| 2 | Migrate WarningBits/HintHash/args stacks to PerlRuntime fields | WarningBitsRegistry.java, HintHashRegistry.java, RuntimeCode.java, PerlRuntime.java | 14-17 per sub call (separate ThreadLocals -> 1 ThreadLocal) |
| 2b | Batch pushCallerState/popCallerState and pushSubState/popSubState | PerlRuntime.java, RuntimeCode.java | 8-12 per sub call -> 2 |
| 2c | Batch RegexState save/restore | RegexState.java | **24 per sub call** (13 getters + 13 setters -> 2) |
| 2d | Skip RegexState save/restore for subs without regex | EmitterMethodCreator.java, RegexUsageDetector.java | **Entire save/restore eliminated** for non-regex subs |
| 3 | JVM-compile anonymous subs inside `eval STRING` | BytecodeCompiler.java, OpcodeHandlerExtended.java, JvmClosureTemplate.java (new) | N/A (execution speedup, not ThreadLocal reduction) |

**Tier 3: JVM compilation of eval STRING anonymous subs.** Previously,
`BytecodeCompiler.visitAnonymousSubroutine()` always compiled anonymous sub bodies
to `InterpretedCode`. This meant hot closures created via `eval STRING` (e.g.,
Benchmark.pm's `sub { for (1..$n) { &$c } }`) ran in the bytecode interpreter.
The fix tries JVM compilation first via `EmitterMethodCreator.createClassWithMethod()`,
falling back to the interpreter on any failure (e.g., ASM frame computation crash).
A new `JvmClosureTemplate` class holds the JVM-compiled class and instantiates it
with captured variables via reflection. Measured 4.5x speedup for eval STRING closures
in isolation (6.4M iter/s vs 1.4M iter/s). The broader benchmark improvements (method
+36.7% vs pre-opt, global +10.8%) likely reflect this change improving Benchmark.pm's
own infrastructure which is used by all benchmarks.

#### Benchmark Results (updated 2026-04-10, after JVM eval STRING closures)

| Benchmark | master | branch (pre-opt) | **Current** | vs master | vs pre-opt |
|-----------|--------|-------------------|-------------|-----------|------------|
| **closure** | 863 | 569 (-34.1%) | **1,220** | **+41.4%** | +114.4% |
| **method** | 436 | 319 (-26.9%) | **436** | **0.0%** | +36.7% |
| **lexical** | 394K | 375K (-4.9%) | **480K** | **+21.8%** | +28.0% |
| global | 78K | 74K (-5.4%) | **82K** | +5.1% | +10.8% |
| eval_string | 86K | 82K (-4.8%) | **89K** | +3.1% | +8.5% |
| regex | 51K | 47K (-7.0%) | **46K** | -9.4% | -2.1% |
| string | 29K | 31K (+6.5%) | **29K** | +0.3% | -5.6% |

All benchmarks that originally regressed are now **at or above master**. Closure is
41% faster than master, method matches master exactly, lexical is 22% faster, and
global is 5% faster. The closure and lexical improvements come from eliminating
unnecessary RegexState save/restore for subroutines that don't use regex — an
overhead that existed on master too (via separate ThreadLocals) but was masked by
lower per-lookup cost. The regex benchmark shows a small regression (~9%) from
ThreadLocal routing overhead in regex-heavy code paths.

#### Remaining Optimization Opportunities (not yet pursued)

These are lower-priority since the main goal (closure/method within 10%) is exceeded:

| Option | Effort | Expected Impact | Notes |
|--------|--------|-----------------|-------|
| Pass `PerlRuntime rt` from static apply to instance apply | Low | Eliminates 1 of 2 remaining lookups per sub call | Changes method signatures |
| Cache warning bits on RuntimeCode field | Low | Avoids ConcurrentHashMap lookup per call | `getWarningBitsForCode()` in profile |
| Batch RuntimeRegex field access in match methods | Medium | Eliminates ~10-15 lookups per regex match | Profile showed many individual accessors in RuntimeRegex.java; may help the -9% regex regression |
| DynamicVariableManager.variableStack() caching | Low | 1 lookup per call eliminated | 10 samples in profile |

Note: "Lazy regex state save (skip when sub doesn't use regex)" was listed here previously
and has been implemented as Tier 2d above.

### Open Questions
- `runtimeEvalCounter` and `nextCallsiteId` remain static (shared across runtimes) —
acceptable for unique ID generation but may want per-runtime counters in future
Expand Down
143 changes: 142 additions & 1 deletion src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

import org.perlonjava.backend.jvm.EmitterContext;
import org.perlonjava.backend.jvm.EmitterMethodCreator;
import org.perlonjava.backend.jvm.InterpreterFallbackException;
import org.perlonjava.backend.jvm.JavaClassInfo;
import org.perlonjava.backend.jvm.JvmClosureTemplate;
import org.perlonjava.frontend.analysis.ConstantFoldingVisitor;
import org.perlonjava.frontend.analysis.FindDeclarationVisitor;
import org.perlonjava.frontend.analysis.RegexUsageDetector;
Expand Down Expand Up @@ -4909,6 +4912,10 @@ private void visitNamedSubroutine(SubroutineNode node) {
* <p>
* Compiles the subroutine body to bytecode with closure support.
* Anonymous subs capture lexical variables from the enclosing scope.
* <p>
* When compiled inside eval STRING with an EmitterContext available,
* attempts JVM compilation first for better runtime performance.
* Falls back to interpreter bytecode if JVM compilation fails.
*/
private void visitAnonymousSubroutine(SubroutineNode node) {
// Step 1: Collect closure variables.
Expand All @@ -4924,7 +4931,141 @@ private void visitAnonymousSubroutine(SubroutineNode node) {

closureCapturedVarNames.addAll(closureVarNames);

// Step 3: Create a new BytecodeCompiler for the subroutine body
// Step 2: Try JVM compilation first if we have an EmitterContext (eval STRING path)
// Skip JVM attempt for defer blocks and map/grep blocks which have special control flow
Boolean isDeferBlock = (Boolean) node.getAnnotation("isDeferBlock");
Boolean isMapGrepBlock = (Boolean) node.getAnnotation("isMapGrepBlock");
boolean skipJvm = (isDeferBlock != null && isDeferBlock)
|| (isMapGrepBlock != null && isMapGrepBlock);

if (this.emitterContext != null && !skipJvm) {
try {
emitJvmAnonymousSub(node, closureVarNames, closureVarIndices);
return; // JVM compilation succeeded
} catch (Exception e) {
// JVM compilation failed, fall through to interpreter path
if (System.getenv("JPERL_SHOW_FALLBACK") != null) {
System.err.println("JVM compilation failed for anonymous sub in eval STRING, using interpreter: "
+ e.getClass().getSimpleName() + ": " + e.getMessage());
}
}
}

// Step 3: Interpreter compilation (existing path)
emitInterpretedAnonymousSub(node, closureVarNames, closureVarIndices);
}

/**
* Attempt to compile an anonymous sub body to JVM bytecode.
* Creates an EmitterContext, calls EmitterMethodCreator.createClassWithMethod(),
* and emits interpreter opcodes to instantiate the JVM class at runtime.
*/
private void emitJvmAnonymousSub(SubroutineNode node,
List<String> closureVarNames,
List<Integer> closureVarIndices) {
// Build a ScopedSymbolTable for the sub body with captured variables
ScopedSymbolTable newSymbolTable = new ScopedSymbolTable();
newSymbolTable.enterScope();

// Add reserved variables first to occupy slots 0-2
// EmitterMethodCreator skips these (skipVariables=3) but they must be present
// in the symbol table to keep captured variable indices aligned at 3+
newSymbolTable.addVariable("this", "", getCurrentPackage(), null);
newSymbolTable.addVariable("@_", "", getCurrentPackage(), null);
newSymbolTable.addVariable("wantarray", "", getCurrentPackage(), null);

// Add captured variables to the symbol table
// They will be at indices 3, 4, 5, ... (after this/@_/wantarray)
for (String varName : closureVarNames) {
newSymbolTable.addVariable(varName, "my", getCurrentPackage(), null);
}

// Copy package and pragma flags from the current BytecodeCompiler state
newSymbolTable.setCurrentPackage(getCurrentPackage(), symbolTable.currentPackageIsClass());
newSymbolTable.strictOptionsStack.pop();
newSymbolTable.strictOptionsStack.push(symbolTable.strictOptionsStack.peek());
newSymbolTable.featureFlagsStack.pop();
newSymbolTable.featureFlagsStack.push(symbolTable.featureFlagsStack.peek());
newSymbolTable.warningFlagsStack.pop();
newSymbolTable.warningFlagsStack.push((java.util.BitSet) symbolTable.warningFlagsStack.peek().clone());
newSymbolTable.warningFatalStack.pop();
newSymbolTable.warningFatalStack.push((java.util.BitSet) symbolTable.warningFatalStack.peek().clone());
newSymbolTable.warningDisabledStack.pop();
newSymbolTable.warningDisabledStack.push((java.util.BitSet) symbolTable.warningDisabledStack.peek().clone());

// Reset variable index past the captured variables
String[] newEnv = newSymbolTable.getVariableNames();
int currentVarIndex = newSymbolTable.getCurrentLocalVariableIndex();
int resetTo = Math.max(newEnv.length, currentVarIndex);
newSymbolTable.resetLocalVariableIndex(resetTo);

// Create EmitterContext for JVM compilation
JavaClassInfo newJavaClassInfo = new JavaClassInfo();
EmitterContext subCtx = new EmitterContext(
newJavaClassInfo,
newSymbolTable,
null, // mv - will be set by EmitterMethodCreator
null, // cw - will be set by EmitterMethodCreator
RuntimeContextType.RUNTIME,
true,
this.errorUtil,
this.emitterContext.compilerOptions,
new RuntimeArray()
);

// Try JVM compilation - may throw InterpreterFallbackException or other exceptions
Class<?> generatedClass = EmitterMethodCreator.createClassWithMethod(
subCtx, node.block, false);

// Cache the generated class
RuntimeCode.getAnonSubs().put(subCtx.javaClassInfo.javaClassName, generatedClass);

// Emit interpreter opcodes to create the code reference at runtime
int codeReg = allocateRegister();
String packageName = getCurrentPackage();

if (closureVarIndices.isEmpty()) {
// No closures - instantiate JVM class at compile time
JvmClosureTemplate template = new JvmClosureTemplate(
generatedClass, node.prototype, packageName);
RuntimeScalar codeScalar = template.instantiateNoClosure();

// Handle attributes
if (node.attributes != null && !node.attributes.isEmpty() && packageName != null) {
RuntimeCode code = (RuntimeCode) codeScalar.value;
code.attributes = node.attributes;
Attributes.runtimeDispatchModifyCodeAttributes(packageName, codeScalar);
}

int constIdx = addToConstantPool(codeScalar);
emit(Opcodes.LOAD_CONST);
emitReg(codeReg);
emit(constIdx);
} else {
// Has closures - store JvmClosureTemplate in constant pool
// CREATE_CLOSURE opcode handles both InterpretedCode and JvmClosureTemplate
JvmClosureTemplate template = new JvmClosureTemplate(
generatedClass, node.prototype, packageName);
int templateIdx = addToConstantPool(template);
emit(Opcodes.CREATE_CLOSURE);
emitReg(codeReg);
emit(templateIdx);
emit(closureVarIndices.size());
for (int regIdx : closureVarIndices) {
emit(regIdx);
}
}

lastResultReg = codeReg;
}

/**
* Compile an anonymous sub to InterpretedCode (the fallback/default path).
* This is the original implementation of visitAnonymousSubroutine.
*/
private void emitInterpretedAnonymousSub(SubroutineNode node,
List<String> closureVarNames,
List<Integer> closureVarIndices) {
// Build a variable registry from current scope to pass to sub-compiler
// This allows nested closures to see grandparent scope variables
Map<String, Integer> parentRegistry = new HashMap<>();
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package org.perlonjava.backend.bytecode;

import org.perlonjava.backend.jvm.JvmClosureTemplate;
import org.perlonjava.runtime.operators.*;
import org.perlonjava.runtime.perlmodule.Attributes;
import org.perlonjava.runtime.regex.RuntimeRegex;
Expand Down Expand Up @@ -888,35 +889,43 @@ public static int executeMatchRegexNot(int[] bytecode, int pc, RuntimeBase[] reg
/**
* Execute create closure operation.
* Format: CREATE_CLOSURE rd template_idx num_captures reg1 reg2 ...
* <p>
* Supports both InterpretedCode templates (interpreter-compiled subs)
* and JvmClosureTemplate (JVM-compiled subs from eval STRING).
*/
public static int executeCreateClosure(int[] bytecode, int pc, RuntimeBase[] registers, InterpretedCode code) {
int rd = bytecode[pc++];
int templateIdx = bytecode[pc++];
int numCaptures = bytecode[pc++];

// Get the template InterpretedCode from constants
InterpretedCode template = (InterpretedCode) code.constants[templateIdx];

// Capture the current register values
RuntimeBase[] capturedVars = new RuntimeBase[numCaptures];
for (int i = 0; i < numCaptures; i++) {
int captureReg = bytecode[pc++];
capturedVars[i] = registers[captureReg];
}

// Create a new InterpretedCode with the captured variables
InterpretedCode closureCode = template.withCapturedVars(capturedVars);

// Wrap in RuntimeScalar and set __SUB__ for self-reference
RuntimeScalar codeRef = new RuntimeScalar(closureCode);
closureCode.__SUB__ = codeRef;
registers[rd] = codeRef;
Object template = code.constants[templateIdx];

// Dispatch MODIFY_CODE_ATTRIBUTES for anonymous subs with non-builtin attributes
// Pass isClosure=true since CREATE_CLOSURE always creates a closure
if (closureCode.attributes != null && !closureCode.attributes.isEmpty()
&& closureCode.packageName != null) {
Attributes.runtimeDispatchModifyCodeAttributes(closureCode.packageName, codeRef, true);
if (template instanceof JvmClosureTemplate jvmTemplate) {
// JVM-compiled closure: instantiate the generated class with captured variables
registers[rd] = jvmTemplate.instantiate(capturedVars);
} else {
// InterpretedCode closure: create a new copy with captured variables
InterpretedCode interpTemplate = (InterpretedCode) template;
InterpretedCode closureCode = interpTemplate.withCapturedVars(capturedVars);

// Wrap in RuntimeScalar and set __SUB__ for self-reference
RuntimeScalar codeRef = new RuntimeScalar(closureCode);
closureCode.__SUB__ = codeRef;
registers[rd] = codeRef;

// Dispatch MODIFY_CODE_ATTRIBUTES for anonymous subs with non-builtin attributes
// Pass isClosure=true since CREATE_CLOSURE always creates a closure
if (closureCode.attributes != null && !closureCode.attributes.isEmpty()
&& closureCode.packageName != null) {
Attributes.runtimeDispatchModifyCodeAttributes(closureCode.packageName, codeRef, true);
}
}
return pc;
}
Expand Down
Loading