diff --git a/dev/design/regex_preprocessing_fixes.md b/dev/design/regex_preprocessing_fixes.md new file mode 100644 index 000000000..a93c7f0a7 --- /dev/null +++ b/dev/design/regex_preprocessing_fixes.md @@ -0,0 +1,282 @@ +# Regex Preprocessing Fixes + +## Overview + +This document tracks regex preprocessing issues discovered while running `re/pat.t`, `re/pat_advanced.t`, and related tests with `JPERL_UNIMPLEMENTED=warn`. + +## Completed Fixes + +### 1. Invalid quantifier brace handling consuming regex metacharacters + +**Root cause:** `handleQuantifier()` in `RegexPreprocessor.java` used `s.indexOf('}', start)` to find the closing brace of a potential quantifier, but this search crossed character class boundaries and regex groups. For example, in `{ (?> [^{}]+ | (??{...}) )* }`, the `{` at the start was treated as a potential quantifier, and `indexOf('}')` found the `}` inside the character class `[^{}]`, consuming everything in between (including `(?>` and `[^{`) as literal text. + +**Fix:** When `handleQuantifier` determines that braces don't form a valid quantifier (content contains non-numeric characters), it now only escapes the opening `{` as `\{` and returns immediately, letting the main regex loop process subsequent characters normally. Previously it consumed and escaped the entire `{...}` range. + +**Files changed:** `RegexPreprocessor.java` — `handleQuantifier()` method + +### 2. `\x{...}` hex escape with non-hex characters + +**Root cause:** The hex escape handler used `Integer.parseInt(hexStr, 16)` which throws `NumberFormatException` for strings containing non-hex characters (e.g., `\x{9bq}`). Inside character classes, this was caught and re-thrown as a fatal `PerlCompilerException`, killing the test run. Outside character classes, the escape was passed through to Java's regex engine which also rejected it. + +**Perl behavior:** `\x{9bq}` extracts the valid hex prefix `9b` (value 0x9B) and ignores the remaining characters. `\x{x9b}` has no valid prefix, so the value is 0. Underscores are allowed (removed by preprocessing) but other non-hex chars terminate the hex number. + +**Fix:** All three `\x{...}` handlers now extract the valid hex prefix instead of requiring the entire content to be valid hex: +- `handleRegexCharacterClassEscape()` — inside `[...]` (was the fatal crash) +- `handleEscapeSequences()` — outside `[...]` +- Range endpoint parser — for character class ranges + +**Files changed:** `RegexPreprocessorHelper.java` + +### 3. Bare `\xNN` with non-hex characters + +**Root cause:** Bare `\x` (without braces) was passed through to Java's regex engine, which expects exactly 2 hex digits after `\x`. Patterns like `\xk` or `\x4j` caused `PatternSyntaxException`. + +**Perl behavior:** `\x` takes up to 2 hex digits. `\xk` = `\x00` followed by literal `k`. `\x4j` = `\x04` followed by literal `j`. + +**Fix:** Added explicit bare `\x` handling that parses up to 2 hex digits and emits `\x{HH}` format when fewer than 2 valid hex digits are found. + +**Files changed:** `RegexPreprocessorHelper.java` — `handleEscapeSequences()` method + +### 4. NullPointerException when regex fails with JPERL_UNIMPLEMENTED=warn + +**Root cause:** When regex compilation fails and gets downgraded to a warning, the catch block in `RuntimeRegex.compile()` set the error pattern but didn't set `regex.patternString`. Downstream code (e.g., `replaceRegex()`) checked `regex.patternString == null` and triggered recompilation with a null pattern, causing NPE in `convertPythonStyleGroups(null).replaceAll(...)`. + +**Fix:** +1. Set `regex.patternString` in the catch block when downgrading to warning +2. Added null guard in `preProcessRegex()` to treat null input as empty string + +**Files changed:** `RuntimeRegex.java`, `RegexPreprocessor.java` + +## Known Remaining Issues + +### Test Pass Rates (after all fixes) + +| Test | Before fixes | After fixes | Remaining failures | +|------|-------------|-------------|-------------------| +| `re/pat.t` | 428/1298 | **1077**/1298 (all run) | 221 fail | +| `re/pat_advanced.t` | 63/1298 | **1308**/1625 | 317 fail + 53 not reached | +| `re/pat_rt_report.t` | 2397/2515 | **2431**/2515 (ran 2508) | 77 fail + 7 not reached | +| `re/regexp_unicode_prop.t` | — | **1017**/1096 | 79 fail + 14 not reached | +| `re/reg_eval_scope.t` | 6/49 | 7/49 | 42 fail | +| `uni/variables.t` | 66880/66880 | 66880/66880 | 0 | + +### Early Termination (crashes blocking remaining tests) + +| Test | Crash point | Cause | Tests blocked | +|------|------------|-------|---------------| +| pat.t | **No crash** — all 1298 tests now run | N/A | 0 | +| pat_advanced.t | Line 2308 (test 1625) | `\p{Is_q}` — package-scoped user property (`Some::Is_q`) | 53 tests | +| pat_rt_report.t | Line 1158 (test 2508) | `(?1)` — numbered group recursion not supported | 7 tests | +| regexp_unicode_prop.t | Line 543 (test 1096) | `\pf`/`\Pf` invalid property generates warnings instead of errors | 14 tests | + +### Failure Categories + +#### A. `\G` anchor (26 failures in pat.t) + +The `\G` assertion (match at pos()) has significant issues: +- **Floating `\G`** patterns like `/a+\G/` fail — Java doesn't support `\G` except at pattern start +- **`\G` in loops** (`/\G.../gc` iteration) doesn't maintain position correctly +- Tests: pat.t 455-518 + +**Difficulty: Medium-High.** Requires custom `\G` tracking in the match engine; Java's `\G` only works at the start of a match attempt. + +#### B. `(?{...})` code blocks (36 failures in pat.t, 5 in pat_advanced.t, 5 in pat_rt_report.t) + +Regex embedded code blocks are replaced with no-op groups. This breaks: +- **`$^R`** — result of last `(?{...})` (tests 308-310) +- **`pos()` inside `(?{...})`** (tests 470-494) +- **Package/lexical variable access** inside `(?{...})` (tests 522-525) +- **Eval-group runtime checks** — "Eval-group not allowed at runtime" (tests 300-304) + +**Difficulty: Very High.** Would require integrating the Perl compiler into the regex engine to execute code at match time. + +#### C. `$^N` — last successful capture (20 failures in pat_advanced.t) + +`$^N` is not updated after successful group captures. Tests 69-88 all fail. +- Both outside regex and inside `(?{...})` usage fails +- `$^N` is automatically localized — not implemented + +**Difficulty: Medium.** Requires tracking the last successfully matched group in the match result. + +#### D. `(??{...})` recursive patterns (5 failures in pat.t) + +Non-constant recursive patterns are replaced with empty groups. Tests 293-297 (complicated backtracking, recursion with `(??{})`) all fail. + +**Difficulty: Very High.** Same as `(?{...})` — requires runtime code execution. + +#### E. `(*ACCEPT)`, `(*FAIL)` control verbs (5 failures in pat.t) + +Regex control verbs are not supported by Java's regex engine. Tests 357-373 (ACCEPT and CLOSE buffer tests). + +**Difficulty: High.** Would require a custom regex engine or post-processing layer. + +#### F. `@-` / `@+` / `@{^CAPTURE}` arrays (12 failures in pat.t, 5 in pat_rt_report.t) + +The match position arrays have bugs: +- **Wrong values** for capture group positions (tests 381-438) +- **Stale values** not cleared after new match (tests 439-441) +- **Read-only protection** throws wrong exception type: `UnsupportedOperationException` instead of `Modification of a read-only value attempted` (test 449) +- **Interpolation in patterns** — `@-` and `@+` should not be interpolated (pat_rt_report.t 151-154) +- **Undefined values** in `@-`/`@+` after match (pat_rt_report.t 213) + +**Difficulty: Medium.** The data is available from Java's `Matcher`; needs more careful mapping to Perl semantics. + +#### G. `qr//` stringification and modifiers (4 failures in pat.t) + +- `qr/\b\v$/xism` stringifies as `(?^imsx:\b\v$)` but should be `(?^msix:\\b\\v$)` — backslashes not escaped in stringification (test 315) +- **`/u` modifier** not tracked: `use feature 'unicode_strings'` should add `/u` flag (tests 323-327) + +**Difficulty: Low-Medium.** Stringification fix is straightforward; `/u` modifier tracking needs scope awareness. + +#### H. `\N{name}` charnames (25 failures in pat_advanced.t) + +Named character escapes have extensive issues: +- **Empty `\N{}`** not handled correctly (tests 794-809) +- **`\N{PLUS SIGN}`** — named characters not expanded in regex (tests 831-833) +- **`\N{U+0041}`** in character class — `[\N{SPACE}\N{U+0041}]` fails (test 836) +- **Charname validation** — leading digit, comma, latin1 symbol errors not produced (tests 821-828) +- **Charname caching** with `$1` — not implemented (tests 798-801) +- **Cedilla/NO-BREAK SPACE** in names — error handling missing (tests 816-819) + +**Difficulty: Medium-High.** `\N{U+XXXX}` is partially implemented; full charnames support needs the `charnames` module. + +#### I. Useless `(?c)` / `(?g)` / `(?o)` warnings (13 failures in pat_advanced.t) + +Perl warns about useless regex modifiers (`/c`, `/g`, `/o` are match-operator flags, not regex flags). PerlOnJava silently ignores them without producing warnings. + +**Difficulty: Low.** Add warning emission in the regex flag parser. + +#### J. Bare `\x` hex escape edge cases (5 failures in pat_advanced.t) + +Our fix handles the crash but the test strings don't match correctly: +- `\x4j` produces `\004j` but regex `[\x4j]{2}` doesn't match it (test 101) +- `\xk` produces `\000k` but regex `[\xk]{2}` doesn't match it (test 102) +- `\xx`, `\xxa`, `\x9_b` — regex character class expansion doesn't match the test string (tests 103-105) + +The issue is that the test string and the regex pattern both use `\x` escapes, but the regex preprocessor and the string processor handle them differently. The test expects both to produce the same character. + +**Difficulty: Low-Medium.** The regex-side `\x` handling needs to produce character classes that match what the string-side produces. + +#### K. Conditional `(?(1)...)` with `$` anchor — Bug 41010 (48 failures in pat_rt_report.t) + +The largest single failure category. Patterns like `/([ ]*$)(?(1))/` don't match correctly. This is a systematic issue with conditionals referencing a group that ends with `$` anchor. + +**Difficulty: Medium.** Likely a subtle difference in how Java handles the interaction between `$` anchor in a group and conditional backreference. + +#### L. `$REGMARK` / `${^PREMATCH}` etc. (6 failures in pat_rt_report.t) + +`$REGMARK` (set by `(*MARK:name)`) is not implemented. Tests 2458-2463. + +**Difficulty: High.** Requires `(*MARK)` verb support. + +#### M. `(?1)` numbered group recursion / `(?&name)` named recursion (pat_advanced.t, pat_rt_report.t) + +`(?1)` and `(?&name)` syntax for recursing into capture groups is not recognized. Now downgradable with `JPERL_UNIMPLEMENTED=warn` (no longer crashes tests), but the patterns silently fail to match. + +**Difficulty: Very High.** Java's regex engine has no recursion support. Would need a custom engine or PCRE/JNI bridge. + +#### N. `\p{isAlpha}` POSIX-style Unicode property (crash in pat.t) + +The POSIX-style Unicode property syntax `\p{isAlpha}`, `\p{isSpace}` is not recognized. This causes the fatal error that stops pat.t at line 1247, blocking 666 remaining tests. + +**Difficulty: Low-Medium.** Map POSIX-style aliases (`isAlpha` → `Alpha`, `isSpace` → `Space`, etc.) in the Unicode property handler. + +#### O. Empty clause in alternation (3 failures in pat.t) + +Empty alternatives in patterns like `/(|a)/` or the "0 match in alternation" test don't work correctly. + +**Difficulty: Low-Medium.** Likely a regex preprocessing issue. + +#### P. Miscellaneous (small counts) + +| Issue | Tests | Difficulty | +|-------|-------|-----------| +| Look around edge cases | pat.t 332-333 | Medium | +| REG_INFTY (quantifier limit) | pat.t 250 | Low | +| POSIX class error message format | pat.t 348 | Low | +| Lookbehind limit (Java) | pat.t 252 | Hard (engine limit) | +| Empty pattern pmop flags | pat_rt_report.t 44 | Medium | +| Nested split | pat_rt_report.t 85 | Medium | +| Ill-formed UTF-8 in class | pat_rt_report.t 140 | Medium | +| Pattern in loop (prev success) | pat_rt_report.t 2469-2470 | Medium | +| Long string patterns | pat_advanced.t 805-813 | Medium | +| `/d` to `/u` modifier change | pat_advanced.t 807-808 | Low-Medium | + +#### Q. Package-scoped user-defined Unicode properties (crash in pat_advanced.t) + +`\p{Is_q}` defined in package `Some` as `Some::Is_q` is not found because user-defined property lookup only checks `main::` package. Perl uses the current package when resolving `\p{...}` names. This crashes pat_advanced.t at line 2308 (test 1625), blocking 53 tests. + +**Difficulty: Medium.** Need to pass the current package context to the regex preprocessor and try the current package before falling back to `main::`. + +#### R. Invalid single-char `\pX`/`\PX` properties (crash in regexp_unicode_prop.t) + +Invalid single-character properties like `\pf`, `\Pq` are passed through to Java's regex engine which throws `PatternSyntaxException`. This is caught and wrapped as `PerlJavaUnimplementedException`, which under `JPERL_UNIMPLEMENTED=warn` generates warnings instead of proper errors. Test 1096 in regexp_unicode_prop.t expects 0 warnings but gets 8 (from `\pf`, `\Pf`, `\pq`, `\Pq`), then crashes. + +**Fix approach:** Validate single-char properties in the preprocessor (only `\pL`, `\pM`, `\pN`, etc. are valid — single Unicode general category letters). Invalid ones should throw `PerlCompilerException` (not `PerlJavaUnimplementedException`). + +**Difficulty: Low.** Add validation for single-char `\p`/`\P` properties in `RegexPreprocessorHelper`. + +#### S. `/i` flag not passed to user-defined property subs (regexp_unicode_prop.t) + +Perl calls user-defined property subs with `$caseless=1` when the `/i` flag is active, allowing subs to return a wider character set for case-insensitive matching. PerlOnJava always calls the sub with an empty argument list. This causes 2 test failures in regexp_unicode_prop.t (tests 1061, 1077) and several in pat_advanced.t. + +**Fix approach:** Pass the `/i` flag through the regex preprocessor to `tryUserDefinedProperty`, which then passes `1` as the first argument to the property sub. + +**Difficulty: Medium.** Requires threading the case-insensitive flag through several method calls in the regex preprocessing pipeline. + +### Priority Recommendations + +**Quick wins (Low difficulty, high impact):** +1. ~~**`\p{isAlpha}` aliases** — unblocks 666 pat.t tests (category N)~~ **DONE** — pat.t now runs all 1298 tests +2. **Invalid `\pX` single-char properties** — unblocks 14 regexp_unicode_prop.t tests (category R) +3. **Useless `(?c)`/`(?g)`/`(?o)` warnings** — fixes 13 pat_advanced.t tests (category I) +4. **POSIX class error message** — fix message format (category P) +5. **REG_INFTY error** — add quantifier limit check (category P) + +**Medium effort, significant impact:** +6. **Package-scoped user properties** — unblocks 53 pat_advanced.t tests (category Q) +7. **`/i` caseless flag for user properties** — fixes ~4 tests (category S) +8. **`(?(1)...)` with `$` anchor** — fixes 48 pat_rt_report.t tests (category K) +9. **`@-`/`@+` position arrays** — fixes 17 tests across files (category F) +10. **`$^N` last capture** — fixes 20 pat_advanced.t tests (category C) +11. **Bare `\x` edge cases** — fixes 5 pat_advanced.t tests (category J) +12. **`\N{name}` charnames** — fixes 25 pat_advanced.t tests (category H) + +**Hard / architectural (major work):** +13. **`\G` anchor** — 26 pat.t tests (category A) +14. **`(?{...})` code blocks** — 46 tests total (category B) +15. **`(?1)` recursion / `(?&name)` / `(*ACCEPT)` / `(*MARK)`** — engine limitations (categories E, L, M) + +## Progress Tracking + +### Current Status: Major user-defined property and regex cache fixes (2026-04-10) + +### Completed +- [x] Fix 1: handleQuantifier brace consumption (2026-04-10) +- [x] Fix 2: \x{...} hex escape with non-hex chars (2026-04-10) +- [x] Fix 3: Bare \xNN with non-hex chars (2026-04-10) +- [x] Fix 4: NPE on failed regex with JPERL_UNIMPLEMENTED=warn (2026-04-10) +- [x] Failure analysis and categorization (2026-04-10) +- [x] Fix 5: \p{isAlpha} case-insensitive Is prefix, add Space/Alnum/Punct aliases (2026-04-10) +- [x] Fix 6: \p{Property=Value} syntax (2026-04-10) +- [x] Fix 7: Named capture groups with underscores — U95 encoding (2026-04-10) +- [x] Fix 8: User-defined property resolution — refactor resolvePropertyReference to return UnicodeSet (2026-04-10) + - Properties using +utf8:: references (e.g., +utf8::Uppercase, &utf8::ASCII) were failing because + the old code returned Java regex patterns that ICU4J's UnicodeSet couldn't parse + - Created resolvePropertyReferenceAsSet() and resolveStandardPropertyAsSet() methods +- [x] Fix 9: Regex cache preventing deferred recompilation (2026-04-10) + - ensureCompiledForRuntime() now evicts stale cache entries before recompiling +- [x] Fix 10: Cache user-defined property sub results (2026-04-10) + - Matches Perl behavior of calling each property sub only once + - Fixes "Called twice" errors from subs with `state` variables +- [x] Fix 11: Titlecase/TitlecaseLetter/Lt property aliases (2026-04-10) +- [x] Fix 12: (?&name) named group recursion downgraded to regexUnimplemented (2026-04-10) +- [x] Fix 13: (?digit) numbered recursion downgraded to regexUnimplemented (2026-04-10) + +### Files Modified +- `src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java` +- `src/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.java` +- `src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java` +- `src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java` +- `src/main/java/org/perlonjava/runtime/regex/CaptureNameEncoder.java` +- `src/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java` diff --git a/src/main/java/org/perlonjava/backend/bytecode/CompileAssignment.java b/src/main/java/org/perlonjava/backend/bytecode/CompileAssignment.java index 642384312..1153cf356 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/CompileAssignment.java +++ b/src/main/java/org/perlonjava/backend/bytecode/CompileAssignment.java @@ -1189,31 +1189,54 @@ public static void compileAssignmentOperator(BytecodeCompiler bytecodeCompiler, // Handle array slice assignment: @array[1, 3, 5] = (20, 30, 40) if (leftBin.operator.equals("[") && leftBin.left instanceof OperatorNode arrayOp) { - // Must be @array (not $array) - if (arrayOp.operator.equals("@") && arrayOp.operand instanceof IdentifierNode) { - String varName = "@" + ((IdentifierNode) arrayOp.operand).name; - + // Must be @array or @$ref (not $array) + if (arrayOp.operator.equals("@")) { int arrayReg; - if (bytecodeCompiler.currentSubroutineBeginId != 0 && bytecodeCompiler.currentSubroutineClosureVars != null - && bytecodeCompiler.currentSubroutineClosureVars.contains(varName)) { + + if (arrayOp.operand instanceof IdentifierNode) { + String varName = "@" + ((IdentifierNode) arrayOp.operand).name; + + if (bytecodeCompiler.currentSubroutineBeginId != 0 && bytecodeCompiler.currentSubroutineClosureVars != null + && bytecodeCompiler.currentSubroutineClosureVars.contains(varName)) { + arrayReg = bytecodeCompiler.allocateRegister(); + int nameIdx = bytecodeCompiler.addToStringPool(varName); + bytecodeCompiler.emitWithToken(Opcodes.RETRIEVE_BEGIN_ARRAY, node.getIndex()); + bytecodeCompiler.emitReg(arrayReg); + bytecodeCompiler.emit(nameIdx); + bytecodeCompiler.emit(bytecodeCompiler.currentSubroutineBeginId); + } else if (bytecodeCompiler.hasVariable(varName)) { + arrayReg = bytecodeCompiler.getVariableRegister(varName); + } else { + arrayReg = bytecodeCompiler.allocateRegister(); + String globalArrayName = NameNormalizer.normalizeVariableName( + ((IdentifierNode) arrayOp.operand).name, + bytecodeCompiler.getCurrentPackage() + ); + int nameIdx = bytecodeCompiler.addToStringPool(globalArrayName); + bytecodeCompiler.emit(Opcodes.LOAD_GLOBAL_ARRAY); + bytecodeCompiler.emitReg(arrayReg); + bytecodeCompiler.emit(nameIdx); + } + } else if (arrayOp.operand instanceof OperatorNode || arrayOp.operand instanceof BlockNode) { + // @$ref[@idx] = ... or @{expr}[@idx] = ... + // Compile the scalar reference expression and dereference to array + bytecodeCompiler.compileNode(arrayOp.operand, -1, RuntimeContextType.SCALAR); + int scalarReg = bytecodeCompiler.lastResultReg; arrayReg = bytecodeCompiler.allocateRegister(); - int nameIdx = bytecodeCompiler.addToStringPool(varName); - bytecodeCompiler.emitWithToken(Opcodes.RETRIEVE_BEGIN_ARRAY, node.getIndex()); - bytecodeCompiler.emitReg(arrayReg); - bytecodeCompiler.emit(nameIdx); - bytecodeCompiler.emit(bytecodeCompiler.currentSubroutineBeginId); - } else if (bytecodeCompiler.hasVariable(varName)) { - arrayReg = bytecodeCompiler.getVariableRegister(varName); + if (bytecodeCompiler.isStrictRefsEnabled()) { + bytecodeCompiler.emitWithToken(Opcodes.DEREF_ARRAY, node.getIndex()); + bytecodeCompiler.emitReg(arrayReg); + bytecodeCompiler.emitReg(scalarReg); + } else { + int pkgIdx = bytecodeCompiler.addToStringPool(bytecodeCompiler.getCurrentPackage()); + bytecodeCompiler.emitWithToken(Opcodes.DEREF_ARRAY_NONSTRICT, node.getIndex()); + bytecodeCompiler.emitReg(arrayReg); + bytecodeCompiler.emitReg(scalarReg); + bytecodeCompiler.emit(pkgIdx); + } } else { - arrayReg = bytecodeCompiler.allocateRegister(); - String globalArrayName = NameNormalizer.normalizeVariableName( - ((IdentifierNode) arrayOp.operand).name, - bytecodeCompiler.getCurrentPackage() - ); - int nameIdx = bytecodeCompiler.addToStringPool(globalArrayName); - bytecodeCompiler.emit(Opcodes.LOAD_GLOBAL_ARRAY); - bytecodeCompiler.emitReg(arrayReg); - bytecodeCompiler.emit(nameIdx); + bytecodeCompiler.throwCompilerException("Array slice assignment requires identifier or reference"); + return; } // Compile indices (right side of []) diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitOperator.java b/src/main/java/org/perlonjava/backend/jvm/EmitOperator.java index 9ac47f9f9..c9e919083 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitOperator.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitOperator.java @@ -534,9 +534,28 @@ static void handleSpliceBuiltin(EmitterVisitor emitterVisitor, OperatorNode node if (first != null) { try { + MethodVisitor mv = emitterVisitor.ctx.mv; first.accept(emitterVisitor.with(RuntimeContextType.LIST)); + + // Spill the first operand before evaluating remaining args so + // non-local control flow can't jump to returnLabel with an + // extra value on the JVM operand stack. + int firstSlot = emitterVisitor.ctx.javaClassInfo.acquireSpillSlot(); + boolean pooled = firstSlot >= 0; + if (!pooled) { + firstSlot = emitterVisitor.ctx.symbolTable.allocateLocalVariable(); + } + mv.visitVarInsn(Opcodes.ASTORE, firstSlot); + // Accept the remaining arguments in LIST context. args.accept(emitterVisitor.with(RuntimeContextType.LIST)); + + mv.visitVarInsn(Opcodes.ALOAD, firstSlot); + mv.visitInsn(Opcodes.SWAP); + + if (pooled) { + emitterVisitor.ctx.javaClassInfo.releaseSpillSlot(); + } } finally { listArgs.elements.addFirst(first); } diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 4dcd7910d..c4c76eb5a 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "b8043f312"; + public static final String gitCommitId = "c6ee04074"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). @@ -48,7 +48,7 @@ public final class Configuration { * Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at" * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String buildTimestamp = "Apr 10 2026 11:59:23"; + public static final String buildTimestamp = "Apr 10 2026 13:40:40"; // Prevent instantiation private Configuration() { diff --git a/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java b/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java index 91b1bb3da..1167eaa03 100644 --- a/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java +++ b/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java @@ -252,6 +252,61 @@ static BinaryOperatorNode parsePrint(Parser parser, LexerToken token, int curren return new BinaryOperatorNode(token.text, handle, operand, currentIndex); } + /** + * Check if a variable name refers to a forced-global variable that cannot + * be lexicalized with 'my' or 'state'. + * + * Perl rule: the following are always global: + * - $_, @_, %_ (the underscore variables, since Perl 5.30) + * - $0, $1, $2, ... (digit-only names) + * - $!, $/, $@, $;, $,, $., $|, etc. (single punctuation character names) + * - $^W, $^H, etc. (control character / caret variable names) + */ + private static boolean isGlobalOnlyVariable(String name) { + if (name == null || name.isEmpty()) return false; + + // Underscore: $_, @_, %_ are all forced global (since Perl 5.30) + if (name.equals("_")) return true; + + // Digit-only names: $0, $1, $2, ... + boolean allDigits = true; + for (int i = 0; i < name.length(); i++) { + if (!Character.isDigit(name.charAt(i))) { + allDigits = false; + break; + } + } + if (allDigits) return true; + + // Single ASCII non-alphanumeric, non-underscore character: $!, $/, $@, $;, etc. + // Only check ASCII range — Unicode characters (>= 128) may be valid identifiers + // even if Java's Character.isLetterOrDigit() doesn't recognize them. + if (name.length() == 1) { + char c = name.charAt(0); + if (c < 128 && !Character.isLetterOrDigit(c) && c != '_') return true; + } + + // Control character prefix (caret variables like $^W stored as chr(23)) + if (name.charAt(0) < 32) return true; + + return false; + } + + /** + * Format a variable name for display in error messages. + * Converts internal control character representation back to ^X form. + * E.g., chr(23) + "" becomes "^W", chr(8) + "MATCH" becomes "^HMATCH". + */ + private static String formatVarNameForDisplay(String name) { + if (name == null || name.isEmpty()) return name; + char first = name.charAt(0); + if (first < 32) { + // Control character: convert to ^X notation + return "^" + (char) (first + 'A' - 1) + name.substring(1); + } + return name; + } + private static void addVariableToScope(EmitterContext ctx, String operator, OperatorNode node) { String sigil = node.operator; if ("$@%".contains(sigil)) { @@ -260,7 +315,19 @@ private static void addVariableToScope(EmitterContext ctx, String operator, Oper if (identifierNode instanceof IdentifierNode) { // my $a String name = ((IdentifierNode) identifierNode).name; String var = sigil + name; - + + // Check for global-only variables in my/state declarations + // Perl: "Can't use global $0 in "my"" + if ((operator.equals("my") || operator.equals("state")) + && isGlobalOnlyVariable(name)) { + throw new PerlCompilerException( + node.getIndex(), + "Can't use global " + sigil + formatVarNameForDisplay(name) + + " in \"" + operator + "\"", + ctx.errorUtil + ); + } + // Check for redeclaration warnings if (operator.equals("our")) { // For 'our', only warn if redeclared in the same package (matching Perl behavior) diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java b/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java index 0fc8197bd..7b4d10202 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Storable.java @@ -548,9 +548,12 @@ private static RuntimeScalar deepClone(RuntimeScalar scalar, IdentityHashMap 5; } - // FUTURE ENHANCEMENTS: - // - // For underscore support: (?) - // Use the same hex encoding pattern: (?) where HEX encodes "my_name" - // Then %CAPTURE decodes back to show original name to user + // UNDERSCORE ENCODING: // - // For duplicate names: (?a)|(?b) - // Encode with disambiguation: (?a)|(?b) where HEX encodes "name" - // Track mapping for proper capture group retrieval + // Java regex doesn't allow underscores in group names (only [a-zA-Z][a-zA-Z0-9]*). + // Perl allows \w+ (letters, digits, underscores) for group names. // - // The generic hex encoding pattern is reusable for all Java regex limitations! + // Encoding: Replace each underscore with "U95" (ASCII code 95 for '_') + // (?) → (?) + // (?<_>) → (?) + // (?<_foo>) → (?) + // + // Names starting with underscore need a letter prefix for Java, so U95 works + // since it starts with 'U'. To avoid ambiguity, literal "U95" sequences in + // names are escaped as "UU95" (the 'U' itself is escaped). + + /** + * Encodes a Perl capture group name for use in Java regex. + * Replaces underscores with "U95" and escapes literal "U95" sequences. + * + * @param perlName The original Perl capture group name + * @return The encoded name safe for Java regex, or the original if no encoding needed + */ + public static String encodeGroupName(String perlName) { + if (perlName == null || (!perlName.contains("_") && !perlName.contains("U95"))) { + return perlName; + } + // First escape any existing "U95" as "UU95" to avoid ambiguity + String encoded = perlName.replace("U95", "UU95"); + // Then replace underscores with "U95" + encoded = encoded.replace("_", "U95"); + return encoded; + } + + /** + * Decodes a Java regex capture group name back to the original Perl name. + * Reverses the encoding done by encodeGroupName. + * + * @param javaName The encoded Java group name + * @return The original Perl capture group name + */ + public static String decodeGroupName(String javaName) { + if (javaName == null || !javaName.contains("U95")) { + return javaName; + } + // First restore underscores from "U95" + String decoded = javaName.replace("U95", "_"); + // Then restore literal "U95" from "U_95" (which was "UU95" before first step) + decoded = decoded.replace("U_95", "U95"); + return decoded; + } + + /** + * Checks if a capture group name is an internal name that should be hidden + * from user-visible variables like %+ and %-. + * + * @param captureName The capture group name to check + * @return true if this is an internal capture (code block or \K marker) + */ + public static boolean isInternalCapture(String captureName) { + return isCodeBlockCapture(captureName) || "perlK".equals(captureName); + } } diff --git a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java index 52961e4a3..035d8332f 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java +++ b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.java @@ -93,6 +93,9 @@ static boolean hadBackslashK() { * @throws PerlCompilerException If there are unmatched parentheses in the regex. */ static String preProcessRegex(String s, RegexFlags regexFlags) { + if (s == null) { + s = ""; + } captureGroupCount = 0; deferredUnicodePropertyEncountered = false; inlinePFlagEncountered = false; @@ -997,7 +1000,7 @@ private static int handleParentheses(String s, int offset, int length, StringBui sb.append("(?:)"); // Throw error that can be caught by JPERL_UNIMPLEMENTED=warn - regexError(s, offset + 2, "Regex control verb " + verb + " not implemented"); + regexUnimplemented(s, offset + 2, "Regex control verb " + verb + " not implemented"); return verbEnd; // Skip past the entire verb construct } @@ -1019,7 +1022,7 @@ private static int handleParentheses(String s, int offset, int length, StringBui return offset; // offset points to ')', caller will increment past it } else if (c3 == '@') { // Handle (?@...) which is not implemented - regexError(s, offset + 3, "Sequence (?@...) not implemented"); + regexUnimplemented(s, offset + 3, "Sequence (?@...) not implemented"); } else if (c3 == '{') { // Check if this is our special unimplemented marker if (s.startsWith("(?{UNIMPLEMENTED_CODE_BLOCK})", offset)) { @@ -1095,8 +1098,8 @@ private static int handleParentheses(String s, int offset, int length, StringBui validateLookbehindLength(s, offset); sb.append("(? ... ) + } else if (c3 == '<' && (isAlphabetic(c4) || c4 == '_')) { + // Handle named capture (? ... ) - name can start with letter or underscore offset = handleNamedCapture(c3, s, offset, length, sb, regexFlags); } else if (c3 == '<') { // Invalid character after (?< @@ -1138,7 +1141,10 @@ private static int handleParentheses(String s, int offset, int length, StringBui } else if (Character.isDigit(c3)) { // Recursive subpattern reference (?1), (?2), etc. // These refer to the subpattern with that number and are recursive - regexError(s, offset + 2, "Sequence (?" + ((char) c3) + "...) not recognized"); + regexUnimplemented(s, offset + 2, "Sequence (?" + ((char) c3) + "...) not recognized"); + } else if (c3 == '&') { + // Named group recursion (?&name) - Perl feature not yet implemented + regexUnimplemented(s, offset + 2, "Sequence (?&...) not recognized"); } else { // Unknown sequence - show the actual character String seq = "(?"; @@ -1194,7 +1200,9 @@ private static int handleNamedCapture(int c, String s, int offset, int length, S regexError(s, offset, "Unterminated named capture in regex"); } String name = s.substring(start, end); - sb.append("(?<").append(name).append(">"); + // Encode underscores for Java regex compatibility + String encodedName = CaptureNameEncoder.encodeGroupName(name); + sb.append("(?<").append(encodedName).append(">"); captureGroupCount++; // Increment counter for capturing groups return handleRegex(s, end + 1, sb, regexFlags, true); // Process content inside the group } @@ -1523,7 +1531,7 @@ private static void validateLookbehindLength(String s, int offset) { int maxLength = calculateMaxLength(s, start); if (maxLength >= 255 || maxLength == -1) { // >= 255 means 255 or more - regexErrorSimple(s, "Lookbehind longer than 255 not implemented"); + throw new PerlJavaUnimplementedException("Lookbehind longer than 255 not implemented in regex m/" + s + "/"); } } @@ -1868,11 +1876,11 @@ private static int[] handleQuantifier(String s, int offset, StringBuilder sb) { // Valid quantifier forms: {n}, {n,}, {n,m}, {,m} // Invalid (literal): {}, {,}, {abc}, etc. if (!isValid || (!hasFirstNumber && !hasSecondNumber)) { - // Not a valid quantifier - treat braces as literal (escape for Java regex) + // Not a valid quantifier - treat opening brace as literal (escape for Java regex). + // Don't consume content up to '}' — it may contain regex metacharacters + // (like parentheses, character classes, etc.) that need proper processing. sb.append("\\{"); - sb.append(quantifier); - sb.append("\\}"); - return new int[]{end, 1}; // literal + return new int[]{start, 1}; // literal, offset stays at '{' so caller increments past it } // Valid quantifier - pass through to Java @@ -2217,23 +2225,23 @@ private static int handleCodeBlock(String s, int offset, int length, StringBuild // Append a named capture group that matches empty string // This allows us to store the constant value without affecting the match - sb.append("(?<").append(captureName).append(">)"); + sb.append("(?<").append(captureName).append(">"); - // Skip past '}' and ')' - the closing brace and paren of (?{...}) - // codeEnd points to the '}', so we need to skip '}' and ')' + // Return offset pointing to the ')' so handleGroup can consume it + // codeEnd points to the '}', the next char should be ')' if (codeEnd + 1 < length && s.charAt(codeEnd + 1) == ')') { - return codeEnd + 2; // Skip past both '}' and ')' + return codeEnd + 1; // Point to ')' for handleGroup to consume } return codeEnd + 1; // Just skip past '}' if no ')' found } // Non-constant code block: replace with no-op group so the regex compiles. // This allows tests that use (?{...}) in non-critical parts to continue running. - sb.append("(?:)"); + sb.append("(?:"); - // Skip past '}' and ')' - the closing brace and paren of (?{...}) + // Return offset pointing to the ')' so handleGroup can consume it if (codeEnd + 1 < length && s.charAt(codeEnd + 1) == ')') { - return codeEnd + 2; // Skip past both '}' and ')' + return codeEnd + 1; // Point to ')' for handleGroup to consume } return codeEnd + 1; // Just skip past '}' if no ')' found } diff --git a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.java b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.java index b6699dbd7..b9f0d73c9 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.java +++ b/src/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.java @@ -83,14 +83,31 @@ static int handleEscapeSequences(String s, StringBuilder sb, int c, int offset, int endQuote = s.indexOf('\'', offset); if (endQuote != -1) { String name = s.substring(offset, endQuote); + // Encode underscores for Java regex compatibility + String encodedName = CaptureNameEncoder.encodeGroupName(name); // Convert to Java syntax \k sb.setLength(sb.length() - 1); // Remove the backslash - sb.append("\\k<").append(name).append(">"); + sb.append("\\k<").append(encodedName).append(">"); return endQuote; // Return position at closing quote } else { RegexPreprocessor.regexError(s, offset - 2, "Unterminated \\k'...' backreference"); } } + if (nextChar == 'k' && offset + 1 < length && s.charAt(offset + 1) == '<') { + // Handle \k backreference (also valid Perl syntax) + offset += 2; // Skip past \k< + int endAngle = s.indexOf('>', offset); + if (endAngle != -1) { + String name = s.substring(offset, endAngle); + // Encode underscores for Java regex compatibility + String encodedName = CaptureNameEncoder.encodeGroupName(name); + sb.setLength(sb.length() - 1); // Remove the backslash + sb.append("\\k<").append(encodedName).append(">"); + return endAngle; // Return position at closing > + } else { + RegexPreprocessor.regexError(s, offset - 2, "Unterminated \\k<...> backreference"); + } + } if (nextChar == 'g') { if (offset + 1 < length && s.charAt(offset + 1) == '{') { // Handle \g{name} backreference @@ -124,9 +141,10 @@ static int handleEscapeSequences(String s, StringBuilder sb, int c, int offset, sb.append("\\").append(groupNum); } } catch (NumberFormatException e) { - // It's a named reference + // It's a named reference - encode underscores for Java regex + String encodedRef = CaptureNameEncoder.encodeGroupName(ref); sb.setLength(sb.length() - 1); // Remove the backslash - sb.append("\\k<").append(ref).append(">"); + sb.append("\\k<").append(encodedRef).append(">"); } offset = endBrace; } @@ -343,7 +361,7 @@ static int handleEscapeSequences(String s, StringBuilder sb, int c, int offset, // But if the error already contains "in expansion of", it is a real user-property definition error // that should be reported (not deferred). String msg = e.getMessage(); - if (property.matches("^(.*::)?(Is|In)[A-Z].*") && (msg == null || !msg.contains("in expansion of"))) { + if (property.matches("^(.*::)?([Ii][sSNn]).+") && (msg == null || !msg.contains("in expansion of"))) { RegexPreprocessor.markDeferredUnicodePropertyEncountered(); sb.setLength(sb.length() - 1); // Remove the backslash // Placeholder: match any single character, including newline @@ -440,18 +458,69 @@ static int handleEscapeSequences(String s, StringBuilder sb, int c, int offset, sb.setLength(sb.length() - 1); // Remove the backslash sb.append(Character.toChars(c2)); } else if (c2 == 'x' && offset + 1 < length && s.charAt(offset + 1) == '{') { - // \x{...} hex escape - consume entire sequence so main loop doesn't see the braces - sb.append('x'); - sb.append('{'); + // \x{...} hex escape - parse and normalize the hex value. + // Perl stops at the first non-hex character (after removing underscores). offset += 2; // Skip past x{ - while (offset < length && s.charAt(offset) != '}') { - sb.append(s.charAt(offset)); - offset++; + int endBrace = -1; + for (int i = offset; i < length; i++) { + if (s.charAt(i) == '}') { + endBrace = i; + break; + } } - if (offset < length) { - sb.append('}'); // Append closing brace + if (endBrace != -1) { + String hexStr = s.substring(offset, endBrace).trim().replace("_", ""); + // Extract valid hex prefix + int validLen = 0; + for (int i = 0; i < hexStr.length(); i++) { + char ch = hexStr.charAt(i); + if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F')) { + validLen++; + } else { + break; + } + } + int value; + if (validLen == 0) { + value = 0; // No valid hex digits → \x00 + } else { + value = Integer.parseInt(hexStr.substring(0, validLen), 16); + } + sb.append(String.format("x{%X}", value)); + offset = endBrace; + } else { + // No closing brace - pass through as-is + sb.append('x'); } // offset now points to '}', caller will increment + } else if (c2 == 'x') { + // Bare \xNN (no braces) - Perl takes up to 2 hex digits. + // If fewer than 2 valid hex digits, stop at first non-hex char. + // Java's Pattern requires exactly 2 hex digits for \xHH, so normalize. + int hexVal = 0; + int hexDigits = 0; + int pos = offset + 1; // position after 'x' + while (hexDigits < 2 && pos < length) { + char ch = s.charAt(pos); + if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F')) { + hexVal = hexVal * 16 + Character.digit(ch, 16); + hexDigits++; + pos++; + } else { + break; + } + } + if (hexDigits == 2) { + // Standard \xHH - pass through (Java handles it natively) + sb.append('x'); + sb.append(s.charAt(offset + 1)); + sb.append(s.charAt(offset + 2)); + offset += 2; + } else { + // 0 or 1 hex digits - use \x{H} format for Java + sb.append(String.format("x{%X}", hexVal)); + offset = pos - 1; // -1 because caller will increment + } } else { // Other escape sequences, pass through sb.append(Character.toChars(c2)); @@ -565,12 +634,22 @@ static int handleRegexCharacterClassEscape(int offset, String s, StringBuilder s int endBrace = s.indexOf('}', nextPos + 3); if (endBrace != -1) { String hex = s.substring(nextPos + 3, endBrace).trim().replace("_", ""); - try { - nextChar = Integer.parseInt(hex, 16); - rangeEndCharCount = endBrace - nextPos + 1; - } catch (NumberFormatException e) { - nextChar = -1; + // Extract valid hex prefix (Perl stops at first non-hex char) + int vLen = 0; + for (int i = 0; i < hex.length(); i++) { + char ch = hex.charAt(i); + if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F')) { + vLen++; + } else { + break; + } + } + if (vLen == 0) { + nextChar = 0; + } else { + nextChar = Integer.parseInt(hex.substring(0, vLen), 16); } + rangeEndCharCount = endBrace - nextPos + 1; } } else if (nextChar == 'o' && nextPos + 2 < length && s.charAt(nextPos + 2) == '{') { // Parse \o{NNNN} as range endpoint @@ -736,14 +815,25 @@ static int handleRegexCharacterClassEscape(int offset, String s, StringBuilder s String hexStr = s.substring(offset, endBrace).trim(); // Remove underscores (Perl allows them in number literals) hexStr = hexStr.replace("_", ""); - try { - int value = Integer.parseInt(hexStr, 16); - sb.append(String.format("x{%X}", value)); - offset = endBrace; - lastChar = value; - } catch (NumberFormatException e) { - RegexPreprocessor.regexError(s, offset, "Invalid hex number in \\x{...}"); + // Extract valid hex prefix (Perl stops at first non-hex char) + int validLen = 0; + for (int i = 0; i < hexStr.length(); i++) { + char ch = hexStr.charAt(i); + if ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || (ch >= 'A' && ch <= 'F')) { + validLen++; + } else { + break; + } + } + int value; + if (validLen == 0) { + value = 0; // No valid hex digits → \x00 + } else { + value = Integer.parseInt(hexStr.substring(0, validLen), 16); } + sb.append(String.format("x{%X}", value)); + offset = endBrace; + lastChar = value; } else { RegexPreprocessor.regexError(s, offset, "Missing right brace on \\x{}"); } diff --git a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java index 685b56255..0f2aeafad 100644 --- a/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java +++ b/src/main/java/org/perlonjava/runtime/regex/RuntimeRegex.java @@ -227,10 +227,28 @@ public static RuntimeRegex compile(String patternString, String modifiers) { } } } catch (Exception e) { - if (GlobalVariable.getGlobalHash("main::ENV").get("JPERL_UNIMPLEMENTED").toString().equals("warn") - ) { - // Warn for unimplemented features and Java regex compilation errors - String base = (e instanceof PerlJavaUnimplementedException) ? e.getMessage() : ("Regex compilation failed: " + e.getMessage()); + // PerlJavaUnimplementedException extends PerlCompilerException, so check + // the more specific type first. Real syntax errors (PerlCompilerException + // but NOT PerlJavaUnimplementedException) are always fatal. + // Java PatternSyntaxException etc. are wrapped as unimplemented. + boolean isUnimplemented = e instanceof PerlJavaUnimplementedException; + boolean isRealSyntaxError = !isUnimplemented && e instanceof PerlCompilerException; + + if (isRealSyntaxError) { + throw (PerlCompilerException) e; + } + + // Wrap non-Perl exceptions (PatternSyntaxException etc.) as unimplemented + PerlJavaUnimplementedException unimplEx; + if (isUnimplemented) { + unimplEx = (PerlJavaUnimplementedException) e; + } else { + unimplEx = new PerlJavaUnimplementedException("Regex compilation failed: " + e.getMessage()); + } + + // With JPERL_UNIMPLEMENTED=warn, downgrade to warning and use a never-matching pattern + if (GlobalVariable.getGlobalHash("main::ENV").get("JPERL_UNIMPLEMENTED").toString().equals("warn")) { + String base = unimplEx.getMessage(); // Include original and preprocessed patterns to aid debugging String patternInfo = " [pattern='" + (patternString == null ? "" : patternString) + "'" + (javaPattern != null ? ", java='" + javaPattern + "'" : "") + "]"; @@ -242,11 +260,12 @@ public static RuntimeRegex compile(String patternString, String modifiers) { WarnDie.warn(new RuntimeScalar(errorMessage), new RuntimeScalar()); regex.pattern = Pattern.compile(Character.toString(0) + "ERROR" + Character.toString(0), Pattern.DOTALL); regex.patternUnicode = regex.pattern; // Error pattern - same for both - } else { - if (e instanceof PerlCompilerException) { - throw e; + // Ensure patternString is set so downstream code doesn't NPE + if (regex.patternString == null) { + regex.patternString = patternString != null ? patternString : ""; } - throw new PerlJavaUnimplementedException("Regex compilation failed: " + e.getMessage()); + } else { + throw unimplEx; } } @@ -271,6 +290,12 @@ private static RuntimeRegex ensureCompiledForRuntime(RuntimeRegex regex) { // Recompile once, now that runtime may have defined user properties. // To avoid infinite loops if recompilation still can't resolve, clear the flag first. regex.deferredUserDefinedUnicodeProperties = false; + + // Evict the old cached entry so compile() will actually recompile + // instead of returning the stale regex with deferred placeholders. + String cacheKey = regex.patternString + "/" + (regex.regexFlags == null ? "" : regex.regexFlags.toFlagString()); + regexCache.remove(cacheKey); + RuntimeRegex recompiled = compile(regex.patternString, regex.regexFlags == null ? "" : regex.regexFlags.toFlagString()); regex.pattern = recompiled.pattern; regex.patternUnicode = recompiled.patternUnicode; diff --git a/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java b/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java index 570873797..164ac3473 100644 --- a/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java +++ b/src/main/java/org/perlonjava/runtime/regex/UnicodeResolver.java @@ -5,10 +5,20 @@ import com.ibm.icu.text.UnicodeSet; import org.perlonjava.runtime.runtimetypes.*; +import java.util.HashMap; import java.util.HashSet; +import java.util.Map; import java.util.Set; public class UnicodeResolver { + /** + * Cache for user-defined property subroutine results. + * Perl only calls user-defined property subs once per unique name and caches the result. + * Key: fully qualified sub name (e.g., "main::IsMyUpper") + * Value: the parsed character class pattern from parseUserDefinedProperty + */ + private static final Map userPropertyCache = new HashMap<>(); + /** * Retrieves the Unicode code point for a given character name. * Supports: @@ -150,24 +160,22 @@ private static String parseUserDefinedProperty(String definition, Set re if (line.startsWith("+")) { // Add another property String propName = line.substring(1).trim(); - String propPattern = resolvePropertyReference(propName, recursionSet, propertyName); - UnicodeSet propSet = new UnicodeSet(propPattern); + UnicodeSet propSet = resolvePropertyReferenceAsSet(propName, recursionSet, propertyName); resultSet.addAll(propSet); } else if (line.startsWith("-") || line.startsWith("!")) { // Remove a property String propName = line.substring(1).trim(); - String propPattern = resolvePropertyReference(propName, recursionSet, propertyName); - UnicodeSet propSet = new UnicodeSet(propPattern); + UnicodeSet propSet = resolvePropertyReferenceAsSet(propName, recursionSet, propertyName); resultSet.removeAll(propSet); } else if (line.startsWith("&")) { // Intersection with a property String propName = line.substring(1).trim(); - String propPattern = resolvePropertyReference(propName, recursionSet, propertyName); + UnicodeSet propSet = resolvePropertyReferenceAsSet(propName, recursionSet, propertyName); if (!hasIntersection) { - intersectionSet = new UnicodeSet(propPattern); + intersectionSet = propSet; hasIntersection = true; } else { - intersectionSet.retainAll(new UnicodeSet(propPattern)); + intersectionSet.retainAll(propSet); } } else { // Parse hex range - extract the hex part before any comments @@ -184,7 +192,9 @@ private static String parseUserDefinedProperty(String definition, Set re try { long codePoint = Long.parseLong(hexStr, 16); if (codePoint > 0x10FFFF) { - throw new IllegalArgumentException("Code point too large in \"" + line.trim() + "\" in expansion of " + propertyName); + // JVM only supports Unicode up to U+10FFFF; silently clamp + // (Perl supports 31-bit/32-bit code points, but Java doesn't) + codePoint = 0x10FFFF; } resultSet.add((int) codePoint); } catch (NumberFormatException e) { @@ -204,11 +214,14 @@ private static String parseUserDefinedProperty(String definition, Set re long start = Long.parseLong(startHex, 16); long end = Long.parseLong(endHex, 16); + // JVM only supports Unicode up to U+10FFFF; clamp values + // (Perl supports 31-bit/32-bit code points, but Java doesn't) if (start > 0x10FFFF) { - throw new IllegalArgumentException("Code point too large in \"" + line.trim() + "\" in expansion of " + propertyName); + // Entire range is beyond JVM limit; skip it + continue; } if (end > 0x10FFFF) { - throw new IllegalArgumentException("Code point too large in \"" + line.trim() + "\" in expansion of " + propertyName); + end = 0x10FFFF; } if (start > end) { throw new IllegalArgumentException("Illegal range in \"" + line.trim() + "\" in expansion of " + propertyName); @@ -231,14 +244,16 @@ private static String parseUserDefinedProperty(String definition, Set re } /** - * Resolves a property reference (like utf8::InHiragana or main::IsMyProp). + * Resolves a property reference to a UnicodeSet (like utf8::InHiragana or main::IsMyProp). + * Returns a UnicodeSet directly instead of a Java regex pattern string, so the result + * can be used with UnicodeSet set operations (addAll, removeAll, retainAll). * * @param propRef The property reference * @param recursionSet Set to track recursive property calls * @param parentProperty The parent property name (for error messages) - * @return A character class pattern + * @return A UnicodeSet representing the property */ - private static String resolvePropertyReference(String propRef, Set recursionSet, String parentProperty) { + private static UnicodeSet resolvePropertyReferenceAsSet(String propRef, Set recursionSet, String parentProperty) { // Check for recursion if (recursionSet.contains(propRef)) { // Build recursion chain for error message @@ -257,23 +272,166 @@ private static String resolvePropertyReference(String propRef, Set recur } // Remove utf8:: prefix if present + String propName = propRef; if (propRef.startsWith("utf8::")) { - String stdProp = propRef.substring(6); + propName = propRef.substring(6); + } + + // Try to resolve as a standard Unicode property via ICU4J + UnicodeSet result = resolveStandardPropertyAsSet(propName, recursionSet); + if (result != null) { + return result; + } + + // Try as user-defined property (calls the Perl sub) + String fallbackRef = propRef.startsWith("utf8::") ? "main::" + propRef.substring(6) : propRef; + String userProp = tryUserDefinedProperty(fallbackRef, recursionSet); + if (userProp != null) { + // userProp is a character class pattern from unicodeSetToJavaPattern + return new UnicodeSet("[" + userProp + "]"); + } + + throw new IllegalArgumentException("Invalid or unsupported Unicode property: " + propRef); + } + + /** + * Resolves a standard Unicode property name to a UnicodeSet using ICU4J directly. + * Handles the same aliases as translateUnicodeProperty but returns a UnicodeSet. + * + * @param property The property name (without utf8:: prefix) + * @param recursionSet Set to track recursive property calls + * @return A UnicodeSet, or null if the property cannot be resolved + */ + private static UnicodeSet resolveStandardPropertyAsSet(String property, Set recursionSet) { + // Handle well-known Perl property aliases + switch (property) { + case "XPosixSpace": case "XPerlSpace": case "SpacePerl": + case "Space": case "White_Space": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("White_Space", "True"); + return set; + } + case "XPosixAlnum": case "Alnum": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Alphabetic", "True"); + UnicodeSet digits = new UnicodeSet(); + digits.applyPropertyAlias("gc", "Nd"); + set.addAll(digits); + return set; + } + case "XPosixAlpha": case "Alpha": case "Alphabetic": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Alphabetic", "True"); + return set; + } + case "XPosixUpper": case "Upper": case "Uppercase": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Uppercase", "True"); + return set; + } + case "Titlecase": case "TitlecaseLetter": case "Titlecase_Letter": case "Lt": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("gc", "Lt"); + return set; + } + case "XPosixLower": case "Lower": case "Lowercase": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Lowercase", "True"); + return set; + } + case "XPosixDigit": case "Decimal_Number": case "Digit": case "Nd": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("gc", "Nd"); + return set; + } + case "XPosixPunct": case "Punct": case "Punctuation": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("gc", "P"); + return set; + } + case "Dash": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Dash", "True"); + return set; + } + case "Hex_Digit": case "Hex": case "XPosixXDigit": case "XDigit": + case "ASCII_Hex_Digit": case "AHex": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("ASCII_Hex_Digit", "True"); + return set; + } + case "Cn": { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("gc", "Cn"); + return set; + } + case "ASCII": { + return new UnicodeSet("[\\u0000-\\u007F]"); + } + default: + break; + } + + // Strip Is/In prefix for Perl compatibility + String stripped = property; + if (property.length() > 2 + && (property.charAt(0) == 'I' || property.charAt(0) == 'i') + && (property.charAt(1) == 's' || property.charAt(1) == 'S') + && Character.isUpperCase(property.charAt(2))) { + stripped = property.substring(2); + // Recurse with stripped name + UnicodeSet result = resolveStandardPropertyAsSet(stripped, recursionSet); + if (result != null) { + return result; + } + } else if (property.length() > 2 + && (property.charAt(0) == 'I' || property.charAt(0) == 'i') + && (property.charAt(1) == 'n' || property.charAt(1) == 'N') + && Character.isUpperCase(property.charAt(2))) { + stripped = property.substring(2); + // Try as block name try { - // Try as standard property - return translateUnicodeProperty(stdProp, false, recursionSet); - } catch (IllegalArgumentException e) { - // Fall through to user-defined property lookup - propRef = "main::" + stdProp; + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Block", stripped); + return set; + } catch (IllegalArgumentException ignored) { } } - // Try as user-defined property - return translateUnicodeProperty(propRef, false, recursionSet); + // Map ASCII alias to block name + if (stripped.equalsIgnoreCase("ASCII")) { + return new UnicodeSet("[\\u0000-\\u007F]"); + } + + // Try direct ICU4J lookup as general category, script, or binary property + try { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias(stripped, "True"); + return set; + } catch (IllegalArgumentException ignored) { + } + try { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias(stripped, ""); + return set; + } catch (IllegalArgumentException ignored) { + } + + // Try as block name + try { + UnicodeSet set = new UnicodeSet(); + set.applyPropertyAlias("Block", stripped); + return set; + } catch (IllegalArgumentException ignored) { + } + + return null; } /** * Tries to look up a user-defined property by calling a Perl subroutine. + * Results are cached per sub name, matching Perl's behavior of only calling + * user-defined property subs once per unique property name. * * @param property The property name (e.g., "IsMyUpper" or "main::IsMyUpper") * @param recursionSet Set to track recursive property calls @@ -291,6 +449,11 @@ private static String tryUserDefinedProperty(String property, Set recurs subName = "main::" + subName; } + // Check cache first — Perl only calls user-defined property subs once + if (userPropertyCache.containsKey(subName)) { + return userPropertyCache.get(subName); + } + // Look up the subroutine RuntimeScalar codeRef = GlobalVariable.getGlobalCodeRef(subName); if (codeRef == null || !codeRef.getDefinedBoolean()) { @@ -303,13 +466,17 @@ private static String tryUserDefinedProperty(String property, Set recurs RuntimeList result = RuntimeCode.apply(codeRef, args, RuntimeContextType.SCALAR); if (result.elements.isEmpty()) { - return ""; + String parsed = ""; + userPropertyCache.put(subName, parsed); + return parsed; } String definition = result.elements.getFirst().toString(); - // Parse and return the property definition - return parseUserDefinedProperty(definition, newRecursionSet, subName); + // Parse and cache the property definition + String parsed = parseUserDefinedProperty(definition, newRecursionSet, subName); + userPropertyCache.put(subName, parsed); + return parsed; } catch (PerlCompilerException e) { // Re-throw Perl exceptions (like die in IsDeath) @@ -334,7 +501,10 @@ public static String translateUnicodeProperty(String property, boolean negated) private static String translateUnicodeProperty(String property, boolean negated, Set recursionSet) { try { // Check for user-defined properties (Is... or In...) - if (property.matches("^(.*::)?(Is|In)[A-Z].*")) { + // Perl treats ANY property starting with Is/In (case-insensitive prefix) + // as potentially user-defined, regardless of the character after the prefix + // (e.g., Is_q, IsMyProp, InMyBlock all trigger user-defined lookup) + if (property.matches("^(.*::)?([Ii][sSNn]).+")) { String userProp = tryUserDefinedProperty(property, recursionSet); if (userProp != null) { return wrapCharClass(userProp, negated); @@ -351,9 +521,12 @@ private static String translateUnicodeProperty(String property, boolean negated, case "XPosixSpace": case "XPerlSpace": case "SpacePerl": + case "Space": + case "White_Space": // Use ICU4J UnicodeSet for accurate XPosixSpace return getXPosixSpacePattern(negated); case "XPosixAlnum": + case "Alnum": return wrapCharClass("\\p{IsAlphabetic}\\p{IsDigit}", negated); case "XPosixAlpha": case "Alpha": @@ -384,11 +557,18 @@ private static String translateUnicodeProperty(String property, boolean negated, case "Print": return wrapCharClass("\\p{IsAlphabetic}\\p{IsDigit}\\p{IsPunctuation}\\p{IsWhite_Space}", negated); case "XPosixPunct": + case "Punct": + case "Punctuation": return wrapProperty("IsPunctuation", negated); case "XPosixUpper": case "Upper": case "Uppercase": return wrapProperty("IsUppercase", negated); + case "Titlecase": + case "TitlecaseLetter": + case "Titlecase_Letter": + case "Lt": + return wrapProperty("gc=Lt", negated); case "XPosixWord": case "Word": case "IsWord": @@ -453,9 +633,12 @@ private static String translateUnicodeProperty(String property, boolean negated, } } - // Strip 'Is' prefix for Perl compatibility (e.g., IsPrint -> Print, IsDigit -> Digit) - // ICU4J doesn't recognize Is-prefixed property names, but they're valid in Perl - if (property.startsWith("Is") && property.length() > 2 && Character.isUpperCase(property.charAt(2))) { + // Strip 'Is'/'is' prefix for Perl compatibility (e.g., IsPrint -> Print, isAlpha -> Alpha) + // Perl is case-insensitive for the 'Is' prefix on Unicode property names + if (property.length() > 2 + && (property.charAt(0) == 'I' || property.charAt(0) == 'i') + && (property.charAt(1) == 's' || property.charAt(1) == 'S') + && Character.isUpperCase(property.charAt(2))) { property = property.substring(2); } @@ -480,11 +663,28 @@ private static String translateUnicodeProperty(String property, boolean negated, // Standard Unicode properties UnicodeSet unicodeSet = new UnicodeSet(); - if (isBlockProperty(property)) { - unicodeSet.applyPropertyAlias("Block", property); + + // Handle Property=Value syntax (e.g., ASCII_Hex_Digit=True, gc=Ll) + String propName = property; + String propValue = ""; + int eqIdx = property.indexOf('='); + if (eqIdx > 0 && eqIdx < property.length() - 1) { + propName = property.substring(0, eqIdx); + propValue = property.substring(eqIdx + 1); + // Handle negation: Property=False means \P{Property} + if (propValue.equalsIgnoreCase("False") || propValue.equalsIgnoreCase("No") || propValue.equals("N") || propValue.equals("F")) { + negated = !negated; + propValue = "True"; + } else if (propValue.equalsIgnoreCase("True") || propValue.equalsIgnoreCase("Yes") || propValue.equals("Y") || propValue.equals("T")) { + propValue = "True"; + } + } + + if (isBlockProperty(propName)) { + unicodeSet.applyPropertyAlias("Block", propName); } else { try { - unicodeSet.applyPropertyAlias(property, ""); + unicodeSet.applyPropertyAlias(propName, propValue); } catch (IllegalArgumentException ex) { // Property not found as general category/script - try as a Unicode block name. // Perl resolves \p{Emoticons} as \p{Block=Emoticons}, etc. diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java b/src/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java index 18c44573b..f4f69d721 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java @@ -1,6 +1,7 @@ package org.perlonjava.runtime.runtimetypes; import org.perlonjava.runtime.mro.InheritanceResolver; +import org.perlonjava.runtime.regex.CaptureNameEncoder; import org.perlonjava.runtime.regex.RuntimeRegex; import java.util.AbstractMap; @@ -77,6 +78,12 @@ public Set> entrySet() { if (matcher != null) { Map namedGroups = matcher.pattern().namedGroups(); for (String name : namedGroups.keySet()) { + // Skip internal captures (code blocks, \K marker) + if (CaptureNameEncoder.isInternalCapture(name)) { + continue; + } + // Decode the name back to original Perl name (reverse underscore encoding) + String perlName = CaptureNameEncoder.decodeGroupName(name); String matchedValue = matcher.group(name); if (this.mode == Id.CAPTURE_ALL) { // For %-, values are always array refs (even for non-participating groups) @@ -86,11 +93,11 @@ public Set> entrySet() { } else { arr.push(new RuntimeScalar()); // undef for non-participating groups } - entries.add(new SimpleEntry<>(name, arr.createReference())); + entries.add(new SimpleEntry<>(perlName, arr.createReference())); } else { // For %+, only include groups that actually matched if (matchedValue != null) { - entries.add(new SimpleEntry<>(name, new RuntimeScalar(matchedValue))); + entries.add(new SimpleEntry<>(perlName, new RuntimeScalar(matchedValue))); } } } @@ -177,11 +184,13 @@ public RuntimeScalar get(Object key) { if (this.mode == Id.CAPTURE_ALL || this.mode == Id.CAPTURE) { Matcher matcher = RuntimeRegex.globalMatcher; if (matcher != null && key instanceof String name) { + // Encode the Perl name to Java regex name (underscore encoding) + String encodedName = CaptureNameEncoder.encodeGroupName(name); // Check if this is a valid named group - if (!matcher.pattern().namedGroups().containsKey(name)) { + if (!matcher.pattern().namedGroups().containsKey(encodedName)) { return scalarUndef; } - String matchedValue = matcher.group(name); + String matchedValue = matcher.group(encodedName); if (this.mode == Id.CAPTURE_ALL) { // For %-, always return array ref (with undef for non-participating groups) RuntimeArray arr = new RuntimeArray(); @@ -220,7 +229,8 @@ public boolean containsKey(Object key) { // For %-, all named groups exist (even non-participating ones) Matcher matcher = RuntimeRegex.globalMatcher; if (matcher != null && key instanceof String name) { - return matcher.pattern().namedGroups().containsKey(name); + String encodedName = CaptureNameEncoder.encodeGroupName(name); + return matcher.pattern().namedGroups().containsKey(encodedName); } return false; } @@ -228,7 +238,8 @@ public boolean containsKey(Object key) { // For %+, only groups that actually captured Matcher matcher = RuntimeRegex.globalMatcher; if (matcher != null && key instanceof String name) { - return matcher.pattern().namedGroups().containsKey(name) && matcher.group(name) != null; + String encodedName = CaptureNameEncoder.encodeGroupName(name); + return matcher.pattern().namedGroups().containsKey(encodedName) && matcher.group(encodedName) != null; } return false; } diff --git a/src/main/perl/lib/ExtUtils/MakeMaker.pm b/src/main/perl/lib/ExtUtils/MakeMaker.pm index f00746765..95be5fc3d 100644 --- a/src/main/perl/lib/ExtUtils/MakeMaker.pm +++ b/src/main/perl/lib/ExtUtils/MakeMaker.pm @@ -579,7 +579,9 @@ pm_to_blib: $install_cmds_str # Copy to blib/lib for test compatibility (make test uses PERL5LIB=./blib/lib) +# Also create blib/arch so that "use blib" / "-Mblib" works (blib.pm requires both) pure_all: +\t\@mkdir -p blib/arch $blib_cmds_str # Process PL_FILES