diff --git a/dev/design/utf8_flag_parity.md b/dev/design/utf8_flag_parity.md new file mode 100644 index 000000000..e8c8cedc1 --- /dev/null +++ b/dev/design/utf8_flag_parity.md @@ -0,0 +1,136 @@ +# UTF-8 Flag Parity: Byte-String Preservation + +## Problem + +PerlOnJava has a systemic issue where operations that should produce byte strings +(SvUTF8=0 in Perl) instead produce UTF-8-flagged strings (STRING type). This +causes data corruption when binary data (JPEG, TIFF, PNG, GIF) is round-tripped +through ExifTool's write path, because `Encode::is_utf8()` returns true and +`Encode::encode('utf8', $data)` re-encodes bytes >127 as multi-byte sequences. + +### PerlOnJava Type Model + +| PerlOnJava type | Perl equivalent | UTF-8 flag | +|-----------------|-----------------|------------| +| `BYTE_STRING` | SvUTF8=0 | off | +| `STRING` | SvUTF8=1 | on | +| `INTEGER` | IV | N/A | +| `DOUBLE` | NV | N/A | +| `UNDEF` | undef | N/A | + +### Perl Rule + +An operation produces a UTF-8-flagged string **only** when at least one input +has the UTF-8 flag on. Types without a flag (integers, floats, undef, byte +strings) never upgrade the result to UTF-8. + +## Completed Fixes + +### 1. `join` — StringOperators.joinInternal() + +**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java` + +- 1-element fast path: preserve source type instead of always creating STRING +- 2+ element path: track `hasUtf8` — only set if an element is STRING type +- Non-STRING types (INTEGER, DOUBLE, UNDEF, BYTE_STRING) are byte-compatible + +### 2. String concatenation — StringOperators.stringConcat() + +**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java` + +- Only return STRING when at least one operand is STRING type +- When both operands are non-STRING, produce BYTE_STRING (with Latin-1 safety check) +- Previously, if neither was BYTE_STRING (e.g. INTEGER + BYTE_STRING), it fell + through to the default STRING return + +### 3. `sprintf` — SprintfOperator.sprintfInternal() + +**File:** `src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java` + +- Track `hasUtf8Input` by checking format string and all argument types +- Return BYTE_STRING when no input has STRING type + +## Remaining Work + +### 4. `unpack` — Format handlers return STRING for string results + +**Status:** Not yet fixed — needs per-handler analysis + +**Problem:** `unpack` format handlers (`HexStringFormatHandler`, `StringFormatHandler`, +`NumericFormatHandler`, `BitStringFormatHandler`, etc.) create results with +`new RuntimeScalar(someString)` which defaults to STRING type. In Perl, `unpack` +returns byte strings for all formats except `U` with wide characters. + +**Impact:** ExifTool's `ImageInfo` path uses `unpack("n", ...)`, `unpack("H*", ...)`, +etc. to extract tag values. These values carry the UTF-8 flag, and when +round-tripped through SetNewValue + WriteInfo, the flag propagates to the +output buffer. + +**Approach — per-handler fixes:** + +Each handler that produces string results via `new RuntimeScalar(String)` needs +to set `type = BYTE_STRING` on the result. This is safe because: + +- Numeric formats (n, N, v, V, s, S, i, I, l, L, q, Q): return integers, already OK +- String formats (a, A, Z): should return BYTE_STRING +- Hex/bit formats (H, h, B, b): produce ASCII hex/bit strings, should be BYTE_STRING +- `U` format: returns code points — may legitimately need STRING for chars > 0xFF +- `C` format: returns byte values as integers, already OK + +**Files to audit:** +- `src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java` +- `src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java` +- `src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java` +- `src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java` + +**NOT a blanket post-process:** A post-processing step on all unpack results +was considered but rejected as dangerous — it could break `unpack("U", $wide_char)` +which legitimately produces UTF-8 strings. + +### 5. Other potential sources + +These operations may also need auditing for byte-string preservation: + +| Operation | Risk | Notes | +|-----------|------|-------| +| `chr()` | Low | Likely OK — returns BYTE_STRING for 0-255 | +| `substr()` | Medium | Result should inherit source type | +| `lc/uc/ucfirst/lcfirst` | Medium | Should inherit source type | +| `reverse()` | Low | Should inherit source type | +| Hash/array stringification | Low | Produces addresses, should be byte | + +### 6. GPX/Geotag parsing (separate issue) + +**Status:** Not yet investigated + +ExifTool's Geotag.pm reads GPX files using `$/ = '>'` and regex with `\3` +backreference. 5 test failures in Geotag.t and Geolocation.t. This is a +separate issue from UTF-8 flag handling — likely I/O or regex related. + +## Verification + +```bash +# Quick sanity check +./jperl -e 'use Encode; print Encode::is_utf8(join("", "\xff")) ? "BAD" : "OK", "\n"' +./jperl -e 'use Encode; print Encode::is_utf8(sprintf "%d", 42) ? "BAD" : "OK", "\n"' +./jperl -e 'use Encode; print Encode::is_utf8("" . 42) ? "BAD" : "OK", "\n"' +./jperl -e 'use Encode; print Encode::is_utf8(unpack("H*", "AB")) ? "BAD" : "OK", "\n"' + +# ExifTool test suite +cd /path/to/Image-ExifTool-13.55-0 +../jperl -Ilib t/IPTC.t # Test 4 should pass after unpack fix +../jperl -Ilib t/Writer.t # Multiple write tests +../jperl -Ilib t/GIF.t # GIF header byte integrity +``` + +## Progress Tracking + +### Current Status: Fixes 1-3 completed, Fix 4 pending + +| Fix | Status | Impact | +|-----|--------|--------| +| join byte-string | Done | High — ExifTool Write path | +| stringConcat byte-string | Done | High — all concat ops | +| sprintf byte-string | Done | Medium — tag value formatting | +| unpack per-handler | Pending | High — ExifTool ImageInfo path | +| GPX/Geotag parsing | Pending | 5 test failures | diff --git a/dev/modules/exiftool_parity.md b/dev/modules/exiftool_parity.md new file mode 100644 index 000000000..67a112526 --- /dev/null +++ b/dev/modules/exiftool_parity.md @@ -0,0 +1,166 @@ +# Image::ExifTool Parity Fixes for PerlOnJava + +## Overview + +Image::ExifTool 13.55 has 113 test programs. When run under PerlOnJava, +98 pass cleanly, 4 are false-timeout (pass when run individually), and +11 have real failures totalling ~24 broken subtests across 6 root-cause +categories. + +**Branch:** `fix/http-tiny-redirect-mirror` (started with HTTP::Tiny fix) +**Module version:** Image::ExifTool 13.55 (113 test programs) + +### Results History + +| Date | Programs OK | Subtests Failed | Key Fix | +|------|-------------|-----------------|---------| +| Baseline (pre-fix) | 0/113 | N/A | HTTP::Tiny 301 redirect + binary mirror — `jcpan` couldn't download | +| After HTTP fix | 98/113 | ~24 | Redirect following + binary-safe mirror | + +### Current Failure Summary + +| Test File | Fail/Total | Category | +|-----------|-----------|----------| +| GIF.t | 3/5 | Binary write (GIF header byte corruption) | +| IPTC.t | 1/8 | `join` UTF-8 flag on byte strings | +| CanonRaw.t | 1/9 | Binary write (maker notes offsets) | +| FujiFilm.t | 1/6 | Binary write (TIFF header corruption) | +| Geolocation.t | 1/8 | GPX/XML parsing (`$/` or regex) | +| Geotag.t | 4/12 | GPX/XML parsing (`$/` or regex) | +| MIE.t | 1/6 | Binary write (thumbnail offset +26 bytes) | +| Nikon.t | 1/9 | Binary write (IFD format corruption) | +| PNG.t | 1/7 | `join` UTF-8 flag corrupts written file | +| Writer.t | 8/61 | Binary write (multiple patterns) | +| XMP.t | 2/54 | Binary write + UTF-8 encoding | + +False-timeout (pass when run individually): CanonVRD, FotoStation, Olympus, Pentax + +--- + +## Root Cause Analysis + +### RC1: `join` corrupts byte-string flag (~17 failures) + +**Impact:** Categories 1-4, 6a, 6b — the dominant root cause + +`StringOperators.joinInternal()` has three bugs: + +1. **1-element fast path** (line 650): `new RuntimeScalar(scalar.toString())` always + creates type `STRING`, discarding `BYTE_STRING`. When ExifTool does + `$$outfile .= join('', @data)` with binary data, the UTF-8 flag + propagates to the output buffer. On re-read, `Encode::is_utf8()` is true + and `Encode::encode('utf8', $$arg)` re-encodes bytes >127 as multi-byte + sequences, destroying JPEG/TIFF/GIF/PNG signatures. + +2. **0-element fast path** (line 640): `new RuntimeScalar("")` creates STRING; + should create BYTE_STRING when separator is byte-string. + +3. **2+ element path** (line 679): `isByteString` requires ALL elements to be + `BYTE_STRING`. In Perl, `join` on a mix of integers and byte-strings + should produce a byte-string (integers have no UTF-8 flag). The check + should treat non-STRING types (INTEGER, DOUBLE, UNDEF) as byte-compatible. + +**Reproducer:** +```bash +./jperl -e 'use Encode; my $b = "\xff\xd8"; print Encode::is_utf8(join("", $b)), "\n"' +# PerlOnJava: 1 (wrong) +# Perl: (empty, i.e. false) +``` + +### RC2: GPX/XML parsing failures (5 failures) + +**Impact:** Geotag.t (4 failures), Geolocation.t (1 failure) + +ExifTool's `Geotag.pm` reads GPX files using `$/ = '>'` as the input +record separator, then parses each chunk with regex. The "No track points +found" error means the parser can't match `` elements. + +Possible sub-causes: +- `File::RandomAccess::ReadLine()` not honoring custom `$/` +- Regex backreference `\3` in attribute parser not working +- Floating-point interpolation issue (Geotag test 9 — wrong coordinates) + +### RC3: UTF-8 / XML encoding (1 failure) + +**Impact:** XMP.t test 35 + +XMP structured write with `AOTitle-de=pr\xc3\xbcfung` (UTF-8 "prüfung") +differs from reference output at line 13. May be double-encoding or +incorrect byte-to-character handling. + +--- + +## Fix Plan + +### Phase 1: Fix `join` byte-string preservation (RC1) + +**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java` + +**Changes:** +1. In 1-element fast path: preserve `BYTE_STRING` type from input element +2. In 0-element fast path: return BYTE_STRING when separator is byte-string +3. In 2+ element path: treat INTEGER, DOUBLE, UNDEF as byte-compatible + (only STRING type upgrades to UTF-8) + +**Expected impact:** Should fix most binary write corruption (IPTC, PNG, +Writer, CanonRaw, FujiFilm, GIF, MIE, Nikon, XMP test 3). This is the +highest-value fix. + +**Verification:** +```bash +make # Unit tests pass +# Individual ExifTool tests: +cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0 +../jperl -Ilib t/IPTC.t +../jperl -Ilib t/GIF.t +../jperl -Ilib t/Writer.t +../jperl -Ilib t/PNG.t +``` + +### Phase 2: Investigate GPX/Geotag parsing (RC2) + +**Files to investigate:** +- `Image::ExifTool::Geotag` — GPX parser using `$/ = '>'` +- `File::RandomAccess` — `ReadLine()` method +- PerlOnJava regex engine — `\3` backreference support + +**Steps:** +1. Test `$/` with a simple script: `$/ = '>'; while () { ... }` +2. Test regex backreference: `'key="val"' =~ /(\w+)=(['"])(.*?)\2/` +3. If `$/` is the issue, fix in the I/O layer +4. If regex backreference, fix in `RegexPreprocessor.java` + +**Verification:** +```bash +cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0 +../jperl -Ilib t/Geotag.t +../jperl -Ilib t/Geolocation.t +``` + +### Phase 3: Investigate UTF-8 / XMP encoding (RC3) + +**Steps:** +1. Diff XMP.t test 35 output vs reference file +2. Determine if it's double-encoding or byte-handling issue +3. May be fixed by Phase 1 (join fix) + +**Verification:** +```bash +cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0 +../jperl -Ilib t/XMP.t +``` + +--- + +## Progress Tracking + +### Current Status: Phase 1 in progress + +### Completed +- [x] HTTP::Tiny redirect + binary mirror fix (PR #472) +- [x] Full failure analysis and categorization + +### Next Steps +1. Fix `joinInternal` byte-string preservation +2. Run full ExifTool test suite to measure improvement +3. Investigate remaining failures (GPX, UTF-8) diff --git a/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java b/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java index 30d35ccd7..56222c801 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java +++ b/src/main/java/org/perlonjava/backend/bytecode/BytecodeCompiler.java @@ -4307,16 +4307,21 @@ void compileVariableReference(OperatorNode node, String op) { // Also set isSymbolicReference so defined(\&stub) returns true, matching // the JVM backend's createCodeReference behavior. if (node.operand instanceof OperatorNode operandOp - && operandOp.operator.equals("&") - && operandOp.operand instanceof IdentifierNode idNode) { - // Set isSymbolicReference before loading, so defined(\&Name) returns true - String subName = NameNormalizer.normalizeVariableName( - idNode.name, getCurrentPackage()); - RuntimeScalar codeRef = GlobalVariable.getGlobalCodeRef(subName); - if (codeRef.type == RuntimeScalarType.CODE - && codeRef.value instanceof RuntimeCode rc) { - rc.isSymbolicReference = true; + && operandOp.operator.equals("&")) { + if (operandOp.operand instanceof IdentifierNode idNode) { + // \&name — regular package sub. Set isSymbolicReference before + // loading, so defined(\&Name) returns true + String subName = NameNormalizer.normalizeVariableName( + idNode.name, getCurrentPackage()); + RuntimeScalar codeRef = GlobalVariable.getGlobalCodeRef(subName); + if (codeRef.type == RuntimeScalarType.CODE + && codeRef.value instanceof RuntimeCode rc) { + rc.isSymbolicReference = true; + } } + // For both \&name and \&$var (lexical subs), the & operator + // already produces a CODE value — no CREATE_REF wrapping needed. + // This matches the JVM backend's createCodeReference behavior. node.operand.accept(this); // lastResultReg already holds the CODE scalar — no wrapping needed return; diff --git a/src/main/java/org/perlonjava/backend/bytecode/OpcodeHandlerExtended.java b/src/main/java/org/perlonjava/backend/bytecode/OpcodeHandlerExtended.java index 1bd7d2125..91e81e872 100644 --- a/src/main/java/org/perlonjava/backend/bytecode/OpcodeHandlerExtended.java +++ b/src/main/java/org/perlonjava/backend/bytecode/OpcodeHandlerExtended.java @@ -298,11 +298,31 @@ public static int executeStringConcatAssign(int[] bytecode, int pc, RuntimeBase[ registers[rd] = BytecodeInterpreter.ensureMutableScalar(registers[rd]); } RuntimeScalar target = (RuntimeScalar) registers[rd]; + // Remember if target was BYTE_STRING before concatenation. + // In PerlOnJava, "upgrading" from BYTE_STRING to STRING doesn't change bytes + // (unlike Perl where bytes > 127 get re-encoded), so we preserve BYTE_STRING + // in .= to prevent false UTF-8 flag contamination of binary buffers. + boolean wasByteString = (target.type == RuntimeScalarType.BYTE_STRING); RuntimeScalar result = StringOperators.stringConcat( target, (RuntimeScalar) registers[rs] ); target.set(result); + // Preserve BYTE_STRING type when the target was byte string and the result + // still fits in Latin-1 (all chars <= 255) + if (wasByteString && target.type == RuntimeScalarType.STRING) { + String s = target.toString(); + boolean fits = true; + for (int i = 0; i < s.length(); i++) { + if (s.charAt(i) > 255) { + fits = false; + break; + } + } + if (fits) { + target.type = RuntimeScalarType.BYTE_STRING; + } + } // Invalidate pos() - any string modification should reset pos to undef RuntimePosLvalue.invalidatePos(target); return pc; diff --git a/src/main/java/org/perlonjava/backend/jvm/EmitBinaryOperator.java b/src/main/java/org/perlonjava/backend/jvm/EmitBinaryOperator.java index 700b80f74..5d3ed0610 100644 --- a/src/main/java/org/perlonjava/backend/jvm/EmitBinaryOperator.java +++ b/src/main/java/org/perlonjava/backend/jvm/EmitBinaryOperator.java @@ -313,7 +313,12 @@ static void handleCompoundAssignment(EmitterVisitor emitterVisitor, BinaryOperat throw new RuntimeException("No operator handler found for base operator: " + baseOperator); } // assign to the Lvalue - mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "set", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false); + // For .= use setPreservingByteString to prevent UTF-8 flag contamination of binary buffers + if (node.operator.equals(".=")) { + mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "setPreservingByteString", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false); + } else { + mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "set", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false); + } // For string concat assign (.=), invalidate pos() since string was modified if (node.operator.equals(".=")) { diff --git a/src/main/java/org/perlonjava/core/Configuration.java b/src/main/java/org/perlonjava/core/Configuration.java index 6f0cfffbe..217ab3fce 100644 --- a/src/main/java/org/perlonjava/core/Configuration.java +++ b/src/main/java/org/perlonjava/core/Configuration.java @@ -33,7 +33,7 @@ public final class Configuration { * Automatically populated by Gradle/Maven during build. * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String gitCommitId = "5ee00fd77"; + public static final String gitCommitId = "f90f44dc2"; /** * Git commit date of the build (ISO format: YYYY-MM-DD). @@ -48,7 +48,7 @@ public final class Configuration { * Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at" * DO NOT EDIT MANUALLY - this value is replaced at build time. */ - public static final String buildTimestamp = "Apr 9 2026 17:13:46"; + public static final String buildTimestamp = "Apr 9 2026 18:35:44"; // Prevent instantiation private Configuration() { diff --git a/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java b/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java index 2819a992d..91b1bb3da 100644 --- a/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java +++ b/src/main/java/org/perlonjava/frontend/parser/OperatorParser.java @@ -911,6 +911,13 @@ static BinaryOperatorNode parseSplit(Parser parser, LexerToken token, int curren ((OperatorNode) separator).operator = "quoteRegex"; } } + // If no string argument provided, default to $_ + // This is needed so both JVM and bytecode backends resolve $_ correctly + // at runtime (the bytecode backend otherwise compiles the empty ListNode + // in scalar context, producing a spurious value instead of $_ fallback) + if (operand.elements.isEmpty()) { + operand.elements.add(ParserNodeUtils.scalarUnderscore(parser)); + } return new BinaryOperatorNode(token.text, separator, operand, currentIndex); } diff --git a/src/main/java/org/perlonjava/runtime/operators/Operator.java b/src/main/java/org/perlonjava/runtime/operators/Operator.java index db433400b..50d17bf1e 100644 --- a/src/main/java/org/perlonjava/runtime/operators/Operator.java +++ b/src/main/java/org/perlonjava/runtime/operators/Operator.java @@ -284,6 +284,7 @@ private static RuntimeScalar substrImpl(int ctx, boolean warnEnabled, RuntimeBas boolean hasExplicitLength = size > 2; int length = hasExplicitLength ? ((RuntimeScalar) args[2]).getInt() : strLength - offset; String replacement = (size > 3) ? args[3].toString() : null; + RuntimeScalar replacementScalar = (size > 3) ? (RuntimeScalar) args[3] : null; // Handle negative offsets (count from the end of the string) if (offset < 0) { @@ -357,8 +358,12 @@ private static RuntimeScalar substrImpl(int ctx, boolean warnEnabled, RuntimeBas if (replacement != null) { // With replacement, still need to handle the replacement at position 0 var lvalue = new RuntimeSubstrLvalue((RuntimeScalar) args[0], "", offset, 0); - lvalue.set(replacement); - return new RuntimeScalar(""); + lvalue.set(replacementScalar); + RuntimeScalar retVal = new RuntimeScalar(""); + if (((RuntimeScalar) args[0]).type == RuntimeScalarType.BYTE_STRING) { + retVal.type = RuntimeScalarType.BYTE_STRING; + } + return retVal; } return new RuntimeSubstrLvalue((RuntimeScalar) args[0], "", offset, 0); } @@ -376,9 +381,14 @@ private static RuntimeScalar substrImpl(int ctx, boolean warnEnabled, RuntimeBas if (replacement != null) { // When replacement is provided, save the extracted substring before modifying String extractedSubstring = result; - lvalue.set(replacement); + lvalue.set(replacementScalar); // Return the extracted substring, not the lvalue (which now contains the replacement) - return new RuntimeScalar(extractedSubstring); + RuntimeScalar retVal = new RuntimeScalar(extractedSubstring); + // Preserve BYTE_STRING type from parent + if (((RuntimeScalar) args[0]).type == RuntimeScalarType.BYTE_STRING) { + retVal.type = RuntimeScalarType.BYTE_STRING; + } + return retVal; } return lvalue; diff --git a/src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java b/src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java index 7d7159b0d..7a628a987 100644 --- a/src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java +++ b/src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java @@ -48,6 +48,17 @@ private static RuntimeScalar sprintfInternal(RuntimeScalar runtimeScalar, Runtim // Expand the list to ensure all elements are available list = new RuntimeList((RuntimeBase) list); String format = runtimeScalar.toString(); + // Track if any input has UTF-8 flag — sprintf produces byte string unless + // the format or a %s argument has UTF-8 flag on + boolean hasUtf8Input = runtimeScalar.type == RuntimeScalarType.STRING; + if (!hasUtf8Input) { + for (RuntimeBase elem : list.elements) { + if (elem instanceof RuntimeScalar rs && rs.type == RuntimeScalarType.STRING) { + hasUtf8Input = true; + break; + } + } + } StringBuilder result = new StringBuilder(); int argIndex = 0; // Sequential argument index @@ -193,7 +204,11 @@ private static RuntimeScalar sprintfInternal(RuntimeScalar runtimeScalar, Runtim WarnDie.warn(new RuntimeScalar("Redundant argument in sprintf"), new RuntimeScalar("")); } - return new RuntimeScalar(result.toString()); + RuntimeScalar res = new RuntimeScalar(result.toString()); + if (!hasUtf8Input) { + res.type = RuntimeScalarType.BYTE_STRING; + } + return res; } private static void handlePercentN(FormatSpecifier spec, diff --git a/src/main/java/org/perlonjava/runtime/operators/StringOperators.java b/src/main/java/org/perlonjava/runtime/operators/StringOperators.java index 2aecf4fc9..51f81287c 100644 --- a/src/main/java/org/perlonjava/runtime/operators/StringOperators.java +++ b/src/main/java/org/perlonjava/runtime/operators/StringOperators.java @@ -335,46 +335,45 @@ public static RuntimeScalar rindex(RuntimeScalar runtimeScalar, RuntimeScalar su } public static RuntimeScalar stringConcat(RuntimeScalar runtimeScalar, RuntimeScalar b) { - String aStr = runtimeScalar.toString(); + // b.toString() may trigger FETCH for tied vars, potentially modifying runtimeScalar. + // Read b first so runtimeScalar.toString() reflects any FETCH side-effects, + // matching Perl's behavior where the left SV is read after tied-var resolution. String bStr = b.toString(); + String aStr = runtimeScalar.toString(); - if (runtimeScalar.type == RuntimeScalarType.STRING || b.type == RuntimeScalarType.STRING) { - return new RuntimeScalar(runtimeScalar + bStr); + // In Perl, concatenation produces a UTF-8 string only if at least one + // operand has the UTF-8 flag on (STRING type). Non-STRING types + // (BYTE_STRING, INTEGER, DOUBLE, UNDEF) are all byte-compatible. + boolean aIsUtf8 = runtimeScalar.type == RuntimeScalarType.STRING; + boolean bIsUtf8 = b.type == RuntimeScalarType.STRING; + + if (aIsUtf8 || bIsUtf8) { + return new RuntimeScalar(aStr + bStr); } - if (runtimeScalar.type == BYTE_STRING || b.type == BYTE_STRING) { - boolean aIsByte = runtimeScalar.type == BYTE_STRING - || runtimeScalar.type == RuntimeScalarType.UNDEF - || (aStr.isEmpty() && runtimeScalar.type != RuntimeScalarType.STRING); - boolean bIsByte = b.type == BYTE_STRING - || b.type == RuntimeScalarType.UNDEF - || (bStr.isEmpty() && b.type != RuntimeScalarType.STRING); - if (aIsByte && bIsByte) { - boolean safe = true; - for (int i = 0; safe && i < aStr.length(); i++) { - if (aStr.charAt(i) > 255) { - safe = false; - break; - } - } - for (int i = 0; safe && i < bStr.length(); i++) { - if (bStr.charAt(i) > 255) { - safe = false; - break; - } - } - if (safe) { - byte[] aBytes = aStr.getBytes(StandardCharsets.ISO_8859_1); - byte[] bBytes = bStr.getBytes(StandardCharsets.ISO_8859_1); - byte[] out = new byte[aBytes.length + bBytes.length]; - System.arraycopy(aBytes, 0, out, 0, aBytes.length); - System.arraycopy(bBytes, 0, out, aBytes.length, bBytes.length); - return new RuntimeScalar(out); - } + // Neither operand is UTF-8 — produce BYTE_STRING result + // Check if all chars fit in a byte (Latin-1) + boolean safe = true; + for (int i = 0; safe && i < aStr.length(); i++) { + if (aStr.charAt(i) > 255) { + safe = false; } } + for (int i = 0; safe && i < bStr.length(); i++) { + if (bStr.charAt(i) > 255) { + safe = false; + } + } + if (safe) { + byte[] aBytes = aStr.getBytes(StandardCharsets.ISO_8859_1); + byte[] bBytes = bStr.getBytes(StandardCharsets.ISO_8859_1); + byte[] out = new byte[aBytes.length + bBytes.length]; + System.arraycopy(aBytes, 0, out, 0, aBytes.length); + System.arraycopy(bBytes, 0, out, aBytes.length, bBytes.length); + return new RuntimeScalar(out); + } - return new RuntimeScalar(runtimeScalar + bStr); + return new RuntimeScalar(aStr + bStr); } public static RuntimeScalar stringConcatWarnUninitialized(RuntimeScalar runtimeScalar, RuntimeScalar b) { @@ -641,13 +640,20 @@ private static RuntimeScalar joinInternal(RuntimeScalar runtimeScalar, RuntimeBa } // Fast path: 1 element -> return that element (no separator evaluation needed) + // Preserve BYTE_STRING type: in Perl, join doesn't upgrade to UTF-8 unless + // an input has the UTF-8 flag on. Non-STRING types (INTEGER, DOUBLE, UNDEF) + // are byte-compatible and should not trigger UTF-8 upgrade. if (elements.size() == 1) { RuntimeScalar scalar = elements.get(0); if (warnOnUndef && !isStringInterpolation && scalar.type == RuntimeScalarType.UNDEF) { WarnDie.warnWithCategory(new RuntimeScalar("Use of uninitialized value in join or string"), RuntimeScalarCache.scalarEmptyString, "uninitialized"); } - return new RuntimeScalar(scalar.toString()); + RuntimeScalar res = new RuntimeScalar(scalar.toString()); + if (scalar.type != RuntimeScalarType.STRING) { + res.type = BYTE_STRING; + } + return res; } // 2+ elements: evaluate the separator @@ -658,7 +664,10 @@ private static RuntimeScalar joinInternal(RuntimeScalar runtimeScalar, RuntimeBa String delimiter = runtimeScalar.toString(); - boolean isByteString = runtimeScalar.type == BYTE_STRING || delimiter.isEmpty(); + // In Perl, join produces a byte-string unless one of the inputs has + // the UTF-8 flag on. Only STRING type has the flag; INTEGER, DOUBLE, + // UNDEF, and BYTE_STRING are all byte-compatible. + boolean hasUtf8 = runtimeScalar.type == RuntimeScalarType.STRING; // Join the elements StringBuilder sb = new StringBuilder(); @@ -676,11 +685,13 @@ private static RuntimeScalar joinInternal(RuntimeScalar runtimeScalar, RuntimeBa RuntimeScalarCache.scalarEmptyString, "uninitialized"); } - isByteString = isByteString && scalar.type == BYTE_STRING; + if (scalar.type == RuntimeScalarType.STRING) { + hasUtf8 = true; + } sb.append(scalar); } RuntimeScalar res = new RuntimeScalar(sb.toString()); - if (isByteString) { + if (!hasUtf8) { res.type = BYTE_STRING; } return res; diff --git a/src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java b/src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java index ecc5145bb..d81ad52e0 100644 --- a/src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java +++ b/src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java @@ -3,6 +3,7 @@ import org.perlonjava.runtime.operators.UnpackState; import org.perlonjava.runtime.runtimetypes.RuntimeBase; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; +import org.perlonjava.runtime.runtimetypes.RuntimeScalarType; import java.nio.ByteBuffer; import java.util.List; @@ -58,6 +59,8 @@ public void unpack(UnpackState state, List output, int count, boole } } - output.add(new RuntimeScalar(bitString.toString())); + RuntimeScalar rs = new RuntimeScalar(bitString.toString()); + rs.type = RuntimeScalarType.BYTE_STRING; + output.add(rs); } } \ No newline at end of file diff --git a/src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java b/src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java index 223f6b55b..915fd3dfd 100644 --- a/src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java +++ b/src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java @@ -3,6 +3,7 @@ import org.perlonjava.runtime.operators.UnpackState; import org.perlonjava.runtime.runtimetypes.RuntimeBase; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; +import org.perlonjava.runtime.runtimetypes.RuntimeScalarType; import java.nio.ByteBuffer; import java.util.List; @@ -78,6 +79,8 @@ public void unpack(UnpackState state, List output, int count, boole } } - output.add(new RuntimeScalar(hexString.toString())); + RuntimeScalar rs = new RuntimeScalar(hexString.toString()); + rs.type = RuntimeScalarType.BYTE_STRING; + output.add(rs); } } \ No newline at end of file diff --git a/src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java b/src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java index e2ec1ebed..11396807e 100644 --- a/src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java +++ b/src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java @@ -8,6 +8,8 @@ import java.nio.ByteBuffer; import java.util.List; +import static org.perlonjava.runtime.runtimetypes.RuntimeScalarType.BYTE_STRING; + public class PointerFormatHandler implements FormatHandler { private final boolean bigEndian; @@ -65,7 +67,9 @@ public void unpack(UnpackState state, List result, int count, boole if (str != null) { result.add(new RuntimeScalar(str)); } else { - result.add(new RuntimeScalar("")); + RuntimeScalar empty = new RuntimeScalar(""); + empty.type = BYTE_STRING; + result.add(empty); } } } diff --git a/src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java b/src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java index 1d6ded5ce..af980c2de 100644 --- a/src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java +++ b/src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java @@ -3,6 +3,7 @@ import org.perlonjava.runtime.operators.UnpackState; import org.perlonjava.runtime.runtimetypes.RuntimeBase; import org.perlonjava.runtime.runtimetypes.RuntimeScalar; +import org.perlonjava.runtime.runtimetypes.RuntimeScalarType; import java.nio.ByteBuffer; import java.nio.charset.StandardCharsets; @@ -70,10 +71,12 @@ public void unpack(UnpackState state, List output, int count, boole output.add(new RuntimeScalar(str)); } else { - // In byte mode, read from buffer + // In byte mode, read from buffer — always ISO-8859-1 (BYTE_STRING) ByteBuffer buffer = state.getBuffer(); String str = readString(buffer, count, isStarCount); - output.add(new RuntimeScalar(str)); + RuntimeScalar rs = new RuntimeScalar(str); + rs.type = RuntimeScalarType.BYTE_STRING; + output.add(rs); } } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java index 0d3c8406f..49daef958 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Encode.java @@ -815,8 +815,9 @@ public static RuntimeList is_utf8(RuntimeArray args, int ctx) { RuntimeScalar arg = args.get(0); - // Check the UTF-8 flag (type != BYTE_STRING means UTF-8 flag is on) - boolean hasUtf8Flag = (arg.type != BYTE_STRING); + // Check the UTF-8 flag: only STRING type has it set. + // INTEGER, DOUBLE, UNDEF, REFERENCE etc. don't have the UTF-8 flag in Perl. + boolean hasUtf8Flag = (arg.type == STRING); if (!hasUtf8Flag) { return scalarFalse.getList(); @@ -1040,8 +1041,8 @@ public static RuntimeList _utf8_on(RuntimeArray args, int ctx) { throw new IllegalStateException("Bad number of arguments for _utf8_on"); } RuntimeScalar arg = args.get(0); - boolean wasUtf8 = (arg.type != BYTE_STRING); - if (!wasUtf8) { + boolean wasUtf8 = (arg.type == STRING); + if (arg.type == BYTE_STRING) { // Re-decode the byte string as UTF-8 to get proper characters // e.g., bytes \xC3\xA9 -> character U+00E9 (é) String s = arg.toString(); @@ -1062,13 +1063,13 @@ public static RuntimeList _utf8_off(RuntimeArray args, int ctx) { throw new IllegalStateException("Bad number of arguments for _utf8_off"); } RuntimeScalar arg = args.get(0); - boolean wasUtf8 = (arg.type != BYTE_STRING); + boolean wasUtf8 = (arg.type == STRING); if (wasUtf8) { String s = arg.toString(); byte[] bytes = s.getBytes(StandardCharsets.UTF_8); arg.set(new String(bytes, StandardCharsets.ISO_8859_1)); - arg.type = BYTE_STRING; } + arg.type = BYTE_STRING; return new RuntimeScalar(wasUtf8).getList(); } diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/HttpTiny.java b/src/main/java/org/perlonjava/runtime/perlmodule/HttpTiny.java index 6bebf0621..f5fdbaf34 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/HttpTiny.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/HttpTiny.java @@ -86,6 +86,7 @@ public static RuntimeList request(RuntimeArray args, int ctx) throws Exception { responseMap.put("status", new RuntimeScalar(response.statusCode())); responseMap.put("reason", new RuntimeScalar(getStatusReason(response.statusCode()))); responseMap.put("content", new RuntimeScalar(response.body())); + responseMap.put("url", new RuntimeScalar(response.uri().toString())); // Collect headers RuntimeHash responseHeaders = new RuntimeHash(); @@ -101,16 +102,49 @@ public static RuntimeList request(RuntimeArray args, int ctx) throws Exception { } private static String getStatusReason(int statusCode) { - // Simple status reason mapping (you might want to expand this) return switch (statusCode) { + case 100 -> "Continue"; + case 101 -> "Switching Protocols"; case 200 -> "OK"; case 201 -> "Created"; + case 202 -> "Accepted"; + case 203 -> "Non-Authoritative Information"; case 204 -> "No Content"; + case 205 -> "Reset Content"; + case 206 -> "Partial Content"; + case 300 -> "Multiple Choices"; + case 301 -> "Moved Permanently"; + case 302 -> "Found"; + case 303 -> "See Other"; + case 304 -> "Not Modified"; + case 305 -> "Use Proxy"; + case 307 -> "Temporary Redirect"; + case 308 -> "Permanent Redirect"; case 400 -> "Bad Request"; case 401 -> "Unauthorized"; case 403 -> "Forbidden"; case 404 -> "Not Found"; + case 405 -> "Method Not Allowed"; + case 406 -> "Not Acceptable"; + case 407 -> "Proxy Authentication Required"; + case 408 -> "Request Timeout"; + case 409 -> "Conflict"; + case 410 -> "Gone"; + case 411 -> "Length Required"; + case 412 -> "Precondition Failed"; + case 413 -> "Payload Too Large"; + case 414 -> "URI Too Long"; + case 415 -> "Unsupported Media Type"; + case 416 -> "Range Not Satisfiable"; + case 417 -> "Expectation Failed"; + case 422 -> "Unprocessable Entity"; + case 429 -> "Too Many Requests"; case 500 -> "Internal Server Error"; + case 501 -> "Not Implemented"; + case 502 -> "Bad Gateway"; + case 503 -> "Service Unavailable"; + case 504 -> "Gateway Timeout"; + case 505 -> "HTTP Version Not Supported"; default -> "Unknown Status"; }; } @@ -118,7 +152,8 @@ private static String getStatusReason(int statusCode) { private static HttpClient createHttpClient(RuntimeHash options) { HttpClient.Builder builder = HttpClient.newBuilder() .connectTimeout(Duration.ofMillis(options.get("timeout").getLong() * 1000L)) - .version(HttpClient.Version.HTTP_1_1); + .version(HttpClient.Version.HTTP_1_1) + .followRedirects(HttpClient.Redirect.NORMAL); // Configure SSL context if SSL verification is disabled if (!options.get("verify_SSL").getBoolean()) { @@ -204,49 +239,88 @@ public static RuntimeList mirror(RuntimeArray args, int ctx) throws Exception { String filePath = args.get(2).toString(); RuntimeHash options = args.size() > 3 ? args.get(3).hashDeref() : new RuntimeHash(); + RuntimeHash instanceHash = self.hashDeref(); + + // Build request headers + RuntimeHash reqHeaders = options.exists("headers").getBoolean() + ? options.get("headers").hashDeref() : new RuntimeHash(); + File file = new File(filePath); if (file.exists()) { // Set If-Modified-Since header long lastModified = file.lastModified(); SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss z", Locale.US); dateFormat.setTimeZone(TimeZone.getTimeZone("GMT")); - String ifModifiedSince = dateFormat.format(new Date(lastModified)); - - RuntimeHash headers = options.exists("headers").getBoolean() - ? options.get("headers").hashDeref() : new RuntimeHash(); - headers.put("If-Modified-Since", new RuntimeScalar(ifModifiedSince)); - options.put("headers", headers.createReference()); + reqHeaders.put("If-Modified-Since", new RuntimeScalar(dateFormat.format(new Date(lastModified)))); } - // Perform the request - RuntimeArray requestArgs = new RuntimeArray(); - RuntimeArray.push(requestArgs, self); - RuntimeArray.push(requestArgs, new RuntimeScalar("GET")); - RuntimeArray.push(requestArgs, new RuntimeScalar(url)); - RuntimeArray.push(requestArgs, options.createReference()); - - RuntimeList response = request(requestArgs, ctx); - RuntimeHash responseHash = ((RuntimeScalar) response.elements.get(0)).hashDeref(); - - // Check if the request was successful or not modified - boolean success = responseHash.get("success").getBoolean() || responseHash.get("status").getLong() == 304; - if (success && responseHash.get("status").getLong() != 304) { - // Write response content to file - try (FileOutputStream fos = new FileOutputStream(file)) { - fos.write(responseHash.get("content").toString().getBytes()); - } + // Build HTTP request + HttpRequest.Builder requestBuilder = HttpRequest.newBuilder() + .uri(URI.create(url)) + .GET(); + requestBuilder.header("User-Agent", instanceHash.get("agent").toString()); + reqHeaders.elements.forEach((key, value) -> + requestBuilder.header(key, value.toString()) + ); + + try { + HttpClient client = createHttpClient(instanceHash); + // Use byte[] body handler to preserve binary content + HttpResponse response = client.send(requestBuilder.build(), HttpResponse.BodyHandlers.ofByteArray()); + + int statusCode = response.statusCode(); + boolean success = (statusCode >= 200 && statusCode < 300) || statusCode == 304; + + if (success && statusCode != 304) { + // Ensure parent directory exists + File parent = file.getParentFile(); + if (parent != null && !parent.exists()) { + parent.mkdirs(); + } + // Write binary content directly to file + try (FileOutputStream fos = new FileOutputStream(file)) { + fos.write(response.body()); + } - // Update file modification time if Last-Modified header is present - RuntimeHash headers = responseHash.get("headers").hashDeref(); - if (headers.exists("last-modified").getBoolean()) { - String lastModifiedStr = headers.get("last-modified").toString(); - SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss z", Locale.US); - dateFormat.setTimeZone(TimeZone.getTimeZone("GMT")); - Date lastModifiedDate = dateFormat.parse(lastModifiedStr); - Files.setLastModifiedTime(file.toPath(), FileTime.fromMillis(lastModifiedDate.getTime())); + // Update file modification time if Last-Modified header is present + Optional lastModified = response.headers().firstValue("last-modified"); + if (lastModified.isPresent()) { + SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss z", Locale.US); + dateFormat.setTimeZone(TimeZone.getTimeZone("GMT")); + Date lastModifiedDate = dateFormat.parse(lastModified.get()); + Files.setLastModifiedTime(file.toPath(), FileTime.fromMillis(lastModifiedDate.getTime())); + } } - } - return response; + // Build Perl response hash + RuntimeHash responseMap = new RuntimeHash(); + responseMap.put("success", new RuntimeScalar(statusCode >= 200 && statusCode < 300)); + responseMap.put("status", new RuntimeScalar(statusCode)); + responseMap.put("reason", new RuntimeScalar(getStatusReason(statusCode))); + responseMap.put("content", new RuntimeScalar("")); + responseMap.put("url", new RuntimeScalar(response.uri().toString())); + + RuntimeHash responseHeaders = new RuntimeHash(); + response.headers().map().forEach((key, value) -> + responseHeaders.put(key.toLowerCase(), new RuntimeScalar(String.join(", ", value))) + ); + responseMap.put("headers", responseHeaders.createReference()); + + return responseMap.createReference().getList(); + } catch (IOException | InterruptedException e) { + // Return 599 error response like HTTP::Tiny does + String errMsg = e.getMessage() != null ? e.getMessage() : e.toString(); + RuntimeHash responseMap = new RuntimeHash(); + responseMap.put("success", new RuntimeScalar("")); + responseMap.put("status", new RuntimeScalar(599)); + responseMap.put("reason", new RuntimeScalar("Internal Exception")); + responseMap.put("content", new RuntimeScalar(errMsg)); + responseMap.put("url", new RuntimeScalar(url)); + RuntimeHash responseHeaders = new RuntimeHash(); + responseHeaders.put("content-type", new RuntimeScalar("text/plain")); + responseHeaders.put("content-length", new RuntimeScalar(errMsg.length())); + responseMap.put("headers", responseHeaders.createReference()); + return responseMap.createReference().getList(); + } } } \ No newline at end of file diff --git a/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java b/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java index 1faa7423a..f07e0fc9e 100644 --- a/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java +++ b/src/main/java/org/perlonjava/runtime/perlmodule/Utf8.java @@ -322,7 +322,9 @@ public static boolean isUtf8(RuntimeScalar scalar) { if (scalar instanceof ScalarSpecialVariable sv) { scalar = sv.getValueAsScalar(); } - return scalar.type != BYTE_STRING; + // Only STRING type has the UTF-8 flag set. + // INTEGER, DOUBLE, UNDEF, REFERENCE etc. don't have the UTF-8 flag in Perl. + return scalar.type == STRING; } /** diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java index bc9b49b02..6bc358e62 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeHash.java @@ -331,14 +331,33 @@ public void markKeyByte(String key, boolean isByte) { /** * Creates a RuntimeScalar for a hash key with the correct type (STRING or BYTE_STRING). + *

+ * In Perl, hash keys that are pure ASCII (all bytes 0-127) are always stored and + * returned without the UTF-8 flag, regardless of whether the original string had it. + * Only keys containing characters > 127 preserve the byte/UTF-8 distinction via byteKeys. */ RuntimeScalar createKeyScalar(String key) { + // Check if key is explicitly marked as byte string (for non-ASCII keys) if (byteKeys != null && byteKeys.contains(key)) { RuntimeScalar scalar = new RuntimeScalar(key); scalar.type = BYTE_STRING; return scalar; } - return new RuntimeScalar(key); // default STRING type + // In Perl, ASCII-only hash keys are always returned without the UTF-8 flag. + // Only non-ASCII keys that are NOT in byteKeys should be returned as STRING (UTF-8). + boolean isAscii = true; + for (int i = 0; i < key.length(); i++) { + if (key.charAt(i) > 127) { + isAscii = false; + break; + } + } + if (isAscii) { + RuntimeScalar scalar = new RuntimeScalar(key); + scalar.type = BYTE_STRING; + return scalar; + } + return new RuntimeScalar(key); // non-ASCII, not in byteKeys → STRING (UTF-8) } /** diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java index 9f5ed1c08..c4022038b 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java @@ -762,6 +762,33 @@ public RuntimeScalar set(RuntimeScalar value) { return setLarge(value); } + /** + * Set value while preserving BYTE_STRING type when possible. + * Used by .= (string concat-assign) to prevent UTF-8 flag contamination + * of binary buffers. In PerlOnJava, upgrading from BYTE_STRING to STRING + * doesn't change the underlying chars (unlike Perl where bytes > 127 get + * re-encoded), so preserving BYTE_STRING is safe when all chars fit in Latin-1. + */ + public RuntimeScalar setPreservingByteString(RuntimeScalar value) { + boolean wasByteString = (this.type == BYTE_STRING); + this.set(value); + if (wasByteString && this.type == STRING) { + // Check if all chars fit in Latin-1 (single byte) + String s = this.toString(); + boolean allLatin1 = true; + for (int i = 0; i < s.length(); i++) { + if (s.charAt(i) > 255) { + allLatin1 = false; + break; + } + } + if (allLatin1) { + this.type = BYTE_STRING; + } + } + return this; + } + // Slow path for set(RuntimeScalar) private RuntimeScalar setLarge(RuntimeScalar value) { if (value == null) { diff --git a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java index 99ee27ec4..9bc118c4d 100644 --- a/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java +++ b/src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeSubstrLvalue.java @@ -126,7 +126,16 @@ public RuntimeScalar set(RuntimeScalar value) { } // Update the parent RuntimeScalar with the modified string - lvalue.set(new RuntimeScalar(updatedValue.toString())); + RuntimeScalar updated = new RuntimeScalar(updatedValue.toString()); + // Preserve BYTE_STRING type: if the parent was a byte string and the replacement + // doesn't introduce UTF-8 characters, keep the result as BYTE_STRING. + // In Perl, substr assignment on a byte string with a byte replacement stays bytes. + if (lvalue.type == RuntimeScalarType.BYTE_STRING && + (value.type == RuntimeScalarType.BYTE_STRING || + value.type != RuntimeScalarType.STRING)) { + updated.type = RuntimeScalarType.BYTE_STRING; + } + lvalue.set(updated); return this; }