This document tracks regex preprocessing issues discovered while running re/pat.t, re/pat_advanced.t, and related tests with JPERL_UNIMPLEMENTED=warn.
Root cause: handleQuantifier() in RegexPreprocessor.java used s.indexOf('}', start) to find the closing brace of a potential quantifier, but this search crossed character class boundaries and regex groups. For example, in { (?> [^{}]+ | (??{...}) )* }, the { at the start was treated as a potential quantifier, and indexOf('}') found the } inside the character class [^{}], consuming everything in between (including (?> and [^{) as literal text.
Fix: When handleQuantifier determines that braces don't form a valid quantifier (content contains non-numeric characters), it now only escapes the opening { as \{ and returns immediately, letting the main regex loop process subsequent characters normally. Previously it consumed and escaped the entire {...} range.
Files changed: RegexPreprocessor.java — handleQuantifier() method
Root cause: The hex escape handler used Integer.parseInt(hexStr, 16) which throws NumberFormatException for strings containing non-hex characters (e.g., \x{9bq}). Inside character classes, this was caught and re-thrown as a fatal PerlCompilerException, killing the test run. Outside character classes, the escape was passed through to Java's regex engine which also rejected it.
Perl behavior: \x{9bq} extracts the valid hex prefix 9b (value 0x9B) and ignores the remaining characters. \x{x9b} has no valid prefix, so the value is 0. Underscores are allowed (removed by preprocessing) but other non-hex chars terminate the hex number.
Fix: All three \x{...} handlers now extract the valid hex prefix instead of requiring the entire content to be valid hex:
handleRegexCharacterClassEscape()— inside[...](was the fatal crash)handleEscapeSequences()— outside[...]- Range endpoint parser — for character class ranges
Files changed: RegexPreprocessorHelper.java
Root cause: Bare \x (without braces) was passed through to Java's regex engine, which expects exactly 2 hex digits after \x. Patterns like \xk or \x4j caused PatternSyntaxException.
Perl behavior: \x takes up to 2 hex digits. \xk = \x00 followed by literal k. \x4j = \x04 followed by literal j.
Fix: Added explicit bare \x handling that parses up to 2 hex digits and emits \x{HH} format when fewer than 2 valid hex digits are found.
Files changed: RegexPreprocessorHelper.java — handleEscapeSequences() method
Root cause: When regex compilation fails and gets downgraded to a warning, the catch block in RuntimeRegex.compile() set the error pattern but didn't set regex.patternString. Downstream code (e.g., replaceRegex()) checked regex.patternString == null and triggered recompilation with a null pattern, causing NPE in convertPythonStyleGroups(null).replaceAll(...).
Fix:
- Set
regex.patternStringin the catch block when downgrading to warning - Added null guard in
preProcessRegex()to treat null input as empty string
Files changed: RuntimeRegex.java, RegexPreprocessor.java
| Test | Before fixes | After fixes | Remaining failures |
|---|---|---|---|
re/pat.t |
428/1298 | 1077/1298 (all run) | 221 fail |
re/pat_advanced.t |
63/1298 | 1308/1625 | 317 fail + 53 not reached |
re/pat_rt_report.t |
2397/2515 | 2431/2515 (ran 2508) | 77 fail + 7 not reached |
re/regexp_unicode_prop.t |
— | 1017/1096 | 79 fail + 14 not reached |
re/reg_eval_scope.t |
6/49 | 7/49 | 42 fail |
uni/variables.t |
66880/66880 | 66880/66880 | 0 |
| Test | Crash point | Cause | Tests blocked |
|---|---|---|---|
| pat.t | No crash — all 1298 tests now run | N/A | 0 |
| pat_advanced.t | Line 2308 (test 1625) | \p{Is_q} — package-scoped user property (Some::Is_q) |
53 tests |
| pat_rt_report.t | Line 1158 (test 2508) | (?1) — numbered group recursion not supported |
7 tests |
| regexp_unicode_prop.t | Line 543 (test 1096) | \pf/\Pf invalid property generates warnings instead of errors |
14 tests |
The \G assertion (match at pos()) has significant issues:
- Floating
\Gpatterns like/a+\G/fail — Java doesn't support\Gexcept at pattern start \Gin loops (/\G.../gciteration) doesn't maintain position correctly- Tests: pat.t 455-518
Difficulty: Medium-High. Requires custom \G tracking in the match engine; Java's \G only works at the start of a match attempt.
Regex embedded code blocks are replaced with no-op groups. This breaks:
$^R— result of last(?{...})(tests 308-310)pos()inside(?{...})(tests 470-494)- Package/lexical variable access inside
(?{...})(tests 522-525) - Eval-group runtime checks — "Eval-group not allowed at runtime" (tests 300-304)
Difficulty: Very High. Would require integrating the Perl compiler into the regex engine to execute code at match time.
$^N is not updated after successful group captures. Tests 69-88 all fail.
- Both outside regex and inside
(?{...})usage fails $^Nis automatically localized — not implemented
Difficulty: Medium. Requires tracking the last successfully matched group in the match result.
Non-constant recursive patterns are replaced with empty groups. Tests 293-297 (complicated backtracking, recursion with (??{})) all fail.
Difficulty: Very High. Same as (?{...}) — requires runtime code execution.
Regex control verbs are not supported by Java's regex engine. Tests 357-373 (ACCEPT and CLOSE buffer tests).
Difficulty: High. Would require a custom regex engine or post-processing layer.
The match position arrays have bugs:
- Wrong values for capture group positions (tests 381-438)
- Stale values not cleared after new match (tests 439-441)
- Read-only protection throws wrong exception type:
UnsupportedOperationExceptioninstead ofModification of a read-only value attempted(test 449) - Interpolation in patterns —
@-and@+should not be interpolated (pat_rt_report.t 151-154) - Undefined values in
@-/@+after match (pat_rt_report.t 213)
Difficulty: Medium. The data is available from Java's Matcher; needs more careful mapping to Perl semantics.
qr/\b\v$/xismstringifies as(?^imsx:\b\v$)but should be(?^msix:\\b\\v$)— backslashes not escaped in stringification (test 315)/umodifier not tracked:use feature 'unicode_strings'should add/uflag (tests 323-327)
Difficulty: Low-Medium. Stringification fix is straightforward; /u modifier tracking needs scope awareness.
Named character escapes have extensive issues:
- Empty
\N{}not handled correctly (tests 794-809) \N{PLUS SIGN}— named characters not expanded in regex (tests 831-833)\N{U+0041}in character class —[\N{SPACE}\N{U+0041}]fails (test 836)- Charname validation — leading digit, comma, latin1 symbol errors not produced (tests 821-828)
- Charname caching with
$1— not implemented (tests 798-801) - Cedilla/NO-BREAK SPACE in names — error handling missing (tests 816-819)
Difficulty: Medium-High. \N{U+XXXX} is partially implemented; full charnames support needs the charnames module.
Perl warns about useless regex modifiers (/c, /g, /o are match-operator flags, not regex flags). PerlOnJava silently ignores them without producing warnings.
Difficulty: Low. Add warning emission in the regex flag parser.
Our fix handles the crash but the test strings don't match correctly:
\x4jproduces\004jbut regex[\x4j]{2}doesn't match it (test 101)\xkproduces\000kbut regex[\xk]{2}doesn't match it (test 102)\xx,\xxa,\x9_b— regex character class expansion doesn't match the test string (tests 103-105)
The issue is that the test string and the regex pattern both use \x escapes, but the regex preprocessor and the string processor handle them differently. The test expects both to produce the same character.
Difficulty: Low-Medium. The regex-side \x handling needs to produce character classes that match what the string-side produces.
The largest single failure category. Patterns like /([ ]*$)(?(1))/ don't match correctly. This is a systematic issue with conditionals referencing a group that ends with $ anchor.
Difficulty: Medium. Likely a subtle difference in how Java handles the interaction between $ anchor in a group and conditional backreference.
$REGMARK (set by (*MARK:name)) is not implemented. Tests 2458-2463.
Difficulty: High. Requires (*MARK) verb support.
(?1) and (?&name) syntax for recursing into capture groups is not recognized. Now downgradable with JPERL_UNIMPLEMENTED=warn (no longer crashes tests), but the patterns silently fail to match.
Difficulty: Very High. Java's regex engine has no recursion support. Would need a custom engine or PCRE/JNI bridge.
The POSIX-style Unicode property syntax \p{isAlpha}, \p{isSpace} is not recognized. This causes the fatal error that stops pat.t at line 1247, blocking 666 remaining tests.
Difficulty: Low-Medium. Map POSIX-style aliases (isAlpha → Alpha, isSpace → Space, etc.) in the Unicode property handler.
Empty alternatives in patterns like /(|a)/ or the "0 match in alternation" test don't work correctly.
Difficulty: Low-Medium. Likely a regex preprocessing issue.
| Issue | Tests | Difficulty |
|---|---|---|
| Look around edge cases | pat.t 332-333 | Medium |
| REG_INFTY (quantifier limit) | pat.t 250 | Low |
| POSIX class error message format | pat.t 348 | Low |
| Lookbehind limit (Java) | pat.t 252 | Hard (engine limit) |
| Empty pattern pmop flags | pat_rt_report.t 44 | Medium |
| Nested split | pat_rt_report.t 85 | Medium |
| Ill-formed UTF-8 in class | pat_rt_report.t 140 | Medium |
| Pattern in loop (prev success) | pat_rt_report.t 2469-2470 | Medium |
| Long string patterns | pat_advanced.t 805-813 | Medium |
/d to /u modifier change |
pat_advanced.t 807-808 | Low-Medium |
\p{Is_q} defined in package Some as Some::Is_q is not found because user-defined property lookup only checks main:: package. Perl uses the current package when resolving \p{...} names. This crashes pat_advanced.t at line 2308 (test 1625), blocking 53 tests.
Difficulty: Medium. Need to pass the current package context to the regex preprocessor and try the current package before falling back to main::.
Invalid single-character properties like \pf, \Pq are passed through to Java's regex engine which throws PatternSyntaxException. This is caught and wrapped as PerlJavaUnimplementedException, which under JPERL_UNIMPLEMENTED=warn generates warnings instead of proper errors. Test 1096 in regexp_unicode_prop.t expects 0 warnings but gets 8 (from \pf, \Pf, \pq, \Pq), then crashes.
Fix approach: Validate single-char properties in the preprocessor (only \pL, \pM, \pN, etc. are valid — single Unicode general category letters). Invalid ones should throw PerlCompilerException (not PerlJavaUnimplementedException).
Difficulty: Low. Add validation for single-char \p/\P properties in RegexPreprocessorHelper.
Perl calls user-defined property subs with $caseless=1 when the /i flag is active, allowing subs to return a wider character set for case-insensitive matching. PerlOnJava always calls the sub with an empty argument list. This causes 2 test failures in regexp_unicode_prop.t (tests 1061, 1077) and several in pat_advanced.t.
Fix approach: Pass the /i flag through the regex preprocessor to tryUserDefinedProperty, which then passes 1 as the first argument to the property sub.
Difficulty: Medium. Requires threading the case-insensitive flag through several method calls in the regex preprocessing pipeline.
Quick wins (Low difficulty, high impact):
DONE — pat.t now runs all 1298 tests\p{isAlpha}aliases — unblocks 666 pat.t tests (category N)- Invalid
\pXsingle-char properties — unblocks 14 regexp_unicode_prop.t tests (category R) - Useless
(?c)/(?g)/(?o)warnings — fixes 13 pat_advanced.t tests (category I) - POSIX class error message — fix message format (category P)
- REG_INFTY error — add quantifier limit check (category P)
Medium effort, significant impact:
6. Package-scoped user properties — unblocks 53 pat_advanced.t tests (category Q)
7. /i caseless flag for user properties — fixes ~4 tests (category S)
8. (?(1)...) with $ anchor — fixes 48 pat_rt_report.t tests (category K)
9. @-/@+ position arrays — fixes 17 tests across files (category F)
10. $^N last capture — fixes 20 pat_advanced.t tests (category C)
11. Bare \x edge cases — fixes 5 pat_advanced.t tests (category J)
12. \N{name} charnames — fixes 25 pat_advanced.t tests (category H)
Hard / architectural (major work):
13. \G anchor — 26 pat.t tests (category A)
14. (?{...}) code blocks — 46 tests total (category B)
15. (?1) recursion / (?&name) / (*ACCEPT) / (*MARK) — engine limitations (categories E, L, M)
- Fix 1: handleQuantifier brace consumption (2026-04-10)
- Fix 2: \x{...} hex escape with non-hex chars (2026-04-10)
- Fix 3: Bare \xNN with non-hex chars (2026-04-10)
- Fix 4: NPE on failed regex with JPERL_UNIMPLEMENTED=warn (2026-04-10)
- Failure analysis and categorization (2026-04-10)
- Fix 5: \p{isAlpha} case-insensitive Is prefix, add Space/Alnum/Punct aliases (2026-04-10)
- Fix 6: \p{Property=Value} syntax (2026-04-10)
- Fix 7: Named capture groups with underscores — U95 encoding (2026-04-10)
- Fix 8: User-defined property resolution — refactor resolvePropertyReference to return UnicodeSet (2026-04-10)
- Properties using +utf8:: references (e.g., +utf8::Uppercase, &utf8::ASCII) were failing because the old code returned Java regex patterns that ICU4J's UnicodeSet couldn't parse
- Created resolvePropertyReferenceAsSet() and resolveStandardPropertyAsSet() methods
- Fix 9: Regex cache preventing deferred recompilation (2026-04-10)
- ensureCompiledForRuntime() now evicts stale cache entries before recompiling
- Fix 10: Cache user-defined property sub results (2026-04-10)
- Matches Perl behavior of calling each property sub only once
- Fixes "Called twice" errors from subs with
statevariables
- Fix 11: Titlecase/TitlecaseLetter/Lt property aliases (2026-04-10)
- Fix 12: (?&name) named group recursion downgraded to regexUnimplemented (2026-04-10)
- Fix 13: (?digit) numbered recursion downgraded to regexUnimplemented (2026-04-10)
src/main/java/org/perlonjava/runtime/regex/RegexPreprocessor.javasrc/main/java/org/perlonjava/runtime/regex/RegexPreprocessorHelper.javasrc/main/java/org/perlonjava/runtime/regex/RuntimeRegex.javasrc/main/java/org/perlonjava/runtime/regex/UnicodeResolver.javasrc/main/java/org/perlonjava/runtime/regex/CaptureNameEncoder.javasrc/main/java/org/perlonjava/runtime/runtimetypes/HashSpecialVariable.java