This document defines the comprehensive sublanguage parser architecture for PerlOnJava, designed to provide proper semantic parsing for domain-specific languages embedded within Perl using a unified lexer→parser→AST framework.
Perl contains multiple "sublanguages" - specialized syntaxes within string literals that have their own complex parsing rules:
- Regex patterns:
/(?:pattern){3,5}/flags- complex syntax with quantifiers, groups, escapes - Pack/unpack templates:
pack("A10i4l<*", ...)- binary data formatting with endianness and counts - Sprintf formats:
sprintf("%*.*f", width, precision, value)- printf-style formatting with dynamic width - Transliteration:
tr/\x00-\xFF/a-z/- character mapping with ranges and escapes - Double-quoted strings:
"Hello \x41\n$var"- interpolation with escapes and variables
-
Manual String Manipulation: Current parsing uses character-by-character loops
- Result: Parsing bugs, poor maintainability, no semantic understanding
-
Runtime Parsing Overhead: Patterns parsed every time code executes
- Result: Performance penalty, especially in loops
-
Inconsistent Error Handling: Each sublanguage has different error reporting
- Result: Poor developer experience, inconsistent error messages
-
No Semantic Understanding: Direct string manipulation without AST
- Result: Cannot validate semantics, transform patterns, or optimize
Core Principle: All sublanguages follow the same lexer→parser→AST pattern for consistency, maintainability, and semantic understanding.
Architecture Flow:
Input String → SublanguageLexer → SublanguageToken[] → SublanguageParser → SublanguageASTNode → Validation/Transformation
Why This Approach:
- Compile-Time Parsing: Parse once during compilation, use optimized structures at runtime
- Semantic Understanding: AST enables validation, transformation, and optimization
- Consistent Error Reporting: All sublanguages use same position tracking and error formatting
- Maintainable Code: Unified patterns across all sublanguages
- Extensible: Easy to add new sublanguages following the same pattern
RULE: Every sublanguage MUST follow this exact flow:
// CORRECT: Follow the architecture
String input = "\\x41\\n";
SublanguageLexer lexer = new RegexLexer(input, sourceOffset);
List<SublanguageToken> tokens = lexer.tokenize();
SublanguageParser parser = new RegexParser(tokens);
SublanguageASTNode ast = parser.parse();
// WRONG: Utility functions that bypass architecture
String result = EscapeSequenceParser.parseEscape(input, 0); // ❌ VIOLATES ARCHITECTUREWhy: Bypassing the lexer→parser→AST flow breaks consistency, error reporting, and semantic understanding.
RULE: Reusable components MUST work within the lexer→parser→AST framework.
// CORRECT: Reusable components within architecture
public class RegexLexer extends SublanguageLexer {
protected boolean tokenizeEscape() {
// Tokenize escape sequence into SublanguageToken
// This can be shared across sublanguage lexers
return tokenizeCommonEscape(); // ✅ FOLLOWS ARCHITECTURE
}
}
// WRONG: Utility functions outside architecture
public class EscapeUtils {
public static String parseEscape(String input, int pos) { // ❌ BYPASSES ARCHITECTURE
// Direct string manipulation
}
}Why: Utility shortcuts bypass position tracking, error handling, and AST generation.
RULE: All parsing MUST produce SublanguageASTNode objects for semantic analysis.
// CORRECT: AST-first design
SublanguageASTNode escapeNode = new SublanguageASTNode("escape", "\\n");
escapeNode.addAnnotation("resolvedValue", "\n");
escapeNode.addAnnotation("escapeType", "newline");
// WRONG: Direct string results
String result = "\\n"; // ❌ NO SEMANTIC INFORMATIONWhy: AST enables semantic validation, transformation, and optimization that strings cannot provide.
-
Lexer/Token System:
LexerTokenwithtypeandtextLexerTokenTypeenum for categorizationLexer.javafor tokenization
-
Parser Infrastructure:
Parser.javabase class withEmitterContext- Specialized parsers:
StringParser,NumberParser,IdentifierParser TokenUtilsfor token manipulation- Consistent error handling with
PerlCompilerException
-
AST Generation:
Nodehierarchy for AST representation- Context-aware compilation with
EmitterContext
-
Inconsistent Architecture: Each sublanguage uses different parsing approaches
- Why this matters: Developers must learn different patterns for each sublanguage
- Impact: Higher maintenance cost, more bugs, inconsistent user experience
-
Manual String Manipulation: Direct character-by-character parsing
- Why this matters: Error-prone, hard to extend, no semantic understanding
- Impact: Parsing bugs, poor error messages, difficulty adding new features
-
Poor Error Reporting: Limited location information
- Why this matters: Users can't easily find and fix their mistakes
- Impact: Poor developer experience, harder debugging
-
Runtime Parsing Overhead: Parsing happens every time code runs
- Why this matters: Performance penalty for repeated pattern parsing
- Impact: Slower execution, especially for patterns in loops
Design Philosophy:
-
Compile-Time Parsing: Parse sublanguages during compilation, not runtime
- Why: Better performance, early error detection, semantic understanding
- How: Integration points in
StringParser.javacall sublanguage parsers
-
Generic AST Approach: Single
SublanguageASTNodeclass with annotations- Why: Simpler than many specific subclasses, easier to extend
- How: Node type and annotations describe structure, children hold sub-nodes
-
Two-Error-Type System: Distinguish user mistakes from unimplemented features
- Why: PerlOnJava implements Perl incrementally, need graceful degradation
- How:
SYNTAX_ERROR(user mistake) vsUNIMPLEMENTED_ERROR(valid Perl, not supported yet)
// Comprehensive token types for all sublanguages (IMPLEMENTED)
public enum SublanguageTokenType {
// Common types
LITERAL, ESCAPE, DELIMITER, MODIFIER, COMMENT, WHITESPACE,
// Regex-specific
REGEX_ANCHOR, REGEX_QUANTIFIER, REGEX_GROUP, REGEX_CLASS,
// Pack-specific
PACK_FORMAT, PACK_COUNT, PACK_ENDIAN,
// Sprintf-specific
SPRINTF_FLAG, SPRINTF_WIDTH, SPRINTF_PRECISION, SPRINTF_CONVERSION,
// Transliteration-specific
TR_RANGE, TR_CLASS, TR_MODIFIER
}
// Base sublanguage token
public class SublanguageToken extends LexerToken {
public final int position;
public final int length;
public SublanguageToken(SublanguageTokenType type, String text, int position, int length) {
super(LexerTokenType.IDENTIFIER, text); // Map to base type
this.sublanguageType = type;
this.position = position;
this.length = length;
}
public SublanguageTokenType sublanguageType;
}
// Base sublanguage lexer
public abstract class SublanguageLexer {
protected final String input;
protected int pos = 0;
protected final List<SublanguageToken> tokens = new ArrayList<>();
public SublanguageLexer(String input) {
this.input = input;
}
public List<SublanguageToken> tokenize() {
while (pos < input.length()) {
if (!tokenizeNext()) {
throw new PerlCompilerException("Unexpected character at position " + pos);
}
}
return tokens;
}
protected abstract boolean tokenizeNext();
protected void addToken(SublanguageTokenType type, String text) {
tokens.add(new SublanguageToken(type, text, pos - text.length(), text.length()));
}
protected char peek() { return pos < input.length() ? input.charAt(pos) : '\0'; }
protected char advance() { return pos < input.length() ? input.charAt(pos++) : '\0'; }
protected boolean match(String expected) {
if (input.startsWith(expected, pos)) {
pos += expected.length();
return true;
}
return false;
}
}
// Base sublanguage parser
public abstract class SublanguageParser<T> {
protected final List<SublanguageToken> tokens;
protected int tokenIndex = 0;
protected final EmitterContext ctx;
public SublanguageParser(List<SublanguageToken> tokens, EmitterContext ctx) {
this.tokens = tokens;
this.ctx = ctx;
}
public abstract T parse();
protected SublanguageToken peek() {
return tokenIndex < tokens.size() ? tokens.get(tokenIndex) : null;
}
protected SublanguageToken advance() {
return tokenIndex < tokens.size() ? tokens.get(tokenIndex++) : null;
}
protected boolean match(SublanguageTokenType type) {
SublanguageToken token = peek();
if (token != null && token.sublanguageType == type) {
advance();
return true;
}
return false;
}
protected void error(String message) {
SublanguageToken token = peek();
int position = token != null ? token.position : input.length();
throw new PerlCompilerException(message + " at position " + position);
}
}// Regex-specific lexer
public class RegexLexer extends SublanguageLexer {
public RegexLexer(String input) {
super(input);
}
@Override
protected boolean tokenizeNext() {
char c = peek();
switch (c) {
case '\\':
return tokenizeEscape();
case '[':
return tokenizeCharacterClass();
case '(':
return tokenizeGroup();
case '*':
case '+':
case '?':
return tokenizeQuantifier();
case '^':
case '$':
return tokenizeAnchor();
default:
return tokenizeLiteral();
}
}
private boolean tokenizeEscape() {
int start = pos;
advance(); // consume '\'
char next = peek();
if (next == 'g') {
advance();
if (peek() == '{') {
// \g{...} backreference
while (peek() != '}' && peek() != '\0') advance();
if (peek() == '}') advance();
} else if (Character.isDigit(peek()) || peek() == '-') {
// \g1 or \g-1
if (peek() == '-') advance();
while (Character.isDigit(peek())) advance();
}
} else if (Character.isDigit(next)) {
// \1, \2, etc.
while (Character.isDigit(peek())) advance();
} else {
advance(); // single character escape
}
addToken(SublanguageTokenType.ESCAPE, input.substring(start, pos));
return true;
}
// ... other tokenize methods
}
// Regex-specific parser
public class RegexParser extends SublanguageParser<RegexAST> {
public RegexParser(List<SublanguageToken> tokens, EmitterContext ctx) {
super(tokens, ctx);
}
@Override
public RegexAST parse() {
return parseAlternation();
}
private RegexAST parseAlternation() {
RegexAST left = parseSequence();
while (match(SublanguageTokenType.DELIMITER) &&
peek().text.equals("|")) {
RegexAST right = parseSequence();
left = new AlternationNode(left, right);
}
return left;
}
private RegexAST parseSequence() {
List<RegexAST> elements = new ArrayList<>();
while (peek() != null && !peek().text.equals("|")) {
elements.add(parseElement());
}
return new SequenceNode(elements);
}
// ... other parse methods with proper semantic validation
}// Pack-specific lexer
public class PackLexer extends SublanguageLexer {
@Override
protected boolean tokenizeNext() {
char c = peek();
if (isPackFormat(c)) {
advance();
addToken(SublanguageTokenType.PACK_FORMAT, String.valueOf(c));
return true;
} else if (c == '<' || c == '>') {
advance();
addToken(SublanguageTokenType.PACK_ENDIAN, String.valueOf(c));
return true;
} else if (c == '!') {
advance();
addToken(SublanguageTokenType.MODIFIER, "!");
return true;
} else if (Character.isDigit(c) || c == '*') {
return tokenizeCount();
} else if (c == '[') {
return tokenizeBracketExpression();
} else if (c == '(') {
advance();
addToken(SublanguageTokenType.DELIMITER, "(");
return true;
}
// ... handle other cases
return false;
}
private boolean isPackFormat(char c) {
return "aAZbBhHcCsSiIlLnNvVqQjJfdFDpPuUwWxX@".indexOf(c) >= 0;
}
// ... other methods
}
// Pack-specific parser with semantic validation
public class PackParser extends SublanguageParser<PackTemplate> {
@Override
public PackTemplate parse() {
List<PackItem> items = new ArrayList<>();
while (peek() != null) {
items.add(parsePackItem());
}
return new PackTemplate(items);
}
private PackItem parsePackItem() {
// Parse format character
if (!match(SublanguageTokenType.PACK_FORMAT)) {
error("Expected pack format character");
}
String format = tokens.get(tokenIndex - 1).text;
// Parse modifiers
List<String> modifiers = new ArrayList<>();
while (match(SublanguageTokenType.PACK_ENDIAN) ||
match(SublanguageTokenType.MODIFIER)) {
modifiers.add(tokens.get(tokenIndex - 1).text);
}
// Validate modifier combinations
validateModifiers(format, modifiers);
// Parse count
PackCount count = parseCount();
return new PackItem(format, modifiers, count);
}
private void validateModifiers(String format, List<String> modifiers) {
boolean hasLittleEndian = modifiers.contains("<");
boolean hasBigEndian = modifiers.contains(">");
if (hasLittleEndian && hasBigEndian) {
error("Cannot use both '<' and '>' modifiers");
}
// Format-specific validation
if ("aAZ".contains(format) && (hasLittleEndian || hasBigEndian)) {
error("Endianness modifiers not valid for string formats");
}
}
}// Sprintf-specific lexer
public class SprintfLexer extends SublanguageLexer {
@Override
protected boolean tokenizeNext() {
char c = peek();
if (c == '%') {
return tokenizeSpecifier();
} else {
return tokenizeLiteral();
}
}
private boolean tokenizeSpecifier() {
int start = pos;
advance(); // consume '%'
// Parse flags
while (pos < input.length() && "+-# 0".indexOf(peek()) >= 0) {
advance();
}
// Parse width
while (Character.isDigit(peek()) || peek() == '*') {
advance();
}
// Parse precision
if (peek() == '.') {
advance();
while (Character.isDigit(peek()) || peek() == '*') {
advance();
}
}
// Parse vector
if (peek() == '*') {
advance();
while (Character.isDigit(peek())) advance();
if (peek() == 'v') advance();
}
// Parse conversion
if (pos < input.length() && "diouxXeEfFgGaAcspn%".indexOf(peek()) >= 0) {
advance();
}
addToken(SublanguageTokenType.SPRINTF_CONVERSION, input.substring(start, pos));
return true;
}
}// Add to StringParser.java
public static class SublanguageParseResult {
public final Object ast;
public final int endIndex;
public final String processedString;
public SublanguageParseResult(Object ast, int endIndex, String processedString) {
this.ast = ast;
this.endIndex = endIndex;
this.processedString = processedString;
}
}
public static SublanguageParseResult parseRegex(EmitterContext ctx, String regex, RegexFlags flags) {
RegexLexer lexer = new RegexLexer(regex);
List<SublanguageToken> tokens = lexer.tokenize();
RegexParser parser = new RegexParser(tokens, ctx);
RegexAST ast = parser.parse();
// Transform AST to Java-compatible string
RegexTransformer transformer = new RegexTransformer(flags);
String javaRegex = transformer.transform(ast);
return new SublanguageParseResult(ast, regex.length(), javaRegex);
}// Enhanced error reporting using existing patterns
public class SublanguageError extends PerlCompilerException {
private final String input;
private final int position;
private final String sublanguage;
public SublanguageError(String message, String input, int position, String sublanguage) {
super(formatError(message, input, position, sublanguage));
this.input = input;
this.position = position;
this.sublanguage = sublanguage;
}
private static String formatError(String message, String input, int position, String sublanguage) {
String before = input.substring(0, Math.min(position, input.length()));
String after = input.substring(Math.min(position, input.length()));
return String.format("%s error: %s in %s pattern; marked by <-- HERE in /%s <-- HERE %s/",
sublanguage, message, sublanguage.toLowerCase(), before, after);
}
}- Uses established
LexerToken/Parserpatterns - Integrates with
EmitterContextand error handling - Follows existing code organization and naming conventions
- No external dependencies (ANTLR overhead)
- Optimized for PerlOnJava's specific needs
- Reuses existing infrastructure
- Familiar patterns for the development team
- Consistent error handling and reporting
- Easy to extend and modify
- Can be implemented one sublanguage at a time
- Maintains backward compatibility
- Allows gradual migration from current implementations
The phases below implement sublanguages following the mandatory lexer→parser→AST architecture. Violating these principles will result in:
- Broken architecture - Utility shortcuts bypass lexer/parser/AST flow
- No semantic understanding - Direct string manipulation loses AST benefits
- Inconsistent error handling - Position tracking and validation broken
- Maintenance nightmare - Code scattered across utility classes instead of unified architecture
Status: Base architecture implemented and verified
What Was Built:
SublanguageLexer- Base class with tokenization patterns and position trackingSublanguageParser- Base class returningSublanguageASTNodeobjectsSublanguageToken- Position-aware tokens with source offset mappingSublanguageASTNode- Generic AST nodes with annotations for all sublanguagesSublanguageValidationResult- Two-error-type system (SYNTAX_ERROR vs UNIMPLEMENTED_ERROR)
Architecture Verification:
- ✅ All base classes follow lexer→parser→AST flow
- ✅ No utility shortcuts that bypass architecture
- ✅ AST-first design with semantic annotations
- ✅ Consistent error handling with position tracking
Priority: HIGH - Foundation for other sublanguages Prerequisites: Phase 1 complete
Why Double-Quoted Strings First:
- Simpler sublanguage to establish the pattern
- Contains common elements (escapes, interpolation) used by other sublanguages
- Validates the architecture before more complex sublanguages
Implementation:
-
DoubleQuotedLexer extends SublanguageLexer
// Tokenize: "Hello \x41\n$var" // → [LITERAL:"Hello ", ESCAPE:"\x41", ESCAPE:"\n", VARIABLE:"$var"]
-
DoubleQuotedParser extends SublanguageParser
// Parse tokens → SublanguageASTNode tree // Enable semantic validation of escapes and interpolation
-
Integration via static method
// In DoubleQuotedParser.java public static SublanguageValidationResult validateString(String content, int sourceOffset)
Verification:
- Lexer produces proper
SublanguageTokenobjects with position tracking - Parser produces
SublanguageASTNodeAST (not strings) - Integration works via static method (no StringParser pollution)
- Escape sequences tokenized properly (foundation for other sublanguages)
Prerequisites: Phase 2 complete (establishes escape handling pattern)
Implementation follows same lexer→parser→AST pattern:
- RegexLexer extends SublanguageLexer - Tokenize regex syntax
- RegexParser extends SublanguageParser - Build AST for semantic validation
- Static integration method -
RegexParser.validatePattern()
Priority: HIGH - Serious test failures to resolve Prerequisites: Phase 2 complete (establishes number/count tokenization pattern)
Prerequisites: Phase 2 complete (establishes format tokenization pattern)
Prerequisites: Phase 2 complete (establishes escape and range tokenization pattern)
BREAKTHROUGH DISCOVERY: parseRawStringWithDelimiter() already performs sophisticated token-level parsing:
- Character-by-character processing of existing
LexerTokenobjects - State machine parsing: START → STRING → ESCAPE → END_TOKEN
- Delimiter handling with nesting levels and paired delimiters
- Escape sequence detection (already has ESCAPE state!)
- Complex infrastructure: heredoc processing, position tracking
Key Realization: Instead of creating separate tokenization layers, we can enhance the existing state machine with sublanguage-specific processing.
No separate tokenization needed! Enhance existing parseRawStringWithDelimiter():
// BEFORE: Complex 3-layer architecture with re-tokenization
String rawString = extractFromTokens(tokens);
SublanguageLexer lexer = new SublanguageLexer(rawString);
List<SublanguageToken> newTokens = lexer.tokenize();
SublanguageParser parser = new RegexParser(newTokens);
// AFTER: Enhanced existing method
public static ParsedString parseRawStringWithDelimiterEnhanced(
EmitterContext ctx,
List<LexerToken> tokens,
int index,
boolean redo,
Parser parser,
SublanguageType sublanguageType // NEW: specify sublanguage
) {
// Existing state machine + sublanguage-specific enhancements
switch (state) {
case ESCAPE:
if (sublanguageType == REGEX) {
processRegexEscape(ch, buffer, astBuilder);
} else if (sublanguageType == PACK) {
processPackFormat(ch, buffer, astBuilder);
}
// ... existing logic
}
}- Massive Simplification: No separate tokenization layer needed
- Better Performance: Single-pass parsing, no re-tokenization overhead
- Leverage Existing Code: Reuse 200+ lines of proven parsing logic
- Easier Integration: Minimal changes, backward compatible
- Proven Infrastructure: Delimiter handling, escape detection already work
KEY INSIGHT: No new architecture needed! The existing parseRawString() flow is perfect:
parseRawString(parser, operator)
↓
parseRawStrings() // extracts content using parseRawStringWithDelimiter()
↓
switch (operator) {
case "m", "qr", "/", "//", "/=" -> parseRegexMatch(ctx, operator, rawStr, parser);
case "s" -> parseRegexReplace(ctx, rawStr, parser);
case "\"", "qq" -> StringDoubleQuoted.parseDoubleQuotedString(...);
// etc.
}APPROACH: Simply enhance existing operator-specific methods with AST building.
// In parseRegexMatch() - where content is already extracted
public static Node parseRegexMatch(EmitterContext ctx, String operator, ParsedString rawStr, Parser parser) {
String regexContent = rawStr.buffers.get(0);
// NEW: Build AST from extracted content
try {
SublanguageASTNode regexAST = RegexASTBuilder.buildAST(regexContent, rawStr.sourceOffset);
// Use AST for optimized code generation and validation
return generateOptimizedRegexMatch(regexAST, operator, rawStr);
} catch (Exception e) {
// Fail compilation on syntax errors
throw new PerlCompilerException(rawStr.sourceOffset, "Invalid regex: " + e.getMessage(), ctx.errorUtil);
}
}// Where pack() and sprintf() content is extracted
String template = extractedContent;
// NEW: Build AST and validate
try {
SublanguageASTNode ast = PackASTBuilder.buildAST(template, sourceOffset);
return generateOptimizedCode(ast);
} catch (Exception e) {
throw new PerlCompilerException(sourceOffset, "Invalid template: " + e.getMessage(), ctx.errorUtil);
}public class RegexASTBuilder {
public static SublanguageASTNode buildAST(String content, int sourceOffset) {
// Parse regex content into AST: literals, character classes, quantifiers, groups
// Fail on syntax errors with position info
}
}
public class PackASTBuilder {
public static SublanguageASTNode buildAST(String content, int sourceOffset) {
// Parse pack template into AST: format specifiers, modifiers, repeat counts
// Fail on invalid formats with position info
}
}}
public class DoubleQuotedParser extends SublanguageParser { // Uses common library: parseEscapeSequence(), parseCaseModifier(), parseInterpolation() // Focuses on: string interpolation, case modification state machine }
### **🚀 Revised Implementation Plan: In-Place AST Enhancement**
**Benefits of This Approach:**
1. **No New Architecture** - Just enhance existing methods
2. **No Switch Statements** - Operator context already handled perfectly
3. **Minimal Changes** - Target specific methods where content is extracted
4. **Backward Compatible** - Existing flow unchanged, AST building added
5. **In-Place Enhancement** - AST building happens where content is already available
**Commit 1: Create AST Builders (~300 lines)**
- Create `RegexASTBuilder.buildAST()` for regex parsing and validation
- Create `PackASTBuilder.buildAST()` for pack template parsing and validation
- Create `SprintfASTBuilder.buildAST()` for sprintf format parsing and validation
- All builders use existing `SublanguageASTNode` infrastructure
- Comprehensive error handling with position tracking
**Commit 2: Enhance Regex Methods (~200 lines)**
- Enhance `parseRegexMatch()` and `parseRegexReplace()` with AST building
- Add AST-driven code generation for optimized regex handling
- Maintain backward compatibility with fallback to existing implementation
- Fail compilation on syntax errors with proper position info
**Commit 3: Enhance Pack/Sprintf Methods (~200 lines)**
- Enhance pack() and sprintf() parsing locations with AST building
- Add AST-driven code generation for optimized template handling
- Fail compilation on syntax errors with proper position info
- Integration with existing error handling patterns
**Commit 4: Testing and Optimization (~100 lines)**
- Comprehensive tests for all AST builders and enhanced methods
- Performance benchmarks vs existing implementations
- Documentation and examples
- Optional compiler flags for enabling/disabling AST validation
## 📋 **ORIGINAL COMPREHENSIVE IMPLEMENTATION PLAN (ARCHIVED)**
### **Phase 1: Foundation Infrastructure ✅ COMPLETED**
- Base classes and interfaces implemented and verified
### **Phase 2: Three-Layer Architecture Implementation (SUPERSEDED)**
#### **Commit 1: Unified SublanguageLexer (~400 lines)**
**Assignable Work Unit**: Complete tokenization layer
- Comprehensive token types for all sublanguages
- Shared tokenization: escapes, numbers, strings, variables, whitespace
- Sublanguage-specific tokens: regex operators, pack formats, sprintf flags, case modifiers
- Position tracking and source offset mapping
- **Deliverable**: Single lexer that can tokenize any sublanguage input
#### **Commit 2: Common Parser Library (~500 lines)**
**Assignable Work Unit**: Reusable parsing components
- `parseEscapeSequence()` - handles \n, \t, \x41, \x{263A}, \o{100}, \cA, \N{NAME}
- `parseNumber()` - handles 123, 0x41, 077, ranges, counts with validation
- `parseRange()` - handles A-Z, 0-9, \0-\377 with complex validation logic
- `parseCaseModifier()` - handles \U, \L, \u, \l, \E with state machine
- `parseInterpolation()` - handles $var, @array, ${complex} variable parsing
- Shared validation methods and error handling
- **Deliverable**: Rich library of parsing components usable by all specialized parsers
#### **Commit 3: RegexParser (~300 lines)**
**Assignable Work Unit**: Regex-specific parsing logic
- Uses common library for escapes and ranges
- Implements precedence parsing: | > concatenation > quantifiers
- Handles groups, anchors, character classes
- Regex-specific validation and error messages
- **Deliverable**: Complete regex parser with AST generation
#### **Commit 4: PackParser (~300 lines)**
**Assignable Work Unit**: Pack/Unpack-specific parsing logic
- Uses common library for numbers and escapes
- Left-to-right format parsing: A, C, N, v, V formats
- Modifier and count validation
- Pack-specific error messages for test compliance
- **Deliverable**: Complete pack parser addressing test failures
#### **Commit 5: SprintfParser (~250 lines)**
**Assignable Work Unit**: Sprintf-specific parsing logic
- Uses common library for number parsing
- Format specifier parsing: %, flags, width, precision, conversion
- Argument count validation
- Sprintf-specific error messages
- **Deliverable**: Complete sprintf parser with validation
#### **Commit 6: DoubleQuotedParser (~350 lines)**
**Assignable Work Unit**: Double-quoted string parsing logic
- Uses common library for escapes, case modifiers, interpolation
- String interpolation state machine
- Case modification conflict detection
- \Q...\E quotemeta handling
- **Deliverable**: Complete double-quoted string parser
### **Phase 3: Integration and Migration**
#### **Commit 7-10: Integration Layer (~200 lines each)**
**Assignable Work Units**: Backward-compatible integration
- StringDoubleQuoted integration with fallback
- RegexPreprocessor integration with fallback
- Pack/Unpack integration with fallback
- SprintfOperator integration with fallback
- **Deliverable**: New parsers integrated with existing code, zero breaking changes
#### **Commit 11+: Gradual Migration (~100-300 lines each)**
**Assignable Work Units**: Method-by-method replacement
- Replace individual methods with AST-based implementations
- Maintain string-based APIs for backward compatibility
- Performance testing and validation
- **Deliverable**: Incremental migration with continuous validation
## ✅ **Work Splitting Benefits**
### **Parallel Development Possible:**
- **Team Member A**: Unified lexer (Commit 1)
- **Team Member B**: Common parser library (Commit 2)
- **Team Member C**: RegexParser (Commit 3) - can start after Commits 1-2
- **Team Member D**: PackParser (Commit 4) - can start after Commits 1-2
- **Team Member E**: Integration work (Commits 7-10)
### **Clear Dependencies:**
- Commits 1-2 must complete before 3-6
- Commits 3-6 can be developed in parallel
- Commits 7+ can start once corresponding parser is complete
### **Measurable Progress:**
- Each commit is 200-500 lines with clear deliverables
- Each commit compiles and passes tests independently
- Progress can be tracked and validated incrementally
This comprehensive plan enables **efficient parallel development** while maintaining **architectural integrity** and **backward compatibility**! 🎉
## ⚠️ **COMPREHENSIVE RISK MITIGATION PLAN**
### **Critical Risk Assessment**
After analyzing existing error/warning systems, we identified **sophisticated error handling complexity**:
- **Sprintf**: 4 error types, overlapping specifiers, positional parameters, vector formats
- **Pack**: Runtime exceptions, Unicode state tracking, endianness conflicts
- **Regex**: 52+ error cases, Java compatibility transformations
- **All sublanguages**: Complex validation rules and edge cases
### **Risk Mitigation Strategies**
#### **Commit -1: Risk Mitigation Phase (~300 lines)**
**MANDATORY before implementation begins**
1. **Enhanced Error Compatibility**
```java
// Extend SublanguageValidationResult to support existing error types
public enum ErrorBehavior {
APPEND_INVALID, // sprintf INVALID_APPEND_ERROR
NO_APPEND, // sprintf INVALID_NO_APPEND
RUNTIME_EXCEPTION, // pack runtime exceptions
WARNING_ONLY // sprintf overlapping specifiers
}
-
Error Compatibility Layer
// Map new error system to existing error messages and behavior public class ErrorCompatibilityMapper { public static String mapToExistingError(SublanguageError error); public static boolean shouldAppendInvalid(SublanguageError error); public static boolean shouldThrowException(SublanguageError error); }
-
Performance Baseline Framework
// Measure current parsing performance before migration public class ParsingBenchmark { public static void benchmarkSprintfParsing(); public static void benchmarkPackParsing(); public static void benchmarkRegexParsing(); }
-
Comprehensive Test Catalog
- Document all existing edge cases from current parsers
- Create test migration checklist
- Regression test framework
-
Feature Flag System
// Gradual rollout with monitoring public class SublanguageFeatureFlags { public static boolean useNewSprintfParser(); public static boolean useNewPackParser(); public static boolean useNewRegexParser(); }
-
Sprintf Overlapping Specifiers
- Complex logic for detecting overlapping format specifiers
- Warning generation without breaking functionality
- Position tracking across overlaps
-
Pack Unicode State Tracking
byteModevshasUnicodeInNormalModestate management- UTF-8 encoding error handling
- Unicode code point validation
-
Regex Java Compatibility
- Octal sequence transformation (
\120→\0120) - Character class mapping (
[:ascii:]→\p{ASCII}) - Inline comment removal (
(?#...)) \Ganchor handling
- Octal sequence transformation (
- Performance Gate: No regression > 10% in parsing speed
- Error Compatibility Gate: All existing error messages preserved
- Test Coverage Gate: 100% of existing edge cases covered
- Integration Gate: Fallback mechanism works correctly
- Feature flags can disable new parsers immediately
- Monitoring alerts for error rate increases
- Automated rollback if performance degrades
- Clear rollback procedures documented
- ✅ Zero breaking changes to existing functionality
- ✅ All existing error messages preserved with same behavior
- ✅ Performance maintained or improved
- ✅ 100% test coverage of existing edge cases
- ✅ Gradual rollout with monitoring and rollback capability
This risk mitigation ensures our architecture can handle the full complexity of existing error/warning systems while providing a safe migration path.
- AST generation for transformation
- Replace current regex preprocessing gradually
- Update
RegexPreprocessor.javato use new parser - Maintain backward compatibility during transition
- Update
Verification for Phase 2 Completion:
- Regex parsing happens during compilation
- Debug output shows parser methods being called
- Existing regex tests still pass
Prerequisites: Phase 1 must be complete
-
Implement
PackLexerandPackParser- Must extend base classes from Phase 1
- Integration through
StringParsermethods
-
Add semantic validation for modifier combinations
- Proper error reporting for invalid combinations
-
Integrate with existing pack/unpack operators
- Update
Pack.javaandUnpack.javato use new parser - Call parsing during compilation, not runtime
- Update
Prerequisites: Phase 1 must be complete
-
Implement
SprintfLexerandSprintfParser- Must extend base classes from Phase 1
- Integration through
StringParsermethods
-
Add format specification validation
- Semantic validation of format specifiers
-
Replace current sprintf parsing
- Update
SprintfOperator.javato use compile-time parsing - Remove runtime parsing logic
- Update
Prerequisites: Phase 1 must be complete
-
Implement
TrLexerandTrParser- Must extend base classes from Phase 1
- Integration through
StringParsermethods
-
Add pattern expansion and validation
- Handle escape sequences, ranges, character classes
- Proper octal/hex parsing (learned from failed implementation)
-
Replace current transliteration parsing
- Update
RuntimeTransliterate.javato use compile-time parsing - Remove runtime parsing logic
- Update
- Heredoc enhancement
- Other domain-specific languages as needed
The sublanguage parser architecture supports two distinct types of errors to handle PerlOnJava's incremental implementation approach:
- Definition: User mistakes - invalid syntax, malformed patterns
- Examples:
- Pack:
"l<>4"(invalid endianness syntax) - Regex:
"/[/"(unmatched bracket) - Sprintf:
"%q"(invalid conversion specifier) - Transliteration:
"tr/a-z-0-9//"(invalid range)
- Pack:
- Handling: Should fail compilation immediately
- User Experience: Clear error message pointing to the syntax mistake
- Definition: Valid Perl syntax but not yet supported in PerlOnJava
- Examples:
- Pack:
"w"format (BER compressed integer - valid Perl, not implemented) - Regex:
"(?{code})"(code execution in regex - valid Perl, not implemented) - Sprintf:
"%v"vector flag (valid Perl, partially implemented)
- Pack:
- Handling: May fall back to existing behavior or provide helpful "not yet implemented" message
- User Experience: Informative message about what's not supported yet
- Clear User Feedback: Users know if they made a mistake vs. hit a limitation
- Incremental Implementation: Unimplemented features can be added gradually
- Graceful Degradation: Valid Perl can fall back to existing behavior
- Better Testing: Can distinguish between bugs and missing features
- Development Prioritization: Unimplemented errors guide feature development
// Syntax error - user mistake
return SublanguageValidationResult.syntaxError("Invalid range 'z-a' in pack template", position);
// Unimplemented error - valid Perl, not supported yet
return SublanguageValidationResult.unimplementedError("BER compressed integer format 'w' not yet implemented");
// Check error types
if (result.hasSyntaxErrors()) {
// Fail compilation immediately
throw new PerlCompilerException(result.getFirstError());
} else if (result.hasUnimplementedErrors()) {
// Log warning and fall back to existing behavior
System.err.println("Warning: " + result.getFirstError());
return fallbackBehavior();
}❌ Skipped Phase 1 Infrastructure
- We jumped straight to implementing
TrLexer,TrParser,SprintfLexer, etc. - We created individual parsers without the base
SublanguageLexer/SublanguageParserclasses - Result: Inconsistent, isolated parsers that weren't integrated with the main system
❌ Wrong Execution Model
- We put parsing logic in runtime operator classes (
RuntimeTransliterate.compile()) - We tried to parse patterns at runtime instead of compile time
- Result: Parsers were never called because they weren't in the compilation pipeline
❌ No StringParser Integration
- We never added integration points to
StringParser.java - Our parsers existed in isolation without connection to the main parser
- Result: No way for the main compilation process to invoke our sublanguage parsers
❌ Missing AST Generation
- We did direct string manipulation instead of generating ASTs
- We didn't follow the plan's AST → transformation → output model
- Result: No semantic understanding, just token shuffling
- Debug output never appears - because methods aren't called during compilation
- Runtime parsing fails - because the execution model is wrong
- Isolated functionality - parsers work in isolation but not integrated
- No error integration - custom error handling instead of
PerlCompilerException
🔑 Phase 1 is Not Optional The base infrastructure (Phase 1) is not just a "nice to have" - it's the foundation that makes everything else work. Without it:
- Individual parsers have no way to integrate with the main system
- There's no consistent pattern for tokenization and parsing
- Error handling is fragmented and inconsistent
🔑 Integration is Everything
The most important part of the architecture is the integration with StringParser.java. This is where:
- Sublanguage parsing gets triggered during compilation
- The main parser delegates to specialized sublanguage parsers
- ASTs get generated and transformed into usable output
🔑 Compile-Time vs Runtime Sublanguage parsing must happen during compilation, not runtime:
- Compile-time: Patterns are parsed, validated, and transformed into efficient runtime structures
- Runtime: Pre-compiled structures are used directly without parsing overhead
- Our failed approach tried to parse at runtime, which is the wrong model
- Phase 1 Verification: Can call sublanguage parsing from
StringParser.javaand see debug output - Compile-Time Parsing: Sublanguage parsing happens when code is compiled, not when it runs
- AST Generation: Parsers produce ASTs that get transformed, not direct string manipulation
- Error Integration: Sublanguage errors use
PerlCompilerExceptionwith proper location info - Consistent Patterns: All sublanguage parsers extend the same base classes and follow the same patterns
This approach leverages PerlOnJava's existing, proven architecture while providing the benefits of proper sublanguage parsing:
- Eliminates "cheating" with semantic understanding
- Maintains performance without external dependencies
- Provides consistency across all sublanguages
- Enables incremental adoption with minimal risk
- Follows established patterns familiar to the team
The result is a clean, maintainable architecture that eliminates the parsing issues we've identified while staying true to PerlOnJava's design principles.