Sublanguage Parser Architecture: Unified Lexer→Parser→AST Framework

Overview

This document defines the comprehensive sublanguage parser architecture for PerlOnJava, designed to provide proper semantic parsing for domain-specific languages embedded within Perl using a unified lexer→parser→AST framework.

Why This Architecture? (Critical Understanding)

The Problem: Perl's Embedded Sublanguages

Perl contains multiple "sublanguages" - specialized syntaxes within string literals that have their own complex parsing rules:

Regex patterns: /(?:pattern){3,5}/flags - complex syntax with quantifiers, groups, escapes
Pack/unpack templates: pack("A10i4l<*", ...) - binary data formatting with endianness and counts
Sprintf formats: sprintf("%*.*f", width, precision, value) - printf-style formatting with dynamic width
Transliteration: tr/\x00-\xFF/a-z/ - character mapping with ranges and escapes
Double-quoted strings: "Hello \x41\n$var" - interpolation with escapes and variables

Current Problems (Why We Need This Architecture)

Manual String Manipulation: Current parsing uses character-by-character loops
- Result: Parsing bugs, poor maintainability, no semantic understanding
Runtime Parsing Overhead: Patterns parsed every time code executes
- Result: Performance penalty, especially in loops
Inconsistent Error Handling: Each sublanguage has different error reporting
- Result: Poor developer experience, inconsistent error messages
No Semantic Understanding: Direct string manipulation without AST
- Result: Cannot validate semantics, transform patterns, or optimize

The Solution: Unified Lexer→Parser→AST Architecture

Core Principle: All sublanguages follow the same lexer→parser→AST pattern for consistency, maintainability, and semantic understanding.

Architecture Flow:

Input String → SublanguageLexer → SublanguageToken[] → SublanguageParser → SublanguageASTNode → Validation/Transformation

Why This Approach:

Compile-Time Parsing: Parse once during compilation, use optimized structures at runtime
Semantic Understanding: AST enables validation, transformation, and optimization
Consistent Error Reporting: All sublanguages use same position tracking and error formatting
Maintainable Code: Unified patterns across all sublanguages
Extensible: Easy to add new sublanguages following the same pattern

Architecture Principles (CRITICAL - Read This First!)

Principle 1: Mandatory Lexer→Parser→AST Flow

RULE: Every sublanguage MUST follow this exact flow:

// CORRECT: Follow the architecture
String input = "\\x41\\n";
SublanguageLexer lexer = new RegexLexer(input, sourceOffset);
List<SublanguageToken> tokens = lexer.tokenize();
SublanguageParser parser = new RegexParser(tokens);
SublanguageASTNode ast = parser.parse();

// WRONG: Utility functions that bypass architecture  
String result = EscapeSequenceParser.parseEscape(input, 0); // ❌ VIOLATES ARCHITECTURE

Why: Bypassing the lexer→parser→AST flow breaks consistency, error reporting, and semantic understanding.

Principle 2: No Utility Shortcuts

RULE: Reusable components MUST work within the lexer→parser→AST framework.

// CORRECT: Reusable components within architecture
public class RegexLexer extends SublanguageLexer {
    protected boolean tokenizeEscape() {
        // Tokenize escape sequence into SublanguageToken
        // This can be shared across sublanguage lexers
        return tokenizeCommonEscape(); // ✅ FOLLOWS ARCHITECTURE
    }
}

// WRONG: Utility functions outside architecture
public class EscapeUtils {
    public static String parseEscape(String input, int pos) { // ❌ BYPASSES ARCHITECTURE
        // Direct string manipulation
    }
}

Why: Utility shortcuts bypass position tracking, error handling, and AST generation.

Principle 3: AST-First Design

RULE: All parsing MUST produce SublanguageASTNode objects for semantic analysis.

// CORRECT: AST-first design
SublanguageASTNode escapeNode = new SublanguageASTNode("escape", "\\n");
escapeNode.addAnnotation("resolvedValue", "\n");
escapeNode.addAnnotation("escapeType", "newline");

// WRONG: Direct string results
String result = "\\n"; // ❌ NO SEMANTIC INFORMATION

Why: AST enables semantic validation, transformation, and optimization that strings cannot provide.

Current Architecture Analysis

Existing Patterns We Can Leverage

Lexer/Token System:
- LexerToken with type and text
- LexerTokenType enum for categorization
- Lexer.java for tokenization
Parser Infrastructure:
- Parser.java base class with EmitterContext
- Specialized parsers: StringParser, NumberParser, IdentifierParser
- TokenUtils for token manipulation
- Consistent error handling with PerlCompilerException
AST Generation:
- Node hierarchy for AST representation
- Context-aware compilation with EmitterContext

Current Sublanguage Parsing Issues

Inconsistent Architecture: Each sublanguage uses different parsing approaches
- Why this matters: Developers must learn different patterns for each sublanguage
- Impact: Higher maintenance cost, more bugs, inconsistent user experience
Manual String Manipulation: Direct character-by-character parsing
- Why this matters: Error-prone, hard to extend, no semantic understanding
- Impact: Parsing bugs, poor error messages, difficulty adding new features
Poor Error Reporting: Limited location information
- Why this matters: Users can't easily find and fix their mistakes
- Impact: Poor developer experience, harder debugging
Runtime Parsing Overhead: Parsing happens every time code runs
- Why this matters: Performance penalty for repeated pattern parsing
- Impact: Slower execution, especially for patterns in loops

Implemented Sublanguage Architecture (Phase 1 Complete)

Why This Architecture?

Design Philosophy:

Compile-Time Parsing: Parse sublanguages during compilation, not runtime
- Why: Better performance, early error detection, semantic understanding
- How: Integration points in StringParser.java call sublanguage parsers
Generic AST Approach: Single SublanguageASTNode class with annotations
- Why: Simpler than many specific subclasses, easier to extend
- How: Node type and annotations describe structure, children hold sub-nodes
Two-Error-Type System: Distinguish user mistakes from unimplemented features
- Why: PerlOnJava implements Perl incrementally, need graceful degradation
- How: SYNTAX_ERROR (user mistake) vs UNIMPLEMENTED_ERROR (valid Perl, not supported yet)

Core Components (Implemented)

1. Token System

// Comprehensive token types for all sublanguages (IMPLEMENTED)
public enum SublanguageTokenType {
    // Common types
    LITERAL, ESCAPE, DELIMITER, MODIFIER, COMMENT, WHITESPACE,
    
    // Regex-specific  
    REGEX_ANCHOR, REGEX_QUANTIFIER, REGEX_GROUP, REGEX_CLASS,
    
    // Pack-specific
    PACK_FORMAT, PACK_COUNT, PACK_ENDIAN,
    
    // Sprintf-specific
    SPRINTF_FLAG, SPRINTF_WIDTH, SPRINTF_PRECISION, SPRINTF_CONVERSION,
    
    // Transliteration-specific
    TR_RANGE, TR_CLASS, TR_MODIFIER
}

// Base sublanguage token
public class SublanguageToken extends LexerToken {
    public final int position;
    public final int length;
    
    public SublanguageToken(SublanguageTokenType type, String text, int position, int length) {
        super(LexerTokenType.IDENTIFIER, text); // Map to base type
        this.sublanguageType = type;
        this.position = position;
        this.length = length;
    }
    
    public SublanguageTokenType sublanguageType;
}

// Base sublanguage lexer
public abstract class SublanguageLexer {
    protected final String input;
    protected int pos = 0;
    protected final List<SublanguageToken> tokens = new ArrayList<>();
    
    public SublanguageLexer(String input) {
        this.input = input;
    }
    
    public List<SublanguageToken> tokenize() {
        while (pos < input.length()) {
            if (!tokenizeNext()) {
                throw new PerlCompilerException("Unexpected character at position " + pos);
            }
        }
        return tokens;
    }
    
    protected abstract boolean tokenizeNext();
    
    protected void addToken(SublanguageTokenType type, String text) {
        tokens.add(new SublanguageToken(type, text, pos - text.length(), text.length()));
    }
    
    protected char peek() { return pos < input.length() ? input.charAt(pos) : '\0'; }
    protected char advance() { return pos < input.length() ? input.charAt(pos++) : '\0'; }
    protected boolean match(String expected) {
        if (input.startsWith(expected, pos)) {
            pos += expected.length();
            return true;
        }
        return false;
    }
}

// Base sublanguage parser
public abstract class SublanguageParser<T> {
    protected final List<SublanguageToken> tokens;
    protected int tokenIndex = 0;
    protected final EmitterContext ctx;
    
    public SublanguageParser(List<SublanguageToken> tokens, EmitterContext ctx) {
        this.tokens = tokens;
        this.ctx = ctx;
    }
    
    public abstract T parse();
    
    protected SublanguageToken peek() {
        return tokenIndex < tokens.size() ? tokens.get(tokenIndex) : null;
    }
    
    protected SublanguageToken advance() {
        return tokenIndex < tokens.size() ? tokens.get(tokenIndex++) : null;
    }
    
    protected boolean match(SublanguageTokenType type) {
        SublanguageToken token = peek();
        if (token != null && token.sublanguageType == type) {
            advance();
            return true;
        }
        return false;
    }
    
    protected void error(String message) {
        SublanguageToken token = peek();
        int position = token != null ? token.position : input.length();
        throw new PerlCompilerException(message + " at position " + position);
    }
}

2. Regex Sublanguage Implementation

// Regex-specific lexer
public class RegexLexer extends SublanguageLexer {
    public RegexLexer(String input) {
        super(input);
    }
    
    @Override
    protected boolean tokenizeNext() {
        char c = peek();
        
        switch (c) {
            case '\\':
                return tokenizeEscape();
            case '[':
                return tokenizeCharacterClass();
            case '(':
                return tokenizeGroup();
            case '*':
            case '+':
            case '?':
                return tokenizeQuantifier();
            case '^':
            case '$':
                return tokenizeAnchor();
            default:
                return tokenizeLiteral();
        }
    }
    
    private boolean tokenizeEscape() {
        int start = pos;
        advance(); // consume '\'
        
        char next = peek();
        if (next == 'g') {
            advance();
            if (peek() == '{') {
                // \g{...} backreference
                while (peek() != '}' && peek() != '\0') advance();
                if (peek() == '}') advance();
            } else if (Character.isDigit(peek()) || peek() == '-') {
                // \g1 or \g-1
                if (peek() == '-') advance();
                while (Character.isDigit(peek())) advance();
            }
        } else if (Character.isDigit(next)) {
            // \1, \2, etc.
            while (Character.isDigit(peek())) advance();
        } else {
            advance(); // single character escape
        }
        
        addToken(SublanguageTokenType.ESCAPE, input.substring(start, pos));
        return true;
    }
    
    // ... other tokenize methods
}

// Regex-specific parser
public class RegexParser extends SublanguageParser<RegexAST> {
    public RegexParser(List<SublanguageToken> tokens, EmitterContext ctx) {
        super(tokens, ctx);
    }
    
    @Override
    public RegexAST parse() {
        return parseAlternation();
    }
    
    private RegexAST parseAlternation() {
        RegexAST left = parseSequence();
        
        while (match(SublanguageTokenType.DELIMITER) && 
               peek().text.equals("|")) {
            RegexAST right = parseSequence();
            left = new AlternationNode(left, right);
        }
        
        return left;
    }
    
    private RegexAST parseSequence() {
        List<RegexAST> elements = new ArrayList<>();
        
        while (peek() != null && !peek().text.equals("|")) {
            elements.add(parseElement());
        }
        
        return new SequenceNode(elements);
    }
    
    // ... other parse methods with proper semantic validation
}

3. Pack/Unpack Sublanguage Implementation

// Pack-specific lexer
public class PackLexer extends SublanguageLexer {
    @Override
    protected boolean tokenizeNext() {
        char c = peek();
        
        if (isPackFormat(c)) {
            advance();
            addToken(SublanguageTokenType.PACK_FORMAT, String.valueOf(c));
            return true;
        } else if (c == '<' || c == '>') {
            advance();
            addToken(SublanguageTokenType.PACK_ENDIAN, String.valueOf(c));
            return true;
        } else if (c == '!') {
            advance();
            addToken(SublanguageTokenType.MODIFIER, "!");
            return true;
        } else if (Character.isDigit(c) || c == '*') {
            return tokenizeCount();
        } else if (c == '[') {
            return tokenizeBracketExpression();
        } else if (c == '(') {
            advance();
            addToken(SublanguageTokenType.DELIMITER, "(");
            return true;
        }
        // ... handle other cases
        
        return false;
    }
    
    private boolean isPackFormat(char c) {
        return "aAZbBhHcCsSiIlLnNvVqQjJfdFDpPuUwWxX@".indexOf(c) >= 0;
    }
    
    // ... other methods
}

// Pack-specific parser with semantic validation
public class PackParser extends SublanguageParser<PackTemplate> {
    @Override
    public PackTemplate parse() {
        List<PackItem> items = new ArrayList<>();
        
        while (peek() != null) {
            items.add(parsePackItem());
        }
        
        return new PackTemplate(items);
    }
    
    private PackItem parsePackItem() {
        // Parse format character
        if (!match(SublanguageTokenType.PACK_FORMAT)) {
            error("Expected pack format character");
        }
        
        String format = tokens.get(tokenIndex - 1).text;
        
        // Parse modifiers
        List<String> modifiers = new ArrayList<>();
        while (match(SublanguageTokenType.PACK_ENDIAN) || 
               match(SublanguageTokenType.MODIFIER)) {
            modifiers.add(tokens.get(tokenIndex - 1).text);
        }
        
        // Validate modifier combinations
        validateModifiers(format, modifiers);
        
        // Parse count
        PackCount count = parseCount();
        
        return new PackItem(format, modifiers, count);
    }
    
    private void validateModifiers(String format, List<String> modifiers) {
        boolean hasLittleEndian = modifiers.contains("<");
        boolean hasBigEndian = modifiers.contains(">");
        
        if (hasLittleEndian && hasBigEndian) {
            error("Cannot use both '<' and '>' modifiers");
        }
        
        // Format-specific validation
        if ("aAZ".contains(format) && (hasLittleEndian || hasBigEndian)) {
            error("Endianness modifiers not valid for string formats");
        }
    }
}

4. Sprintf Sublanguage Implementation

// Sprintf-specific lexer
public class SprintfLexer extends SublanguageLexer {
    @Override
    protected boolean tokenizeNext() {
        char c = peek();
        
        if (c == '%') {
            return tokenizeSpecifier();
        } else {
            return tokenizeLiteral();
        }
    }
    
    private boolean tokenizeSpecifier() {
        int start = pos;
        advance(); // consume '%'
        
        // Parse flags
        while (pos < input.length() && "+-# 0".indexOf(peek()) >= 0) {
            advance();
        }
        
        // Parse width
        while (Character.isDigit(peek()) || peek() == '*') {
            advance();
        }
        
        // Parse precision
        if (peek() == '.') {
            advance();
            while (Character.isDigit(peek()) || peek() == '*') {
                advance();
            }
        }
        
        // Parse vector
        if (peek() == '*') {
            advance();
            while (Character.isDigit(peek())) advance();
            if (peek() == 'v') advance();
        }
        
        // Parse conversion
        if (pos < input.length() && "diouxXeEfFgGaAcspn%".indexOf(peek()) >= 0) {
            advance();
        }
        
        addToken(SublanguageTokenType.SPRINTF_CONVERSION, input.substring(start, pos));
        return true;
    }
}

Integration with Existing Architecture

1. Extend StringParser Pattern

// Add to StringParser.java
public static class SublanguageParseResult {
    public final Object ast;
    public final int endIndex;
    public final String processedString;
    
    public SublanguageParseResult(Object ast, int endIndex, String processedString) {
        this.ast = ast;
        this.endIndex = endIndex;
        this.processedString = processedString;
    }
}

public static SublanguageParseResult parseRegex(EmitterContext ctx, String regex, RegexFlags flags) {
    RegexLexer lexer = new RegexLexer(regex);
    List<SublanguageToken> tokens = lexer.tokenize();
    
    RegexParser parser = new RegexParser(tokens, ctx);
    RegexAST ast = parser.parse();
    
    // Transform AST to Java-compatible string
    RegexTransformer transformer = new RegexTransformer(flags);
    String javaRegex = transformer.transform(ast);
    
    return new SublanguageParseResult(ast, regex.length(), javaRegex);
}

2. Error Handling Integration

// Enhanced error reporting using existing patterns
public class SublanguageError extends PerlCompilerException {
    private final String input;
    private final int position;
    private final String sublanguage;
    
    public SublanguageError(String message, String input, int position, String sublanguage) {
        super(formatError(message, input, position, sublanguage));
        this.input = input;
        this.position = position;
        this.sublanguage = sublanguage;
    }
    
    private static String formatError(String message, String input, int position, String sublanguage) {
        String before = input.substring(0, Math.min(position, input.length()));
        String after = input.substring(Math.min(position, input.length()));
        
        return String.format("%s error: %s in %s pattern; marked by <-- HERE in /%s <-- HERE %s/",
            sublanguage, message, sublanguage.toLowerCase(), before, after);
    }
}

Benefits of This Approach

1. Consistency with Existing Architecture

Uses established LexerToken/Parser patterns
Integrates with EmitterContext and error handling
Follows existing code organization and naming conventions

2. Performance

No external dependencies (ANTLR overhead)
Optimized for PerlOnJava's specific needs
Reuses existing infrastructure

3. Maintainability

Familiar patterns for the development team
Consistent error handling and reporting
Easy to extend and modify

4. Incremental Adoption

Can be implemented one sublanguage at a time
Maintains backward compatibility
Allows gradual migration from current implementations

Sublanguage Implementation Phases

CRITICAL SUCCESS FACTORS (Learned from Failed Implementation)

⚠️ MUST FOLLOW ARCHITECTURE PRINCIPLES - NO SHORTCUTS!

The phases below implement sublanguages following the mandatory lexer→parser→AST architecture. Violating these principles will result in:

Broken architecture - Utility shortcuts bypass lexer/parser/AST flow
No semantic understanding - Direct string manipulation loses AST benefits
Inconsistent error handling - Position tracking and validation broken
Maintenance nightmare - Code scattered across utility classes instead of unified architecture

Phase 1: Foundation Infrastructure ✅ COMPLETED

Status: Base architecture implemented and verified

What Was Built:

SublanguageLexer - Base class with tokenization patterns and position tracking
SublanguageParser - Base class returning SublanguageASTNode objects
SublanguageToken - Position-aware tokens with source offset mapping
SublanguageASTNode - Generic AST nodes with annotations for all sublanguages
SublanguageValidationResult - Two-error-type system (SYNTAX_ERROR vs UNIMPLEMENTED_ERROR)

Architecture Verification:

✅ All base classes follow lexer→parser→AST flow
✅ No utility shortcuts that bypass architecture
✅ AST-first design with semantic annotations
✅ Consistent error handling with position tracking

Phase 2: Double-Quoted String Parser (Week 1)

Priority: HIGH - Foundation for other sublanguages Prerequisites: Phase 1 complete

Why Double-Quoted Strings First:

Simpler sublanguage to establish the pattern
Contains common elements (escapes, interpolation) used by other sublanguages
Validates the architecture before more complex sublanguages

Implementation:

DoubleQuotedLexer extends SublanguageLexer

// Tokenize: "Hello \x41\n$var" 
// → [LITERAL:"Hello ", ESCAPE:"\x41", ESCAPE:"\n", VARIABLE:"$var"]

DoubleQuotedParser extends SublanguageParser

// Parse tokens → SublanguageASTNode tree
// Enable semantic validation of escapes and interpolation

Integration via static method

// In DoubleQuotedParser.java
public static SublanguageValidationResult validateString(String content, int sourceOffset)

Verification:

Lexer produces proper SublanguageToken objects with position tracking
Parser produces SublanguageASTNode AST (not strings)
Integration works via static method (no StringParser pollution)
Escape sequences tokenized properly (foundation for other sublanguages)

Phase 3: Regex Parser (Week 2)

Prerequisites: Phase 2 complete (establishes escape handling pattern)

Implementation follows same lexer→parser→AST pattern:

RegexLexer extends SublanguageLexer - Tokenize regex syntax
RegexParser extends SublanguageParser - Build AST for semantic validation
Static integration method - RegexParser.validatePattern()

Phase 4: Pack/Unpack Parser (Week 3)

Priority: HIGH - Serious test failures to resolve Prerequisites: Phase 2 complete (establishes number/count tokenization pattern)

Phase 5: Sprintf Parser (Week 4)

Prerequisites: Phase 2 complete (establishes format tokenization pattern)

Phase 6: Transliteration Parser (Week 5)

Prerequisites: Phase 2 complete (establishes escape and range tokenization pattern)

🏗️ BREAKTHROUGH: Enhanced StringParser Architecture

💡 Critical Insight: No Re-tokenization Needed!

BREAKTHROUGH DISCOVERY: parseRawStringWithDelimiter() already performs sophisticated token-level parsing:

Character-by-character processing of existing LexerToken objects
State machine parsing: START → STRING → ESCAPE → END_TOKEN
Delimiter handling with nesting levels and paired delimiters
Escape sequence detection (already has ESCAPE state!)
Complex infrastructure: heredoc processing, position tracking

Key Realization: Instead of creating separate tokenization layers, we can enhance the existing state machine with sublanguage-specific processing.

🎯 Revised Architecture: Enhanced Token Processing

No separate tokenization needed! Enhance existing parseRawStringWithDelimiter():

// BEFORE: Complex 3-layer architecture with re-tokenization
String rawString = extractFromTokens(tokens);
SublanguageLexer lexer = new SublanguageLexer(rawString);
List<SublanguageToken> newTokens = lexer.tokenize();
SublanguageParser parser = new RegexParser(newTokens);

// AFTER: Enhanced existing method
public static ParsedString parseRawStringWithDelimiterEnhanced(
    EmitterContext ctx, 
    List<LexerToken> tokens, 
    int index, 
    boolean redo, 
    Parser parser,
    SublanguageType sublanguageType  // NEW: specify sublanguage
) {
    // Existing state machine + sublanguage-specific enhancements
    switch (state) {
        case ESCAPE:
            if (sublanguageType == REGEX) {
                processRegexEscape(ch, buffer, astBuilder);
            } else if (sublanguageType == PACK) {
                processPackFormat(ch, buffer, astBuilder);
            }
            // ... existing logic
    }
}

Benefits of Enhanced Architecture

Massive Simplification: No separate tokenization layer needed
Better Performance: Single-pass parsing, no re-tokenization overhead
Leverage Existing Code: Reuse 200+ lines of proven parsing logic
Easier Integration: Minimal changes, backward compatible
Proven Infrastructure: Delimiter handling, escape detection already work

🎯 Final Refined Architecture: In-Place AST Enhancement

KEY INSIGHT: No new architecture needed! The existing parseRawString() flow is perfect:

parseRawString(parser, operator) 
  ↓
parseRawStrings() // extracts content using parseRawStringWithDelimiter()
  ↓
switch (operator) {
    case "m", "qr", "/", "//", "/=" -> parseRegexMatch(ctx, operator, rawStr, parser);
    case "s" -> parseRegexReplace(ctx, rawStr, parser);
    case "\"", "qq" -> StringDoubleQuoted.parseDoubleQuotedString(...);
    // etc.
}

APPROACH: Simply enhance existing operator-specific methods with AST building.

Step 1: Enhance parseRegexMatch() and parseRegexReplace()

// In parseRegexMatch() - where content is already extracted
public static Node parseRegexMatch(EmitterContext ctx, String operator, ParsedString rawStr, Parser parser) {
    String regexContent = rawStr.buffers.get(0);
    
    // NEW: Build AST from extracted content
    try {
        SublanguageASTNode regexAST = RegexASTBuilder.buildAST(regexContent, rawStr.sourceOffset);
        // Use AST for optimized code generation and validation
        return generateOptimizedRegexMatch(regexAST, operator, rawStr);
    } catch (Exception e) {
        // Fail compilation on syntax errors
        throw new PerlCompilerException(rawStr.sourceOffset, "Invalid regex: " + e.getMessage(), ctx.errorUtil);
    }
}

Step 2: Enhance Pack/Sprintf Parsing

// Where pack() and sprintf() content is extracted
String template = extractedContent;

// NEW: Build AST and validate
try {
    SublanguageASTNode ast = PackASTBuilder.buildAST(template, sourceOffset);
    return generateOptimizedCode(ast);
} catch (Exception e) {
    throw new PerlCompilerException(sourceOffset, "Invalid template: " + e.getMessage(), ctx.errorUtil);
}

Step 3: AST Builders

public class RegexASTBuilder {
    public static SublanguageASTNode buildAST(String content, int sourceOffset) {
        // Parse regex content into AST: literals, character classes, quantifiers, groups
        // Fail on syntax errors with position info
    }
}

public class PackASTBuilder {
    public static SublanguageASTNode buildAST(String content, int sourceOffset) {
        // Parse pack template into AST: format specifiers, modifiers, repeat counts
        // Fail on invalid formats with position info
    }
}

}

public class DoubleQuotedParser extends SublanguageParser { // Uses common library: parseEscapeSequence(), parseCaseModifier(), parseInterpolation() // Focuses on: string interpolation, case modification state machine }


### **🚀 Revised Implementation Plan: In-Place AST Enhancement**

**Benefits of This Approach:**
1. **No New Architecture** - Just enhance existing methods
2. **No Switch Statements** - Operator context already handled perfectly
3. **Minimal Changes** - Target specific methods where content is extracted
4. **Backward Compatible** - Existing flow unchanged, AST building added
5. **In-Place Enhancement** - AST building happens where content is already available

**Commit 1: Create AST Builders (~300 lines)**
- Create `RegexASTBuilder.buildAST()` for regex parsing and validation
- Create `PackASTBuilder.buildAST()` for pack template parsing and validation  
- Create `SprintfASTBuilder.buildAST()` for sprintf format parsing and validation
- All builders use existing `SublanguageASTNode` infrastructure
- Comprehensive error handling with position tracking

**Commit 2: Enhance Regex Methods (~200 lines)**
- Enhance `parseRegexMatch()` and `parseRegexReplace()` with AST building
- Add AST-driven code generation for optimized regex handling
- Maintain backward compatibility with fallback to existing implementation
- Fail compilation on syntax errors with proper position info

**Commit 3: Enhance Pack/Sprintf Methods (~200 lines)**
- Enhance pack() and sprintf() parsing locations with AST building
- Add AST-driven code generation for optimized template handling
- Fail compilation on syntax errors with proper position info
- Integration with existing error handling patterns

**Commit 4: Testing and Optimization (~100 lines)**
- Comprehensive tests for all AST builders and enhanced methods
- Performance benchmarks vs existing implementations
- Documentation and examples
- Optional compiler flags for enabling/disabling AST validation

## 📋 **ORIGINAL COMPREHENSIVE IMPLEMENTATION PLAN (ARCHIVED)**

### **Phase 1: Foundation Infrastructure ✅ COMPLETED**
- Base classes and interfaces implemented and verified

### **Phase 2: Three-Layer Architecture Implementation (SUPERSEDED)**

#### **Commit 1: Unified SublanguageLexer (~400 lines)**
**Assignable Work Unit**: Complete tokenization layer
- Comprehensive token types for all sublanguages
- Shared tokenization: escapes, numbers, strings, variables, whitespace
- Sublanguage-specific tokens: regex operators, pack formats, sprintf flags, case modifiers
- Position tracking and source offset mapping
- **Deliverable**: Single lexer that can tokenize any sublanguage input

#### **Commit 2: Common Parser Library (~500 lines)**
**Assignable Work Unit**: Reusable parsing components
- `parseEscapeSequence()` - handles \n, \t, \x41, \x{263A}, \o{100}, \cA, \N{NAME}
- `parseNumber()` - handles 123, 0x41, 077, ranges, counts with validation
- `parseRange()` - handles A-Z, 0-9, \0-\377 with complex validation logic
- `parseCaseModifier()` - handles \U, \L, \u, \l, \E with state machine
- `parseInterpolation()` - handles $var, @array, ${complex} variable parsing
- Shared validation methods and error handling
- **Deliverable**: Rich library of parsing components usable by all specialized parsers

#### **Commit 3: RegexParser (~300 lines)**
**Assignable Work Unit**: Regex-specific parsing logic
- Uses common library for escapes and ranges
- Implements precedence parsing: | > concatenation > quantifiers
- Handles groups, anchors, character classes
- Regex-specific validation and error messages
- **Deliverable**: Complete regex parser with AST generation

#### **Commit 4: PackParser (~300 lines)**
**Assignable Work Unit**: Pack/Unpack-specific parsing logic
- Uses common library for numbers and escapes
- Left-to-right format parsing: A, C, N, v, V formats
- Modifier and count validation
- Pack-specific error messages for test compliance
- **Deliverable**: Complete pack parser addressing test failures

#### **Commit 5: SprintfParser (~250 lines)**
**Assignable Work Unit**: Sprintf-specific parsing logic
- Uses common library for number parsing
- Format specifier parsing: %, flags, width, precision, conversion
- Argument count validation
- Sprintf-specific error messages
- **Deliverable**: Complete sprintf parser with validation

#### **Commit 6: DoubleQuotedParser (~350 lines)**
**Assignable Work Unit**: Double-quoted string parsing logic
- Uses common library for escapes, case modifiers, interpolation
- String interpolation state machine
- Case modification conflict detection
- \Q...\E quotemeta handling
- **Deliverable**: Complete double-quoted string parser

### **Phase 3: Integration and Migration**

#### **Commit 7-10: Integration Layer (~200 lines each)**
**Assignable Work Units**: Backward-compatible integration
- StringDoubleQuoted integration with fallback
- RegexPreprocessor integration with fallback
- Pack/Unpack integration with fallback
- SprintfOperator integration with fallback
- **Deliverable**: New parsers integrated with existing code, zero breaking changes

#### **Commit 11+: Gradual Migration (~100-300 lines each)**
**Assignable Work Units**: Method-by-method replacement
- Replace individual methods with AST-based implementations
- Maintain string-based APIs for backward compatibility
- Performance testing and validation
- **Deliverable**: Incremental migration with continuous validation

## ✅ **Work Splitting Benefits**

### **Parallel Development Possible:**
- **Team Member A**: Unified lexer (Commit 1)
- **Team Member B**: Common parser library (Commit 2)  
- **Team Member C**: RegexParser (Commit 3) - can start after Commits 1-2
- **Team Member D**: PackParser (Commit 4) - can start after Commits 1-2
- **Team Member E**: Integration work (Commits 7-10)

### **Clear Dependencies:**
- Commits 1-2 must complete before 3-6
- Commits 3-6 can be developed in parallel
- Commits 7+ can start once corresponding parser is complete

### **Measurable Progress:**
- Each commit is 200-500 lines with clear deliverables
- Each commit compiles and passes tests independently
- Progress can be tracked and validated incrementally

This comprehensive plan enables **efficient parallel development** while maintaining **architectural integrity** and **backward compatibility**! 🎉

## ⚠️ **COMPREHENSIVE RISK MITIGATION PLAN**

### **Critical Risk Assessment**

After analyzing existing error/warning systems, we identified **sophisticated error handling complexity**:

- **Sprintf**: 4 error types, overlapping specifiers, positional parameters, vector formats
- **Pack**: Runtime exceptions, Unicode state tracking, endianness conflicts  
- **Regex**: 52+ error cases, Java compatibility transformations
- **All sublanguages**: Complex validation rules and edge cases

### **Risk Mitigation Strategies**

#### **Commit -1: Risk Mitigation Phase (~300 lines)**
**MANDATORY before implementation begins**

1. **Enhanced Error Compatibility**
   ```java
   // Extend SublanguageValidationResult to support existing error types
   public enum ErrorBehavior {
       APPEND_INVALID,    // sprintf INVALID_APPEND_ERROR
       NO_APPEND,         // sprintf INVALID_NO_APPEND  
       RUNTIME_EXCEPTION, // pack runtime exceptions
       WARNING_ONLY       // sprintf overlapping specifiers
   }

Error Compatibility Layer

// Map new error system to existing error messages and behavior
public class ErrorCompatibilityMapper {
    public static String mapToExistingError(SublanguageError error);
    public static boolean shouldAppendInvalid(SublanguageError error);
    public static boolean shouldThrowException(SublanguageError error);
}

Performance Baseline Framework

// Measure current parsing performance before migration
public class ParsingBenchmark {
    public static void benchmarkSprintfParsing();
    public static void benchmarkPackParsing();
    public static void benchmarkRegexParsing();
}

Comprehensive Test Catalog
- Document all existing edge cases from current parsers
- Create test migration checklist
- Regression test framework

Feature Flag System

// Gradual rollout with monitoring
public class SublanguageFeatureFlags {
    public static boolean useNewSprintfParser();
    public static boolean useNewPackParser();
    public static boolean useNewRegexParser();
}

Special Cases Requiring Attention

Sprintf Overlapping Specifiers
- Complex logic for detecting overlapping format specifiers
- Warning generation without breaking functionality
- Position tracking across overlaps
Pack Unicode State Tracking
- byteMode vs hasUnicodeInNormalMode state management
- UTF-8 encoding error handling
- Unicode code point validation
Regex Java Compatibility
- Octal sequence transformation (\120 → \0120)
- Character class mapping ([:ascii:] → \p{ASCII})
- Inline comment removal ((?#...))
- \G anchor handling

Validation Gates for Each Commit

Performance Gate: No regression > 10% in parsing speed
Error Compatibility Gate: All existing error messages preserved
Test Coverage Gate: 100% of existing edge cases covered
Integration Gate: Fallback mechanism works correctly

Rollback Strategy

Feature flags can disable new parsers immediately
Monitoring alerts for error rate increases
Automated rollback if performance degrades
Clear rollback procedures documented

Success Criteria

✅ Zero breaking changes to existing functionality
✅ All existing error messages preserved with same behavior
✅ Performance maintained or improved
✅ 100% test coverage of existing edge cases
✅ Gradual rollout with monitoring and rollback capability

This risk mitigation ensures our architecture can handle the full complexity of existing error/warning systems while providing a safe migration path.

AST generation for transformation

Replace current regex preprocessing gradually
- Update RegexPreprocessor.java to use new parser
- Maintain backward compatibility during transition

Verification for Phase 2 Completion:

Regex parsing happens during compilation
Debug output shows parser methods being called
Existing regex tests still pass

Phase 3: Pack/Unpack Implementation (Week 3)

Prerequisites: Phase 1 must be complete

Implement PackLexer and PackParser
- Must extend base classes from Phase 1
- Integration through StringParser methods
Add semantic validation for modifier combinations
- Proper error reporting for invalid combinations
Integrate with existing pack/unpack operators
- Update Pack.java and Unpack.java to use new parser
- Call parsing during compilation, not runtime

Phase 4: Sprintf Implementation (Week 4)

Prerequisites: Phase 1 must be complete

Implement SprintfLexer and SprintfParser
- Must extend base classes from Phase 1
- Integration through StringParser methods
Add format specification validation
- Semantic validation of format specifiers
Replace current sprintf parsing
- Update SprintfOperator.java to use compile-time parsing
- Remove runtime parsing logic

Phase 5: Transliteration Implementation (Week 5)

Prerequisites: Phase 1 must be complete

Implement TrLexer and TrParser
- Must extend base classes from Phase 1
- Integration through StringParser methods
Add pattern expansion and validation
- Handle escape sequences, ranges, character classes
- Proper octal/hex parsing (learned from failed implementation)
Replace current transliteration parsing
- Update RuntimeTransliterate.java to use compile-time parsing
- Remove runtime parsing logic

Phase 6: Additional Sublanguages (Future)

Heredoc enhancement
Other domain-specific languages as needed

ERROR HANDLING ARCHITECTURE (Enhanced September 2024)

Two-Error-Type System

The sublanguage parser architecture supports two distinct types of errors to handle PerlOnJava's incremental implementation approach:

1. Syntax Errors (`SYNTAX_ERROR`)

Definition: User mistakes - invalid syntax, malformed patterns
Examples:
- Pack: "l<>4" (invalid endianness syntax)
- Regex: "/[/" (unmatched bracket)
- Sprintf: "%q" (invalid conversion specifier)
- Transliteration: "tr/a-z-0-9//" (invalid range)
Handling: Should fail compilation immediately
User Experience: Clear error message pointing to the syntax mistake

2. Unimplemented Errors (`UNIMPLEMENTED_ERROR`)

Definition: Valid Perl syntax but not yet supported in PerlOnJava
Examples:
- Pack: "w" format (BER compressed integer - valid Perl, not implemented)
- Regex: "(?{code})" (code execution in regex - valid Perl, not implemented)
- Sprintf: "%v" vector flag (valid Perl, partially implemented)
Handling: May fall back to existing behavior or provide helpful "not yet implemented" message
User Experience: Informative message about what's not supported yet

Benefits of Two-Error-Type System

Clear User Feedback: Users know if they made a mistake vs. hit a limitation
Incremental Implementation: Unimplemented features can be added gradually
Graceful Degradation: Valid Perl can fall back to existing behavior
Better Testing: Can distinguish between bugs and missing features
Development Prioritization: Unimplemented errors guide feature development

Implementation

// Syntax error - user mistake
return SublanguageValidationResult.syntaxError("Invalid range 'z-a' in pack template", position);

// Unimplemented error - valid Perl, not supported yet
return SublanguageValidationResult.unimplementedError("BER compressed integer format 'w' not yet implemented");

// Check error types
if (result.hasSyntaxErrors()) {
    // Fail compilation immediately
    throw new PerlCompilerException(result.getFirstError());
} else if (result.hasUnimplementedErrors()) {
    // Log warning and fall back to existing behavior
    System.err.println("Warning: " + result.getFirstError());
    return fallbackBehavior();
}

LESSONS LEARNED FROM FAILED IMPLEMENTATION (September 2024)

What We Did Wrong

❌ Skipped Phase 1 Infrastructure

We jumped straight to implementing TrLexer, TrParser, SprintfLexer, etc.
We created individual parsers without the base SublanguageLexer/SublanguageParser classes
Result: Inconsistent, isolated parsers that weren't integrated with the main system

❌ Wrong Execution Model

We put parsing logic in runtime operator classes (RuntimeTransliterate.compile())
We tried to parse patterns at runtime instead of compile time
Result: Parsers were never called because they weren't in the compilation pipeline

❌ No StringParser Integration

We never added integration points to StringParser.java
Our parsers existed in isolation without connection to the main parser
Result: No way for the main compilation process to invoke our sublanguage parsers

❌ Missing AST Generation

We did direct string manipulation instead of generating ASTs
We didn't follow the plan's AST → transformation → output model
Result: No semantic understanding, just token shuffling

Symptoms of the Wrong Approach

Debug output never appears - because methods aren't called during compilation
Runtime parsing fails - because the execution model is wrong
Isolated functionality - parsers work in isolation but not integrated
No error integration - custom error handling instead of PerlCompilerException

Key Insights

🔑 Phase 1 is Not Optional The base infrastructure (Phase 1) is not just a "nice to have" - it's the foundation that makes everything else work. Without it:

Individual parsers have no way to integrate with the main system
There's no consistent pattern for tokenization and parsing
Error handling is fragmented and inconsistent

🔑 Integration is Everything The most important part of the architecture is the integration with StringParser.java. This is where:

Sublanguage parsing gets triggered during compilation
The main parser delegates to specialized sublanguage parsers
ASTs get generated and transformed into usable output

🔑 Compile-Time vs Runtime Sublanguage parsing must happen during compilation, not runtime:

Compile-time: Patterns are parsed, validated, and transformed into efficient runtime structures
Runtime: Pre-compiled structures are used directly without parsing overhead
Our failed approach tried to parse at runtime, which is the wrong model

Success Criteria for Proper Implementation

Phase 1 Verification: Can call sublanguage parsing from StringParser.java and see debug output
Compile-Time Parsing: Sublanguage parsing happens when code is compiled, not when it runs
AST Generation: Parsers produce ASTs that get transformed, not direct string manipulation
Error Integration: Sublanguage errors use PerlCompilerException with proper location info
Consistent Patterns: All sublanguage parsers extend the same base classes and follow the same patterns

Conclusion

This approach leverages PerlOnJava's existing, proven architecture while providing the benefits of proper sublanguage parsing:

Eliminates "cheating" with semantic understanding
Maintains performance without external dependencies
Provides consistency across all sublanguages
Enables incremental adoption with minimal risk
Follows established patterns familiar to the team

The result is a clean, maintainable architecture that eliminates the parsing issues we've identified while staying true to PerlOnJava's design principles.

FilesExpand file tree

sublanguage_parser_architecture.md

Latest commit

History

sublanguage_parser_architecture.md

File metadata and controls

Sublanguage Parser Architecture: Unified Lexer→Parser→AST Framework

Overview

Why This Architecture? (Critical Understanding)

The Problem: Perl's Embedded Sublanguages

Current Problems (Why We Need This Architecture)

The Solution: Unified Lexer→Parser→AST Architecture

Architecture Principles (CRITICAL - Read This First!)

Principle 1: Mandatory Lexer→Parser→AST Flow

Principle 2: No Utility Shortcuts

Principle 3: AST-First Design

Current Architecture Analysis

Existing Patterns We Can Leverage

Current Sublanguage Parsing Issues

Implemented Sublanguage Architecture (Phase 1 Complete)

Why This Architecture?

Core Components (Implemented)

1. Token System

2. Regex Sublanguage Implementation

3. Pack/Unpack Sublanguage Implementation

4. Sprintf Sublanguage Implementation

Integration with Existing Architecture

1. Extend StringParser Pattern

2. Error Handling Integration

Benefits of This Approach

1. Consistency with Existing Architecture

2. Performance

3. Maintainability

4. Incremental Adoption

Sublanguage Implementation Phases

CRITICAL SUCCESS FACTORS (Learned from Failed Implementation)

Phase 1: Foundation Infrastructure ✅ COMPLETED

Phase 2: Double-Quoted String Parser (Week 1)

Phase 3: Regex Parser (Week 2)

Phase 4: Pack/Unpack Parser (Week 3)

Phase 5: Sprintf Parser (Week 4)

Phase 6: Transliteration Parser (Week 5)

🏗️ BREAKTHROUGH: Enhanced StringParser Architecture

💡 Critical Insight: No Re-tokenization Needed!

🎯 Revised Architecture: Enhanced Token Processing

Benefits of Enhanced Architecture

🎯 Final Refined Architecture: In-Place AST Enhancement

Step 1: Enhance parseRegexMatch() and parseRegexReplace()

Step 2: Enhance Pack/Sprintf Parsing

Step 3: AST Builders

Special Cases Requiring Attention

Validation Gates for Each Commit

Rollback Strategy

Success Criteria

Phase 3: Pack/Unpack Implementation (Week 3)

Phase 4: Sprintf Implementation (Week 4)

Phase 5: Transliteration Implementation (Week 5)

Phase 6: Additional Sublanguages (Future)

ERROR HANDLING ARCHITECTURE (Enhanced September 2024)

Two-Error-Type System

1. Syntax Errors (SYNTAX_ERROR)

2. Unimplemented Errors (UNIMPLEMENTED_ERROR)

Benefits of Two-Error-Type System

Implementation

LESSONS LEARNED FROM FAILED IMPLEMENTATION (September 2024)

What We Did Wrong

Symptoms of the Wrong Approach

Key Insights

Success Criteria for Proper Implementation

Conclusion

1. Syntax Errors (`SYNTAX_ERROR`)

2. Unimplemented Errors (`UNIMPLEMENTED_ERROR`)