Skip to content

refactor: improve matrix parsing validation and performance#3129

Open
manucian-official wants to merge 5 commits into
tinyhumansai:mainfrom
manucian-official:patch-2
Open

refactor: improve matrix parsing validation and performance#3129
manucian-official wants to merge 5 commits into
tinyhumansai:mainfrom
manucian-official:patch-2

Conversation

@manucian-official
Copy link
Copy Markdown
Contributor

@manucian-official manucian-official commented Jun 1, 2026

Strengthen markdown matrix parsing with stricter regex validation, improved error reporting, and optimized data processing. Add line-aware validation messages, enforce status validation at parse time, reduce unnecessary allocations, improve duplicate detection, and enhance catalog validation scalability for larger datasets.

Summary by CodeRabbit

  • Bug Fixes

    • Validation now reports line-numbered errors for invalid IDs and statuses for easier troubleshooting
  • Improvements

    • Parsing enforces stricter markdown row format and captures a new "notes" column
    • Duplicate detection improved and catalog inputs accepted as sets or iterables
    • Validation results now include totalRows and uniqueRows counts in addition to missing/duplicate lists

Strengthen markdown matrix parsing with stricter regex validation, improved error reporting, and optimized data processing. Add line-aware validation messages, enforce status validation at parse time, reduce unnecessary allocations, improve duplicate detection, and enhance catalog validation scalability for larger datasets.
@manucian-official manucian-official requested a review from a team June 1, 2026 07:23
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR tightens coverage-matrix markdown parsing to require a notes column and full-line pipes, adds per-line ID/status validation with numbered errors, and enhances catalog comparison to report duplicates plus total and unique row counts.

Changes

Coverage Matrix Parser and Validation

Layer / File(s) Summary
Stricter parsing constants
scripts/lib/coverage-matrix-parser.mjs (lines 1–7)
ROW_REGEX enforces full-line pipe-based formatting with notes column capture; ID_REGEX and VALID_STATUS are preserved/reformatted.
Matrix parsing validation
scripts/lib/coverage-matrix-parser.mjs (lines 8–52)
parseMatrix guards non-string input, uses ROW_REGEX.exec with new capture groups (including notes), trims fields, and emits line-numbered errors for invalid IDs and statuses; parsed rows include notes.
Catalog validation enrichment
scripts/lib/coverage-matrix-parser.mjs (lines 54–85)
validateAgainstCatalog normalizes catalogIds to a Set, counts ID frequencies with a Map to derive duplicates, and returns { missingFromMatrix, duplicates, totalRows, uniqueRows }.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

working

Poem

🐰
Pipes aligned and headers bright,
I trimmed the rows by moonlit night.
IDs checked and duplicates found,
Notes captured neat and sound.
Hop, parse, validate — all set right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor: improve matrix parsing validation and performance' accurately captures the main changes: improved validation logic, enhanced error reporting with line numbers, and performance optimizations like efficient duplicate detection and reduced allocations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 1-2: The ROW_REGEX is too strict (it validates ID and status),
causing parseMatrix to skip malformed rows before the explicit validators run;
loosen ROW_REGEX (used in coverage-matrix-parser.mjs) so it only splits row
cells (e.g., capture ID and status as generic non-pipe text like [^|]+?) without
enforcing ID formatting or allowed emoji/status values, preserving the same
group count/order, so parseMatrix can pass the raw captures into the existing
ID/status validators and emit the line-specific errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aaf29e3a-6a1d-4a94-b39e-3dda0dfb4a4d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b26267 and 678edfa.

📒 Files selected for processing (1)
  • scripts/lib/coverage-matrix-parser.mjs

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 1, 2026
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor of a CI-critical parser (drives the Coverage Matrix Sync gate). Happy-path output shape is preserved and the real 175-row docs/TEST-COVERAGE-MATRIX.md parses identically — Coverage Matrix Sync is green, no row-rejection regression. Two blockers though:

1. Invalid-status rows are now silently dropped instead of reported. Baking the four valid statuses into ROW_REGEX means a row with a bad/typo'd status no longer matches the regex at all → continue → no row and no error. That defeats the gate's purpose (a malformed status now passes CI silently) and turns the VALID_STATUS.has(status) branch into dead code. Keep the status group permissive (([^|]+?)) and let the existing guard emit your new line-aware error.

2. Existing tests break and weren't updated. app/test/coverage-matrix-parser.test.ts has two tests asserting an error is emitted for an invalid-status row; with this change no error fires, so both fail in the Frontend Coverage (Vitest) lane (currently pending — this is why the PR is BLOCKED). Fixing #1 makes them pass without touching the tests.

Minor: empty-string input now returns an error where it was previously a {rows:[],errors:[]} no-op — an unrequested contract change; consider dropping the markdown.length === 0 arm.

Good parts: line-numbered messages, empty-notes rows now matching, additive totalRows/uniqueRows, no ReDoS. Just keep the status validation able to report.

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated
Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated
Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see @oxoxDev has requested changes on this — the blockers they identified (silent status-validation drop + test regression) are spot-on. I'm deferring to their feedback. Once those are addressed, I'll come back and do a full review.

@senamakel senamakel self-assigned this Jun 2, 2026
senamakel added 2 commits June 2, 2026 02:21
The status capture group (✅|🟡|❌|🚫) in ROW_REGEX caused rows with
bad statuses to silently miss the regex match entirely, making the
VALID_STATUS.has() guard dead code and defeating the matrix-sync gate.

Replace with permissive ([^|]+?) so the explicit validator emits
line-aware errors. Also drop the empty-string error (unrequested
contract change) and apply Prettier formatting.

Addresses @oxoxDev and @coderabbitai feedback on PR tinyhumansai#3129.
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label Jun 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/lib/coverage-matrix-parser.mjs (1)

9-14: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Error message says "non-empty" but empty strings are now valid.

The guard only checks typeof markdown !== "string", so "" passes through and returns { rows: [], errors: [] }. The error message at line 12 is misleading.

Suggested fix
       errors: ["Input must be a non-empty string"],
+      errors: ["Input must be a string"],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/lib/coverage-matrix-parser.mjs` around lines 9 - 14, The guard
currently checks only typeof markdown !== "string" but returns an error message
stating "Input must be a non-empty string" which is misleading; update the
returned error string to "Input must be a string" (or similar) so it matches the
actual check involving the markdown variable and the returned object { rows: [],
errors: [...] } in coverage-matrix-parser.mjs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 1-4: ROW_REGEX currently enforces the ID format in its first
capture group so malformed IDs never reach the ID_REGEX validation path; update
the ROW_REGEX first capture from the numeric pattern to a permissive text
capture (e.g., use ([^|]+?) for the ID capture) and then, where the parser
extracts that capture (use the same variable used after row.match in this file),
trim the raw ID string and call ID_REGEX.test(trimmedId) to detect and report
malformed IDs so the existing line-aware error handling executes.

---

Outside diff comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 9-14: The guard currently checks only typeof markdown !== "string"
but returns an error message stating "Input must be a non-empty string" which is
misleading; update the returned error string to "Input must be a string" (or
similar) so it matches the actual check involving the markdown variable and the
returned object { rows: [], errors: [...] } in coverage-matrix-parser.mjs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2570f18-a94a-4629-8c03-8d32691ab5c5

📥 Commits

Reviewing files that changed from the base of the PR and between 678edfa and 7233e61.

📒 Files selected for processing (1)
  • scripts/lib/coverage-matrix-parser.mjs

Comment on lines +1 to 4
const ROW_REGEX =
/^\|\s*(\d+(?:\.\d+){2,3})\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u;

const ID_REGEX = /^\d+(?:\.\d+){2,3}$/;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ROW_REGEX still bakes strict ID validation, making the ID error path unreachable.

The 1st capture group (\d+(?:\.\d+){2,3}) enforces ID format at regex level. Rows with malformed IDs (e.g., abc, 1.2, 1.2.3.4.5) won't match, hit continue at line 25, and are silently dropped—the line-aware error at lines 31-34 never runs. This is the same pattern that was fixed for status.

Loosen the ID capture to ([^|]+?) and let ID_REGEX.test() report the error:

Suggested fix
 const ROW_REGEX =
-  /^\|\s*(\d+(?:\.\d+){2,3})\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u;
+  /^\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u;

Then trim the raw ID before validation:

-    const [, id, name, layer, path, rawStatus, notes] = match;
+    const [, rawId, name, layer, path, rawStatus, notes] = match;
 
+    const id = rawId.trim();
     const status = rawStatus.trim();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/lib/coverage-matrix-parser.mjs` around lines 1 - 4, ROW_REGEX
currently enforces the ID format in its first capture group so malformed IDs
never reach the ID_REGEX validation path; update the ROW_REGEX first capture
from the numeric pattern to a permissive text capture (e.g., use ([^|]+?) for
the ID capture) and then, where the parser extracts that capture (use the same
variable used after row.match in this file), trim the raw ID string and call
ID_REGEX.test(trimmedId) to detect and report malformed IDs so the existing
line-aware error handling executes.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three blockers from @oxoxDev are now addressed:

  1. Status regex is now permissive — Changed from ([a-z]+) to ([^|]+?), so invalid statuses reach the validator instead of silently dropping rows. The VALID_STATUS.has(status) guard now properly emits line-aware errors like Line 5 (1.2.3): invalid status "bad_status".

  2. Test regression fixed — Tests asserting invalid-status errors were failing because rows were silently dropped. Now that validators report properly, app/test/coverage-matrix-parser.test.ts passes (Frontend Coverage: PASS).

  3. Line-aware error messages — All validation errors now include line numbers, making debugging much easier.

Code quality: Regex is tighter (whitespace handling), duplicate detection is cleaner (Map → Set flow), added totalRows/uniqueRows metadata. Structure is solid and matches patterns in the rest of the codebase.

One small thing: the error message says "Input must be a non-empty string" but the code allows empty strings (returns {rows:[],errors:[]}). CodeRabbit already flagged this as minor. Consider updating the message to "Input must be a string" for accuracy.

CI note: Rust E2E (mock backend) is failing, but this PR only touches JavaScript — looks unrelated. Once that's resolved I'll swing back and approve. Everything else is green (Frontend Coverage, Coverage Matrix Sync, CodeRabbit all pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants