refactor: improve matrix parsing validation and performance by manucian-official · Pull Request #3129 · tinyhumansai/openhuman

manucian-official · 2026-06-01T07:23:15Z

Strengthen markdown matrix parsing with stricter regex validation, improved error reporting, and optimized data processing. Add line-aware validation messages, enforce status validation at parse time, reduce unnecessary allocations, improve duplicate detection, and enhance catalog validation scalability for larger datasets.

Summary by CodeRabbit

Bug Fixes
- Validation now reports line-numbered errors for invalid IDs and statuses for easier troubleshooting
Improvements
- Parsing enforces stricter markdown row format and captures a new "notes" column
- Duplicate detection improved and catalog inputs accepted as sets or iterables
- Validation results now include totalRows and uniqueRows counts in addition to missing/duplicate lists

Strengthen markdown matrix parsing with stricter regex validation, improved error reporting, and optimized data processing. Add line-aware validation messages, enforce status validation at parse time, reduce unnecessary allocations, improve duplicate detection, and enhance catalog validation scalability for larger datasets.

coderabbitai · 2026-06-01T07:23:35Z

📝 Walkthrough

Walkthrough

This PR tightens coverage-matrix markdown parsing to require a notes column and full-line pipes, adds per-line ID/status validation with numbered errors, and enhances catalog comparison to report duplicates plus total and unique row counts.

Changes

Coverage Matrix Parser and Validation

Layer / File(s)	Summary
Stricter parsing constants `scripts/lib/coverage-matrix-parser.mjs` (lines 1–7)	`ROW_REGEX` enforces full-line pipe-based formatting with notes column capture; `ID_REGEX` and `VALID_STATUS` are preserved/reformatted.
Matrix parsing validation `scripts/lib/coverage-matrix-parser.mjs` (lines 8–52)	`parseMatrix` guards non-string input, uses `ROW_REGEX.exec` with new capture groups (including `notes`), trims fields, and emits line-numbered errors for invalid IDs and statuses; parsed rows include `notes`.
Catalog validation enrichment `scripts/lib/coverage-matrix-parser.mjs` (lines 54–85)	`validateAgainstCatalog` normalizes `catalogIds` to a `Set`, counts ID frequencies with a `Map` to derive `duplicates`, and returns `{ missingFromMatrix, duplicates, totalRows, uniqueRows }`.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

working

Poem

🐰
Pipes aligned and headers bright,
I trimmed the rows by moonlit night.
IDs checked and duplicates found,
Notes captured neat and sound.
Hop, parse, validate — all set right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor: improve matrix parsing validation and performance' accurately captures the main changes: improved validation logic, enhanced error reporting with line numbers, and performance optimizations like efficient duplicate detection and reduced allocations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 1-2: The ROW_REGEX is too strict (it validates ID and status),
causing parseMatrix to skip malformed rows before the explicit validators run;
loosen ROW_REGEX (used in coverage-matrix-parser.mjs) so it only splits row
cells (e.g., capture ID and status as generic non-pipe text like [^|]+?) without
enforcing ID formatting or allowed emoji/status values, preserving the same
group count/order, so parseMatrix can pass the raw captures into the existing
ID/status validators and emit the line-specific errors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aaf29e3a-6a1d-4a94-b39e-3dda0dfb4a4d

📥 Commits

Reviewing files that changed from the base of the PR and between 4b26267 and 678edfa.

📒 Files selected for processing (1)

scripts/lib/coverage-matrix-parser.mjs

oxoxDev

Refactor of a CI-critical parser (drives the Coverage Matrix Sync gate). Happy-path output shape is preserved and the real 175-row docs/TEST-COVERAGE-MATRIX.md parses identically — Coverage Matrix Sync is green, no row-rejection regression. Two blockers though:

1. Invalid-status rows are now silently dropped instead of reported. Baking the four valid statuses into ROW_REGEX means a row with a bad/typo'd status no longer matches the regex at all → continue → no row and no error. That defeats the gate's purpose (a malformed status now passes CI silently) and turns the VALID_STATUS.has(status) branch into dead code. Keep the status group permissive (([^|]+?)) and let the existing guard emit your new line-aware error.

2. Existing tests break and weren't updated. app/test/coverage-matrix-parser.test.ts has two tests asserting an error is emitted for an invalid-status row; with this change no error fires, so both fail in the Frontend Coverage (Vitest) lane (currently pending — this is why the PR is BLOCKED). Fixing #1 makes them pass without touching the tests.

Minor: empty-string input now returns an error where it was previously a {rows:[],errors:[]} no-op — an unrequested contract change; consider dropping the markdown.length === 0 arm.

Good parts: line-numbered messages, empty-notes rows now matching, additive totalRows/uniqueRows, no ReDoS. Just keep the status validation able to report.

graycyrus

I see @oxoxDev has requested changes on this — the blockers they identified (silent status-validation drop + test regression) are spot-on. I'm deferring to their feedback. Once those are addressed, I'll come back and do a full review.

@oxoxDev

The status capture group (✅|🟡|❌|🚫) in ROW_REGEX caused rows with bad statuses to silently miss the regex match entirely, making the VALID_STATUS.has() guard dead code and defeating the matrix-sync gate. Replace with permissive ([^|]+?) so the explicit validator emits line-aware errors. Also drop the empty-string error (unrequested contract change) and apply Prettier formatting. Addresses @oxoxDev and @coderabbitai feedback on PR tinyhumansai#3129.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/lib/coverage-matrix-parser.mjs (1)
9-14: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Error message says "non-empty" but empty strings are now valid.

The guard only checks typeof markdown !== "string", so "" passes through and returns { rows: [], errors: [] }. The error message at line 12 is misleading.
Suggested fix
       errors: ["Input must be a non-empty string"],
+      errors: ["Input must be a string"],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/lib/coverage-matrix-parser.mjs` around lines 9 - 14, The guard
currently checks only typeof markdown !== "string" but returns an error message
stating "Input must be a non-empty string" which is misleading; update the
returned error string to "Input must be a string" (or similar) so it matches the
actual check involving the markdown variable and the returned object { rows: [],
errors: [...] } in coverage-matrix-parser.mjs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 1-4: ROW_REGEX currently enforces the ID format in its first
capture group so malformed IDs never reach the ID_REGEX validation path; update
the ROW_REGEX first capture from the numeric pattern to a permissive text
capture (e.g., use ([^|]+?) for the ID capture) and then, where the parser
extracts that capture (use the same variable used after row.match in this file),
trim the raw ID string and call ID_REGEX.test(trimmedId) to detect and report
malformed IDs so the existing line-aware error handling executes.

---

Outside diff comments:
In `@scripts/lib/coverage-matrix-parser.mjs`:
- Around line 9-14: The guard currently checks only typeof markdown !== "string"
but returns an error message stating "Input must be a non-empty string" which is
misleading; update the returned error string to "Input must be a string" (or
similar) so it matches the actual check involving the markdown variable and the
returned object { rows: [], errors: [...] } in coverage-matrix-parser.mjs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a2570f18-a94a-4629-8c03-8d32691ab5c5

📥 Commits

Reviewing files that changed from the base of the PR and between 678edfa and 7233e61.

📒 Files selected for processing (1)

scripts/lib/coverage-matrix-parser.mjs

coderabbitai · 2026-06-02T06:28:06Z

+const ROW_REGEX =
+  /^\|\s*(\d+(?:\.\d+){2,3})\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u;
+
 const ID_REGEX = /^\d+(?:\.\d+){2,3}$/;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ROW_REGEX still bakes strict ID validation, making the ID error path unreachable.

The 1st capture group (\d+(?:\.\d+){2,3}) enforces ID format at regex level. Rows with malformed IDs (e.g., abc, 1.2, 1.2.3.4.5) won't match, hit continue at line 25, and are silently dropped—the line-aware error at lines 31-34 never runs. This is the same pattern that was fixed for status.

Loosen the ID capture to ([^|]+?) and let ID_REGEX.test() report the error:

Suggested fix

const ROW_REGEX = - /^\|\s*(\d+(?:\.\d+){2,3})\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u; + /^\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]+?)\s*\|\s*([^|]*?)\s*\|$/u;

Then trim the raw ID before validation:

- const [, id, name, layer, path, rawStatus, notes] = match; + const [, rawId, name, layer, path, rawStatus, notes] = match; + const id = rawId.trim(); const status = rawStatus.trim();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/lib/coverage-matrix-parser.mjs` around lines 1 - 4, ROW_REGEX currently enforces the ID format in its first capture group so malformed IDs never reach the ID_REGEX validation path; update the ROW_REGEX first capture from the numeric pattern to a permissive text capture (e.g., use ([^|]+?) for the ID capture) and then, where the parser extracts that capture (use the same variable used after row.match in this file), trim the raw ID string and call ID_REGEX.test(trimmedId) to detect and report malformed IDs so the existing line-aware error handling executes.

graycyrus

All three blockers from @oxoxDev are now addressed:

Status regex is now permissive — Changed from ([a-z]+) to ([^|]+?), so invalid statuses reach the validator instead of silently dropping rows. The VALID_STATUS.has(status) guard now properly emits line-aware errors like Line 5 (1.2.3): invalid status "bad_status".
Test regression fixed — Tests asserting invalid-status errors were failing because rows were silently dropped. Now that validators report properly, app/test/coverage-matrix-parser.test.ts passes (Frontend Coverage: PASS).
Line-aware error messages — All validation errors now include line numbers, making debugging much easier.

Code quality: Regex is tighter (whitespace handling), duplicate detection is cleaner (Map → Set flow), added totalRows/uniqueRows metadata. Structure is solid and matches patterns in the rest of the codebase.

One small thing: the error message says "Input must be a non-empty string" but the code allows empty strings (returns {rows:[],errors:[]}). CodeRabbit already flagged this as minor. Consider updating the message to "Input must be a string" for accuracy.

CI note: Rust E2E (mock backend) is failing, but this PR only touches JavaScript — looks unrelated. Once that's resolved I'll swing back and approve. Everything else is green (Frontend Coverage, Coverage Matrix Sync, CodeRabbit all pass).

manucian-official requested a review from a team June 1, 2026 07:23

coderabbitai Bot requested changes Jun 1, 2026

View reviewed changes

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated

coderabbitai Bot previously approved these changes Jun 1, 2026

View reviewed changes

oxoxDev requested changes Jun 1, 2026

View reviewed changes

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated

Comment thread scripts/lib/coverage-matrix-parser.mjs Outdated

graycyrus reviewed Jun 1, 2026

View reviewed changes

senamakel added 2 commits June 1, 2026 15:08

Merge remote-tracking branch 'upstream/main' into pr/3129

796d2c6

Merge branch 'main' into pr/3129

198d554

senamakel self-assigned this Jun 2, 2026

senamakel added 2 commits June 2, 2026 02:21

Merge remote-tracking branch 'upstream/main' into pr/3129

fe54cfc

senamakel dismissed coderabbitai[bot]’s stale review via 7233e61 June 2, 2026 06:24

coderabbitai Bot added the working A PR that is being worked on by the team. label Jun 2, 2026

coderabbitai Bot requested changes Jun 2, 2026

View reviewed changes

graycyrus reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve matrix parsing validation and performance#3129

refactor: improve matrix parsing validation and performance#3129
manucian-official wants to merge 5 commits into
tinyhumansai:mainfrom
manucian-official:patch-2

manucian-official commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

oxoxDev left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graycyrus left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

graycyrus left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

manucian-official commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oxoxDev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

manucian-official commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

graycyrus left a comment •

edited

Loading