Skip to content

Add template linting and Word artifact sanitization#19

Open
houfu wants to merge 1 commit into
16-analyzer-should-sanitize-smart-quotes-p-tags-and-undefined-filters-before-parsingfrom
claude/review-issue-16-A5SgZ
Open

Add template linting and Word artifact sanitization#19
houfu wants to merge 1 commit into
16-analyzer-should-sanitize-smart-quotes-p-tags-and-undefined-filters-before-parsingfrom
claude/review-issue-16-A5SgZ

Conversation

@houfu

@houfu houfu commented May 6, 2026

Copy link
Copy Markdown
Owner

Summary

This PR adds comprehensive linting and sanitization capabilities to the template analyzer to detect and correct Word-introduced artifacts and undefined Jinja callables that would fail at render time.

Key Changes

  • Sanitization pipeline: Added sanitize_for_analysis() to normalize Word autocorrect artifacts (smart quotes, docxtpl prefixes like {%p) inside Jinja tag bodies before analysis. Original template files remain unchanged; sanitization is analysis-time only.

  • Callable detection: Implemented detect_undefined_callables() to flag Jinja function calls in templates that aren't registered with the default Jinja2 + docxtpl environment. Maintains a KNOWN_CALLABLES set of built-in globals and docxtpl helpers (range, dict, RichText, Subdoc, InlineImage, etc.).

  • Lint mode: Added --lint CLI flag that runs extract → sanitize → callable-detect, prints warnings to stdout, and exits non-zero if any are found. No manifest is written in this mode, allowing the Orchestrator to check for issues without side effects.

  • Warning collection: Modified build_manifest() to accept and include warnings in the output manifest under a warnings key. Each warning is formatted as "<code>: <detail>" with codes: smart_quote, docxtpl_prefix, undefined_callable.

  • Test coverage: Added test_word_artifacts() to verify sanitization and callable detection work correctly, and test_lint_flag() to verify --lint mode behavior.

  • Documentation: Updated SKILL.md with sanitization/lint warning details and lint-only invocation instructions.

Implementation Details

  • Smart quote replacement is scoped to tag bodies only (via _iter_tag_bodies()), preserving legitimate curly quotes in prose.
  • Docxtpl prefix stripping ({%p, {%tr, {%tc) is regex-based and generates warnings for each occurrence.
  • Callable detection uses regex to find name( patterns in tag bodies, filtering out Jinja keywords and known globals.
  • Cache check is skipped in --lint mode to ensure fresh analysis on every invocation.

https://claude.ai/code/session_01JZSs2LFRxiQDXQbM2ErRys

Templates authored in Microsoft Word silently introduce three classes of
artifacts that broke the analyzer and renderer: smart quotes inside
Jinja tag bodies (Word autocorrect), docxtpl block prefixes
({%p / {%tr / {%tc) that the regex walker did not recognize, and
undefined Jinja callables that hard-failed only at render time. The
analyzer now runs a non-mutating sanitize-for-analysis pass before the
two-pass walker, normalizes these artifacts in the analysis-time string,
and records each correction (plus undefined callables) as a warning
string in manifest.yaml. A new --lint flag reports warnings without
writing the manifest and exits non-zero when any are found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyzer should sanitize smart quotes, {%p ...%} tags, and undefined filters before parsing

2 participants