Skip to content

Analyzer should sanitize smart quotes, {%p ...%} tags, and undefined filters before parsing #16

Description

@houfu

Problem

When users author templates in Microsoft Word, three things commonly slip through that silently break both the analyzer and the renderer:

1. Smart quotes around choice literals. Word's autocorrect converts "siac" into “siac” (U+201C / U+201D). Inside a Jinja tag this is no longer valid syntax — e.g. {% if jurisdiction_type == “siac” %}. The current analyzer regex in scripts/analyze.py (RE_EQUALITY = re.compile(r"(\w+)\s*==\s*['\"](.+?)['\"]")) only recognises straight quotes, so the conditional is dropped from the manifest entirely, and docxtpl then fails or mis-renders at runtime.

2. {%p ... %} paragraph-removing tags. This is docxtpl-native syntax for "remove the containing paragraph after rendering." The analyzer regex matches \{%[-\s]*if / \{%[-\s]*for, which does not recognise the p prefix. Result: every {%p if %} / {%p for %} block is invisible to the analyzer, so the manifest shows zero conditionals/loops even though the template is full of them.

3. Undefined Jinja filters/functions. Templates occasionally reference helpers like {{ country_name(governing_law) }} that aren't registered with the environment. The analyzer doesn't warn about these, and rendering hard-fails later.

Repro

Given a Word template with:

{%p if jurisdiction_type == “siac%}
... SIAC clause ...
{%p endif %}
{%p for party in parties_list %}
{{ party.name }}, incorporated in {{ party.place_of_incorporation }}
{%p endfor %}
This Agreement is governed by {{ country_name(governing_law) }}.

Run:

python scripts/analyze.py path/to/dir

Expected: 1 conditional (equality-gated on jurisdiction_type), 1 loop with sub-variables, and a warning about the undefined country_name callable.

Actual: 0 conditionals, 0 loops, and party.name / party.place_of_incorporation end up as flat top-level dotted variables.

Suggested fix — add a lint/sanitize pass in Phase 2

Before the two-pass analyzer runs, sanitize the extracted text within Jinja tag boundaries and emit warnings:

  • Smart-quote normaliser: within every {% ... %} / {{ ... }} region, map U+201C U+201D U+2018 U+2019 → straight " / '. Warn with "replaced smart quotes in tag at offset N".
  • {%p support: either extend the regexes to \{%p?[-\s]*if\s... (and the same for for / else / endif / endfor), or strip the p in a pre-pass and restore it before rendering. Equivalent handling for {%tr ...%} (table-row) and {%tc ...%} (table-cell) docxtpl tags while you're in there.
  • Undefined-callable warning: scan {{ ... }} bodies for \w+\( patterns and cross-reference against the environment's globals/filters. Flag unknowns so the author can either register the helper or rewrite the expression.

Ideally the fixer can run in two modes: --lint (report only, non-zero exit on issues) and --fix (write a sanitized copy next to the template). The analyzer would consume the sanitized copy and cache the diff in the manifest, so the original template-of-record is never mutated.

Why this matters

All three issues are authoring accidents that Word introduces silently. In a real session I hit all three in a single uploaded template before the analyzer produced a usable manifest. A lint/fix pass in the analyzer would have caught them upfront with readable warnings instead of leaving an empty conditionals: [] / loops: [] manifest and failing at render time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions