diff --git a/docs/_generated/COUNTS.md b/docs/_generated/COUNTS.md index e9c657f6b..4834117bd 100644 --- a/docs/_generated/COUNTS.md +++ b/docs/_generated/COUNTS.md @@ -3,7 +3,7 @@ > Auto-generated by `scripts/generate_counts.py` from live repo state. Do not edit > manually — run `uv run python scripts/generate_counts.py --write` to refresh. -**Generated from live repo state on 2026-06-11 (UTC).** Volatile literals are re-derived on every run: tracked `infrastructure/` Python-file count via `git ls-files infrastructure | grep .py` (**551**), project-scope + publishing test collection via `pytest --collect-only` (**228** / **395**), the public exemplar roster, and the importable module list. The per-exemplar test/coverage snapshot table is a measured snapshot (see Test Status). +**Generated from live repo state on 2026-06-11 (UTC).** Volatile literals are re-derived on every run: tracked `infrastructure/` Python-file count via `git ls-files infrastructure | grep .py` (**565**), project-scope + publishing test collection via `pytest --collect-only` (**228** / **395**), the public exemplar roster, and the importable module list. The per-exemplar test/coverage snapshot table is a measured snapshot (see Test Status). This file aggregates verifiable facts from discovery scripts, CI configuration, and test execution. Human-written documentation should link here rather than duplicate lists or numbers. @@ -89,7 +89,7 @@ Tracked Python modules (matches the drift gate): git ls-files infrastructure | grep -c '\.py$' ``` -(Last refreshed count: **551** on 2026-06-11 UTC — point-in-time; re-derive with the command above, the literal drifts as the tree changes.) +(Last refreshed count: **565** on 2026-06-11 UTC — point-in-time; re-derive with the command above, the literal drifts as the tree changes.) See `infrastructure/AGENTS.md` for module-specific function signatures and entry points. diff --git a/docs/maintenance/doc-mega-decomposition.md b/docs/maintenance/doc-mega-decomposition.md new file mode 100644 index 000000000..9210aee4f --- /dev/null +++ b/docs/maintenance/doc-mega-decomposition.md @@ -0,0 +1,42 @@ +# Documentation mega-file decomposition policy + +Human-authored guides above **800 lines** are tracked as **P1 watch** items in +[`infrastructure/AGENTS.md`](../../infrastructure/AGENTS.md). They are not CI +failures; decomposition is done when a guide's edit churn or navigation cost +justifies the split. + +## When to split + +Split a mega guide when **any** of the following hold: + +1. Two or more distinct audiences (operator vs author vs API consumer) share one file. +2. More than three unrelated TOC sections are edited in the same release cycle. +3. Cross-link density inside the file exceeds ~40 internal anchors (grep `](#` count). +4. A new leaf would drop the parent below **650 lines** without losing narrative flow. + +Do **not** split generated inventories (`docs/_generated/*`, `api-reference.md`); +those are refreshed by scripts and are exempt. + +## Current P1 watch list (2026-06-11) + +| Path | Lines | Suggested leaf topics | +| --- | ---: | --- | +| [`docs/reference/api-reference.md`](../reference/api-reference.md) | 3245 | Generated — no split; refresh via `scripts/generate_api_reference_doc.py` | +| [`docs/rules/manuscript_style.md`](../rules/manuscript_style.md) | 1145 | LaTeX math · citations · figures · accessibility | +| [`docs/guides/figures-and-analysis.md`](../guides/figures-and-analysis.md) | 860 | Registry figures · analysis scripts · manifest hooks | +| [`docs/rules/llm_standards.md`](../rules/llm_standards.md) | 800 | Prompt hygiene · Ollama workflow · review templates | +| [`docs/reference/common-workflows.md`](../reference/common-workflows.md) | 813 | Pipeline · validation · publishing | + +## Leaf naming + +- Place operational splits under `docs/operational//`. +- Place author-facing splits under `docs/guides/-*.md`. +- Keep the parent as a **hub** with a short intro + links; do not duplicate prose. + +## Verification + +After splitting: + +1. Run `uv run python scripts/lint_docs.py`. +2. Update hub links in [`docs/documentation-index.md`](../documentation-index.md). +3. Refresh measured counts: `uv run python scripts/generate_counts.py --write`. diff --git a/docs/operational/logging/output-design.md b/docs/operational/logging/output-design.md index d0d401a8c..c0c089e05 100644 --- a/docs/operational/logging/output-design.md +++ b/docs/operational/logging/output-design.md @@ -70,7 +70,9 @@ NEXT STEPS This block is the canonical end-of-run summary. It is rendered by `format_multi_project_detailed_report` in -[`infrastructure/core/pipeline/multi_project.py`](../../../infrastructure/core/pipeline/multi_project.py), +[`infrastructure/reporting/multi_project_report.py`](../../../infrastructure/reporting/multi_project_report.py) +(re-exported from +[`infrastructure/core/pipeline/multi_project.py`](../../../infrastructure/core/pipeline/multi_project.py)), emitted by the orchestrator in [`infrastructure/orchestration/pipeline_runner.py`](../../../infrastructure/orchestration/pipeline_runner.py), and persisted verbatim to `docs/_generated/last-run-summary.md` after every diff --git a/docs/operational/reporting-guide.md b/docs/operational/reporting-guide.md index 9f2fa5d71..f46f2f854 100644 --- a/docs/operational/reporting-guide.md +++ b/docs/operational/reporting-guide.md @@ -298,8 +298,9 @@ fi ## Pipeline Summary Format The end-of-run terminal summary block is rendered by **`format_multi_project_detailed_report`** -in [`infrastructure/core/pipeline/multi_project.py`](../../infrastructure/core/pipeline/multi_project.py). -This is the **canonical pipeline-completion reporting surface** — every full-run option +in [`infrastructure/reporting/multi_project_report.py`](../../infrastructure/reporting/multi_project_report.py) +(re-exported from [`infrastructure/core/pipeline/multi_project.py`](../../infrastructure/core/pipeline/multi_project.py) +for backward compatibility). This is the **canonical pipeline-completion reporting surface** — every full-run option (interactive menu, `./run.sh --pipeline`, and direct `infrastructure.orchestration` invocations) prints this block via the orchestrator in [`infrastructure/orchestration/pipeline_runner.py`](../../infrastructure/orchestration/pipeline_runner.py). diff --git a/docs/reference/api-reference.md b/docs/reference/api-reference.md index f0d1ac8c8..1ad01f714 100644 --- a/docs/reference/api-reference.md +++ b/docs/reference/api-reference.md @@ -3073,7 +3073,7 @@ Scan extracted text for common rendering issues. ### `validate_citations` -*function — defined in `infrastructure.validation.content.markdown_validator`* +*function — defined in `infrastructure.validation.content.validator_citations`* ```python validate_citations(md_paths: list[str], repo_root: str | Path, bib_file: str | Path | list[str | Path] | None=None) -> list[DiagnosticEvent] @@ -3103,7 +3103,7 @@ Validate figure registry against manuscript references. ### `validate_images` -*function — defined in `infrastructure.validation.content.markdown_validator`* +*function — defined in `infrastructure.validation.content.validator_images`* ```python validate_images(md_paths: list[str], repo_root: str | Path, extra_search_dirs: list[str | Path] | None=None) -> list[DiagnosticEvent] @@ -3123,7 +3123,7 @@ Validate all markdown files in a directory. ### `validate_math` -*function — defined in `infrastructure.validation.content.markdown_validator`* +*function — defined in `infrastructure.validation.content.validator_math`* ```python validate_math(md_paths: list[str], repo_root: str | Path) -> list[DiagnosticEvent] @@ -3143,7 +3143,7 @@ Validate complete output directory structure. ### `validate_pandoc_pitfalls` -*function — defined in `infrastructure.validation.content.markdown_validator`* +*function — defined in `infrastructure.validation.content.validator_pitfalls`* ```python validate_pandoc_pitfalls(md_paths: list[str], repo_root: str | Path) -> list[DiagnosticEvent] @@ -3163,7 +3163,7 @@ Perform comprehensive validation of PDF rendering. ### `validate_refs` -*function — defined in `infrastructure.validation.content.markdown_validator`* +*function — defined in `infrastructure.validation.content.validator_refs`* ```python validate_refs(md_paths: list[str], repo_root: str | Path, labels: set[str], anchors: set[str]) -> list[DiagnosticEvent] diff --git a/infrastructure/AGENTS.md b/infrastructure/AGENTS.md index 3fdc2fa60..3884f9b01 100644 --- a/infrastructure/AGENTS.md +++ b/infrastructure/AGENTS.md @@ -103,8 +103,8 @@ Tracked after the P0 composability pass (stage registry, unified markdown discov | `validation/integrity/link_extract.py` | 446 | **Done** (2026-06-11 close-out) — path helpers in `_link_normalize.py` (96 LOC); skip policy in `link_skip_policy.py` (144 LOC) | | `validation/integrity/_link_normalize.py` | 96 | **Done** (2026-06-11 close-out) — project-root + template path resolution for link validation | | `validation/integrity/link_skip_policy.py` | 144 | **Done** (2026-06-11) — `PATH_SKIP_*` tables + `should_validate_path()` | -| `rendering/pipeline.py` | 665 | **Partial** (2026-06-11) — DOCX metadata via `build_pandoc_metadata()`; P2: `_manuscript_source.py` + `_combined_exports.py` | -| `validation/content/markdown_validator.py` | 607 | Extract image/ref/math validators + pitfalls/citations leaves (discovery in `content/discovery.py`) | +| `rendering/pipeline.py` | ~180 | **Done** (2026-06-11) — orchestrator; leaves `_manuscript_source.py`, `_combined_exports.py` | +| `validation/content/markdown_validator.py` | ~75 | **Done** (2026-06-11) — facade; leaves `validator_{images,refs,math,pitfalls,citations}.py` | | `search/literature/backends.py` | — | **Done** (2026-05-29 Wave 5) — package `search/literature/backends/` | | `doctor/detectors.py` | — | **Done** (2026-05-29 Wave 6) — package `doctor/detectors/` | | `reporting/_dashboard_charts.py` | 43 | **Done** (2026-05-29 Wave 7) — facade; chart families in `_dashboard_charts_*.py` | @@ -123,7 +123,10 @@ Tracked after the P0 composability pass (stage registry, unified markdown discov | `project/drift/checks_boundary.py` | 95 | **Done** (2026-06-11 v2) — src/ ↔ infrastructure import boundary | | `publishing/archival.py` | 669 | P1 watch: split provider adapters before next archival feature | | `autoresearch/validation_checks.py` | 661 | P1 watch: monitor before next autoresearch feature wave | -| `rendering/render_all_cli.py` | — | Remove `sys.path.insert`; use `--project` / discovery like other CLIs | +| `rendering/render_all_cli.py` | — | **Done** (2026-06-11) — `--project` + `resolve_project_root`; legacy CWD `manuscript/` retained | +| `documentation/generate_glossary_cli.py` | — | **Done** (2026-06-11) — top-level imports; no `sys.path.insert` | +| Doc megas (>800 LOC) | — | Policy: [`docs/maintenance/doc-mega-decomposition.md`](../docs/maintenance/doc-mega-decomposition.md) | +| Test module line count | — | Advisory: `scripts/gates/module_line_count_check.py --include-tests` (warn ≥800, no fail) | | Package barrels | — | Lazy `__getattr__` on wide `__init__.py` hubs (`validation`, `reporting`, `publishing`, `doctor`) | ## Function Signatures diff --git a/infrastructure/documentation/generate_glossary_cli.py b/infrastructure/documentation/generate_glossary_cli.py index 6bb47962a..a2671da9e 100644 --- a/infrastructure/documentation/generate_glossary_cli.py +++ b/infrastructure/documentation/generate_glossary_cli.py @@ -9,6 +9,11 @@ from pathlib import Path from infrastructure.core.logging.utils import get_logger +from infrastructure.documentation.glossary_gen import ( + build_api_index, + generate_markdown_table, + inject_between_markers, +) logger = get_logger(__name__) @@ -67,17 +72,6 @@ def main() -> int: _ensure_glossary_file(glossary_md) - sys.path.insert(0, str(repo)) - try: - from infrastructure.documentation.glossary_gen import ( - build_api_index, - generate_markdown_table, - inject_between_markers, - ) - except Exception as exc: # noqa: BLE001 — dynamic import; any import error is handled identically - logger.error(f"Failed to import glossary_gen from infrastructure/documentation/: {exc}") - return 1 - text = glossary_md.read_text(encoding="utf-8") entries = build_api_index(str(src_dir)) diff --git a/infrastructure/rendering/_combined_exports.py b/infrastructure/rendering/_combined_exports.py new file mode 100644 index 000000000..0eb1f6a81 --- /dev/null +++ b/infrastructure/rendering/_combined_exports.py @@ -0,0 +1,217 @@ +"""Combined PDF/HTML/DOCX/EPUB export helpers for the rendering pipeline.""" + +from __future__ import annotations + +import subprocess +import traceback +from pathlib import Path + +from infrastructure.core.exceptions import RenderingError +from infrastructure.core.logging.constants import BANNER_WIDTH +from infrastructure.core.logging.diagnostic import DiagnosticReporter, DiagnosticSeverity +from infrastructure.core.logging.utils import get_logger +from infrastructure.publishing.transmission_bookends import is_transmission_bookend +from infrastructure.rendering import RenderManager + +logger = get_logger(__name__) + + +def combined_source_files(md_files: list[Path]) -> list[Path]: + """Return combined-render inputs, ignoring missing generated transmission bookends.""" + combined_files: list[Path] = [] + for path in md_files: + if path.exists() or not is_transmission_bookend(path): + combined_files.append(path) + return combined_files + + +html_combined_source_files = combined_source_files + + +def resolve_combined_markdown(manuscript_dir: Path) -> Path | None: + """Find the combined-manuscript markdown produced by the combined-PDF pipeline.""" + if manuscript_dir.name == "manuscript" and manuscript_dir.parent.name == "output": + project_root = manuscript_dir.parent.parent + else: + project_root = manuscript_dir.parent + candidates = [ + project_root / "output" / "pdf" / "_combined_manuscript.md", + project_root / "output" / "tex" / "_combined_manuscript.md", + ] + for candidate in candidates: + if candidate.exists() and candidate.stat().st_size > 0: + return candidate + return None + + +def resolve_bibliography(manuscript_dir: Path) -> Path | None: + """Return the first .bib in the manuscript dir, or None if not found.""" + bibs = sorted(manuscript_dir.glob("*.bib")) + return bibs[0] if bibs else None + + +def render_combined_docx( + manager: RenderManager, + manuscript_dir: Path, + project_name: str, + reporter: DiagnosticReporter, +) -> None: + """Render the combined DOCX from the preprocessed combined markdown.""" + from infrastructure.rendering.docx_renderer import render_docx + + combined_md = resolve_combined_markdown(manuscript_dir) + if combined_md is None: + logger.warning( + "[skip] DOCX rendering: no combined markdown found (combined-PDF stage may have been skipped or failed)" + ) + return + + docx_dir = Path(manager.config.docx_dir) + docx_dir.mkdir(parents=True, exist_ok=True) + out_path = docx_dir / f"{project_name}_combined.docx" + bibliography = resolve_bibliography(manuscript_dir) + + import shutil + + extra_args = [ + "--resource-path=" + str(manuscript_dir), + "--resource-path=" + str(manager.config.figures_dir), + ] + crossref = shutil.which("pandoc-crossref") + if crossref: + extra_args.extend(["--filter", crossref]) + else: + logger.warning("pandoc-crossref not on PATH; DOCX @fig:/@sec:/@tbl:/@eq: will not resolve.") + if bibliography is not None: + extra_args.extend(["--citeproc", f"--bibliography={bibliography}"]) + + import yaml as _yaml + from infrastructure.rendering._pdf_title_page import _load_render_config, build_pandoc_metadata + + config, _ = _load_render_config(manuscript_dir) + if isinstance(config, dict): + meta = build_pandoc_metadata(config) + if meta: + meta_path = docx_dir / "_docx_metadata.yaml" + with meta_path.open("w", encoding="utf-8") as handle: + _yaml.safe_dump(meta, handle, allow_unicode=True, sort_keys=False) + extra_args.append(f"--metadata-file={meta_path}") + + logger.debug("\n" + "=" * BANNER_WIDTH) + logger.info("Generating combined DOCX manuscript...") + try: + result = render_docx( + combined_md, + out_path, + bibliography=None, + pandoc_path=manager.config.pandoc_path, + extra_args=extra_args, + ) + logger.info(f"✅ Generated combined DOCX: {result.output_path.name} ({result.size_bytes / 1024:.1f} KB)") + except RenderingError as re: + logger.warning(f"⚠️ Rendering error generating combined DOCX: {re.message}") + reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) + except (OSError, subprocess.SubprocessError, ValueError, FileNotFoundError) as e: + logger.warning(f"⚠️ Unexpected error generating combined DOCX: {e}") + + +def render_combined_epub( + manager: RenderManager, + manuscript_dir: Path, + project_name: str, + reporter: DiagnosticReporter, +) -> None: + """Render the combined EPUB from the preprocessed combined markdown.""" + from infrastructure.rendering.epub_renderer import render_epub + + combined_md = resolve_combined_markdown(manuscript_dir) + if combined_md is None: + logger.warning( + "[skip] EPUB rendering: no combined markdown found (combined-PDF stage may have been skipped or failed)" + ) + return + + epub_dir = Path(manager.config.epub_dir) + epub_dir.mkdir(parents=True, exist_ok=True) + out_path = epub_dir / f"{project_name}_combined.epub" + bibliography = resolve_bibliography(manuscript_dir) + + logger.debug("\n" + "=" * BANNER_WIDTH) + logger.info("Generating combined EPUB manuscript...") + try: + result = render_epub( + combined_md, + out_path, + bibliography=bibliography, + pandoc_path=manager.config.pandoc_path, + ) + logger.info(f"✅ Generated combined EPUB: {result.output_path.name} ({result.size_bytes / 1024:.1f} KB)") + except RenderingError as re: + logger.warning(f"⚠️ Rendering error generating combined EPUB: {re.message}") + reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) + except (OSError, subprocess.SubprocessError, ValueError, FileNotFoundError) as e: + logger.warning(f"⚠️ Unexpected error generating combined EPUB: {e}") + + +def render_combined_outputs( + manager: RenderManager, + md_files: list[Path], + manuscript_dir: Path, + project_name: str, + reporter: DiagnosticReporter, + rendered_count: int, +) -> None: + """Generate the combined PDF / HTML / DOCX / EPUB manuscripts.""" + config = manager.config + + if config.enable_pdf: + logger.debug("\n" + "=" * BANNER_WIDTH) + logger.info("Generating combined PDF manuscript...") + try: + combined_pdf = manager.render_combined_pdf(combined_source_files(md_files), manuscript_dir, project_name) + logger.info(f"✅ Generated combined PDF: {combined_pdf.name}") + except RenderingError as re: + logger.error(f"❌ Rendering error generating combined PDF: {re.message}") + reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.ERROR)) + if rendered_count > 0: + logger.info(f"ℹ️ Note: {rendered_count} individual PDF(s) were generated despite combined PDF failure.") + except (OSError, subprocess.SubprocessError, ValueError, TypeError) as e: + logger.error(f"❌ Unexpected error generating combined PDF: {e}") + logger.error(f" Error type: {type(e).__name__}") + logger.error(f" Full traceback:\n{traceback.format_exc()}") + if hasattr(e, "stderr") and e.stderr: + logger.error(f" Full stderr:\n{e.stderr}") + if hasattr(e, "stdout") and e.stdout: + logger.error(f" Full stdout:\n{e.stdout}") + try: + combined_md_path = manuscript_dir.parent / "output" / "tex" / "_combined_manuscript.md" + if combined_md_path.exists(): + logger.error(f" Combined markdown: {combined_md_path} ({combined_md_path.stat().st_size} bytes)") + except OSError as stat_err: + logger.debug(f" Could not stat combined markdown file: {stat_err}") + logger.warning(" This is an unexpected error - please report this issue") + else: + logger.info("[skip] PDF rendering disabled in config (render.formats.pdf=false)") + + if config.enable_html: + logger.debug("\n" + "=" * BANNER_WIDTH) + logger.info("Generating combined HTML manuscript...") + try: + manager.render_combined_web(combined_source_files(md_files), manuscript_dir, project_name) + except RenderingError as re: + logger.warning(f"⚠️ Rendering error generating combined HTML: {re.message}") + reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) + except (OSError, subprocess.SubprocessError, ValueError) as e: + logger.warning(f"⚠️ Unexpected error generating combined HTML: {e}") + else: + logger.info("[skip] HTML rendering disabled in config (render.formats.html=false)") + + if config.enable_docx: + render_combined_docx(manager, manuscript_dir, project_name, reporter) + else: + logger.debug("[skip] DOCX rendering disabled in config (default; render.formats.docx=true to enable)") + + if config.enable_epub: + render_combined_epub(manager, manuscript_dir, project_name, reporter) + else: + logger.debug("[skip] EPUB rendering disabled in config (default; render.formats.epub=true to enable)") diff --git a/infrastructure/rendering/_manuscript_source.py b/infrastructure/rendering/_manuscript_source.py new file mode 100644 index 000000000..3956ba72a --- /dev/null +++ b/infrastructure/rendering/_manuscript_source.py @@ -0,0 +1,223 @@ +"""Manuscript source resolution and per-file rendering helpers.""" + +from __future__ import annotations + +import subprocess +from pathlib import Path +from typing import Any + +from infrastructure.core.exceptions import RenderingError, ValidationError +from infrastructure.core.logging.constants import BANNER_WIDTH +from infrastructure.core.logging.diagnostic import DiagnosticReporter, DiagnosticSeverity +from infrastructure.core.logging.utils import get_logger, log_success +from infrastructure.core.progress import SubStageProgress +from infrastructure.publishing.transmission_bookends import is_transmission_bookend +from infrastructure.rendering import RenderManager +from infrastructure.rendering.latex_package_validator import validate_preamble_packages +from infrastructure.rendering.latex_validation import ValidationReport + +logger = get_logger(__name__) + + +def has_generated_manuscript_ordering(config_path: Path) -> bool: + """Return True when an injected config owns generated manuscript ordering.""" + if not config_path.is_file(): + return False + return "# Generated manuscript ordering" in config_path.read_text(encoding="utf-8") + + +def resolve_manuscript_dir(project_root: Path) -> Path: + """Return the manuscript directory to render from.""" + import shutil as _shutil + + source_dir = project_root / "manuscript" + injected_dir = project_root / "output" / "manuscript" + if injected_dir.exists() and any(injected_dir.glob("*.md")): + if source_dir.is_dir(): + cfg_src = source_dir / "config.yaml" + cfg_dst = injected_dir / "config.yaml" + if cfg_src.is_file(): + if has_generated_manuscript_ordering(cfg_dst): + logger.info( + "Preserved generated config.yaml ordering in injected manuscript: %s", + cfg_dst, + ) + else: + _shutil.copy2(cfg_src, cfg_dst) + logger.info(f"Refreshed config.yaml in injected manuscript: {cfg_dst}") + for bib in sorted(source_dir.glob("*.bib")): + bib_dst = injected_dir / bib.name + _shutil.copy2(bib, bib_dst) + logger.info(f"Refreshed {bib.name} in injected manuscript: {bib_dst}") + logger.info(f"Rendering from injected manuscript directory: {injected_dir}") + return injected_dir + return source_dir + + +def run_override_script(project_root: Path, override_script: Path) -> int: + """Delegate rendering to a project-specific override script.""" + from infrastructure.core.runtime.environment import get_python_command + + logger.info(f"⚡ Found custom render override: {override_script.name}") + logger.info("Transferring control to project-specific renderer...") + cmd = get_python_command() + [str(override_script)] + try: + result = subprocess.run(cmd, cwd=str(project_root), check=False, timeout=300) # nosec B603 + if result.returncode == 0: + log_success("Custom PDF rendering completed successfully", logger) + else: + logger.error(f"Custom PDF rendering failed (exit code {result.returncode})") + return result.returncode + except (subprocess.SubprocessError, OSError) as e: + logger.error(f"Failed to execute custom renderer: {e}") + return 1 + + +def run_manuscript_variable_script( + project_root: Path, + template_repo_root: Path | None = None, +) -> int: + """Hydrate project manuscript variables before rendering, when available.""" + import os + + from infrastructure.core.runtime.environment import get_python_command + + script = project_root / "scripts" / "z_generate_manuscript_variables.py" + if not script.is_file(): + return 0 + + logger.info("Hydrating manuscript variables before render: %s", script.name) + cmd = get_python_command() + [str(script)] + env = os.environ.copy() + if template_repo_root is not None: + env.setdefault("TEMPLATE_REPO_ROOT", str(template_repo_root)) + try: + result = subprocess.run( # nosec B603 + cmd, + cwd=str(project_root), + env=env, + check=False, + timeout=300, + ) + except (subprocess.SubprocessError, OSError) as exc: + logger.error("Manuscript variable hydration failed to execute: %s", exc) + return 1 + + if result.returncode != 0: + logger.error("Manuscript variable hydration failed (exit code %s)", result.returncode) + return 1 + log_success("Manuscript variables hydrated", logger) + return 0 + + +def validate_latex_packages(report: ValidationReport | None = None) -> int: + """Run pre-flight LaTeX package validation.""" + logger.info("Running pre-flight LaTeX package validation...") + try: + if report is None: + report = validate_preamble_packages(strict=False) + if not report.all_required_available: + logger.error("❌ Missing required LaTeX packages!") + logger.error(f" Missing: {', '.join(report.missing_required)}") + logger.error(f" Install: sudo tlmgr install {' '.join(report.missing_required)}") + return 1 + if report.missing_optional: + logger.warning(f"⚠️ Missing {len(report.missing_optional)} optional package(s):") + for pkg in report.missing_optional: + logger.warning(f" - {pkg}") + logger.warning(" PDF will render with reduced functionality") + logger.info(f" To install: sudo tlmgr install {' '.join(report.missing_optional)}") + else: + logger.info("✓ All LaTeX packages available") + except ValidationError as e: + logger.error(f"❌ LaTeX package validation failed: {e}") + for suggestion in e.suggestions: + logger.error(f" {suggestion}") + return 1 + except (OSError, subprocess.SubprocessError) as e: + logger.warning(f"⚠️ Could not validate LaTeX packages: {e}") + logger.warning(" Proceeding anyway - compilation may fail if packages are missing") + return 0 + + +def log_manuscript_composition(source_files: list[Path]) -> None: + """Log the manuscript file composition summary with file sizes.""" + md_files = [f for f in source_files if f.suffix == ".md"] + tex_files = [f for f in source_files if f.suffix == ".tex"] + logger.info("\n" + "=" * BANNER_WIDTH) + logger.info(f"MANUSCRIPT COMPOSITION ({len(source_files)} files)") + logger.info("=" * BANNER_WIDTH) + if md_files: + logger.info(f"Markdown sections ({len(md_files)}):") + for f in md_files: + size_kb = f.stat().st_size / 1024 + logger.info(f" • {f.name:<40} ({size_kb:>6.1f} KB)") + total_size_kb = sum(f.stat().st_size for f in md_files) / 1024 + logger.info(f" {'Total markdown:':<40} ({total_size_kb:>6.1f} KB)") + if tex_files: + logger.info(f"LaTeX files ({len(tex_files)}):") + for f in tex_files: + size_kb = f.stat().st_size / 1024 + logger.info(f" • {f.name:<40} ({size_kb:>6.1f} KB)") + logger.info("=" * BANNER_WIDTH + "\n") + + +def load_project_config_yaml(manuscript_dir: Path) -> dict[str, Any] | None: + """Load the manuscript ``config.yaml`` as a plain dict for render-format toggles.""" + cfg = manuscript_dir / "config.yaml" + if not cfg.is_file(): + return None + try: + import yaml + except ImportError: + logger.debug("PyYAML not available; cannot read render.formats from config.yaml") + return None + try: + with cfg.open("r", encoding="utf-8") as fh: + data = yaml.safe_load(fh) + return data if isinstance(data, dict) else None + except (OSError, yaml.YAMLError) as exc: + logger.debug(f"Could not parse {cfg.name} for render formats: {exc}") + return None + + +def render_individual_files( + manager: RenderManager, + source_files: list[Path], + reporter: DiagnosticReporter, +) -> tuple[int, list[str]]: + """Render each source file; return (rendered_count, failed_file_names).""" + rendered_count = 0 + failed_files: list[str] = [] + progress = SubStageProgress(total=len(source_files), stage_name="Rendering Files") + for i, source_file in enumerate(source_files, 1): + progress.start_substage(i, source_file.name) + try: + if is_transmission_bookend(source_file): + logger.debug( + "Skipping per-file render for transmission bookend (combined PDF only): %s", + source_file.name, + ) + progress.complete_substage() + continue + outputs = manager.render_all(source_file) + if outputs: + for output_path in outputs: + logger.debug(f" Generated: {output_path.name}") + rendered_count += 1 + else: + logger.warning(f" No output generated for {source_file.name}") + except RenderingError as re: + logger.warning(f" ❌ Rendering error for {source_file.name}: {re.message}") + reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.ERROR)) + failed_files.append(source_file.name) + except (OSError, subprocess.SubprocessError, ValueError) as e: + logger.warning(f" ❌ Unexpected error rendering {source_file.name}: {e}") + reporter.record_error( + category="UnexpectedError", + message=f"Unexpected error rendering {source_file.name}: {e}", + file_path=source_file.name, + ) + failed_files.append(source_file.name) + progress.complete_substage() + return rendered_count, failed_files diff --git a/infrastructure/rendering/pipeline.py b/infrastructure/rendering/pipeline.py index c28fc87e9..bf4cfea56 100644 --- a/infrastructure/rendering/pipeline.py +++ b/infrastructure/rendering/pipeline.py @@ -8,514 +8,44 @@ 5. Verifying output quality """ -import subprocess +from __future__ import annotations + from pathlib import Path -from typing import Any -from infrastructure.core.exceptions import RenderingError, ValidationError -from infrastructure.core.logging.constants import BANNER_WIDTH -from infrastructure.core.logging.utils import get_logger, log_success, log_live_resource_usage -from infrastructure.core.progress import SubStageProgress -from infrastructure.rendering import RenderManager -from infrastructure.rendering.config import RenderingConfig -from infrastructure.core.logging.diagnostic import DiagnosticReporter, DiagnosticSeverity -from infrastructure.rendering.manuscript_discovery import ( - discover_manuscript_files, - verify_figures_exist, -) -from infrastructure.rendering.latex_package_validator import validate_preamble_packages -from infrastructure.rendering.latex_validation import ValidationReport +from infrastructure.core.logging.utils import get_logger, log_live_resource_usage, log_success from infrastructure.project.discovery import resolve_project_root -from infrastructure.publishing.transmission_bookends import is_transmission_bookend - -# Re-exports for backwards compatibility -from infrastructure.rendering._pipeline_summary import ( # noqa: F401 +from infrastructure.rendering._combined_exports import ( # noqa: F401 + combined_source_files as _combined_source_files, + html_combined_source_files as _html_combined_source_files, + render_combined_docx as _render_combined_docx, + render_combined_epub as _render_combined_epub, + render_combined_outputs as _render_combined_outputs, +) +from infrastructure.rendering._manuscript_source import ( # noqa: F401 + has_generated_manuscript_ordering as _has_generated_manuscript_ordering, + load_project_config_yaml as _load_project_config_yaml, + log_manuscript_composition as _log_manuscript_composition, + render_individual_files as _render_individual_files, + resolve_manuscript_dir as _resolve_manuscript_dir, + run_manuscript_variable_script as _run_manuscript_variable_script, + run_override_script as _run_override_script, + validate_latex_packages as _validate_latex_packages, +) +from infrastructure.rendering._pipeline_summary import ( generate_rendering_summary, log_rendering_summary, verify_pdf_outputs, ) +from infrastructure.rendering.config import RenderingConfig +from infrastructure.rendering.manuscript_discovery import discover_manuscript_files, verify_figures_exist +from infrastructure.rendering import RenderManager +from infrastructure.core.logging.diagnostic import DiagnosticReporter logger = get_logger(__name__) -def _has_generated_manuscript_ordering(config_path: Path) -> bool: - """Return True when an injected config owns generated manuscript ordering.""" - if not config_path.is_file(): - return False - return "# Generated manuscript ordering" in config_path.read_text(encoding="utf-8") - - -def _resolve_manuscript_dir(project_root: Path) -> Path: - """Return the manuscript directory to render from. - - Prefers the injected output/manuscript/ directory when it exists and - contains markdown files; falls back to the source manuscript/ directory. - - When the injected dir is selected, this function also refreshes the - rendering-critical auxiliary files (``config.yaml`` and ``*.bib``) from - the source ``manuscript/`` directory. Without a fresh ``config.yaml``, - title-page and TOC injection can render from stale metadata. - """ - import shutil as _shutil - - source_dir = project_root / "manuscript" - injected_dir = project_root / "output" / "manuscript" - if injected_dir.exists() and any(injected_dir.glob("*.md")): - if source_dir.is_dir(): - cfg_src = source_dir / "config.yaml" - cfg_dst = injected_dir / "config.yaml" - if cfg_src.is_file(): - if _has_generated_manuscript_ordering(cfg_dst): - logger.info( - "Preserved generated config.yaml ordering in injected manuscript: %s", - cfg_dst, - ) - else: - _shutil.copy2(cfg_src, cfg_dst) - logger.info(f"Refreshed config.yaml in injected manuscript: {cfg_dst}") - for bib in sorted(source_dir.glob("*.bib")): - bib_dst = injected_dir / bib.name - _shutil.copy2(bib, bib_dst) - logger.info(f"Refreshed {bib.name} in injected manuscript: {bib_dst}") - logger.info(f"Rendering from injected manuscript directory: {injected_dir}") - return injected_dir - return source_dir - - -def _run_override_script(project_root: Path, override_script: Path) -> int: - """Delegate rendering to a project-specific override script. - - Returns the exit code from the override script (0 = success). - """ - from infrastructure.core.runtime.environment import get_python_command - - logger.info(f"⚡ Found custom render override: {override_script.name}") - logger.info("Transferring control to project-specific renderer...") - cmd = get_python_command() + [str(override_script)] - try: - result = subprocess.run(cmd, cwd=str(project_root), check=False, timeout=300) # nosec B603 - if result.returncode == 0: - log_success("Custom PDF rendering completed successfully", logger) - else: - logger.error(f"Custom PDF rendering failed (exit code {result.returncode})") - return result.returncode - except (subprocess.SubprocessError, OSError) as e: - logger.error(f"Failed to execute custom renderer: {e}") - return 1 - - -def _run_manuscript_variable_script( - project_root: Path, - template_repo_root: Path | None = None, -) -> int: - """Hydrate project manuscript variables before rendering, when available.""" - import os - - from infrastructure.core.runtime.environment import get_python_command - - script = project_root / "scripts" / "z_generate_manuscript_variables.py" - if not script.is_file(): - return 0 - - logger.info("Hydrating manuscript variables before render: %s", script.name) - cmd = get_python_command() + [str(script)] - env = os.environ.copy() - if template_repo_root is not None: - env.setdefault("TEMPLATE_REPO_ROOT", str(template_repo_root)) - try: - result = subprocess.run( # nosec B603 - cmd, - cwd=str(project_root), - env=env, - check=False, - timeout=300, - ) - except (subprocess.SubprocessError, OSError) as exc: - logger.error("Manuscript variable hydration failed to execute: %s", exc) - return 1 - - if result.returncode != 0: - logger.error("Manuscript variable hydration failed (exit code %s)", result.returncode) - return 1 - log_success("Manuscript variables hydrated", logger) - return 0 - - -def _validate_latex_packages(report: ValidationReport | None = None) -> int: - """Run pre-flight LaTeX package validation. - - Args: - report: Pre-built ValidationReport to evaluate. If None the function - calls validate_preamble_packages() to obtain one at runtime. - Passing a report directly makes the function testable with real - dataclass instances without requiring a LaTeX installation. - - Returns: - 0 if validation passed (or could not run), 1 if required packages - are missing. - """ - logger.info("Running pre-flight LaTeX package validation...") - try: - if report is None: - report = validate_preamble_packages(strict=False) - if not report.all_required_available: - logger.error("❌ Missing required LaTeX packages!") - logger.error(f" Missing: {', '.join(report.missing_required)}") - logger.error(f" Install: sudo tlmgr install {' '.join(report.missing_required)}") - return 1 - if report.missing_optional: - logger.warning(f"⚠️ Missing {len(report.missing_optional)} optional package(s):") - for pkg in report.missing_optional: - logger.warning(f" - {pkg}") - logger.warning(" PDF will render with reduced functionality") - logger.info(f" To install: sudo tlmgr install {' '.join(report.missing_optional)}") - else: - logger.info("✓ All LaTeX packages available") - except ValidationError as e: - logger.error(f"❌ LaTeX package validation failed: {e}") - for suggestion in e.suggestions: - logger.error(f" {suggestion}") - return 1 - except (OSError, subprocess.SubprocessError) as e: - logger.warning(f"⚠️ Could not validate LaTeX packages: {e}") - logger.warning(" Proceeding anyway - compilation may fail if packages are missing") - return 0 - - -def _log_manuscript_composition(source_files: list[Path]) -> None: - """Log the manuscript file composition summary with file sizes.""" - md_files = [f for f in source_files if f.suffix == ".md"] - tex_files = [f for f in source_files if f.suffix == ".tex"] - logger.info("\n" + "=" * BANNER_WIDTH) - logger.info(f"MANUSCRIPT COMPOSITION ({len(source_files)} files)") - logger.info("=" * BANNER_WIDTH) - if md_files: - logger.info(f"Markdown sections ({len(md_files)}):") - for f in md_files: - size_kb = f.stat().st_size / 1024 - logger.info(f" • {f.name:<40} ({size_kb:>6.1f} KB)") - total_size_kb = sum(f.stat().st_size for f in md_files) / 1024 - logger.info(f" {'Total markdown:':<40} ({total_size_kb:>6.1f} KB)") - if tex_files: - logger.info(f"LaTeX files ({len(tex_files)}):") - for f in tex_files: - size_kb = f.stat().st_size / 1024 - logger.info(f" • {f.name:<40} ({size_kb:>6.1f} KB)") - logger.info("=" * BANNER_WIDTH + "\n") - - -def _load_project_config_yaml(manuscript_dir: Path) -> dict[str, Any] | None: - """Load the manuscript ``config.yaml`` as a plain dict for render-format toggles. - - Returns None when the file is missing, unparseable, or PyYAML is unavailable — - callers must fall back to defaults in that case. Best-effort by design. - """ - cfg = manuscript_dir / "config.yaml" - if not cfg.is_file(): - return None - try: - import yaml - except ImportError: - logger.debug("PyYAML not available; cannot read render.formats from config.yaml") - return None - try: - with cfg.open("r", encoding="utf-8") as fh: - data = yaml.safe_load(fh) - return data if isinstance(data, dict) else None - except (OSError, yaml.YAMLError) as exc: - logger.debug(f"Could not parse {cfg.name} for render formats: {exc}") - return None - - -def _render_individual_files( - manager: "RenderManager", - source_files: list[Path], - reporter: "DiagnosticReporter", -) -> tuple[int, list[str]]: - """Render each source file; return (rendered_count, failed_file_names). - - Per-file chatter (each output path) is emitted at DEBUG; the stage-level - progress bar carries the user-facing signal at INFO. - """ - rendered_count = 0 - failed_files: list[str] = [] - progress = SubStageProgress(total=len(source_files), stage_name="Rendering Files") - for i, source_file in enumerate(source_files, 1): - progress.start_substage(i, source_file.name) - try: - if is_transmission_bookend(source_file): - logger.debug( - "Skipping per-file render for transmission bookend (combined PDF only): %s", - source_file.name, - ) - progress.complete_substage() - continue - outputs = manager.render_all(source_file) - if outputs: - for output_path in outputs: - logger.debug(f" Generated: {output_path.name}") - rendered_count += 1 - else: - logger.warning(f" No output generated for {source_file.name}") - except RenderingError as re: - logger.warning(f" ❌ Rendering error for {source_file.name}: {re.message}") - reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.ERROR)) - failed_files.append(source_file.name) - except (OSError, subprocess.SubprocessError, ValueError) as e: - logger.warning(f" ❌ Unexpected error rendering {source_file.name}: {e}") - reporter.record_error( - category="UnexpectedError", - message=f"Unexpected error rendering {source_file.name}: {e}", - file_path=source_file.name, - ) - failed_files.append(source_file.name) - progress.complete_substage() - return rendered_count, failed_files - - -def _render_combined_outputs( - manager: "RenderManager", - md_files: list[Path], - manuscript_dir: Path, - project_name: str, - reporter: "DiagnosticReporter", - rendered_count: int, -) -> None: - """Generate the combined PDF / HTML / DOCX / EPUB manuscripts. - - Each format is gated on the corresponding ``manager.config.enable_`` - boolean. Defaults preserve current behavior (PDF + HTML on; DOCX + EPUB - off — opt in via ``render.formats.{docx,epub}: true`` in - ``manuscript/config.yaml``). - """ - import traceback - - config = manager.config - - # ── Combined PDF ─────────────────────────────────────────────── - if config.enable_pdf: - logger.debug("\n" + "=" * BANNER_WIDTH) - logger.info("Generating combined PDF manuscript...") - try: - combined_pdf = manager.render_combined_pdf(_combined_source_files(md_files), manuscript_dir, project_name) - logger.info(f"✅ Generated combined PDF: {combined_pdf.name}") - except RenderingError as re: - logger.error(f"❌ Rendering error generating combined PDF: {re.message}") - reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.ERROR)) - if rendered_count > 0: - logger.info(f"ℹ️ Note: {rendered_count} individual PDF(s) were generated despite combined PDF failure.") - except (OSError, subprocess.SubprocessError, ValueError, TypeError) as e: - logger.error(f"❌ Unexpected error generating combined PDF: {e}") - logger.error(f" Error type: {type(e).__name__}") - logger.error(f" Full traceback:\n{traceback.format_exc()}") - if hasattr(e, "stderr") and e.stderr: - logger.error(f" Full stderr:\n{e.stderr}") - if hasattr(e, "stdout") and e.stdout: - logger.error(f" Full stdout:\n{e.stdout}") - try: - combined_md_path = manuscript_dir.parent / "output" / "tex" / "_combined_manuscript.md" - if combined_md_path.exists(): - logger.error(f" Combined markdown: {combined_md_path} ({combined_md_path.stat().st_size} bytes)") - except OSError as stat_err: - logger.debug(f" Could not stat combined markdown file: {stat_err}") - logger.warning(" This is an unexpected error - please report this issue") - else: - logger.info("[skip] PDF rendering disabled in config (render.formats.pdf=false)") - - # ── Combined HTML ────────────────────────────────────────────── - if config.enable_html: - logger.debug("\n" + "=" * BANNER_WIDTH) - logger.info("Generating combined HTML manuscript...") - try: - manager.render_combined_web(_combined_source_files(md_files), manuscript_dir, project_name) - except RenderingError as re: - logger.warning(f"⚠️ Rendering error generating combined HTML: {re.message}") - reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) - except (OSError, subprocess.SubprocessError, ValueError) as e: - logger.warning(f"⚠️ Unexpected error generating combined HTML: {e}") - else: - logger.info("[skip] HTML rendering disabled in config (render.formats.html=false)") - - # ── Combined DOCX (opt-in) ───────────────────────────────────── - if config.enable_docx: - _render_combined_docx(manager, manuscript_dir, project_name, reporter) - else: - logger.debug("[skip] DOCX rendering disabled in config (default; render.formats.docx=true to enable)") - - # ── Combined EPUB (opt-in) ───────────────────────────────────── - if config.enable_epub: - _render_combined_epub(manager, manuscript_dir, project_name, reporter) - else: - logger.debug("[skip] EPUB rendering disabled in config (default; render.formats.epub=true to enable)") - - -def _combined_source_files(md_files: list[Path]) -> list[Path]: - """Return combined-render inputs, ignoring missing generated transmission bookends.""" - combined_files: list[Path] = [] - for path in md_files: - if path.exists() or not is_transmission_bookend(path): - combined_files.append(path) - return combined_files - - -_html_combined_source_files = _combined_source_files - - -def _resolve_combined_markdown(manuscript_dir: Path) -> Path | None: - """Find the combined-manuscript markdown produced by the combined-PDF pipeline. - - The combined renderer writes to ``/output/pdf/_combined_manuscript.md`` - or ``/output/tex/_combined_manuscript.md`` depending on layout. - DOCX + EPUB rendering reuses this preprocessed source. - - ``manuscript_dir`` may be either the source ``/manuscript/`` or the - injected ``/output/manuscript/``. We canonicalise to ``/`` - and then probe ``output/{pdf,tex}/_combined_manuscript.md``. - """ - if manuscript_dir.name == "manuscript" and manuscript_dir.parent.name == "output": - project_root = manuscript_dir.parent.parent - else: - project_root = manuscript_dir.parent - candidates = [ - project_root / "output" / "pdf" / "_combined_manuscript.md", - project_root / "output" / "tex" / "_combined_manuscript.md", - ] - for candidate in candidates: - if candidate.exists() and candidate.stat().st_size > 0: - return candidate - return None - - -def _resolve_bibliography(manuscript_dir: Path) -> Path | None: - """Return the first .bib in the manuscript dir, or None if not found.""" - bibs = sorted(manuscript_dir.glob("*.bib")) - return bibs[0] if bibs else None - - -def _render_combined_docx( - manager: "RenderManager", - manuscript_dir: Path, - project_name: str, - reporter: "DiagnosticReporter", -) -> None: - """Render the combined DOCX from the preprocessed combined markdown.""" - from infrastructure.rendering.docx_renderer import render_docx - - combined_md = _resolve_combined_markdown(manuscript_dir) - if combined_md is None: - logger.warning( - "[skip] DOCX rendering: no combined markdown found (combined-PDF stage may have been skipped or failed)" - ) - return - - docx_dir = Path(manager.config.docx_dir) - docx_dir.mkdir(parents=True, exist_ok=True) - out_path = docx_dir / f"{project_name}_combined.docx" - bibliography = _resolve_bibliography(manuscript_dir) - - # Mirror the combined-PDF pandoc setup so figures embed and - # @fig:/@sec:/@tbl:/@eq: cross-references resolve in DOCX. pandoc-crossref - # MUST precede citeproc, otherwise crossref refs are consumed as unknown - # citations. Bibliography/citeproc is therefore passed via extra_args (after - # the crossref filter) rather than through render_docx's own --citeproc. - import shutil - - extra_args = [ - "--resource-path=" + str(manuscript_dir), - "--resource-path=" + str(manager.config.figures_dir), - ] - crossref = shutil.which("pandoc-crossref") - if crossref: - extra_args.extend(["--filter", crossref]) - else: - logger.warning("pandoc-crossref not on PATH; DOCX @fig:/@sec:/@tbl:/@eq: will not resolve.") - if bibliography is not None: - extra_args.extend(["--citeproc", f"--bibliography={bibliography}"]) - - # Inject title/author front matter from config.yaml. The PDF gets this via - # LaTeX title-page injection; the shared combined markdown carries no - # metadata block, so without this the DOCX would have no title or authors. - import yaml as _yaml - from infrastructure.rendering._pdf_title_page import _load_render_config, build_pandoc_metadata - - config, _ = _load_render_config(manuscript_dir) - if isinstance(config, dict): - meta = build_pandoc_metadata(config) - if meta: - meta_path = docx_dir / "_docx_metadata.yaml" - with meta_path.open("w", encoding="utf-8") as handle: - _yaml.safe_dump(meta, handle, allow_unicode=True, sort_keys=False) - extra_args.append(f"--metadata-file={meta_path}") - - logger.debug("\n" + "=" * BANNER_WIDTH) - logger.info("Generating combined DOCX manuscript...") - try: - result = render_docx( - combined_md, - out_path, - bibliography=None, # handled in extra_args (must follow pandoc-crossref) - pandoc_path=manager.config.pandoc_path, - extra_args=extra_args, - ) - logger.info(f"✅ Generated combined DOCX: {result.output_path.name} ({result.size_bytes / 1024:.1f} KB)") - except RenderingError as re: - logger.warning(f"⚠️ Rendering error generating combined DOCX: {re.message}") - reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) - except (OSError, subprocess.SubprocessError, ValueError, FileNotFoundError) as e: - logger.warning(f"⚠️ Unexpected error generating combined DOCX: {e}") - - -def _render_combined_epub( - manager: "RenderManager", - manuscript_dir: Path, - project_name: str, - reporter: "DiagnosticReporter", -) -> None: - """Render the combined EPUB from the preprocessed combined markdown.""" - from infrastructure.rendering.epub_renderer import render_epub - - combined_md = _resolve_combined_markdown(manuscript_dir) - if combined_md is None: - logger.warning( - "[skip] EPUB rendering: no combined markdown found (combined-PDF stage may have been skipped or failed)" - ) - return - - epub_dir = Path(manager.config.epub_dir) - epub_dir.mkdir(parents=True, exist_ok=True) - out_path = epub_dir / f"{project_name}_combined.epub" - bibliography = _resolve_bibliography(manuscript_dir) - - logger.debug("\n" + "=" * BANNER_WIDTH) - logger.info("Generating combined EPUB manuscript...") - try: - result = render_epub( - combined_md, - out_path, - bibliography=bibliography, - pandoc_path=manager.config.pandoc_path, - ) - logger.info(f"✅ Generated combined EPUB: {result.output_path.name} ({result.size_bytes / 1024:.1f} KB)") - except RenderingError as re: - logger.warning(f"⚠️ Rendering error generating combined EPUB: {re.message}") - reporter.record(re.to_diagnostic_event(severity=DiagnosticSeverity.WARNING)) - except (OSError, subprocess.SubprocessError, ValueError, FileNotFoundError) as e: - logger.warning(f"⚠️ Unexpected error generating combined EPUB: {e}") - - def _render_pipeline_impl(project_name: str = "project", *, skip_manuscript_hydration: bool = False) -> int: - """Execute the PDF rendering pipeline using infrastructure rendering. - - This pipeline: - 1. Validates LaTeX packages (pre-flight check) - 2. Verifies figures from analysis stage - 3. Renders individual manuscript files to multiple formats - 4. Generates a combined PDF from all manuscript sections - 5. Reports on all generated outputs - - Args: - project_name: Name of project in projects/ directory (default: "project") - """ + """Execute the PDF rendering pipeline using infrastructure rendering.""" logger.info(f"Executing PDF rendering pipeline for project '{project_name}'...") repo_root = Path(__file__).parent.parent.parent project_root = resolve_project_root(repo_root, project_name) @@ -622,17 +152,7 @@ def _render_pipeline_impl(project_name: str = "project", *, skip_manuscript_hydr def execute_render_pipeline(project_name: str = "project", *, skip_manuscript_hydration: bool = False) -> int: - """Execute PDF rendering orchestration. - - Args: - project_name: Name of project in projects/ directory. - skip_manuscript_hydration: when True, skip the (slow) manuscript-variable - hydration step before rendering — for fast title-page/metadata - re-renders that do not need an analysis rebuild. - - Returns: - Exit code (0=success, 1=failure) - """ + """Execute PDF rendering orchestration.""" log_live_resource_usage("PDF rendering stage start", logger) try: exit_code = _render_pipeline_impl(project_name, skip_manuscript_hydration=skip_manuscript_hydration) @@ -655,10 +175,8 @@ def execute_render_pipeline(project_name: str = "project", *, skip_manuscript_hy __all__ = [ - # Re-exports from _pipeline_summary "generate_rendering_summary", "log_rendering_summary", "verify_pdf_outputs", - # Public entry point "execute_render_pipeline", ] diff --git a/infrastructure/rendering/render_all_cli.py b/infrastructure/rendering/render_all_cli.py index 555182390..ba1e3f4fb 100644 --- a/infrastructure/rendering/render_all_cli.py +++ b/infrastructure/rendering/render_all_cli.py @@ -1,29 +1,49 @@ #!/usr/bin/env python3 """Wrapper script for rendering all formats.""" -import sys -from pathlib import Path +from __future__ import annotations -# Add root to path -root_dir = Path(__file__).resolve().parent.parent -sys.path.insert(0, str(root_dir)) +import argparse +from pathlib import Path from infrastructure.core.logging.utils import get_logger +from infrastructure.project.discovery import resolve_project_root from infrastructure.rendering import RenderManager logger = get_logger(__name__) +REPO_ROOT = Path(__file__).resolve().parent.parent.parent + + +def _resolve_manuscript_dir(project: str | None) -> Path: + if project: + project_root = resolve_project_root(REPO_ROOT, project) + manuscript_dir = project_root / "manuscript" + if manuscript_dir.is_dir(): + return manuscript_dir + logger.error("No manuscript directory for project %r: %s", project, manuscript_dir) + raise SystemExit(1) + legacy = Path("manuscript") + if legacy.is_dir(): + return legacy -def main() -> None: + logger.error("No manuscript directory found.") + raise SystemExit(1) + + +def main(argv: list[str] | None = None) -> None: """Render all formats for all manuscript files.""" + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--project", + default=None, + help="Qualified project name (for example template_code_project or templates/template_code_project)", + ) + args = parser.parse_args([] if argv is None else argv) + + manuscript_dir = _resolve_manuscript_dir(args.project) manager = RenderManager() - # Find sources - manuscript_dir = Path("manuscript") - if not manuscript_dir.exists(): - logger.error("No manuscript directory found.") - raise SystemExit(1) - for source in manuscript_dir.glob("*.tex"): logger.info(f"Rendering {source}...") outputs = manager.render_all(source) @@ -32,4 +52,6 @@ def main() -> None: if __name__ == "__main__": - main() + import sys + + main(sys.argv[1:]) diff --git a/infrastructure/validation/content/markdown_validator.py b/infrastructure/validation/content/markdown_validator.py index 79a01321f..3ccf6d058 100644 --- a/infrastructure/validation/content/markdown_validator.py +++ b/infrastructure/validation/content/markdown_validator.py @@ -1,560 +1,39 @@ -"""Markdown validation utilities for ensuring document integrity. +"""Markdown validation facade — orchestrates leaf validators. -This module provides comprehensive validation of markdown files including: -- Image reference validation -- Cross-reference validation -- Mathematical equation validation -- Link and URL validation - -This is part of the infrastructure layer (generic, reusable validation). +Image, reference, math, Pandoc-pitfall, and citation checks live in sibling +``validator_*.py`` modules; this file keeps the public entry points stable. """ -import re +from __future__ import annotations + from pathlib import Path from infrastructure.core.exceptions import FileNotFoundError from infrastructure.core.logging import DiagnosticEvent, DiagnosticSeverity -from infrastructure.core.logging.utils import get_logger from infrastructure.validation.content.discovery import discover_markdown_files -from infrastructure.validation.content.diagnostic_codes import ( - BibtexCode, - MarkdownCode, -) -from infrastructure.validation.content.markdown_strip import ( - strip_code_and_math, - strip_fences, - strip_markdown_code_regions, -) from infrastructure.validation.content.symbols import collect_symbols - -logger = get_logger(__name__) - -# Regex patterns for validation -IMG_PATTERN = re.compile(r"!\[[^\]]*\]\(([^\)]+)\)") -EQ_LABEL_PATTERN = re.compile(r"\\label\{([^}]+)\}") -EQ_REF_PATTERN = re.compile(r"\\eqref\{([^}]+)\}") -ANCHOR_PATTERN = re.compile(r"\{#([^}]+)\}") -INTERNAL_LINK_PATTERN = re.compile(r"\(#([^\)]+)\)") -LINK_PATTERN = re.compile(r"\[([^\]]+)\]\((https?://[^\)]+)\)") -BARE_URL_PATTERN = re.compile(r"(? str: - """Remove triple-backtick and tilde-fenced code blocks.""" - return strip_fences(text) - - -# Pandoc converts bare ``|word|`` in prose contexts and escaped ``\|`` inside -# table cells to the math macro ``\mid``. When the surrounding text is -# rendered in text mode (e.g. table cell, prose, accessibility alt-text), -# ``\mid`` falls back to the text font (lmroman) which lacks U+2223 and emits -# ``Missing character`` warnings followed by ``U+FFFD`` glyphs in the PDF. -# These two patterns flag the markdown sources that trigger the conversion -# so authors can wrap them in math mode (``$|word|$`` or ``$\mid$``). -PANDOC_BARE_PIPE_PATTERN = re.compile(r"(? list[DiagnosticEvent]: - """Validate that all referenced images exist in the filesystem. - - When a relative image path fails to resolve from the markdown file's - directory, this function also checks ``extra_search_dirs`` and - auto-discovered project-level figure directories (``output/figures/`` - and ``figures/`` relative to the manuscript's project root). - - Args: - md_paths: List of markdown file paths to validate - repo_root: Root directory of the repository - extra_search_dirs: Additional directories to search for images - - Returns: - List of DiagnosticEvents for missing images. - """ - repo_root_path = Path(repo_root) - problems: list[DiagnosticEvent] = [] - - # Build search directories from the markdown directory's project context. - # Manuscript dirs follow the pattern: projects//manuscript/ - # So the project root is two levels up from the manuscript dir. - search_dirs: list[Path] = [] - if extra_search_dirs: - search_dirs.extend(Path(d) for d in extra_search_dirs) - - if md_paths: - md_dir = Path(md_paths[0]).parent - # Auto-discover sibling output/figures/ and figures/ dirs - project_root = md_dir.parent # parent of manuscript/ - for candidate in [ - project_root / "output" / "figures", - project_root / "figures", - ]: - if candidate.is_dir() and candidate not in search_dirs: - search_dirs.append(candidate) - - for path in md_paths: - path_obj = Path(path) - text = path_obj.read_text(encoding="utf-8") - for img in IMG_PATTERN.findall(text): - # Strip optional attributes after ) are not included by regex - img_clean = img.split()[0] - # Normalize relative paths (most are ../output/... or figures/) - abs_path = (path_obj.parent / img_clean).resolve() - if abs_path.exists(): - continue - - # Try each search directory as a fallback - img_basename = Path(img_clean).name - found = False - for search_dir in search_dirs: - if (search_dir / img_basename).exists(): - found = True - break - if not found: - display_path: Path - try: - display_path = Path(path).relative_to(repo_root_path) - except ValueError: - display_path = path_obj - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.ERROR, - category="MARKDOWN_IMAGE", - message=f"Missing referenced image: '{img_clean}'", - code=MarkdownCode.IMG_MISSING, - file_path=str(display_path), - fix_suggestion="Ensure the image file exists in the specified relative path or figures directory.", - ) - ) - return problems - - -def validate_refs( - md_paths: list[str], repo_root: str | Path, labels: set[str], anchors: set[str] -) -> list[DiagnosticEvent]: - """Validate cross-references, internal links, and external URLs. - - Args: - md_paths: List of markdown file paths to validate - repo_root: Root directory of the repository - labels: Set of valid equation labels - anchors: Set of valid section anchors - - Returns: - List of DiagnosticEvents for reference issues. - """ - repo_root_path = Path(repo_root) - problems: list[DiagnosticEvent] = [] - for path in md_paths: - text = Path(path).read_text(encoding="utf-8") - try: - rel: str | Path = Path(path).relative_to(repo_root_path) - except ValueError: - rel = path - - rel_str = str(rel) - text_wo_fences = _text_without_fenced_code(text) - for ref in EQ_REF_PATTERN.findall(text_wo_fences): - if ref not in labels: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.ERROR, - category="MARKDOWN_REF", - message=f"Missing equation label for \\eqref{{{ref}}}", - code=MarkdownCode.REF_EQUATION_MISSING, - file_path=rel_str, - fix_suggestion=f"Verify that '\\label{{{ref}}}' exists in an equation block.", - ) - ) - for link in INTERNAL_LINK_PATTERN.findall(text_wo_fences): - if link not in anchors and link not in labels: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.ERROR, - category="MARKDOWN_LINK", - message=f"Missing anchor/label for internal link (#{link})", - code=MarkdownCode.LINK_ANCHOR_MISSING, - file_path=rel_str, - fix_suggestion=f"Provide a heading anchor '{{#{link}}}' or equation label.", - ) - ) - # Flag bare URLs not inside Markdown links. - text_no_code = re.sub(r"```[^`]*```", "", text, flags=re.DOTALL) - text_no_code = re.sub(r"`[^`]+`", "", text_no_code) # also strip inline code - for m in BARE_URL_PATTERN.finditer(text_no_code): - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_LINK", - message=f"Bare URL found: '{m.group(0)}'", - code=MarkdownCode.LINK_BARE_URL, - file_path=rel_str, - fix_suggestion="Wrap the URL in a Markdown link with informative text: [link text](url)", - ) - ) - # Flag non-informative link text - for m in LINK_PATTERN.finditer(text): - label = m.group(1).strip() - url = m.group(2).strip() - if label == url or label.lower().startswith("http") or "/" in label: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_LINK", - message=f"Non-informative link text for {url}", - code=MarkdownCode.LINK_BAD_TEXT, - file_path=rel_str, - fix_suggestion=f"Replace '{label}' with descriptive text about the link destination.", - ) - ) - return problems - - -def _has_invalid_dollar_display_math(text: str) -> bool: - """Return true for inline, nested, or unbalanced ``$$`` delimiters. - - Pandoc's native ``$$`` display math is the only form that renders - faithfully to both HTML and LaTeX in the template pipeline. The - strictness here is about *shape*: display math must be isolated on - its own line(s), either as a same-line block - ``$$x = y$$`` or as paired delimiter lines. - """ - text = strip_markdown_code_regions(text) - open_block = False - for line in text.splitlines(): - stripped = line.strip() - if "$$" not in stripped: - continue - - delimiter_count = stripped.count("$$") - if stripped == "$$" or re.fullmatch(r"\$\$\s+\{#eq:[A-Za-z0-9_-]+(?:\s+[^}]*)?\}", stripped): - open_block = not open_block - continue - - if ( - not open_block - and delimiter_count == 2 - and stripped.startswith("$$") - and stripped.endswith("$$") - and stripped[2:-2].strip() - ): - continue - - return True - return open_block - - -def validate_math(md_paths: list[str], repo_root: str | Path) -> list[DiagnosticEvent]: - """Validate mathematical equation formatting and labeling. - - Args: - md_paths: List of markdown file paths to validate - repo_root: Root directory of the repository - - Returns: - List of DiagnosticEvents for math formatting issues. - """ - repo_root_path = Path(repo_root) - problems: list[DiagnosticEvent] = [] - eq_block = re.compile(r"\\begin\{equation\}([\s\S]*?)\\end\{equation\}", re.MULTILINE) - label_pattern = re.compile(r"\\label\{([^}]+)\}") - seen_labels: set[str] = set() - for path in md_paths: - text = Path(path).read_text(encoding="utf-8") - try: - rel: str | Path = Path(path).relative_to(repo_root_path) - except ValueError: - rel = path - - rel_str = str(rel) - - # Pandoc-native $$ display math is allowed when isolated on its - # own line(s). Inline, nested, or unbalanced $$ is fragile and - # gets flagged. Raw \[...\] is still banned because Pandoc emits - # literal brackets in HTML for that form. - if _has_invalid_dollar_display_math(text): - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_MATH", - message="Use isolated $$ display blocks; inline or unbalanced $$ is not allowed", - code=MarkdownCode.MATH_DOLLAR_DISPLAY, - file_path=rel_str, - fix_suggestion=("Put display math on its own line(s), for example $$x = y$$ or a paired $$ block."), - ) - ) - if "\\[" in text or "\\]" in text: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_MATH", - message="Use equation environment instead of \\[ \\]", - code=MarkdownCode.MATH_BRACKET_DISPLAY, - file_path=rel_str, - fix_suggestion="Replace \\[...\\] with \\begin{equation}...\\end{equation}", - ) - ) - # Ensure each equation block carries a label and detect duplicates. - # Scan with fenced code removed so ```` ```latex ````/```` ```markdown ```` - # teaching examples in guides are not parsed as real equations - # (otherwise the non-greedy span pairs an example's - # ``\begin{equation}`` with a later prose mention, producing false - # "duplicate label" / "missing \label" findings). - _eq_scan_text = re.sub(r"`+[^`\n]*`+", "", _text_without_fenced_code(text)) - for m in eq_block.finditer(_eq_scan_text): - block = m.group(1) - labels_in_block = label_pattern.findall(block) - if not labels_in_block: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_MATH", - message="Equation missing \\label{...}", - code=MarkdownCode.MATH_LABEL_MISSING, - file_path=rel_str, - fix_suggestion="Add a \\label{eq_name} inside the \\begin{equation} block.", - ) - ) - else: - for lab in labels_in_block: - if lab in seen_labels: - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.ERROR, - category="MARKDOWN_MATH", - message=f"Duplicate equation label '{{{lab}}}' found", - code=MarkdownCode.MATH_LABEL_DUPLICATE, - file_path=rel_str, - fix_suggestion="Rename one of the labels to be unique.", - ) - ) - seen_labels.add(lab) - return problems - - -def validate_pandoc_pitfalls(md_paths: list[str], repo_root: str | Path) -> list[DiagnosticEvent]: - """Flag markdown patterns Pandoc converts to LaTeX ``\\mid`` in text mode. - - Pandoc transforms two prose patterns into the math macro ``\\mid``: - - * Bare ``|word|`` outside math/code (e.g. figure captions, alt-text). - * Escaped ``\\|`` inside table cells (the only way to put a literal - pipe in a Markdown table). - - When the rendered context is text mode, ``\\mid`` resolves through the - text font (lmroman) which lacks U+2223 and produces visible - ``Missing character`` warnings plus ``U+FFFD`` glyphs in the PDF. - Wrapping the offending span in inline math (``$|word|$`` or - ``$\\mid$``) routes the macro through the math font where it renders - correctly. - - Args: - md_paths: Markdown source files to scan. - repo_root: Repository root for relative-path display. - - Returns: - DiagnosticEvents (severity WARNING) for each occurrence. - """ - repo_root_path = Path(repo_root) - problems: list[DiagnosticEvent] = [] - - for path in md_paths: - path_obj = Path(path) - if path_obj.name in NON_RENDERED_MANUSCRIPT_FILES: - continue - text = path_obj.read_text(encoding="utf-8") - try: - rel: str | Path = path_obj.relative_to(repo_root_path) - except ValueError: - rel = path_obj - rel_str = str(rel) - - prose = strip_code_and_math(text) - for m in PANDOC_BARE_PIPE_PATTERN.finditer(prose): - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.WARNING, - category="MARKDOWN_PANDOC_MID", - message=( - f"Bare pipe pattern '|{m.group(1)}|' in prose will be " - f"converted by Pandoc to '\\mid {m.group(1)}\\mid{{}}', " - "which fails to render U+2223 in text mode." - ), - code=MarkdownCode.PANDOC_BARE_PIPE, - file_path=rel_str, - fix_suggestion=( - f"Wrap the span in inline math (e.g. '$|{m.group(1)}|$' " - f"or '${{|}}{m.group(1)}{{|}}$') so the macro renders " - "through the math font." - ), - ) - ) - - for line_no, line in enumerate(text.splitlines(), 1): - stripped = line.lstrip() - if not stripped.startswith("|"): - continue - # Strip inline math regions (where ``\|`` is the norm operator, - # not a Pandoc-converted pipe) before scanning. - line_no_math = re.sub(r"(? list[DiagnosticEvent]: - """Verify every ``[@key]`` citation resolves in the project's BibTeX file(s). - - Scans markdown for Pandoc-style citation tokens (``@key`` outside code - or math contexts) and reports any keys not present in the supplied - BibTeX file(s). Mirrors the natbib *Citation `key' undefined* warning - that would otherwise surface only after a full LaTeX render. - - Multiple bibliographies are supported because some projects split - citations across e.g. ``references.bib`` (curated) and - ``references_deep.bib`` (auto-generated from a deep-search pipeline). - Pandoc accepts repeated ``--bibliography=`` flags, so the validator - must also union keys from every sibling ``*.bib`` file by default. - - Args: - md_paths: Markdown source files to scan. - repo_root: Repository root for relative-path display. - bib_file: Either a single path or a list of paths to BibTeX files. - When ``None``, every ``*.bib`` file next to the first markdown - file is loaded. - - Returns: - DiagnosticEvents (severity ERROR) for each unresolved citation key. - """ - repo_root_path = Path(repo_root) - problems: list[DiagnosticEvent] = [] - if not md_paths: - return problems - - if bib_file is None: - bib_paths = sorted(Path(md_paths[0]).parent.glob("*.bib")) - elif isinstance(bib_file, (list, tuple)): - bib_paths = [Path(p) for p in bib_file if Path(p).exists()] - else: - single = Path(bib_file) - bib_paths = [single] if single.exists() else [] - - if not bib_paths: - return problems - - known_keys: set[str] = set() - bib_names: list[str] = [] - for bib_path in bib_paths: - try: - bib_text = bib_path.read_text(encoding="utf-8", errors="ignore") - except OSError as e: - logger.warning(f"Failed to read BibTeX file {bib_path}: {e}") - continue - known_keys.update(k.strip() for k in BIBTEX_KEY_PATTERN.findall(bib_text)) - bib_names.append(bib_path.name) - - if not bib_names: - return problems - - bib_label = bib_names[0] if len(bib_names) == 1 else ", ".join(bib_names) - - for path in md_paths: - path_obj = Path(path) - if path_obj.name in NON_RENDERED_MANUSCRIPT_FILES: - continue - text = path_obj.read_text(encoding="utf-8") - try: - rel: str | Path = path_obj.relative_to(repo_root_path) - except ValueError: - rel = path_obj - rel_str = str(rel) - - prose = strip_code_and_math(text) - seen_in_file: set[str] = set() - for m in CITE_KEY_PATTERN.finditer(prose): - key = m.group(1) - if key in known_keys or key in seen_in_file: - continue - # pandoc-crossref reserves the ``sec:``, ``fig:``, ``tbl:``, - # ``eq:`` and ``lst:`` prefixes for cross-references such as - # ``[@sec:methodology]`` or ``[@fig:per-source]``. These are - # never bibliography keys and should not trigger - # BIBTEX.UNDEFINED_KEY — Pandoc resolves them at render time - # against ``{#sec:foo}`` / ``{#fig:foo}`` anchors. - if any(key.startswith(prefix) for prefix in ("sec:", "fig:", "tbl:", "eq:", "lst:")): - continue - seen_in_file.add(key) - problems.append( - DiagnosticEvent( - severity=DiagnosticSeverity.ERROR, - category="MARKDOWN_CITATION", - message=f"Undefined citation key '@{key}' (not in {bib_label})", - code=BibtexCode.UNDEFINED_KEY, - file_path=rel_str, - fix_suggestion=( - f"Add an entry '@type{{{key}, ...}}' to {bib_names[0]} or " - "correct the citation key in the markdown." - ), - ) - ) - return problems +from infrastructure.validation.content.validator_citations import validate_citations +from infrastructure.validation.content.validator_images import validate_images +from infrastructure.validation.content.validator_math import validate_math +from infrastructure.validation.content.validator_pitfalls import validate_pandoc_pitfalls +from infrastructure.validation.content.validator_refs import validate_refs + +__all__ = [ + "collect_symbols", + "find_manuscript_directory", + "validate_citations", + "validate_images", + "validate_markdown", + "validate_math", + "validate_pandoc_pitfalls", + "validate_refs", +] def validate_markdown( markdown_dir: str | Path, repo_root: str | Path, strict: bool = False ) -> tuple[list[DiagnosticEvent], int]: - """Validate all markdown files in a directory. - - This is the main validation function that runs all checks. - - Args: - markdown_dir: Directory containing markdown files to validate - repo_root: Root directory of the repository - strict: If True, fail on any issues; if False, warn only - - Returns: - Tuple of (problems list, exit_code) - - problems: List of DiagnosticEvents - - exit_code: 0 for success or when strict=False; 1 only when strict=True and issues found - - Raises: - FileNotFoundError: If markdown_dir doesn't exist - """ + """Validate all markdown files in a directory.""" markdown_dir = Path(markdown_dir) repo_root = Path(repo_root) @@ -575,28 +54,14 @@ def validate_markdown( problems += validate_citations(md_paths, repo_root) if problems: - # Currently treating WARNING severity as failing if strict is True. - # But if we want only ERROR to fail, we can filter. has_errors = any(p.severity == DiagnosticSeverity.ERROR for p in problems) exit_code = 1 if (strict and has_errors) else 0 return (problems, exit_code) - else: - return ([], 0) + return ([], 0) def find_manuscript_directory(repo_root: str | Path, project_name: str = "project") -> Path: - """Find the manuscript directory for a discovered or qualified project name. - - Args: - repo_root: Root directory of the repository - project_name: Bare or qualified project name (default: "project") - - Returns: - Path to the project's manuscript directory - - Raises: - FileNotFoundError: If manuscript directory cannot be found - """ + """Find the manuscript directory for a discovered or qualified project name.""" from infrastructure.project.discovery import discover_projects, resolve_project_root repo_root = Path(repo_root) diff --git a/infrastructure/validation/content/validator_citations.py b/infrastructure/validation/content/validator_citations.py new file mode 100644 index 000000000..80eebd953 --- /dev/null +++ b/infrastructure/validation/content/validator_citations.py @@ -0,0 +1,91 @@ +"""Markdown citation key validation against BibTeX sources.""" + +from __future__ import annotations + +import re +from pathlib import Path + +from infrastructure.core.logging import DiagnosticEvent, DiagnosticSeverity +from infrastructure.core.logging.utils import get_logger +from infrastructure.validation.content.diagnostic_codes import BibtexCode +from infrastructure.validation.content.markdown_strip import strip_code_and_math +from infrastructure.validation.content.validator_pitfalls import NON_RENDERED_MANUSCRIPT_FILES + +logger = get_logger(__name__) + +CITE_KEY_PATTERN = re.compile(r"(? list[DiagnosticEvent]: + """Verify every ``[@key]`` citation resolves in the project's BibTeX file(s).""" + repo_root_path = Path(repo_root) + problems: list[DiagnosticEvent] = [] + if not md_paths: + return problems + + if bib_file is None: + bib_paths = sorted(Path(md_paths[0]).parent.glob("*.bib")) + elif isinstance(bib_file, (list, tuple)): + bib_paths = [Path(p) for p in bib_file if Path(p).exists()] + else: + single = Path(bib_file) + bib_paths = [single] if single.exists() else [] + + if not bib_paths: + return problems + + known_keys: set[str] = set() + bib_names: list[str] = [] + for bib_path in bib_paths: + try: + bib_text = bib_path.read_text(encoding="utf-8", errors="ignore") + except OSError as e: + logger.warning(f"Failed to read BibTeX file {bib_path}: {e}") + continue + known_keys.update(k.strip() for k in BIBTEX_KEY_PATTERN.findall(bib_text)) + bib_names.append(bib_path.name) + + if not bib_names: + return problems + + bib_label = bib_names[0] if len(bib_names) == 1 else ", ".join(bib_names) + + for path in md_paths: + path_obj = Path(path) + if path_obj.name in NON_RENDERED_MANUSCRIPT_FILES: + continue + text = path_obj.read_text(encoding="utf-8") + try: + rel: str | Path = path_obj.relative_to(repo_root_path) + except ValueError: + rel = path_obj + rel_str = str(rel) + + prose = strip_code_and_math(text) + seen_in_file: set[str] = set() + for m in CITE_KEY_PATTERN.finditer(prose): + key = m.group(1) + if key in known_keys or key in seen_in_file: + continue + if any(key.startswith(prefix) for prefix in ("sec:", "fig:", "tbl:", "eq:", "lst:")): + continue + seen_in_file.add(key) + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.ERROR, + category="MARKDOWN_CITATION", + message=f"Undefined citation key '@{key}' (not in {bib_label})", + code=BibtexCode.UNDEFINED_KEY, + file_path=rel_str, + fix_suggestion=( + f"Add an entry '@type{{{key}, ...}}' to {bib_names[0]} or " + "correct the citation key in the markdown." + ), + ) + ) + return problems diff --git a/infrastructure/validation/content/validator_images.py b/infrastructure/validation/content/validator_images.py new file mode 100644 index 000000000..5ea03ba9d --- /dev/null +++ b/infrastructure/validation/content/validator_images.py @@ -0,0 +1,68 @@ +"""Markdown image reference validation.""" + +from __future__ import annotations + +import re +from pathlib import Path + +from infrastructure.core.logging import DiagnosticEvent, DiagnosticSeverity +from infrastructure.validation.content.diagnostic_codes import MarkdownCode + +IMG_PATTERN = re.compile(r"!\[[^\]]*\]\(([^\)]+)\)") + + +def validate_images( + md_paths: list[str], + repo_root: str | Path, + extra_search_dirs: list[str | Path] | None = None, +) -> list[DiagnosticEvent]: + """Validate that all referenced images exist in the filesystem.""" + repo_root_path = Path(repo_root) + problems: list[DiagnosticEvent] = [] + + search_dirs: list[Path] = [] + if extra_search_dirs: + search_dirs.extend(Path(d) for d in extra_search_dirs) + + if md_paths: + md_dir = Path(md_paths[0]).parent + project_root = md_dir.parent + for candidate in [ + project_root / "output" / "figures", + project_root / "figures", + ]: + if candidate.is_dir() and candidate not in search_dirs: + search_dirs.append(candidate) + + for path in md_paths: + path_obj = Path(path) + text = path_obj.read_text(encoding="utf-8") + for img in IMG_PATTERN.findall(text): + img_clean = img.split()[0] + abs_path = (path_obj.parent / img_clean).resolve() + if abs_path.exists(): + continue + + img_basename = Path(img_clean).name + found = False + for search_dir in search_dirs: + if (search_dir / img_basename).exists(): + found = True + break + if not found: + display_path: Path + try: + display_path = Path(path).relative_to(repo_root_path) + except ValueError: + display_path = path_obj + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.ERROR, + category="MARKDOWN_IMAGE", + message=f"Missing referenced image: '{img_clean}'", + code=MarkdownCode.IMG_MISSING, + file_path=str(display_path), + fix_suggestion="Ensure the image file exists in the specified relative path or figures directory.", + ) + ) + return problems diff --git a/infrastructure/validation/content/validator_math.py b/infrastructure/validation/content/validator_math.py new file mode 100644 index 000000000..ad47dbcd5 --- /dev/null +++ b/infrastructure/validation/content/validator_math.py @@ -0,0 +1,108 @@ +"""Markdown mathematical equation validation.""" + +from __future__ import annotations + +import re +from pathlib import Path + +from infrastructure.core.logging import DiagnosticEvent, DiagnosticSeverity +from infrastructure.validation.content.diagnostic_codes import MarkdownCode +from infrastructure.validation.content.markdown_strip import strip_markdown_code_regions +from infrastructure.validation.content.validator_refs import text_without_fenced_code + + +def _has_invalid_dollar_display_math(text: str) -> bool: + """Return true for inline, nested, or unbalanced ``$$`` delimiters.""" + text = strip_markdown_code_regions(text) + open_block = False + for line in text.splitlines(): + stripped = line.strip() + if "$$" not in stripped: + continue + + delimiter_count = stripped.count("$$") + if stripped == "$$" or re.fullmatch(r"\$\$\s+\{#eq:[A-Za-z0-9_-]+(?:\s+[^}]*)?\}", stripped): + open_block = not open_block + continue + + if ( + not open_block + and delimiter_count == 2 + and stripped.startswith("$$") + and stripped.endswith("$$") + and stripped[2:-2].strip() + ): + continue + + return True + return open_block + + +def validate_math(md_paths: list[str], repo_root: str | Path) -> list[DiagnosticEvent]: + """Validate mathematical equation formatting and labeling.""" + repo_root_path = Path(repo_root) + problems: list[DiagnosticEvent] = [] + eq_block = re.compile(r"\\begin\{equation\}([\s\S]*?)\\end\{equation\}", re.MULTILINE) + label_pattern = re.compile(r"\\label\{([^}]+)\}") + seen_labels: set[str] = set() + for path in md_paths: + text = Path(path).read_text(encoding="utf-8") + try: + rel: str | Path = Path(path).relative_to(repo_root_path) + except ValueError: + rel = path + + rel_str = str(rel) + + if _has_invalid_dollar_display_math(text): + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_MATH", + message="Use isolated $$ display blocks; inline or unbalanced $$ is not allowed", + code=MarkdownCode.MATH_DOLLAR_DISPLAY, + file_path=rel_str, + fix_suggestion=("Put display math on its own line(s), for example $$x = y$$ or a paired $$ block."), + ) + ) + if "\\[" in text or "\\]" in text: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_MATH", + message="Use equation environment instead of \\[ \\]", + code=MarkdownCode.MATH_BRACKET_DISPLAY, + file_path=rel_str, + fix_suggestion="Replace \\[...\\] with \\begin{equation}...\\end{equation}", + ) + ) + _eq_scan_text = re.sub(r"`+[^`\n]*`+", "", text_without_fenced_code(text)) + for m in eq_block.finditer(_eq_scan_text): + block = m.group(1) + labels_in_block = label_pattern.findall(block) + if not labels_in_block: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_MATH", + message="Equation missing \\label{...}", + code=MarkdownCode.MATH_LABEL_MISSING, + file_path=rel_str, + fix_suggestion="Add a \\label{eq_name} inside the \\begin{equation} block.", + ) + ) + else: + for lab in labels_in_block: + if lab in seen_labels: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.ERROR, + category="MARKDOWN_MATH", + message=f"Duplicate equation label '{{{lab}}}' found", + code=MarkdownCode.MATH_LABEL_DUPLICATE, + file_path=rel_str, + fix_suggestion="Rename one of the labels to be unique.", + ) + ) + seen_labels.add(lab) + return problems diff --git a/infrastructure/validation/content/validator_pitfalls.py b/infrastructure/validation/content/validator_pitfalls.py new file mode 100644 index 000000000..2bf2caebc --- /dev/null +++ b/infrastructure/validation/content/validator_pitfalls.py @@ -0,0 +1,79 @@ +"""Pandoc conversion pitfall checks for markdown manuscripts.""" + +from __future__ import annotations + +import re +from pathlib import Path + +from infrastructure.core.logging import DiagnosticEvent, DiagnosticSeverity +from infrastructure.validation.content.diagnostic_codes import MarkdownCode +from infrastructure.validation.content.markdown_strip import strip_code_and_math + +PANDOC_BARE_PIPE_PATTERN = re.compile(r"(? list[DiagnosticEvent]: + """Flag markdown patterns Pandoc converts to LaTeX ``\\mid`` in text mode.""" + repo_root_path = Path(repo_root) + problems: list[DiagnosticEvent] = [] + + for path in md_paths: + path_obj = Path(path) + if path_obj.name in NON_RENDERED_MANUSCRIPT_FILES: + continue + text = path_obj.read_text(encoding="utf-8") + try: + rel: str | Path = path_obj.relative_to(repo_root_path) + except ValueError: + rel = path_obj + rel_str = str(rel) + + prose = strip_code_and_math(text) + for m in PANDOC_BARE_PIPE_PATTERN.finditer(prose): + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_PANDOC_MID", + message=( + f"Bare pipe pattern '|{m.group(1)}|' in prose will be " + f"converted by Pandoc to '\\mid {m.group(1)}\\mid{{}}', " + "which fails to render U+2223 in text mode." + ), + code=MarkdownCode.PANDOC_BARE_PIPE, + file_path=rel_str, + fix_suggestion=( + f"Wrap the span in inline math (e.g. '$|{m.group(1)}|$' " + f"or '${{|}}{m.group(1)}{{|}}$') so the macro renders " + "through the math font." + ), + ) + ) + + for line_no, line in enumerate(text.splitlines(), 1): + stripped = line.lstrip() + if not stripped.startswith("|"): + continue + line_no_math = re.sub(r"(? str: + """Remove triple-backtick and tilde-fenced code blocks.""" + return strip_fences(text) + + +def validate_refs( + md_paths: list[str], repo_root: str | Path, labels: set[str], anchors: set[str] +) -> list[DiagnosticEvent]: + """Validate cross-references, internal links, and external URLs.""" + repo_root_path = Path(repo_root) + problems: list[DiagnosticEvent] = [] + for path in md_paths: + text = Path(path).read_text(encoding="utf-8") + try: + rel: str | Path = Path(path).relative_to(repo_root_path) + except ValueError: + rel = path + + rel_str = str(rel) + text_wo_fences = text_without_fenced_code(text) + for ref in EQ_REF_PATTERN.findall(text_wo_fences): + if ref not in labels: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.ERROR, + category="MARKDOWN_REF", + message=f"Missing equation label for \\eqref{{{ref}}}", + code=MarkdownCode.REF_EQUATION_MISSING, + file_path=rel_str, + fix_suggestion=f"Verify that '\\label{{{ref}}}' exists in an equation block.", + ) + ) + for link in INTERNAL_LINK_PATTERN.findall(text_wo_fences): + if link not in anchors and link not in labels: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.ERROR, + category="MARKDOWN_LINK", + message=f"Missing anchor/label for internal link (#{link})", + code=MarkdownCode.LINK_ANCHOR_MISSING, + file_path=rel_str, + fix_suggestion=f"Provide a heading anchor '{{#{link}}}' or equation label.", + ) + ) + text_no_code = re.sub(r"```[^`]*```", "", text, flags=re.DOTALL) + text_no_code = re.sub(r"`[^`]+`", "", text_no_code) + for m in BARE_URL_PATTERN.finditer(text_no_code): + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_LINK", + message=f"Bare URL found: '{m.group(0)}'", + code=MarkdownCode.LINK_BARE_URL, + file_path=rel_str, + fix_suggestion="Wrap the URL in a Markdown link with informative text: [link text](url)", + ) + ) + for m in LINK_PATTERN.finditer(text): + label = m.group(1).strip() + url = m.group(2).strip() + if label == url or label.lower().startswith("http") or "/" in label: + problems.append( + DiagnosticEvent( + severity=DiagnosticSeverity.WARNING, + category="MARKDOWN_LINK", + message=f"Non-informative link text for {url}", + code=MarkdownCode.LINK_BAD_TEXT, + file_path=rel_str, + fix_suggestion=f"Replace '{label}' with descriptive text about the link destination.", + ) + ) + return problems diff --git a/infrastructure/validation/line_count.py b/infrastructure/validation/line_count.py index e52ff506f..2551f1cdd 100644 --- a/infrastructure/validation/line_count.py +++ b/infrastructure/validation/line_count.py @@ -14,6 +14,7 @@ class LineCountThresholds: DEFAULT_INFRA_THRESHOLDS = LineCountThresholds(warn_at=800, fail_at=950) DEFAULT_PROJECT_SCRIPT_THRESHOLDS = LineCountThresholds(warn_at=150, fail_at=250) +DEFAULT_TEST_THRESHOLDS = LineCountThresholds(warn_at=800, fail_at=10_000) def count_lines(path: Path) -> int: @@ -109,3 +110,35 @@ def scan_project_src( warnings.extend(part_warnings) failures.extend(part_failures) return warnings, failures + + +def scan_repository_tests( + repo_root: Path, + *, + allowlist: frozenset[str] = frozenset(), +) -> tuple[list[tuple[str, int]], list[tuple[str, int]]]: + """Advisory scan of infra and public project test modules (warn-only by default).""" + from infrastructure.project.public_scope import PUBLIC_PROJECT_NAMES + + warnings: list[tuple[str, int]] = [] + infra_warnings, _ = scan_line_counts( + repo_root, + ("tests",), + thresholds=DEFAULT_TEST_THRESHOLDS, + allowlist=allowlist, + ) + warnings.extend(infra_warnings) + + for name in PUBLIC_PROJECT_NAMES: + tests_dir = repo_root / "projects" / name / "tests" + if not tests_dir.is_dir(): + continue + rel_root = tests_dir.relative_to(repo_root).as_posix() + part_warnings, _ = scan_line_counts( + repo_root, + (rel_root,), + thresholds=DEFAULT_TEST_THRESHOLDS, + allowlist=allowlist, + ) + warnings.extend(part_warnings) + return warnings, [] diff --git a/projects/templates/template_autoscientists/scripts/hermes_proposer.py b/projects/templates/template_autoscientists/scripts/hermes_proposer.py index 770c9a594..4d91a6d18 100644 --- a/projects/templates/template_autoscientists/scripts/hermes_proposer.py +++ b/projects/templates/template_autoscientists/scripts/hermes_proposer.py @@ -4,13 +4,14 @@ import json from collections.abc import Sequence - -from infrastructure.llm.core.client import LLMClient -from infrastructure.llm.core.config import GenerationOptions +from typing import TYPE_CHECKING from src.agents import _extract_json from src.state import Proposal, SharedState +if TYPE_CHECKING: + from infrastructure.llm.core.client import LLMClient + class HermesProposer: """Live proposer backed by a Hermes model served through Ollama.""" @@ -22,6 +23,8 @@ def __init__(self, model: str = "hermes3", step: float = 0.5) -> None: def _ensure_client(self) -> LLMClient: # pragma: no cover - requires live Ollama if self._client is None: + from infrastructure.llm.core.client import LLMClient + self._client = LLMClient() return self._client @@ -55,6 +58,8 @@ def propose( # pragma: no cover - requires live Ollama ) -> Proposal: if not axes: raise ValueError("axes must be non-empty") + from infrastructure.llm.core.config import GenerationOptions + client = self._ensure_client() raw = client.query( self._prompt(state, axes, avoid), diff --git a/projects/templates/template_autoscientists/src/__init__.py b/projects/templates/template_autoscientists/src/__init__.py index c25c1ed82..59f535965 100644 --- a/projects/templates/template_autoscientists/src/__init__.py +++ b/projects/templates/template_autoscientists/src/__init__.py @@ -23,6 +23,7 @@ from .stagnation import StagnationDetector, reorganize_axes from .state import Champion, ExperimentOutcome, Proposal, SharedState + def __getattr__(name: str): # pragma: no cover - lazy script-layer export if name == "HermesProposer": from hermes_proposer import HermesProposer diff --git a/projects/templates/template_autoscientists/src/agents.py b/projects/templates/template_autoscientists/src/agents.py index 543a32464..4dc5b753a 100644 --- a/projects/templates/template_autoscientists/src/agents.py +++ b/projects/templates/template_autoscientists/src/agents.py @@ -15,7 +15,6 @@ from __future__ import annotations -import json from collections.abc import Sequence from typing import Protocol diff --git a/projects/templates/template_autoscientists/tests/test_hermes_live.py b/projects/templates/template_autoscientists/tests/test_hermes_live.py index 876e2bb83..8827c03be 100644 --- a/projects/templates/template_autoscientists/tests/test_hermes_live.py +++ b/projects/templates/template_autoscientists/tests/test_hermes_live.py @@ -12,7 +12,6 @@ import pytest -from hermes_proposer import HermesProposer from src.state import Champion, SharedState @@ -35,6 +34,8 @@ def test_hermes_proposes_in_scope_axis() -> None: pytest.importorskip("infrastructure.llm.core.config") if not _ollama_reachable(): pytest.skip("Ollama daemon not reachable; opt-in live test") + from hermes_proposer import HermesProposer + state = SharedState(champion=Champion(params=(1.5, 1.5, 1.5, 1.5), metric=-9.0, experiment_index=-1)) proposer = HermesProposer() proposal = proposer.propose(state, axes=[0, 1, 2, 3], proposer_id="hermes0") diff --git a/pyproject.toml b/pyproject.toml index d2c98cf3c..ca16a8b13 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -254,7 +254,6 @@ ignore = [ [tool.ruff.lint.per-file-ignores] "infrastructure/core/logging/*.py" = ["E402"] # Conditional imports after setup "infrastructure/publishing/publish_cli.py" = ["E402"] # sys.path manipulation before imports -"infrastructure/rendering/render_all_cli.py" = ["E402"] # sys.path manipulation before imports "scripts/*" = ["E402", "E501"] # sys.path before imports; docstring line length "projects/*/scripts/**/*.py" = ["E402", "E501"] # Path bootstrap before imports; long CLI strings "tests/**/*.py" = ["E712", "E402", "E501"] # Assertions, conditional imports, long fixture payloads diff --git a/scripts/gates/module_line_count_check.py b/scripts/gates/module_line_count_check.py index 1ff729195..b1ec2061c 100644 --- a/scripts/gates/module_line_count_check.py +++ b/scripts/gates/module_line_count_check.py @@ -14,6 +14,7 @@ scan_infrastructure_and_scripts, scan_project_scripts, scan_project_src, + scan_repository_tests, ) # Documented, time-boxed exceptions to the module line-count gate. These are @@ -38,6 +39,11 @@ def main(argv: list[str] | None = None) -> int: type=Path, default=REPO_ROOT, ) + parser.add_argument( + "--include-tests", + action="store_true", + help="Emit advisory WARN lines for test modules >=800 lines (never fails the gate)", + ) args = parser.parse_args(argv) warnings: list[tuple[str, int]] = [] @@ -55,6 +61,11 @@ def main(argv: list[str] | None = None) -> int: warnings.extend(src_warn) failures.extend(src_fail) + if args.include_tests: + test_warn, _ = scan_repository_tests(args.repo_root, allowlist=LINE_COUNT_ALLOWLIST) + for rel, count in sorted(test_warn): + print(f"WARN [test] {rel}: {count} lines") + for rel, count in sorted(warnings): print(f"WARN {rel}: {count} lines") for rel, count in sorted(failures): diff --git a/tests/infra_tests/rendering/test_format_toggles.py b/tests/infra_tests/rendering/test_format_toggles.py index c61d6ca27..9943ad6e0 100644 --- a/tests/infra_tests/rendering/test_format_toggles.py +++ b/tests/infra_tests/rendering/test_format_toggles.py @@ -96,11 +96,11 @@ def test_pipeline_skip_branches_present() -> None: """ from pathlib import Path - source = Path("infrastructure/rendering/pipeline.py").read_text() + source = Path("infrastructure/rendering/_combined_exports.py").read_text() assert "[skip] PDF rendering disabled" in source assert "[skip] HTML rendering disabled" in source - assert "_render_combined_docx" in source - assert "_render_combined_epub" in source + assert "render_combined_docx" in source + assert "render_combined_epub" in source def test_combined_html_skips_missing_transmission_bookends(tmp_path) -> None: