Skip to content

feat: implement high-fidelity DOCX to PDF conversion using headless LibreOffice#269

Open
ankittroy-21 wants to merge 3 commits into
Durgeshwar-AI:mainfrom
ankittroy-21:main
Open

feat: implement high-fidelity DOCX to PDF conversion using headless LibreOffice#269
ankittroy-21 wants to merge 3 commits into
Durgeshwar-AI:mainfrom
ankittroy-21:main

Conversation

@ankittroy-21

Copy link
Copy Markdown

This PR upgrades the DOCX to PDF conversion engine from a basic text-extraction method to a full-fidelity rendering engine using headless LibreOffice.

The Problem

The previous implementation used ReportLab to extract raw text from python-docx. This resulted in the loss of:

  • Formatting: Bold, italics, font sizes, and colors.
  • Layout: Indents, alignment, and page breaks.
  • Media: All images and graphics.
  • Tables: Structural elements were completely omitted.

The Solution

  • LibreOffice Integration: Integrated headless LibreOffice into the backend Docker environment to serve as the conversion engine.
  • Subprocess Workflow: Refactored backend/blueprints/docx_to_pdf.py to use libreoffice --convert-to pdf via a Python subprocess.
  • Secure Temporary Processing: Utilized tempfile.TemporaryDirectory to ensure that conversion artifacts are automatically cleaned up and never persist beyond the request lifecycle.
  • Dependency Optimization: Removed the reportlab library from requirements.txt as it is no longer required for this feature.

Changes

  • backend/Dockerfile: Installed libreoffice-writer and necessary Java runtimes for headless operation.
  • backend/blueprints/docx_to_pdf.py: Completely refactored to use LibreOffice instead of ReportLab.
  • backend/requirements.txt: Removed reportlab.
  • Readme.md: Updated Tech Stack and added local setup notes for LibreOffice.

Verification Results

  • Verified that complex DOCX documents (with tables and images) are converted to PDF with original styling preserved.
  • Confirmed that temporary files are automatically deleted after conversion.
  • Validated that the Docker build correctly installs all new system dependencies.

How to Test

  1. Rebuild the backend Docker container: docker-compose up --build backend.
  2. Upload a DOCX file containing tables, images, and varied font styles through the "DOCX to PDF" tool in the UI.
  3. Verify that the downloaded PDF matches the source document's layout and formatting exactly.

@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

@ankittroy-21 is attempting to deploy a commit to the Durgeshwar's projects Team on Vercel.

A member of the Team first needs to authorize it.

@ankittroy-21

Copy link
Copy Markdown
Author

@Durgeshwar-AI Look into it

@Durgeshwar-AI

Copy link
Copy Markdown
Owner

@ankittroy-21 I do not want any temp file creations please remove all such instances.

@ankittroy-21

Copy link
Copy Markdown
Author

@Durgeshwar-AI Removed the temp file section

@Durgeshwar-AI

Copy link
Copy Markdown
Owner

import os
import subprocess
import tempfile
import traceback
from io import BytesIO

from flask import Blueprint, request

from utils.helpers import error, send_file_and_cleanup

docx_pdf_bp = Blueprint("docx_pdf", name)

@docx_pdf_bp.route("/convertDocxToPdf", methods=["POST"])
def convert_docx_to_pdf():
try:
if "file" not in request.files:
return error("No file provided")

    docx_file = request.files["file"]

    if docx_file.filename == "":
        return error("No file selected")

    docx_bytes = docx_file.read()

    # Use LibreOffice for high-fidelity conversion
    with tempfile.TemporaryDirectory() as tmp_dir:
        input_path = os.path.join(tmp_dir, "document.docx")
        with open(input_path, "wb") as f:
            f.write(docx_bytes)

        try:
            # Execute LibreOffice headless conversion
            # Using --convert-to pdf --outdir
            result = subprocess.run(
                [
                    "libreoffice",
                    "--headless",
                    "--convert-to",
                    "pdf",
                    "--outdir",
                    tmp_dir,
                    input_path,
                ],
                capture_output=True,
                text=True,
                timeout=60,  # 60 second timeout
            )

            if result.returncode != 0:
                print(f"LibreOffice error: {result.stderr}")
                return error(f"Conversion failed: {result.stderr}", 500)

            # LibreOffice names the output file same as input but with .pdf extension
            pdf_path = os.path.join(tmp_dir, "document.pdf")
            
            if not os.path.exists(pdf_path):
                return error("Conversion failed: PDF not generated", 500)

            with open(pdf_path, "rb") as f:
                pdf_bytes = f.read()

            output = BytesIO(pdf_bytes)
            output.seek(0)

            return send_file_and_cleanup(
                output,
                mimetype="application/pdf",
                as_attachment=True,
                download_name="converted.pdf",
            )

        except subprocess.TimeoutExpired:
            return error("Conversion timed out", 500)
        except Exception as e:
            traceback.print_exc()
            return error(f"Error during conversion: {str(e)}", 500)

except Exception as e:
    traceback.print_exc()
    return error(str(e), 500)

@ankittroy-21 Still there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants