feat: implement high-fidelity DOCX to PDF conversion using headless LibreOffice by ankittroy-21 · Pull Request #269 · Durgeshwar-AI/pdfToPng

ankittroy-21 · 2026-06-10T05:25:28Z

This PR upgrades the DOCX to PDF conversion engine from a basic text-extraction method to a full-fidelity rendering engine using headless LibreOffice.

The Problem

The previous implementation used ReportLab to extract raw text from python-docx. This resulted in the loss of:

Formatting: Bold, italics, font sizes, and colors.
Layout: Indents, alignment, and page breaks.
Media: All images and graphics.
Tables: Structural elements were completely omitted.

The Solution

LibreOffice Integration: Integrated headless LibreOffice into the backend Docker environment to serve as the conversion engine.
Subprocess Workflow: Refactored backend/blueprints/docx_to_pdf.py to use libreoffice --convert-to pdf via a Python subprocess.
Secure Temporary Processing: Utilized tempfile.TemporaryDirectory to ensure that conversion artifacts are automatically cleaned up and never persist beyond the request lifecycle.
Dependency Optimization: Removed the reportlab library from requirements.txt as it is no longer required for this feature.

Changes

backend/Dockerfile: Installed libreoffice-writer and necessary Java runtimes for headless operation.
backend/blueprints/docx_to_pdf.py: Completely refactored to use LibreOffice instead of ReportLab.
backend/requirements.txt: Removed reportlab.
Readme.md: Updated Tech Stack and added local setup notes for LibreOffice.

Verification Results

Verified that complex DOCX documents (with tables and images) are converted to PDF with original styling preserved.
Confirmed that temporary files are automatically deleted after conversion.
Validated that the Docker build correctly installs all new system dependencies.

How to Test

Rebuild the backend Docker container: docker-compose up --build backend.
Upload a DOCX file containing tables, images, and varied font styles through the "DOCX to PDF" tool in the UI.
Verify that the downloaded PDF matches the source document's layout and formatting exactly.

…ibreOffice

vercel · 2026-06-10T05:25:32Z

@ankittroy-21 is attempting to deploy a commit to the Durgeshwar's projects Team on Vercel.

A member of the Team first needs to authorize it.

ankittroy-21 · 2026-06-10T08:46:44Z

@Durgeshwar-AI Look into it

Durgeshwar-AI · 2026-06-11T15:51:41Z

@ankittroy-21 I do not want any temp file creations please remove all such instances.

ankittroy-21 · 2026-06-11T16:21:56Z

@Durgeshwar-AI Removed the temp file section

Durgeshwar-AI · 2026-06-12T15:00:51Z

import os
import subprocess
import tempfile
import traceback
from io import BytesIO

from flask import Blueprint, request

from utils.helpers import error, send_file_and_cleanup

docx_pdf_bp = Blueprint("docx_pdf", name)

@docx_pdf_bp.route("/convertDocxToPdf", methods=["POST"])
def convert_docx_to_pdf():
try:
if "file" not in request.files:
return error("No file provided")

    docx_file = request.files["file"]

    if docx_file.filename == "":
        return error("No file selected")

    docx_bytes = docx_file.read()

    # Use LibreOffice for high-fidelity conversion
    with tempfile.TemporaryDirectory() as tmp_dir:
        input_path = os.path.join(tmp_dir, "document.docx")
        with open(input_path, "wb") as f:
            f.write(docx_bytes)

        try:
            # Execute LibreOffice headless conversion
            # Using --convert-to pdf --outdir
            result = subprocess.run(
                [
                    "libreoffice",
                    "--headless",
                    "--convert-to",
                    "pdf",
                    "--outdir",
                    tmp_dir,
                    input_path,
                ],
                capture_output=True,
                text=True,
                timeout=60,  # 60 second timeout
            )

            if result.returncode != 0:
                print(f"LibreOffice error: {result.stderr}")
                return error(f"Conversion failed: {result.stderr}", 500)

            # LibreOffice names the output file same as input but with .pdf extension
            pdf_path = os.path.join(tmp_dir, "document.pdf")
            
            if not os.path.exists(pdf_path):
                return error("Conversion failed: PDF not generated", 500)

            with open(pdf_path, "rb") as f:
                pdf_bytes = f.read()

            output = BytesIO(pdf_bytes)
            output.seek(0)

            return send_file_and_cleanup(
                output,
                mimetype="application/pdf",
                as_attachment=True,
                download_name="converted.pdf",
            )

        except subprocess.TimeoutExpired:
            return error("Conversion timed out", 500)
        except Exception as e:
            traceback.print_exc()
            return error(f"Error during conversion: {str(e)}", 500)

except Exception as e:
    traceback.print_exc()
    return error(str(e), 500)

@ankittroy-21 Still there

feat: implement high-fidelity DOCX to PDF conversion using headless L…

e1ea759

…ibreOffice

ankittroy-21 added 2 commits June 11, 2026 21:50

Update image.py

c0da698

Merge branch 'Durgeshwar-AI:main' into main

1ea72e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement high-fidelity DOCX to PDF conversion using headless LibreOffice#269

feat: implement high-fidelity DOCX to PDF conversion using headless LibreOffice#269
ankittroy-21 wants to merge 3 commits into
Durgeshwar-AI:mainfrom
ankittroy-21:main

ankittroy-21 commented Jun 10, 2026

Uh oh!

vercel Bot commented Jun 10, 2026

Uh oh!

ankittroy-21 commented Jun 10, 2026

Uh oh!

Durgeshwar-AI commented Jun 11, 2026

Uh oh!

ankittroy-21 commented Jun 11, 2026

Uh oh!

Durgeshwar-AI commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankittroy-21 commented Jun 10, 2026

The Problem

The Solution

Changes

Verification Results

How to Test

Uh oh!

vercel Bot commented Jun 10, 2026

Uh oh!

ankittroy-21 commented Jun 10, 2026

Uh oh!

Durgeshwar-AI commented Jun 11, 2026

Uh oh!

ankittroy-21 commented Jun 11, 2026

Uh oh!

Durgeshwar-AI commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants