Skip to content

1188 improve pdf markdown extraction in mediumrich profiles#1320

Merged
simcariou merged 24 commits intodevelopfrom
1188-improve-pdf-markdown-extraction-in-mediumrich-profiles
Mar 30, 2026
Merged

1188 improve pdf markdown extraction in mediumrich profiles#1320
simcariou merged 24 commits intodevelopfrom
1188-improve-pdf-markdown-extraction-in-mediumrich-profiles

Conversation

@simcariou
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves PDF→Markdown extraction for the Knowledge Flow “medium/rich” ingestion profiles by expanding Docling/RapidOCR configuration (notably OpenVINO support) and aligning deployment configs accordingly, with a small frontend UX/i18n tweak in the agent asset manager drawer.

Changes:

  • Add build-time patching to Docling so RapidOCR’s "openvino" backend can resolve default model artifact paths, and add the openvino dependency on Linux.
  • Extend PDF pipeline configuration to support ocr_backend and force_full_page_ocr, and update multiple environment/deployment YAMLs to use Docling parsing with table structure extraction and OpenVINO OCR in rich profiles.
  • Update the frontend asset manager drawer title to display agentName (fallback to agentId) and add an “Add more files” label.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/patch_docling_rapidocr_openvino.py Build-time patch script to extend Docling RapidOCR default model mapping for OpenVINO.
knowledge-flow-backend/pyproject.toml Adds Linux-only openvino dependency alongside pinned Docling.
knowledge-flow-backend/uv.lock Locks openvino/openvino-telemetry versions for Linux builds.
knowledge-flow-backend/knowledge_flow_backend/core/processors/input/pdf_markdown_processor/pdf_markdown_processor.py Wires new RapidOCR options into Docling pipeline and switches to in-memory markdown export.
knowledge-flow-backend/knowledge_flow_backend/common/structures.py Adds ocr_backend and force_full_page_ocr to PDF pipeline config schema.
knowledge-flow-backend/knowledge_flow_backend/application_context.py Logs the new PDF OCR configuration fields in the config summary.
knowledge-flow-backend/dockerfiles/Dockerfile-prod Runs the Docling patch script during image build.
knowledge-flow-backend/dockerfiles/Dockerfile-dev Runs the Docling patch script during dev image build.
knowledge-flow-backend/config/configuration.yaml Updates default profile settings and PDF processing options (Docling parse, tables, OCR tuning).
knowledge-flow-backend/config/configuration_prod.yaml Aligns prod profile PDF settings (tables + OpenVINO OCR in rich).
knowledge-flow-backend/config/configuration_test.yaml Aligns test profile PDF settings (tables + OpenVINO OCR in rich).
knowledge-flow-backend/config/configuration_gcp.yaml Aligns GCP profile PDF settings and processor selection.
knowledge-flow-backend/config/configuration_worker.yaml Aligns worker profile PDF settings and processors.
knowledge-flow-backend/config/configuration_bench.yaml Adjusts benchmark defaults and PDF profile options.
deploy/charts/fred/values.yaml Updates Helm values to reflect new processing profile defaults and OCR/table settings.
deploy/local/k3d/values-local.yaml Updates local k3d values to include/align medium & rich profiles with Docling parse and OpenVINO OCR.
frontend/src/locales/en/translation.json Updates asset manager title interpolation and adds “addMoreFiles” string.
frontend/src/locales/fr/translation.json Updates asset manager title interpolation and adds “addMoreFiles” string.
frontend/src/components/agentHub/AgentGridManager.tsx Passes agentName into the asset manager drawer.
frontend/src/components/agentHub/AgentConfigWorkspaceManagerDrawer.tsx Uses agent name in title (fallback to id) and uses new i18n key for “add more files”.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread knowledge-flow-backend/dockerfiles/Dockerfile-dev Outdated
Comment thread knowledge-flow-backend/dockerfiles/Dockerfile-dev Outdated
@gitguardian
Copy link
Copy Markdown

gitguardian bot commented Mar 27, 2026

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
27366443 Triggered Generic Password 830873c deploy/local/k3d/values-local.yaml View secret
17205519 Triggered Generic Password 830873c deploy/local/k3d/values-local.yaml View secret
17205519 Triggered Generic Password 3f59a2a deploy/local/k3d/values-local.yaml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@simcariou simcariou merged commit 43bd3f7 into develop Mar 30, 2026
31 checks passed
@simcariou simcariou deleted the 1188-improve-pdf-markdown-extraction-in-mediumrich-profiles branch March 30, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF Markdown extraction returns invalid/poor output in medium/rich profiles

3 participants