Skip to content

feat:数据合成、数据质量评估、unstructuredio算子适配三个算子#485

Open
QianqiuerQS wants to merge 1 commit into
ModelEngine-Group:feat/hzjhfrom
QianqiuerQS:feat/hzjh
Open

feat:数据合成、数据质量评估、unstructuredio算子适配三个算子#485
QianqiuerQS wants to merge 1 commit into
ModelEngine-Group:feat/hzjhfrom
QianqiuerQS:feat/hzjh

Conversation

@QianqiuerQS
Copy link
Copy Markdown

No description provided.

Copilot AI review requested due to automatic review settings May 17, 2026 12:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces three new DataMate operators for document parsing and medical data synthesis/evaluation, along with a standalone HTTP service backing the latter two.

Changes:

  • Adds unstructuredio mapper for PDF/DOCX parsing with a DOCX fast path and PDF noise suppression.
  • Adds data_synthesis and data_quality_evaluator mappers that delegate to a standalone FastAPI service (data_synthesis_service) with vLLM/Ascend backends.
  • Adds test cases, example inputs, Dockerfiles, and documentation for deployment and acceptance testing.

Reviewed changes

Copilot reviewed 71 out of 75 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
runtime/ops/mapper/unstructuredio/operator_src/process.py Core unstructuredio mapper with DOCX fastpath, PDF strategy handling, and runtime overrides.
runtime/ops/mapper/unstructuredio/operator_src/metadata.yml Operator metadata; contains mojibake in pdfInferTableStructure settings.
runtime/ops/mapper/unstructuredio/operator_src/tests/*.py Tests/checks for DOCX fastpath coordinates.
runtime/ops/mapper/unstructuredio/test_cases/* Public test cases and docs.
runtime/ops/mapper/data_synthesis/operator_src/* Lightweight HTTP-calling operator.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/* FastAPI service (app/core/tests/Dockerfile).
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/* Synthesizer, evaluator, metrics, benchmark/verification scripts.
runtime/ops/mapper/data_synthesis/service_image/* Service image Dockerfile and README.
runtime/ops/mapper/data_quality_evaluator/operator_src/* Evaluator operator wrapping /evaluate-file.
runtime/ops/mapper/data_quality_evaluator/service_patch/* Duplicated service code for evaluator.
runtime/ops/mapper/data_quality_evaluator/test_cases/* Public evaluation cases.
Comments suppressed due to low confidence (9)

runtime/ops/mapper/unstructuredio/operator_src/metadata.yml:1

  • These strings are mojibake — Chinese text encoded as GBK but interpreted as another encoding. The name, description, checkedLabel, and unCheckedLabel will render as garbled characters in the DataMate UI. Re-save these values as valid UTF-8 Chinese (e.g. name: 'PDF 表格结构', checkedLabel: '开启', unCheckedLabel: '关闭') so they match the encoding used elsewhere in this file.
    runtime/ops/mapper/unstructuredio/operator_src/process.py:1
  • _classify_paragraph(chunk_text, paragraph_index, block) is called twice per chunk (once for category and once nested inside _assign_docx_coordinates). Compute it once into a local variable and reuse it to avoid duplicate work and keep the two values in sync if classification logic ever changes.
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1
  • When initialization fails, _ensure_synthesizer_initialized is called a second time immediately, but it just re-runs the same _build_synthesizer() synchronously with no backoff or state change in between — so the second call will almost always fail in the same way and only doubles the failure latency. Either drop the second call, or introduce a real retry/backoff strategy (e.g. delay, max attempts, or only retry on specific transient errors).
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/data_evaluator.py:1
  • Bare except: catches SystemExit/KeyboardInterrupt and hides bugs. Use except Exception: (or more specifically except (json.JSONDecodeError, ValueError, TypeError):) so unrelated errors are not silently swallowed.
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/data_evaluator.py:1
  • The docstring says 默认全部 7 个 (default all 7 dimensions), but dimension_criteria now only defines 5 dimensions (准确性、相关性、安全性、完整性、多样性). Update the docstring to reflect 5 dimensions.
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1
  • MedicalDataSynthesizer is imported from data_synthesizer, but no data_synthesizer.py is provided in service_patch/data_synthesis/ in this PR (the directory contains data_evaluator.py, requirement_metrics.py, and several scripts, but not data_synthesizer.py). Importing the service module will raise ModuleNotFoundError at runtime. Please add data_synthesizer.py (with MedicalDataSynthesizer) to the service patch directory, or adjust the import path.
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1
  • Embedding a multi-line Python program as a string and shipping it via python -c makes it hard to maintain (no syntax highlighting, no static analysis, no tests). Consider extracting _synthesize_via_subprocess/_run_subprocess_worker payloads into a small module (e.g. data_synthesis_service.worker) and invoking python -m data_synthesis_service.worker with the payload over stdin. This also avoids duplication between the synthesize and evaluate worker scripts.
    runtime/ops/mapper/unstructuredio/operator_src/process.py:1
  • original_initialize is assigned inside the try block but only saved into the finally cleanup via if \"original_initialize\" in locals(). If the from transformers import ... line raises, original_initialize is set but tables_module.UnstructuredTableTransformerModel.initialize was never actually overwritten, so the restoration is fine — but if the import succeeds and the patch on line 172 runs, then a later exception before yield would still leave the class patched while the original_load_agent restoration on line 183 happens. Initialize original_initialize = None alongside the other captured originals at the top of the function and gate restoration on it being non-None to make the cleanup symmetric and robust.
    runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1
  • Each call to evaluate_text with a different backend (e.g. rule for _build_metrics, then vllm for /evaluate-file) discards the existing evaluator and instantiates MedicalDataEvaluator again — which for the vLLM path triggers a full model load. Within a single synthesize_text request this means the evaluator may be rebuilt twice. Consider caching evaluators per backend (e.g. self._evaluators: dict[str, MedicalDataEvaluator]) so switching backends does not reload the LLM.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,98 @@
import json
import os
from typing import Any, Dict, Iterable, List, Optional
Comment on lines +11 to +15
PROJECT_ROOT = os.path.dirname(CURRENT_DIR)
DATA_SYNTHESIS_DIR = os.path.join(PROJECT_ROOT, "data_synthesis")
if DATA_SYNTHESIS_DIR not in sys.path:
sys.path.insert(0, DATA_SYNTHESIS_DIR)

Comment on lines +92 to +96
if response.status_code >= 400:
raise RuntimeError(
f"data_synthesis service failed: {response.status_code} {response.text}"
)
sample[self.text_key] = serialize_service_response(response.json())
Comment on lines +43 to +45
# DataMate may garble non-ASCII operator params into question marks.
if items and all(set(item) <= {"?"} for item in items):
return list(DEFAULT_DIMENSIONS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants