feat:数据合成、数据质量评估、unstructuredio算子适配三个算子#485
Open
QianqiuerQS wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces three new DataMate operators for document parsing and medical data synthesis/evaluation, along with a standalone HTTP service backing the latter two.
Changes:
- Adds
unstructurediomapper for PDF/DOCX parsing with a DOCX fast path and PDF noise suppression. - Adds
data_synthesisanddata_quality_evaluatormappers that delegate to a standalone FastAPI service (data_synthesis_service) with vLLM/Ascend backends. - Adds test cases, example inputs, Dockerfiles, and documentation for deployment and acceptance testing.
Reviewed changes
Copilot reviewed 71 out of 75 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/ops/mapper/unstructuredio/operator_src/process.py | Core unstructuredio mapper with DOCX fastpath, PDF strategy handling, and runtime overrides. |
| runtime/ops/mapper/unstructuredio/operator_src/metadata.yml | Operator metadata; contains mojibake in pdfInferTableStructure settings. |
| runtime/ops/mapper/unstructuredio/operator_src/tests/*.py | Tests/checks for DOCX fastpath coordinates. |
| runtime/ops/mapper/unstructuredio/test_cases/* | Public test cases and docs. |
| runtime/ops/mapper/data_synthesis/operator_src/* | Lightweight HTTP-calling operator. |
| runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/* | FastAPI service (app/core/tests/Dockerfile). |
| runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/* | Synthesizer, evaluator, metrics, benchmark/verification scripts. |
| runtime/ops/mapper/data_synthesis/service_image/* | Service image Dockerfile and README. |
| runtime/ops/mapper/data_quality_evaluator/operator_src/* | Evaluator operator wrapping /evaluate-file. |
| runtime/ops/mapper/data_quality_evaluator/service_patch/* | Duplicated service code for evaluator. |
| runtime/ops/mapper/data_quality_evaluator/test_cases/* | Public evaluation cases. |
Comments suppressed due to low confidence (9)
runtime/ops/mapper/unstructuredio/operator_src/metadata.yml:1
- These strings are mojibake — Chinese text encoded as GBK but interpreted as another encoding. The
name,description,checkedLabel, andunCheckedLabelwill render as garbled characters in the DataMate UI. Re-save these values as valid UTF-8 Chinese (e.g.name: 'PDF 表格结构',checkedLabel: '开启',unCheckedLabel: '关闭') so they match the encoding used elsewhere in this file.
runtime/ops/mapper/unstructuredio/operator_src/process.py:1 _classify_paragraph(chunk_text, paragraph_index, block)is called twice per chunk (once forcategoryand once nested inside_assign_docx_coordinates). Compute it once into a local variable and reuse it to avoid duplicate work and keep the two values in sync if classification logic ever changes.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1- When initialization fails,
_ensure_synthesizer_initializedis called a second time immediately, but it just re-runs the same_build_synthesizer()synchronously with no backoff or state change in between — so the second call will almost always fail in the same way and only doubles the failure latency. Either drop the second call, or introduce a real retry/backoff strategy (e.g. delay, max attempts, or only retry on specific transient errors).
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/data_evaluator.py:1 - Bare
except:catchesSystemExit/KeyboardInterruptand hides bugs. Useexcept Exception:(or more specificallyexcept (json.JSONDecodeError, ValueError, TypeError):) so unrelated errors are not silently swallowed.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis/data_evaluator.py:1 - The docstring says
默认全部 7 个(default all 7 dimensions), butdimension_criterianow only defines 5 dimensions (准确性、相关性、安全性、完整性、多样性). Update the docstring to reflect 5 dimensions.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1 MedicalDataSynthesizeris imported fromdata_synthesizer, but nodata_synthesizer.pyis provided inservice_patch/data_synthesis/in this PR (the directory containsdata_evaluator.py,requirement_metrics.py, and several scripts, but notdata_synthesizer.py). Importing the service module will raiseModuleNotFoundErrorat runtime. Please adddata_synthesizer.py(withMedicalDataSynthesizer) to the service patch directory, or adjust the import path.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1- Embedding a multi-line Python program as a string and shipping it via
python -cmakes it hard to maintain (no syntax highlighting, no static analysis, no tests). Consider extracting_synthesize_via_subprocess/_run_subprocess_workerpayloads into a small module (e.g.data_synthesis_service.worker) and invokingpython -m data_synthesis_service.workerwith the payload over stdin. This also avoids duplication between the synthesize and evaluate worker scripts.
runtime/ops/mapper/unstructuredio/operator_src/process.py:1 original_initializeis assigned inside thetryblock but only saved into thefinallycleanup viaif \"original_initialize\" in locals(). If thefrom transformers import ...line raises,original_initializeis set buttables_module.UnstructuredTableTransformerModel.initializewas never actually overwritten, so the restoration is fine — but if the import succeeds and the patch on line 172 runs, then a later exception beforeyieldwould still leave the class patched while theoriginal_load_agentrestoration on line 183 happens. Initializeoriginal_initialize = Nonealongside the other captured originals at the top of the function and gate restoration on it being non-None to make the cleanup symmetric and robust.
runtime/ops/mapper/data_synthesis/service_patch/data_synthesis_service/core.py:1- Each call to
evaluate_textwith a different backend (e.g.rulefor_build_metrics, thenvllmfor/evaluate-file) discards the existing evaluator and instantiatesMedicalDataEvaluatoragain — which for the vLLM path triggers a full model load. Within a singlesynthesize_textrequest this means the evaluator may be rebuilt twice. Consider caching evaluators per backend (e.g.self._evaluators: dict[str, MedicalDataEvaluator]) so switching backends does not reload the LLM.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,98 @@ | |||
| import json | |||
| import os | |||
| from typing import Any, Dict, Iterable, List, Optional | |||
Comment on lines
+11
to
+15
| PROJECT_ROOT = os.path.dirname(CURRENT_DIR) | ||
| DATA_SYNTHESIS_DIR = os.path.join(PROJECT_ROOT, "data_synthesis") | ||
| if DATA_SYNTHESIS_DIR not in sys.path: | ||
| sys.path.insert(0, DATA_SYNTHESIS_DIR) | ||
|
|
Comment on lines
+92
to
+96
| if response.status_code >= 400: | ||
| raise RuntimeError( | ||
| f"data_synthesis service failed: {response.status_code} {response.text}" | ||
| ) | ||
| sample[self.text_key] = serialize_service_response(response.json()) |
Comment on lines
+43
to
+45
| # DataMate may garble non-ASCII operator params into question marks. | ||
| if items and all(set(item) <= {"?"} for item in items): | ||
| return list(DEFAULT_DIMENSIONS) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.