Ingestion module for foundational DP#276
Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the cdf_ingestion_foundation module, which provides a framework for orchestrating two-phase ingestion workflows (population and contextualization) across PI, OPC-UA, and SAP source systems. It includes a Python generator script to build workflow versions from task snippets, various SQL transformations, and authorization group definitions. Review feedback primarily addresses Python style guide violations in the build script—such as import sorting, the need for typed data structures (dataclasses/Pydantic), and proper logging—as well as security recommendations to restrict overly broad wildcard scopes in the authorization group capabilities.
| import argparse | ||
| import sys | ||
| from pathlib import Path |
There was a problem hiding this comment.
Sort standard library imports alphabetically.
| import argparse | |
| import sys | |
| from pathlib import Path | |
| import argparse | |
| from pathlib import Path | |
| import sys |
References
- Sort alphabetically within groups. (link)
| def load_config() -> dict: | ||
| return yaml.safe_load(CONFIG_FILE.read_text()) |
There was a problem hiding this comment.
Use dataclasses or Pydantic models to represent the configuration instead of untyped dictionaries. This ensures type safety and adheres to the project's coding standards.
References
- Prefer dataclasses or Pydantic models over untyped dictionaries. Always parse file content into typed structures. (link)
| print(f"WARNING: Unknown dataModelVariant '{dm}'. No contextualization tasks added.") | ||
| print("Supported variants: isa_manufacturing_extension | cfihos_oil_and_gas") |
There was a problem hiding this comment.
Use the logging module instead of print for warnings and informational messages.
References
- Use the logging module with appropriate levels. (link)
| return tasks | ||
|
|
||
|
|
||
| def load_tasks(filenames: list[str]) -> list[dict]: |
There was a problem hiding this comment.
Missing docstring for load_tasks. Functions must have concise docstrings using the Args/Returns format.
| def load_tasks(filenames: list[str]) -> list[dict]: | |
| def load_tasks(filenames: list[str]) -> list[dict]: | |
| """ | |
| Load task snippets from YAML files. | |
| Args: | |
| filenames: list of task snippet filenames. | |
| Returns: | |
| list of task definitions. | |
| """ |
References
- Use concise docstrings with Args/Returns format. (link)
| - transformationsAcl: | ||
| actions: [READ, RUN] | ||
| scope: | ||
| all: {} |
There was a problem hiding this comment.
The transformationsAcl uses a wildcard scope (all: {}). This is overly broad for a service account. Restrict the scope to specific datasets or IDs where possible.
References
- Flag overly broad capabilities (e.g. wildcards '*' for dataSetScope, idScope, ACLs) in group definitions unless clearly justified. (link)
| - datasetsAcl: | ||
| actions: [READ] | ||
| scope: | ||
| all: {} |
There was a problem hiding this comment.
The datasetsAcl uses a wildcard scope (all: {}). Consider restricting this to the specific datasets required by the workflow.
References
- Flag overly broad capabilities (e.g. wildcards '*' for dataSetScope, idScope, ACLs) in group definitions unless clearly justified. (link)
7e0e65e to
7425757
Compare
0347e26 to
c6ed831
Compare
Adds the Layer 2 orchestration module for
dp:foundation. This module owns the ingestion workflow, all transformation definitions, and the auth groups needed to run them.The module implements a two-phase workflow:
Which phases and tasks are included is driven entirely by config flags (enabledSources, enabledContextualization, dataModelVariant) — no YAML editing required when toggling a source on or off.
Key files:
scripts/build_workflow.py— generates wf_ingestion_v1.WorkflowVersion.yaml from per-task snippets based on the active config. Run with --check in CI to detect drift.transformations/— 8 scaffold SQL + YAML pairs (population: PI, OPC-UA, SAP assets/equipment/orders/operations; contextualization: equipment-to-asset, operation-to-order). SQL follows naming conventiontr_{source}_{location}_to_{target}.auth/— two self-contained groups: workflow service account (execute + transform) and workflow user (read-only monitoring).