feat(warehouses): Implement merge loading of updated Opralog entries #177

martyngigg · 2026-01-05T15:55:48Z

Summary

Entries in Opralog can be updated after creation. These changes ensure both new and updated records are loaded into the warehouse.

A set of tests against a fake database have also been added.

Fixes #138

Summary by CodeRabbit

Release Notes

New Features
- Added end-to-end testing for warehouse operations
- Implemented merge-based data loading strategy for improved data synchronisation
Tests
- Expanded integration test coverage for warehouse functionality
- Refactored test infrastructure for improved reliability
Chores
- Updated GitHub Actions workflows and testing configuration
- Enhanced test utilities and fixtures for warehouse integration scenarios

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-05T15:55:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request refactors testing infrastructure by restructuring GitHub Actions workflows, extracting reusable testing utilities into elt_common modules, and implementing lag window merge loading for Opralog incremental updates with comprehensive test coverage.

Changes

Cohort / File(s)	Summary
GitHub Actions workflow refactoring `.github/actions/run-pytest-with-uv/action.yml`, `.github/workflows/elt-common_*_tests.yml`, `.github/workflows/warehouses_e2e_tests.yml`	Replaced `pyproject-directory` input with `working-directory` and `install-args-extra` parameters; introduced new warehouses e2e test workflow with Docker and host configuration steps.
ELT Common testing modules `elt-common/src/elt_common/testing/__init__.py`, `elt-common/src/elt_common/testing/dlt.py`, `elt-common/src/elt_common/testing/fixtures.py`, `elt-common/src/elt_common/testing/lakekeeper.py`	New reusable testing utilities including `Settings`, `Endpoint`, pytest fixtures, DLT test configuration, Lakekeeper client wrapper, and warehouse scaffolding for integration tests.
ELT Common library updates `elt-common/src/elt_common/cli.py`, `elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py`, `elt-common/pyproject.toml`	Removed `default_write_disposition` parameters from CLI functions; added `iceberg_catalog()` context manager; consolidated m365 dependencies and added pytest11 entry point.
ELT Common test migration `elt-common/tests/e2e_tests/conftest.py`, `elt-common/tests/e2e_tests/.dlt/config.toml`, `elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/*`	Simplified e2e test configuration by moving utilities to shared modules; updated imports to reference new testing infrastructure; removed local test setup scaffolding.
Opralog lag window merge implementation `warehouses/accelerator/extract_load/opralogweb/extract_and_load.py`, `warehouses/accelerator/extract_load/opralogweb/.dlt/config.toml`	Implemented two-stage loading strategy with incremental cursors and merge write disposition; introduced entry ID filtering for related tables; added transformation functions for markdown conversion.
Opralog test infrastructure `warehouses/accelerator/extract_load/opralogweb/tests/fake_source/*`, `warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py`, `warehouses/accelerator/extract_load/opralogweb/tests/conftest.py`	New fake data source with SQLAlchemy ORM models; test fixtures for initial and incremental runs validating row counts and merge behaviour.
Infrastructure configuration `infra/ansible-docker/playbooks/elt/templates/secrets/envvars.j2`, `infra/local/env-local`	Added SQL database scheme credential; adjusted OpenFGA datastore connection pool size from 200 to 50.

Poem

🐇 A testing tale of infrastructure new,
With fixtures shared and lag windows too,
From pyproject paths to working directories bright,
Opralog merges now handle updates right!
The rabbit hops onward, test coverage in sight,
Where warehouses flourish with incremental might. ✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	Several out-of-scope changes are present: refactoring of testing infrastructure (moved fixtures to elt_common.testing module), CLI parameter removal unrelated to Opralog, and unrelated warehouse E2E test workflow additions.	Isolate Opralog-specific changes into a separate PR. Move the testing infrastructure refactoring and CLI changes to distinct PRs focused on those concerns.
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.15% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(warehouses): Implement merge loading of updated Opralog entries' clearly and specifically describes the main change—implementing merge loading for Opralog entries.
Linked Issues check	✅ Passed	The PR successfully implements lag window merge loading for Opralog by using LastChangedDate for incremental detection, adding merge write_disposition, and including integration tests with a fake database.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Fix all issues with AI Agents 🤖

In @.github/workflows/warehouses_e2e_tests.yml:
- Around line 44-47: The cleanup step named "Remove adp-router from /etc/hosts"
should run regardless of test outcome; modify that job step to include the
conditional "if: always()" at the step level so the sudo sed -i -e
'/adp-router/d' /etc/hosts command executes unconditionally after the test step.

In @elt-common/pyproject.toml:
- Around line 18-19: The pytest11 entry point points at
elt_common.testing.fixtures but setuptools won't discover packages in a src/
layout; update pyproject.toml by adding a [tool.setuptools] section that maps
the package directory (package-dir = {"" = "src"}) so setuptools can find the
elt_common package and the pytest plugin will load.
- Around line 24-35: Update the vulnerable/insecure dev dependencies in the
dependency-groups list: replace the "requests~=2.32.3" spec with a safe minimum
(e.g., "requests>=2.32.4" or "requests~=2.32.5") to remediate CVE-2024-47081,
bump "ruff~=0.11.11" to a secure release such as "ruff~=0.14.10", and optionally
update "pytest~=8.3.5" to "pytest~=9.0.2"; ensure these changes are made where
the dependency entries for requests, ruff, and pytest appear so the project uses
the non-vulnerable versions.

In @infra/ansible-docker/playbooks/elt/templates/secrets/envvars.j2:
- Line 7: Remove the unsupported connection parameter by deleting the
OPRALOGWEB__SOURCES__SQL_DATABASE__CREDENTIALS__SCHEME entry from the envvars
template; schema is not a valid dlt sql_database credential and must be
configured at the application level (see schema usage in extract_and_load.py),
so ensure only valid credentials remain (drivername, database, username,
password, host, port).

In @infra/local/env-local:
- Around line 52-53: The MAX_IDLE_CONNS value exceeds MAX_OPEN_CONNS which is
invalid; update OPENFGA_DATASTORE_MAX_IDLE_CONNS to be <=
OPENFGA_DATASTORE_MAX_OPEN_CONNS (e.g., change OPENFGA_DATASTORE_MAX_IDLE_CONNS
from 100 to 50) in the env-local file so the connection pool configuration is
consistent; ensure the two environment variables
OPENFGA_DATASTORE_MAX_OPEN_CONNS and OPENFGA_DATASTORE_MAX_IDLE_CONNS are kept
aligned whenever modified.

In @warehouses/accelerator/extract_load/opralogweb/.dlt/config.toml:
- Around line 24-25: The MoreEntryColumns DLT table config currently sets
write_disposition = "merge" but is missing a primary_key; update the
MoreEntryColumns configuration to include primary_key = "MoreEntryColumnId"
(matching the incremental field used for Entries) so merges can be applied
correctly for the table named "MoreEntryColumns" when using write_disposition =
"merge".
- Around line 7-9: The TOML shows write_disposition=merge for Entries (which is
fine because entries_table() already sets primary key via
dlt.sources.incremental()), but the MoreEntryColumns resource is missing a
primary key required for merge/upsert; update the Python resource definition
that uses sql_table for the "MoreEntryColumns" table to include
primary_key="MoreEntryColumnId" (i.e., set primary_key on the sql_table call
that defines MoreEntryColumns so the merge strategy/upsert will succeed).

In @warehouses/accelerator/extract_load/opralogweb/extract_and_load.py:
- Around line 38-47: Fix the module docstring in extract_and_load.py: correct
the typo "Unfortuantely" to "Unfortunately" and complete the trailing sentence
so it reads something like "We use the 'LastChangedDate' column of 'Entries' to
find the list of new or updated EntryId values and load these into the
MoreEntryColumns table to refresh associated records." Ensure the updated text
remains consistent with the docstring’s explanation of append-only vs
merge-tables.
- Around line 113-126: The callback more_entry_columns_query currently calls
EXTRACTED_ENTRY_IDS.to_pylist() unguarded; update it to first check that
EXTRACTED_ENTRY_IDS is not None and that the resulting list is non-empty, and
only then apply table.c.EntryId.in_(ids); if EXTRACTED_ENTRY_IDS is None or ids
is empty, return the incoming query unchanged or replace it with a safe no-rows
filter (e.g., filter a false condition) to avoid AttributeError and invalid IN
() SQL—modify more_entry_columns_query inside more_entry_columns_table
accordingly.
- Around line 65-69: The extraction order can be interleaved so ensure
deterministic sequencing: either set next_item_mode="fifo" in dlt/config.toml to
force FIFO extraction, or eliminate the global EXTRACTED_ENTRY_IDS dependency by
refactoring the flow—move the logic in store_extracted_entry_ids() into a
source-scoped state accessible by more_entry_columns_query() (or return the list
from entries_table() and pass it into more_entry_columns_table()), and ensure
more_entry_columns_table()/more_entry_columns_query() read that state rather
than a module-level EXTRACTED_ENTRY_IDS so the callback never sees None or
incomplete data.
- Around line 32-33: Replace the global EXTRACTED_ENTRY_IDS with an explicit dlt
resource dependency: stop assigning EXTRACTED_ENTRY_IDS (which is typed as
pa.Table but holds a pa.ChunkedArray) and instead make entries_table return the
EntryId collection as a resource value and have more_entry_columns_table declare
a depends_on (or bind the resource) to receive that value; update signatures for
entries_table and more_entry_columns_table to accept/return the EntryId list
(use consistent type e.g., List[str] or pa.ChunkedArray) and remove the global
variable EXTRACTED_ENTRY_IDS and any references to it.

In
@warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py:
- Around line 18-26: The fixture fake_source_db uses tempfile.TemporaryDirectory
which already removes the temp dir on context exit, so the explicit
shutil.rmtree(tmp_dir) in the finally block is redundant and can raise
FileNotFoundError; remove the finally/shutil.rmtree call and simply yield db_url
inside the TemporaryDirectory context (keeping create_fake_source_db and db_url
logic intact) so cleanup is handled by TemporaryDirectory.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

elt-common/src/elt_common/cli.py (1)
75-85: Stale docstring references removed parameter.

The docstring at line 81 still documents :param default_write_disposition:, but this parameter has been removed from the function signature.
🔎 Proposed fix
     """Run a standard extract and load pipeline

     :param pipeline_name: Name of dlt pipeline
     :param data_generator: Callable returning a dlt.DltSource or dlt.DltResource
     :param dataset_name_suffix: Suffix part of full dataset name in the destination. The given string is prefixed with
                                 a standard string defined in constants.DATASET_NAME_PREFIX_SRCS
-    :param default_write_disposition: Default mode for dlt write_disposition defaults to "append"
     :param default_destination: Default destination, defaults to "filesystem"
     :param default_loader_file_format: Default dlt loader file format, defaults to "parquet"
     :param default_progress: Default progress reporter, defaults to NULL_COLLECTOR
     """
elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/conftest.py (1)
5-11: Missing warehouse argument in PyIcebergDestinationTestConfiguration instantiation.

Based on the PyIcebergDestinationTestConfiguration dataclass definition in elt_common/testing/dlt.py, it requires a warehouse: Warehouse attribute. The fixture should receive the warehouse fixture and pass it to the constructor.
🔎 Proposed fix
 @pytest.fixture
-def destination_config():
-    destination_config = PyIcebergDestinationTestConfiguration()
+def destination_config(warehouse):
+    destination_config = PyIcebergDestinationTestConfiguration(warehouse=warehouse)
     try:
         yield destination_config
     finally:
         destination_config.clean_catalog()

🧹 Nitpick comments (20)

elt-common/src/elt_common/cli.py (1)
114-116: Good addition for observability.

The extracted row counts logging improves debugging capabilities. Consider adding a defensive check in case last_normalize_info is None for edge cases (e.g., empty data sources), though this may be acceptable for debug-level logging.
🔎 Optional defensive check
     LOGGER.debug(pipeline.last_trace.last_extract_info)
-    LOGGER.debug(f"Extracted row counts: {pipeline.last_trace.last_normalize_info.row_counts}")
+    if pipeline.last_trace.last_normalize_info:
+        LOGGER.debug(f"Extracted row counts: {pipeline.last_trace.last_normalize_info.row_counts}")
     LOGGER.debug(pipeline.last_trace.last_load_info)
elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py (2)
30-30: Inconsistent variable reference for consistency.

Line 30 uses server.settings.openid_client_secret whilst the rest of the fixture consistently uses server_settings (assigned on line 10). For consistency and readability, this should match the pattern used elsewhere.
🔎 Proposed fix for consistency
-  "iceberg.rest-catalog.oauth2.credential" = '{server_settings.openid_client_id}:{server.settings.openid_client_secret}',
+  "iceberg.rest-catalog.oauth2.credential" = '{server_settings.openid_client_id}:{server_settings.openid_client_secret}',
46-46: Redundant assignment can be removed.

The creds.catalog is already set to warehouse.name on line 16 when the credentials are created. This reassignment appears unnecessary unless there's a specific reason to refresh it.
🔎 Proposed simplification
-    # Create another connector connected to the new catalog
-    creds.catalog = warehouse.name
-    trino_engine = TrinoQueryEngine(creds)
+    # Create another connector connected to the new catalog
+    trino_engine = TrinoQueryEngine(creds)
elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py (1)
9-14: Minor: Redundant cast after assert isinstance.

The cast on line 14 is unnecessary since the assert isinstance on line 13 already narrows the type. Consider simplifying:
🔎 Proposed simplification
 @contextmanager
 def iceberg_catalog(pipeline: dlt.Pipeline):
     """Given a pipeline return the Iceberg catalog client for this pipeline"""
     with pipeline.destination_client() as client:
         assert isinstance(client, PyIcebergClient)
-        yield cast(PyIcebergClient, client).iceberg_catalog
+        yield client.iceberg_catalog
elt-common/src/elt_common/testing/dlt.py (1)
94-101: Consider adding error handling for robustness.

If purge_table fails for one table, the remaining tables and namespaces won't be cleaned up. For test cleanup, consider wrapping operations in try/except to ensure best-effort cleanup of all resources.
🔎 Proposed improvement
     def clean_catalog(self):
         """Clean the destination catalog of all namespaces and tables"""
         catalog = self.warehouse.connect()
         for ns_name in catalog.list_namespaces():
             tables = catalog.list_tables(ns_name)
             for qualified_table_name in tables:
-                catalog.purge_table(qualified_table_name)
-            catalog.drop_namespace(ns_name)
+                try:
+                    catalog.purge_table(qualified_table_name)
+                except Exception:
+                    pass  # Best-effort cleanup
+            try:
+                catalog.drop_namespace(ns_name)
+            except Exception:
+                pass
elt-common/src/elt_common/testing/__init__.py (1)
20-21: Consider validating path format in concatenation.

The __add__ method concatenates paths directly without ensuring the path starts with /. This could create malformed URLs if callers forget the leading slash.
🔎 Proposed improvement
 def __add__(self, path: str) -> "Endpoint":
+    if path and not path.startswith("/"):
+        path = "/" + path
     return Endpoint(self.raw_value + path, self.internal_netloc)
elt-common/src/elt_common/testing/lakekeeper.py (3)
58-63: Overly broad exception handling.

Catching Exception masks unexpected errors. Consider catching only requests.HTTPError or requests.RequestException to let genuine bugs surface.
🔎 Proposed fix
 try:
     response.raise_for_status()
-except Exception:
+except requests.HTTPError:
     raise ValueError(
         f"Failed to create warehouse ({response.status_code}): {response.text}."
     )
66-66: Prefer logging over print statements.

Using print in library code prevents callers from controlling output verbosity and interferes with structured logging.
🔎 Proposed improvement

Add logging at the top of the file:
+import logging
+
+logger = logging.getLogger(__name__)
Then replace the print statement:
-print(f"Created warehouse {name} with ID {warehouse_id}")
+logger.info(f"Created warehouse {name} with ID {warehouse_id}")
88-93: Consider making timeout configurable.

The 10-second timeout is hardcoded. Whilst reasonable for most cases, making it configurable would improve flexibility for slower environments.
elt-common/src/elt_common/testing/fixtures.py (2)
68-68: Prefer logging over print statements.

Using print prevents test frameworks from capturing and controlling output appropriately.

89-93: Exception handling may miss other error types.

Only catching RuntimeError means other exceptions (e.g., requests.RequestException, ValueError) from the cleanup operations won't be caught, causing test failures instead of warnings.
🔎 Proposed fix
-except RuntimeError as exc:
+except Exception as exc:
     warnings.warn(
         f"Error deleting test warehouse '{str(warehouse.project_id)}'. It may need to be removed manually."
     )
     warnings.warn(f"Error:\n{str(exc)}")
warehouses/accelerator/extract_load/opralogweb/extract_and_load.py (2)
89-102: Consider performance for large datasets.

The .to_pandas().apply() pattern converts the entire column to pandas and applies the transformation row-by-row, which may be inefficient for large datasets. Consider using vectorised operations if performance becomes an issue.

105-110: Add validation and error handling.

The function modifies global state without validation:

No check if table contains the EntryId column

No validation that EntryId extraction succeeded

Silent overwrite of any existing EXTRACTED_ENTRY_IDS value

Consider adding validation to prevent runtime errors if the input table structure is unexpected.
🔎 Proposed enhancement
 def store_extracted_entry_ids(table: pa.Table) -> pa.Table:
+    """Store extracted EntryIds for filtering MoreEntryColumns"""
     global EXTRACTED_ENTRY_IDS
-
-    EXTRACTED_ENTRY_IDS = table["EntryId"]
-
+    
+    if "EntryId" not in table.column_names:
+        raise ValueError("Table missing required 'EntryId' column")
+    
+    entry_ids = table["EntryId"]
+    if len(entry_ids) == 0:
+        # Log warning but don't fail - valid case for no new entries
+        pass
+    
+    EXTRACTED_ENTRY_IDS = entry_ids
     return table
warehouses/accelerator/extract_load/opralogweb/tests/conftest.py (1)
1-4: Consider using editable install instead of sys.path manipulation.

Modifying sys.path globally has several drawbacks:

Inserting at position 0 takes precedence over installed packages, potentially causing import shadowing

Makes tests environment-dependent and harder to reproduce

Modern Python testing recommends proper package installation

Consider using an editable install instead:
uv pip install -e warehouses/accelerator/extract_load/opralogweb
Or set PYTHONPATH in the test runner configuration.
.github/actions/run-pytest-with-uv/action.yml (1)

8-10: Consider documenting or validating install-args-extra to prevent unintended command injection.

The install-args-extra input is directly interpolated into the shell command. While GitHub Actions sanitises inputs, consider documenting allowed values or adding validation if this action is used in workflows triggered by untrusted PR authors.

Also applies to: 47-47
warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py (1)
34-39: Capture stdout separately for better debugging.

When the script fails, proc.stdout will be None because stdout isn't captured (only stderr is redirected to stdout). Consider capturing both streams properly.
🔎 Proposed fix
     proc = subp.run(
-        [sys.executable, EL_SCRIPT], env=subp_env, text=True, stderr=subp.STDOUT
+        [sys.executable, EL_SCRIPT], env=subp_env, text=True, capture_output=True
     )

     if proc.returncode != 0:
-        pytest.fail(reason=proc.stdout)
+        pytest.fail(reason=f"stdout:\n{proc.stdout}\nstderr:\n{proc.stderr}")
warehouses/accelerator/extract_load/opralogweb/tests/fake_source/__init__.py (2)
3-3: Remove unused import.

The click module is imported but not used in this file.
🔎 Proposed fix
-import click
36-59: Consider batching inserts for performance.

Each call to insert_downtime_record commits separately due to the @commit_or_rollback decorator. For 100 iterations, this means 100 commits. For test code this is acceptable, but could be slow for larger datasets.
warehouses/accelerator/extract_load/opralogweb/tests/fake_source/opralogmodel.py (2)
44-56: Decorator commits after rollback, which is redundant.

The finally block executes commit() even after a rollback() in the except block. While this doesn't cause data corruption (committing after rollback just commits an empty transaction), it's semantically incorrect. Use else clause for the commit instead.
🔎 Proposed fix
 def commit_or_rollback(func):
     """Decorator to either commit a successful ORM session or rollback on exception"""

     def impl(session: Session, *args, **kwargs):
         try:
-            return func(session, *args, **kwargs)
+            result = func(session, *args, **kwargs)
         except Exception:
             session.rollback()
             raise
-        finally:
+        else:
             session.commit()
+            return result

     return impl
37-37: Remove unused constant or add usage comment.

LOGBOOKS dictionary is defined but never used. If it's intended for future use, consider adding a comment; otherwise, remove it.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between dae74ba and 66315cb.

📒 Files selected for processing (29)

.github/actions/run-pytest-with-uv/action.yml
.github/workflows/elt-common_e2e_tests.yml
.github/workflows/elt-common_unit_tests.yml
.github/workflows/warehouses_e2e_tests.yml
elt-common/pyproject.toml
elt-common/src/elt_common/cli.py
elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py
elt-common/src/elt_common/testing/__init__.py
elt-common/src/elt_common/testing/dlt.py
elt-common/src/elt_common/testing/fixtures.py
elt-common/src/elt_common/testing/lakekeeper.py
elt-common/tests/e2e_tests/.dlt/config.toml
elt-common/tests/e2e_tests/conftest.py
elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/conftest.py
elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/test_pyiceberg_pipeline.py
elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/utils.py
elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py
elt-common/tests/e2e_tests/elt_common/iceberg/test_maintenance.py
infra/ansible-docker/playbooks/elt/templates/secrets/envvars.j2
infra/local/env-local
warehouses/accelerator/extract_load/opralogweb/.dlt/config.toml
warehouses/accelerator/extract_load/opralogweb/.gitignore
warehouses/accelerator/extract_load/opralogweb/extract_and_load.py
warehouses/accelerator/extract_load/opralogweb/tests/__init__.py
warehouses/accelerator/extract_load/opralogweb/tests/conftest.py
warehouses/accelerator/extract_load/opralogweb/tests/fake_source/README.md
warehouses/accelerator/extract_load/opralogweb/tests/fake_source/__init__.py
warehouses/accelerator/extract_load/opralogweb/tests/fake_source/opralogmodel.py
warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py

💤 Files with no reviewable changes (1)

elt-common/tests/e2e_tests/.dlt/config.toml

🧰 Additional context used

🧬 Code graph analysis (11)

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/test_pyiceberg_pipeline.py (4)

elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py (1)

iceberg_catalog (10-14)

elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (1)

iceberg_catalog (61-62)

elt-common/src/elt_common/testing/dlt.py (1)

PyIcebergDestinationTestConfiguration (33-101)

elt-common/src/elt_common/testing/lakekeeper.py (1)

Warehouse (97-167)

elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py (2)

elt-common/src/elt_common/dlt_destinations/pyiceberg/factory.py (1)

pyiceberg (15-58)

elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (1)

PyIcebergClient (72-463)

elt-common/src/elt_common/testing/lakekeeper.py (5)

elt-common/src/elt_common/dlt_destinations/pyiceberg/factory.py (1)

pyiceberg (15-58)

elt-common/src/elt_common/dlt_destinations/pyiceberg/helpers.py (1)

create_catalog (50-58)

elt-common/src/elt_common/testing/__init__.py (5)

Endpoint (16-32)

Settings (35-96)

openid_provider_uri (70-73)

lakekeeper_url (66-67)

storage_config (75-96)

elt-common/src/elt_common/testing/fixtures.py (4)

settings (20-21)

token_endpoint (25-28)

warehouse (52-93)

server (47-48)

elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py (1)

as_dict (26-46)

elt-common/tests/e2e_tests/elt_common/iceberg/test_maintenance.py (1)

elt-common/src/elt_common/testing/lakekeeper.py (1)

Warehouse (97-167)

elt-common/tests/e2e_tests/conftest.py (1)

elt-common/src/elt_common/testing/dlt.py (1)

configure_dlt_for_testing (19-29)

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/__init__.py (1)

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/opralogmodel.py (5)

get_latest_entry_id (119-121)

metadata (40-41)

insert_empty_downtime_logbook (125-134)

insert_downtime_record (138-208)

update_downtime_record (212-245)

elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py (1)

elt-common/src/elt_common/testing/lakekeeper.py (1)

Warehouse (97-167)

elt-common/src/elt_common/testing/fixtures.py (3)

elt-common/src/elt_common/testing/dlt.py (2)

PyIcebergDestinationTestConfiguration (33-101)

clean_catalog (94-101)

elt-common/src/elt_common/testing/__init__.py (3)

Settings (35-96)

openid_provider_uri (70-73)

storage_config (75-96)

elt-common/src/elt_common/testing/lakekeeper.py (6)

Server (18-93)

Warehouse (97-167)

token_endpoint (26-27)

create_warehouse (45-73)

purge_warehouse (76-78)

delete_warehouse (81-86)

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/utils.py (3)

elt-common/src/elt_common/dlt_destinations/pyiceberg/factory.py (1)

pyiceberg (15-58)

elt-common/src/elt_common/dlt_destinations/pyiceberg/__init__.py (1)

iceberg_catalog (10-14)

elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (1)

iceberg_catalog (61-62)

elt-common/src/elt_common/testing/dlt.py (3)

elt-common/src/elt_common/testing/lakekeeper.py (4)

Warehouse (97-167)

catalog_endpoint (29-34)

token_endpoint (26-27)

connect (103-113)

elt-common/src/elt_common/testing/fixtures.py (4)

warehouse (52-93)

server (47-48)

settings (20-21)

token_endpoint (25-28)

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/test_pyiceberg_pipeline.py (1)

pipeline_name (30-32)

warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py (4)

elt-common/src/elt_common/testing/lakekeeper.py (2)

Warehouse (97-167)

connect (103-113)

elt-common/src/elt_common/testing/fixtures.py (3)

destination_config (97-102)

pipelines_dir (106-108)

warehouse (52-93)

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/conftest.py (1)

destination_config (6-11)

elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (1)

run (64-69)

🪛 Ruff (0.14.10)

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/__init__.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)

warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py

34-34: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: warehouses end-to-end tests
GitHub Check: elt-common end-to-end tests

🔇 Additional comments (36)

warehouses/accelerator/extract_load/opralogweb/.gitignore (1)

1-2: LGTM!

The patterns appropriately exclude temporary SQLite and DuckDB database files created by test scaffolding, aligning with the fake Opralog database infrastructure introduced in this PR.

elt-common/src/elt_common/cli.py (1)

46-51: Defaulting write-disposition to None is correct for merge loading.

This change allows the write disposition to be controlled by resource decorators rather than CLI defaults, which is necessary for the merge-style incremental load feature.

elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py (1)

4-4: LGTM! Good refactoring to centralize test utilities.

The import path update moves the Warehouse fixture to a centralized testing module, improving code organization and reusability across the test suite.

elt-common/pyproject.toml (1)

38-39: LGTM!

The ruff configuration with a line length of 100 is a reasonable and widely-adopted standard.

elt-common/tests/e2e_tests/elt_common/iceberg/test_maintenance.py (1)

8-8: LGTM!

The import path update correctly aligns with the refactored testing utilities location.

elt-common/tests/e2e_tests/conftest.py (1)

1-3: LGTM!

Clean simplification that centralises the DLT testing configuration. Module-level invocation ensures the setup occurs before any tests run.

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/test_pyiceberg_pipeline.py (1)

8-16: LGTM!

Import paths correctly updated to use the centralised testing utilities from elt_common.testing and the new iceberg_catalog context manager from the pyiceberg module.

elt-common/tests/e2e_tests/elt_common/dlt_destinations/pyiceberg/utils.py (1)

17-18: LGTM!

Clean refactoring that removes local duplication by importing the centralised iceberg_catalog context manager.

elt-common/src/elt_common/testing/dlt.py (1)

39-74: LGTM on environment setup logic.

The distinction between direct assignment for essential credentials (URI, PROJECT_ID) and setdefault for others is a reasonable pattern that allows tests to override optional settings while ensuring core connectivity is always configured correctly.

elt-common/src/elt_common/testing/__init__.py (2)

8-12: LGTM!

The retry configuration with exponential backoff and 5 attempts is appropriate for testing infrastructure that may have transient failures.

35-96: LGTM!

The Settings class provides sensible defaults for local testing whilst allowing override via environment variables with the tests_ prefix. Hard-coded test credentials are appropriate for development defaults.

elt-common/src/elt_common/testing/lakekeeper.py (5)

1-16: LGTM!

Import organisation is clear and appropriate.

18-44: LGTM!

The endpoint builders properly delegate to the Settings properties and construct versioned API paths correctly.

103-113: LGTM!

The connect() method properly constructs PyIceberg catalog credentials from the server settings.

115-157: LGTM!

The create_test_tables() context manager provides useful test scaffolding. The nested append_snapshot function and automatic cleanup in the finally block are appropriate for test utilities.

159-167: LGTM!

The purge() method correctly handles cleanup order: views first, then tables, then namespaces.

elt-common/src/elt_common/testing/fixtures.py (4)

1-17: LGTM!

Module docstring and imports are clear and appropriate.

19-48: LGTM!

The authentication fixtures correctly implement the OpenID Connect client credentials flow. Session scope is appropriate for reusing authentication across tests.

96-102: LGTM!

Function scope is appropriate for destination_config to ensure test isolation whilst reusing the session-scoped warehouse. The cleanup in finally ensures the catalog is purged after each test.

105-108: LGTM!

The pipelines_dir fixture correctly uses TemporaryDirectory for automatic cleanup.

warehouses/accelerator/extract_load/opralogweb/extract_and_load.py (4)

15-16: LGTM!

The added imports are appropriate for the new functionality: Generator for type hints, datetime for the epoch constant, sql_table for creating resources, and Select for the query adapter callback type hint.

Also applies to: 20-20, 23-23

27-30: LGTM!

The constants are well-defined. OPRALOG_EPOCH aligns with the documented usage start date, and SQL_TABLE_KWARGS centralises common configuration for all table resources.

49-63: LGTM!

The append-only tables are correctly configured with incremental cursors and write_disposition="append".

72-86: LGTM!

The entries_table() function correctly configures merge loading with LastChangedDate as the incremental cursor and applies appropriate transformations.

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/README.md (1)

1-16: LGTM!

The documentation clearly explains how to generate fake data and simulate updates for testing purposes. The usage examples are well-structured and easy to follow.

.github/workflows/elt-common_e2e_tests.yml (1)

46-46: LGTM!

The change from pyproject-directory to working-directory is consistent with the unit tests workflow refactoring.

.github/workflows/elt-common_unit_tests.yml (1)

33-33: Parameter change is correctly supported in the action.

The .github/actions/run-pytest-with-uv/action.yml action has been updated to accept the working-directory input parameter (defined at line 22 and used at lines 48 and 53), so this workflow configuration change is properly aligned.

.github/actions/run-pytest-with-uv/action.yml (1)

22-24: LGTM!

The refactoring from pyproject-directory to working-directory with the addition of install-args-extra provides more flexibility. The structure is clean and the working directory is correctly applied to both the install and test steps.

Also applies to: 45-53

warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py (2)

42-64: LGTM!

The test comprehensively validates that the first run creates the expected namespace and populates all tables with correct row counts.

67-91: LGTM!

The merge update test properly validates the incremental loading behaviour by running the script twice with an update in between, and verifying the expected row count changes.

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/__init__.py (2)

62-100: LGTM!

The update function correctly fetches the latest entry, validates existence, updates one record, and appends new records. The logic aligns with the test expectations (110 entries after update = 100 initial + 10 new).

103-110: LGTM!

Clean wrapper functions that create engines and delegate to the appropriate helpers.

warehouses/accelerator/extract_load/opralogweb/tests/fake_source/opralogmodel.py (3)

59-117: LGTM!

The ORM models are well-structured and correctly define the schema needed to simulate the Opralog database for testing purposes.

137-208: LGTM!

The insert_downtime_record function correctly handles entry ID sequencing, inserts related records across multiple tables, and manages chapter creation when needed.

211-245: LGTM!

The update_downtime_record function correctly updates the entry and related columns, and records the change in the changelog table.

.github/workflows/warehouses_e2e_tests.yml (1)

27-28: No action required. actions/checkout@v6 is a valid and stable release with improved credential security compared to earlier versions. Ensure your GitHub Actions Runner is version 2.329.0 or later for full compatibility.

Likely an incorrect or invalid review comment.

.github/workflows/warehouses_e2e_tests.yml

elt-common/pyproject.toml

elt-common/src/elt_common/testing/dlt.py

infra/ansible-docker/playbooks/elt/templates/secrets/envvars.j2

warehouses/accelerator/extract_load/opralogweb/extract_and_load.py

warehouses/accelerator/extract_load/opralogweb/tests/test_extract_and_load.py

It makes it easier to skip the value if it is not needed.

Ensures any edited entries are reloaded and merged with the current warehouse contents.

… refactor (#179) ### Summary After refactor in #177  ## Summary by CodeRabbit * **Chores** * Restructured SQL credentials environment variable configuration for improved clarity and simplicity. The credential keys have been reorganised to use a flatter, more intuitive hierarchy. Users maintaining custom environment variable configurations should review the changes and update their setup accordingly to ensure continued compatibility. ✏️ Tip: You can customize this high-level summary in your review settings.

### Summary Fix problems arising from removal of cli parameter in #177. Set write_dispositions explicitly on each resource.  ## Summary by CodeRabbit * **Documentation** * Removed parameter documentation from CLI docstring. * **Refactor** * Migrated write disposition configuration from global defaults to resource-level settings across data source connectors, enabling granular control over how data is written to targets (replace vs. merge operations). ✏️ Tip: You can customize this high-level summary in your review settings.

### Summary Datetime handling has been overhauled and as a result databases such as mssql now properly reflect tz-naive columns. We add in a mapping for the tz-naive column to the Iceberg type. Fixes an error now raised for incremental loads in #177 for mssql dbs. See dlt-hub/dlt#3061.  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added resource item limiting capability for enhanced testing and debugging flexibility. * **Bug Fixes** * Improved timestamp handling to correctly support timezone-aware datetime values. * **Chores** * Updated dlt dependency to version 1.20.0 for improved compatibility and performance. ✏️ Tip: You can customize this high-level summary in your review settings.

coderabbitai bot reviewed Jan 5, 2026

View reviewed changes

martyngigg force-pushed the opralogwebdb-fetch-updated branch from 66315cb to 5bf67bc Compare January 5, 2026 16:14

martyngigg added 9 commits January 5, 2026 17:00

Move schema to live with other credentials

db56186

It makes it easier to skip the value if it is not needed.

Reduce max number of open connections for local docker-baesd database

5d3d56b

Add a script to fake Opralog database data for testing

08141a6

Use LastChangedDate on Entries table to see changes.

13947ce

Ensures any edited entries are reloaded and merged with the current warehouse contents.

Print extracted row counts to logger

6636b50

Implement testing for Opralog extract_and_load script

5727386

Add GHA to run warehouses tests

3df8415

Cleanup /etc/hosts if GHA fails

5c0d60d

Remove tables from config.toml

49c46e6

coderabbitai bot mentioned this pull request Jan 5, 2026

Update dev dependencies with security vulnerabilities in elt-common #178

Closed

martyngigg added 5 commits January 5, 2026 17:05

Fix type signature

9469ee1

Fix documentation

12de1e6

Remove redundant shutil.rmtree

08ceddf

Ensure fifo dlt extraction order

ec3ed1f

Fix GHA install command

112cd08

martyngigg force-pushed the opralogwebdb-fetch-updated branch from 5bf67bc to 112cd08 Compare January 5, 2026 17:15

martyngigg added 3 commits January 5, 2026 17:23

Remove unused imports

df901f9

Remove unused imports

9e9514b

Simplify warehouses end-to-end tests in GHA.

f0a02c9

martyngigg force-pushed the opralogwebdb-fetch-updated branch from 1ffca8e to f0a02c9 Compare January 5, 2026 20:32

martyngigg added 2 commits January 6, 2026 10:41

Add checks for content of the loaded tables

e6f9388

Support chunked loading from sql_table

7f211e2

martyngigg merged commit 69dfa4a into main Jan 6, 2026
5 checks passed

martyngigg deleted the opralogwebdb-fetch-updated branch January 6, 2026 13:11

martyngigg mentioned this pull request Jan 6, 2026

fix(warehouses): Fix environment variable name after Opralog pipeline refactor #179

Merged

martyngigg mentioned this pull request Jan 6, 2026

fix(warehouses): Add write_dispositions to each resource #180

Merged

martyngigg mentioned this pull request Jan 6, 2026

Upgrade dlt to 1.20 #181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(warehouses): Implement merge loading of updated Opralog entries #177

feat(warehouses): Implement merge loading of updated Opralog entries #177

Uh oh!

martyngigg commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 5, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(warehouses): Implement merge loading of updated Opralog entries #177

feat(warehouses): Implement merge loading of updated Opralog entries #177

Uh oh!

Conversation

martyngigg commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martyngigg commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 5, 2026 •

edited

Loading