NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe by Qubad786 · Pull Request #18403 · truenas/middleware

Qubad786 · 2026-03-09T14:39:11Z

Context

Container plugin has been converted to typesafe implementation including converting other services which lived inside container plugin.

bugclerk · 2026-03-09T14:39:30Z

Jira URL: https://ixsystems.atlassian.net/browse/NAS-140194

claude · 2026-03-09T15:22:27Z

src/middlewared/middlewared/plugins/container/lxc_config.py

+        if config.preferred_pool:
+            await self.middleware.call(
+                "container.validate_pool", verrors, "lxc_config_update.preferred_pool", config.preferred_pool
+            )


🔴 Bug: lxc_config.py:54-55 calls container.validate_pool via middleware dispatch, but this method was deleted when container.py was replaced by init.py - it no longer exists anywhere in the codebase. Any lxc.update call with a preferred_pool value will raise a runtime CallError (method not found). The fix is to inline the pool validation here, similar to how it was done in crud.py:147 (e.g., check config.preferred_pool not in await pool_choices(self)).

Extended reasoning...

What the bug is

The old ContainerService in container/container.py defined a @Private method validate_pool that was registered as container.validate_pool on the middleware dispatch:

@private async def validate_pool(self, verrors, schema, pool): if pool not in await self.middleware.call("container.pool_choices"): verrors.add(schema, "Pool not found.")

This PR deleted container.py entirely and replaced the ContainerService with a new implementation in init.py. The new ContainerService does not define a validate_pool method.

The specific code path that triggers it

In lxc_config.py:53-56, when config.preferred_pool is truthy during an lxc.update call, the code dispatches:

if config.preferred_pool: await self.middleware.call( "container.validate_pool", verrors, "lxc_config_update.preferred_pool", config.preferred_pool )

Since container.validate_pool no longer exists as a registered method, the middleware dispatch will raise a CallError with a method-not-found error.

Why existing code does not prevent it

The pool validation was correctly inlined in crud.py:147 for container creation (pool not in await pool_choices(self)), but the lxc_config.py update path was not updated to use the same inline approach. The unit test in test_config_network_overlap.py masks this by mocking: m["container.validate_pool"] = AsyncMock(), so the tests still pass without the method existing.

Step-by-step proof

User calls lxc.update with {"preferred_pool": "tank"}

LXCConfigService.do_update delegates to LXCConfigServicePart.do_update in lxc_config.py:35

At line 53, config.preferred_pool is "tank" (truthy), so the if branch is entered

At line 54-55, self.middleware.call("container.validate_pool", ...) is invoked

The middleware dispatch looks up container.validate_pool -- this method does not exist in the new ContainerService in init.py (confirmed by grep showing zero definitions)

A CallError is raised: method not found

Impact

Any attempt to set or change the preferred_pool in LXC configuration will fail with a runtime error. This is a regression introduced by this PR that blocks a user-facing configuration operation.

How to fix

Inline the pool validation in lxc_config.py, matching the pattern used in crud.py:147. Replace lines 53-56 with:

if config.preferred_pool: if config.preferred_pool not in await pool_choices(self): verrors.add("lxc_config_update.preferred_pool", "Pool not found.")

This requires importing pool_choices from .info at the top of lxc_config.py. The test mock for container.validate_pool should also be updated accordingly.

claude · 2026-03-09T15:22:27Z

src/middlewared/middlewared/plugins/container/crud.py

+    def extend_context_sync(self, rows: list[dict[str, Any]], extra: dict[str, Any]) -> dict[str, Any]:
+        return {
+            'states': gather_pylibvirt_domains_states(
+                self.middleware,
+                rows,
+                self.middleware.libvirt_domains_manager.containers_connection,
+                lambda container: pylibvirt_container(self, container),
+            ),
+        }


🔴 Bug: extend_context_sync passes raw DB rows to pylibvirt_container() without calling extend_container() first. Raw rows have idmap_slice (the DB column name), not idmap (the API field), so pylibvirt_container() will KeyError on container["idmap"] at lifecycle.py:101. The broad except Exception in gather_pylibvirt_domains_states silently swallows this, causing all containers to incorrectly report STOPPED state even when running. Fix: change line 87 to lambda container: pylibvirt_container(self, self.extend_container(container)).

Extended reasoning...

What the bug is

In crud.py:81-89, extend_context_sync() builds a lambda that is passed to gather_pylibvirt_domains_states(). This lambda receives raw datastore rows (which have idmap_slice as the DB column name) and passes them directly to pylibvirt_container(). However, pylibvirt_container() at lifecycle.py:101 accesses container["idmap"], which does not exist in raw DB rows — only idmap_slice does.

The code path that triggers it

The flow is: query() → extend_context() → extend_context_sync() → gather_pylibvirt_domains_states(). At pylibvirt.py:46, the function calls container_factory(container.copy()) for each running libvirt domain, where container_factory is the lambda from extend_context_sync. The lambda passes the raw row to pylibvirt_container(), which immediately hits container["idmap"] at lifecycle.py:101 — a key that does not exist in the raw dict.

Why existing code does not prevent it

The extend_container() method (which converts idmap_slice → idmap) is called in extend(), but extend() runs AFTER extend_context_sync() in the query pipeline (see CRUDServicePart.query: extend_context(rows, extra) is called first, then _run_extend_many(rows, ctx) calls extend() for each row). The old code correctly handled this by calling self.extend_container(container) inside the lambda before passing to pylibvirt_container. The new code omits this transformation.

Step-by-step proof

A user calls container.query while a container is running.

extend_context_sync is called with raw DB rows. Each row has idmap_slice (e.g., 0 for DEFAULT) but no idmap key.

gather_pylibvirt_domains_states (pylibvirt.py:38-46) matches the running domain UUID to a row, then calls container_factory(container.copy()) — i.e., the lambda lambda container: pylibvirt_container(self, container).

Inside pylibvirt_container (lifecycle.py:101): if container["idmap"]: → KeyError, because the dict has idmap_slice, not idmap.

The except Exception at pylibvirt.py:48-49 catches this silently. The state dict is returned empty for this container.

get_pylibvirt_domain_state returns the default STOPPED status for all containers, even running ones.

Impact

Every container.query call when containers exist and libvirt domains are active will silently fail to detect running containers. All containers will appear STOPPED regardless of their actual state. This breaks monitoring, the web UI status display, and any automation depending on container state (e.g., handle_shutdown stopping running containers).

How to fix

Change line 87 in crud.py from:

lambda container: pylibvirt_container(self, container),

to:

lambda container: pylibvirt_container(self, self.extend_container(container)),

This matches the pattern used by the VM plugin at vms.py:99-100:

lambda vm: self.middleware.call_sync("vm.pylibvirt_vm", self.extend_vm(vm)),

Note: bug_006 reported separately describes the exact same issue — it is a duplicate of this bug.

claude · 2026-03-09T15:22:27Z

src/middlewared/middlewared/plugins/container/lifecycle.py

+    configure_container_bridge(context)
+    context.middleware.libvirt_domains_manager.containers.start(
+        pylibvirt_container(context, container.model_dump(), True)


🔴 ContainerNICDevice defines type_ with Field(alias="type"), so model_dump() without by_alias=True produces type_ as the dict key instead of type. This causes a KeyError in factory_utils.py:49 which reads device["attributes"]["type"]. All calls to container.model_dump() in lifecycle.py:39, lifecycle.py:45, crud.py:219, and __init__.py:165 need by_alias=True.

Extended reasoning...

Bug Analysis

The ContainerNICDevice model in container_device.py:50 defines its type field as:

type_: Literal["E1000", "VIRTIO"] = Field(alias="type", default="E1000")

In Pydantic v2, model_dump() uses field names by default, not aliases. So calling container.model_dump() produces a dict where the NIC device attributes contain the key type_ instead of type.

Code Path

In lifecycle.py, the start() function (line 39) calls:

pylibvirt_container(context, container.model_dump(), True)

Similarly, stop() (line 45) calls:

pylibvirt_container(context, container.model_dump())

Inside pylibvirt_container(), the container dict is passed to device_factory.get_device(device) for each device. This calls factory_utils.get_device(), which at line 49 reads:

model=NICDeviceModel(device["attributes"]["type"])

Since the dict key is type_ (not type), this raises a KeyError.

Step-by-step proof

A container with a NIC device is queried via get_instance(), returning a ContainerEntry Pydantic model.

The NIC device within has type_="VIRTIO" as the field name and "type" as the alias.

container.model_dump() is called without by_alias=True.

The resulting dict has {"attributes": {"dtype": "NIC", "type_": "VIRTIO", ...}}.

pylibvirt_container() iterates over devices and calls device_factory.get_device(device).

factory_utils.py:49 accesses device["attributes"]["type"] which raises KeyError: "type".

Why existing code does not prevent this

The base model ConfigDict does not set populate_by_name=True and does not override model_dump(). The developer was clearly aware of this issue since container_device_crud.py consistently uses model_dump(by_alias=True) at lines 40, 49, 50, 57, and 82, but the lifecycle.py call sites were missed.

In the old code (before this PR), dicts came from the datastore via get_instance() which returned raw dicts with "type" as the JSON key. The new code constructs Pydantic models and dumps them, introducing this regression.

Affected call sites

lifecycle.py:39 - start() calls container.model_dump()

lifecycle.py:45 - stop() calls container.model_dump()

crud.py:219 - do_delete() calls container.model_dump()

init.py:165 - domain_event_callback() calls container.model_dump() for event fields

Fix

Change all container.model_dump() calls to container.model_dump(by_alias=True) at each affected site. This ensures dict keys use aliases (matching what downstream code expects).

Move container namespace to be typesafe

65504e7

Qubad786 added pending QA jira-high labels Mar 9, 2026

bugclerk changed the title ~~Convert container plugin to be typesafe~~ NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe Mar 9, 2026

Qubad786 requested a review from a team March 9, 2026 15:09

claude bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe#18403

NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe#18403
Qubad786 wants to merge 1 commit intomasterfrom
mrehan/container-typesafe

Qubad786 commented Mar 9, 2026

Uh oh!

bugclerk commented Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qubad786 commented Mar 9, 2026

Context

Uh oh!

bugclerk commented Mar 9, 2026

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

What the bug is

The specific code path that triggers it

Why existing code does not prevent it

Step-by-step proof

Impact

How to fix

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

What the bug is

The code path that triggers it

Why existing code does not prevent it

Step-by-step proof

Impact

How to fix

Uh oh!

claude bot Mar 9, 2026

Choose a reason for hiding this comment

Bug Analysis

Code Path

Step-by-step proof

Why existing code does not prevent this

Affected call sites

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants