NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe#18403
NAS-140194 / 27.0.0-BETA.1 / Convert container plugin to be typesafe#18403
Conversation
| if config.preferred_pool: | ||
| await self.middleware.call( | ||
| "container.validate_pool", verrors, "lxc_config_update.preferred_pool", config.preferred_pool | ||
| ) |
There was a problem hiding this comment.
🔴 Bug: lxc_config.py:54-55 calls container.validate_pool via middleware dispatch, but this method was deleted when container.py was replaced by init.py - it no longer exists anywhere in the codebase. Any lxc.update call with a preferred_pool value will raise a runtime CallError (method not found). The fix is to inline the pool validation here, similar to how it was done in crud.py:147 (e.g., check config.preferred_pool not in await pool_choices(self)).
Extended reasoning...
What the bug is
The old ContainerService in container/container.py defined a @Private method validate_pool that was registered as container.validate_pool on the middleware dispatch:
@private
async def validate_pool(self, verrors, schema, pool):
if pool not in await self.middleware.call("container.pool_choices"):
verrors.add(schema, "Pool not found.")This PR deleted container.py entirely and replaced the ContainerService with a new implementation in init.py. The new ContainerService does not define a validate_pool method.
The specific code path that triggers it
In lxc_config.py:53-56, when config.preferred_pool is truthy during an lxc.update call, the code dispatches:
if config.preferred_pool:
await self.middleware.call(
"container.validate_pool", verrors, "lxc_config_update.preferred_pool", config.preferred_pool
)Since container.validate_pool no longer exists as a registered method, the middleware dispatch will raise a CallError with a method-not-found error.
Why existing code does not prevent it
The pool validation was correctly inlined in crud.py:147 for container creation (pool not in await pool_choices(self)), but the lxc_config.py update path was not updated to use the same inline approach. The unit test in test_config_network_overlap.py masks this by mocking: m["container.validate_pool"] = AsyncMock(), so the tests still pass without the method existing.
Step-by-step proof
- User calls lxc.update with {"preferred_pool": "tank"}
- LXCConfigService.do_update delegates to LXCConfigServicePart.do_update in lxc_config.py:35
- At line 53, config.preferred_pool is "tank" (truthy), so the if branch is entered
- At line 54-55, self.middleware.call("container.validate_pool", ...) is invoked
- The middleware dispatch looks up container.validate_pool -- this method does not exist in the new ContainerService in init.py (confirmed by grep showing zero definitions)
- A CallError is raised: method not found
Impact
Any attempt to set or change the preferred_pool in LXC configuration will fail with a runtime error. This is a regression introduced by this PR that blocks a user-facing configuration operation.
How to fix
Inline the pool validation in lxc_config.py, matching the pattern used in crud.py:147. Replace lines 53-56 with:
if config.preferred_pool:
if config.preferred_pool not in await pool_choices(self):
verrors.add("lxc_config_update.preferred_pool", "Pool not found.")This requires importing pool_choices from .info at the top of lxc_config.py. The test mock for container.validate_pool should also be updated accordingly.
| def extend_context_sync(self, rows: list[dict[str, Any]], extra: dict[str, Any]) -> dict[str, Any]: | ||
| return { | ||
| 'states': gather_pylibvirt_domains_states( | ||
| self.middleware, | ||
| rows, | ||
| self.middleware.libvirt_domains_manager.containers_connection, | ||
| lambda container: pylibvirt_container(self, container), | ||
| ), | ||
| } |
There was a problem hiding this comment.
🔴 Bug: extend_context_sync passes raw DB rows to pylibvirt_container() without calling extend_container() first. Raw rows have idmap_slice (the DB column name), not idmap (the API field), so pylibvirt_container() will KeyError on container["idmap"] at lifecycle.py:101. The broad except Exception in gather_pylibvirt_domains_states silently swallows this, causing all containers to incorrectly report STOPPED state even when running. Fix: change line 87 to lambda container: pylibvirt_container(self, self.extend_container(container)).
Extended reasoning...
What the bug is
In crud.py:81-89, extend_context_sync() builds a lambda that is passed to gather_pylibvirt_domains_states(). This lambda receives raw datastore rows (which have idmap_slice as the DB column name) and passes them directly to pylibvirt_container(). However, pylibvirt_container() at lifecycle.py:101 accesses container["idmap"], which does not exist in raw DB rows — only idmap_slice does.
The code path that triggers it
The flow is: query() → extend_context() → extend_context_sync() → gather_pylibvirt_domains_states(). At pylibvirt.py:46, the function calls container_factory(container.copy()) for each running libvirt domain, where container_factory is the lambda from extend_context_sync. The lambda passes the raw row to pylibvirt_container(), which immediately hits container["idmap"] at lifecycle.py:101 — a key that does not exist in the raw dict.
Why existing code does not prevent it
The extend_container() method (which converts idmap_slice → idmap) is called in extend(), but extend() runs AFTER extend_context_sync() in the query pipeline (see CRUDServicePart.query: extend_context(rows, extra) is called first, then _run_extend_many(rows, ctx) calls extend() for each row). The old code correctly handled this by calling self.extend_container(container) inside the lambda before passing to pylibvirt_container. The new code omits this transformation.
Step-by-step proof
- A user calls
container.querywhile a container is running. extend_context_syncis called with raw DB rows. Each row hasidmap_slice(e.g.,0for DEFAULT) but noidmapkey.gather_pylibvirt_domains_states(pylibvirt.py:38-46) matches the running domain UUID to a row, then callscontainer_factory(container.copy())— i.e., the lambdalambda container: pylibvirt_container(self, container).- Inside
pylibvirt_container(lifecycle.py:101):if container["idmap"]:→ KeyError, because the dict hasidmap_slice, notidmap. - The
except Exceptionat pylibvirt.py:48-49 catches this silently. The state dict is returned empty for this container. get_pylibvirt_domain_statereturns the default STOPPED status for all containers, even running ones.
Impact
Every container.query call when containers exist and libvirt domains are active will silently fail to detect running containers. All containers will appear STOPPED regardless of their actual state. This breaks monitoring, the web UI status display, and any automation depending on container state (e.g., handle_shutdown stopping running containers).
How to fix
Change line 87 in crud.py from:
lambda container: pylibvirt_container(self, container),to:
lambda container: pylibvirt_container(self, self.extend_container(container)),This matches the pattern used by the VM plugin at vms.py:99-100:
lambda vm: self.middleware.call_sync("vm.pylibvirt_vm", self.extend_vm(vm)),Note: bug_006 reported separately describes the exact same issue — it is a duplicate of this bug.
| configure_container_bridge(context) | ||
| context.middleware.libvirt_domains_manager.containers.start( | ||
| pylibvirt_container(context, container.model_dump(), True) |
There was a problem hiding this comment.
🔴 ContainerNICDevice defines type_ with Field(alias="type"), so model_dump() without by_alias=True produces type_ as the dict key instead of type. This causes a KeyError in factory_utils.py:49 which reads device["attributes"]["type"]. All calls to container.model_dump() in lifecycle.py:39, lifecycle.py:45, crud.py:219, and __init__.py:165 need by_alias=True.
Extended reasoning...
Bug Analysis
The ContainerNICDevice model in container_device.py:50 defines its type field as:
type_: Literal["E1000", "VIRTIO"] = Field(alias="type", default="E1000")
In Pydantic v2, model_dump() uses field names by default, not aliases. So calling container.model_dump() produces a dict where the NIC device attributes contain the key type_ instead of type.
Code Path
In lifecycle.py, the start() function (line 39) calls:
pylibvirt_container(context, container.model_dump(), True)
Similarly, stop() (line 45) calls:
pylibvirt_container(context, container.model_dump())
Inside pylibvirt_container(), the container dict is passed to device_factory.get_device(device) for each device. This calls factory_utils.get_device(), which at line 49 reads:
model=NICDeviceModel(device["attributes"]["type"])
Since the dict key is type_ (not type), this raises a KeyError.
Step-by-step proof
- A container with a NIC device is queried via get_instance(), returning a ContainerEntry Pydantic model.
- The NIC device within has type_="VIRTIO" as the field name and "type" as the alias.
- container.model_dump() is called without by_alias=True.
- The resulting dict has {"attributes": {"dtype": "NIC", "type_": "VIRTIO", ...}}.
- pylibvirt_container() iterates over devices and calls device_factory.get_device(device).
- factory_utils.py:49 accesses device["attributes"]["type"] which raises KeyError: "type".
Why existing code does not prevent this
The base model ConfigDict does not set populate_by_name=True and does not override model_dump(). The developer was clearly aware of this issue since container_device_crud.py consistently uses model_dump(by_alias=True) at lines 40, 49, 50, 57, and 82, but the lifecycle.py call sites were missed.
In the old code (before this PR), dicts came from the datastore via get_instance() which returned raw dicts with "type" as the JSON key. The new code constructs Pydantic models and dumps them, introducing this regression.
Affected call sites
- lifecycle.py:39 - start() calls container.model_dump()
- lifecycle.py:45 - stop() calls container.model_dump()
- crud.py:219 - do_delete() calls container.model_dump()
- init.py:165 - domain_event_callback() calls container.model_dump() for event fields
Fix
Change all container.model_dump() calls to container.model_dump(by_alias=True) at each affected site. This ensures dict keys use aliases (matching what downstream code expects).
Context
Container plugin has been converted to typesafe implementation including converting other services which lived inside container plugin.