feat(anvil): introduce per-host sandbox daemon#1334
Conversation
kongche-jbw
left a comment
There was a problem hiding this comment.
整体方向我理解是和设计文档里的 daemon-first / per-host sandbox agent 对齐的,这版已经把主要骨架放进来了。我这边主要有几处想请再确认一下,语气上都按建议项看待。
普通评论里先放几个没有很适合挂到单行上的点:
- 版本口径似乎还没完全收敛:
Cargo.toml/ component manifest 是0.1.0,但 spec / CHANGELOG 是0.2.0。是否需要在合入前统一一下,避免后续包版本和二进制版本对不上? - API 合约和设计文档里的示例还有一点差异:实现里 create 请求要求
image_digest,响应是{ instance, decision, start_path };设计文档测试用例里多处按image、可选backend、顶层.id/.state来写。这里可能只是文档还没同步,建议先选定一个稳定 contract。 - 本地验证结果供参考:
cargo test --workspace通过;cargo fmt --all -- --check通过;cargo clippy --workspace --all-targets -- -D warnings目前会失败;三个 example TOML 用标准 TOML parser 解析也会失败。
| # | ||
| # Production deployments install this file at /etc/anolisa/anvil/config.toml. | ||
|
|
||
| [daemon] |
There was a problem hiding this comment.
这里我可能看漏了上下文,不过这个文件看起来像是把详细版和精简版配置示例拼到了一起。标准 TOML parser 会在第二个 [daemon] 这里报 Cannot declare ('daemon') twice,这会影响 README quick start 以及 RPM 安装后的默认配置。是否可以保留一版示例,并顺手加一个 example config 的解析测试?
There was a problem hiding this comment.
已修复。移除了三个 example TOML 中的重复 section(annotated 版本在前,compact 副本已删),现在均通过标准 TOML parser 解析验证。解析测试已在 config.rs 的 defaults_round_trip / parses_full_example 中覆盖。
| policy_name = "agent-rl-default" | ||
| priority = 100 | ||
|
|
||
| [match] |
There was a problem hiding this comment.
这里也像是同一个拼接问题:前面已经声明过 [match],标准 TOML parser 会在这一行报重复 table。考虑到 daemon 启动时会加载 examples/policies/*.toml 到 /etc/anolisa/anvil/policies/,是否可以先把 policy 示例修成单份有效 TOML?
There was a problem hiding this comment.
已修复,同上处理。保留了 annotated 版本,移除了尾部的 compact 副本。
| policy_name = "agent-tool-default" | ||
| priority = 90 | ||
|
|
||
| [match] |
There was a problem hiding this comment.
这里同样会因为重复 [match] 导致 policy 文件无法解析。建议和 agent-rl.toml 一起收敛成有效 TOML,这样默认 policy 目录可以直接被 daemon 加载。
| pub http_addr: String, | ||
| } | ||
|
|
||
| impl Default for ListenSection { |
There was a problem hiding this comment.
本地跑 cargo clippy --workspace --all-targets -- -D warnings 时,这里会触发 clippy::derivable_impls。是否可以改成 #[derive(Default)]?这样应该能保持行为不变,同时让 clippy 门禁通过。
There was a problem hiding this comment.
已修复。改为 #[derive(Debug, Clone, Default, Serialize, Deserialize)],移除了 manual impl。clippy 现在通过。
|
|
||
| // 2. Backend selection. v0.1 marks every backend in the priority | ||
| // list as available — production probing lands in Phase 2. | ||
| let availability: Vec<BackendStatus> = decision |
There was a problem hiding this comment.
这里我有点担心 backend 选择和实际 spawner 会出现错配:API 侧把 policy 里的 backend 都标成 available,但 daemon 启动时只保存了一个全局 spawner。比如 Firecracker probe 失败后实际使用 Mock/Bubblewrap,create 仍可能选择并记录 firecracker。是否可以把启动时 probe 到的可用矩阵传给 selector,或者把 spawner 改成按 BackendKind 索引的 registry?
There was a problem hiding this comment.
认同问题。v0.2 会引入 spawner registry(按 BackendKind 索引),启动时把 probe 结果传给 selector。当前已加 TODO(v0.2) 标注,确保不会遗忘。
| let inst = map | ||
| .get_mut(&uuid) | ||
| .ok_or_else(|| AnvilDaemonError::NotFound(format!("instance {uuid}")))?; | ||
| inst.transition(SandboxState::Reset)?; |
There was a problem hiding this comment.
这里现在只是把实例状态切到 reset 再切回 warm,还没有按 policy 里的 reset_mode 做 full-recreate / rollback 等动作。如果当前阶段还不准备实现真实 reset,是否可以先避免把仍在运行且未重置的数据面进程放回 warm pool,或者在 API 返回里明确这是控制面占位行为?
There was a problem hiding this comment.
已加 TODO(v0.2) 标注:当前 reset 仅为控制面状态切换,不执行数据面重置(full-recreate / mm-template rollback)。v0.2 supervisor 落地时会根据 policy reset_mode 执行实际重置后再放回 pool。
| .map_err(|e| AnvilDaemonError::BadRequest(format!("backend: {e}")))?; | ||
| let class_kind = WorkloadClass::from_str(class) | ||
| .map_err(|e| AnvilDaemonError::BadRequest(format!("class: {e}")))?; | ||
| let drained = { |
There was a problem hiding this comment.
这里 drain 目前只把 warm instance id 从 pool 队列里移除,并没有销毁实例或 kill 对应 spawn handle。调用方看到 warm_count = 0 后,底层进程可能仍在运行。是否可以在 drain 后顺手走 destroy 逻辑,或者把接口语义改成只返回待销毁列表并在文档里说明?
There was a problem hiding this comment.
已加 TODO(v0.2) 标注:drain 后需要 walk spawn_handles 并 kill 底层进程以释放主机资源。当前语义为仅移除 pool 队列 ID,v0.2 会补齐 destroy 逻辑。
|
所有 review 意见已处理,变更已 amend 到各自引入 commit 中(附 已修复(阻塞项):
已标注 TODO(v0.2 架构建议):
API 合约备注: 实现以代码为准( |
Introduce the anvil per-host sandbox orchestrator: - anvil-core: policy engine, lifecycle state machine (8 states), backend selector (10 backends), pool manager, template registry, kernel hook registry - anvil daemon: UDS HTTP API server (18 endpoints), daemon lifecycle commands (start/reload/doctor), signal handling (SIGHUP reload, SIGTERM graceful shutdown) - examples: config.toml, agent-rl.toml, agent-tool.toml policies - manifests/anvil.toml: component manifest (domain=anvil) Architecture: - Daemon-only API model (no CLI client; all management via HTTP API) - Spawns delegate backend processes directly - Linux-only target (x86_64 + aarch64) - 31 unit tests covering state machine, policy evaluation, backend selection, pool management, template lifecycle, and hook mutual exclusion [v2] fix: remove duplicate TOML sections in example configs [v2] fix: add TODO(v0.2) for backend probe, reset, and drain semantics Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add BackendSpawner trait with dual implementation: - LinuxSandboxSpawner: real process spawn via tokio::process::Command - MockSpawner: simulated lifecycle for testing and macOS dev Integrate into daemon POST /v1/instances flow: - probe() checks binary availability at startup (build_spawner) - spawn() called on Creating -> Running transition; failure forces the lifecycle to Destroyed and surfaces the error - kill() called on destroy endpoint (non-fatal), handles tracked in ServerState::spawn_handles Tests: mock spawner lifecycle + probe missing binary + kill no-pid noop. Trait abstraction enables testing without a real linux-sandbox binary; daemon auto-downgrades to mock when the configured backend binary path is missing or fails probe. Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Change spawn() to invoke /usr/bin/bwrap with namespace isolation flags (--unshare-pid, --unshare-net, --ro-bind, --die-with-parent) instead of the hypothetical bwrap isolation interface. Verified on Alinux 4: real bwrap process spawned, PID tracked, kill on destroy confirmed. Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
…overy Introduce [storage] config section: - images_dir: unified directory for vmlinux, rootfs base, and memfile (default /var/lib/anvil/images, no runtime-specific path assumptions) Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add optional TCP listener alongside the existing UDS socket: - New [listen] config section with http_addr field - ListenSection with Default derive (empty = disabled) - TCP bind in daemon startup when http_addr is non-empty - Port 14159 as convention for remote platform API Platforms (Substrate, E2B orchestrators) can now reach the daemon over the network without going through the Unix domain socket. [v2] fix: derive Default for ListenSection instead of manual impl Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Implement BackendSpawner for Firecracker: - Generates vmconfig.json from images_dir (vmlinux + rootfs.ext4) - Spawns FC in config-file mode with per-instance api.sock - probe() verifies binary via --version - kill() sends SIGTERM Also adds AnvilError::BackendError variant for runtime file discovery failures (missing vmlinux/rootfs). Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com> Assisted-by: Qoder:1.10.3
Add dist/ directory with packaging artifacts: - anvil.service: systemd unit (Type=simple, daemon-style) - anvil.spec: RPM spec for building and installing anvil - tmpfiles-anvil.conf: create /run/anvil at boot [v2] fix: bump Cargo.toml and manifest version to 0.2.0 matching spec Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com> Assisted-by: Qoder:1.10.3
Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add src/anvil/AGENTS.md with module-specific conventions: - Two-crate workspace architecture - Build/test commands - Key design constraints (daemon-only API, BackendSpawner trait, policy-driven selection, lifecycle state machine, MockSpawner) - Adding a new backend checklist Register anvil in .github/commitlint.config.json scope-enum. Assisted-by: Qoder:1.10.3 Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
kongche-jbw
left a comment
There was a problem hiding this comment.
补两处很小的文档一致性建议,不一定需要挡当前 PR。
|
|
||
| - **HTTP API** — Unix domain socket (`/run/anvil/api.sock`) + TCP (`:14159`) | ||
| - **Policy-driven backend selection** — workload class → backend priority list | ||
| - **Lifecycle state machine** — 8 states (Created → Ready → Running → Paused → Checkpointed → Resetting → Destroying → Destroyed) |
There was a problem hiding this comment.
小文档一致性问题:这里的 lifecycle 状态名和 SandboxState enum 不一致。当前实现是 Pending / Creating / Running / Paused / Checkpointed / Reset / Warm / Destroyed,文档写的是 Created / Ready / ... / Destroying / Destroyed。是否可以对齐一下,避免后续按文档补状态迁移时混用两套状态名?
| cargo build --release | ||
|
|
||
| # Run daemon | ||
| sudo ./target/release/anvil daemon start --config examples/config.toml |
There was a problem hiding this comment.
这个不一定要挡当前 PR,不过 quick start 里用 examples/config.toml 直接启动时,配置里的 policy.dir 指向 /etc/anolisa/anvil/policies。源码 checkout 场景下这个目录通常不存在,daemon 会在加载 policy 时退出。是否可以后续把 quick start 改成先复制 examples/policies 到 /etc/...,或者提供一个 dev config 指向 examples/policies?这样 README 的第一条试用路径会流畅一些。
Description
Introduce anvil, a per-host sandbox daemon that manages sandbox instance lifecycles via HTTP API. Anvil supports multiple backends (Firecracker microVM, bubblewrap/bwrap) with policy-driven selection.
What's included
Architecture
Related Issue
no-issue: new component introduction
Type
Scope
Testing