Skip to content

feat(anvil): introduce per-host sandbox daemon#1334

Open
casparant wants to merge 10 commits into
alibaba:mainfrom
casparant:feat/anvil
Open

feat(anvil): introduce per-host sandbox daemon#1334
casparant wants to merge 10 commits into
alibaba:mainfrom
casparant:feat/anvil

Conversation

@casparant

Copy link
Copy Markdown
Collaborator

Description

Introduce anvil, a per-host sandbox daemon that manages sandbox instance lifecycles via HTTP API. Anvil supports multiple backends (Firecracker microVM, bubblewrap/bwrap) with policy-driven selection.

What's included

  • anvil-core (library): policy engine, lifecycle state machine (8 states), backend selector, pool manager, template registry, kernel hook registry, config schema
  • anvil (binary): daemon HTTP server (UDS + TCP, 18 endpoints), spawner implementations (Firecracker, Bubblewrap, Mock), Prometheus metrics, CLI for daemon lifecycle
  • dist/: systemd service unit, RPM spec, tmpfiles config
  • AGENTS.md: scoped module rules for AI coding assistants
  • CHANGELOG.md: bilingual v0.2.0 release notes

Architecture

  • Daemon-only API model — no CLI client for sandbox operations
  • BackendSpawner trait abstracts process management per backend
  • Policy-driven backend selection: workload class → prioritized backend list
  • MockSpawner fallback for dev/test on non-Linux hosts

Related Issue

no-issue: new component introduction

Type

  • feat — new feature
  • docs — documentation

Scope

  • anvil

Testing

cd src/anvil && cargo test --workspace
# 31+ unit tests covering state machine, policy evaluation,
# backend selection, pool management, template lifecycle,
# hook mutual exclusion, spawner lifecycle

@github-actions github-actions Bot added the scope:documentation ./docs/|./*.md|./NOTICE label Jul 4, 2026
@casparant casparant requested a review from kongche-jbw as a code owner July 4, 2026 08:05

@kongche-jbw kongche-jbw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体方向我理解是和设计文档里的 daemon-first / per-host sandbox agent 对齐的,这版已经把主要骨架放进来了。我这边主要有几处想请再确认一下,语气上都按建议项看待。

普通评论里先放几个没有很适合挂到单行上的点:

  • 版本口径似乎还没完全收敛:Cargo.toml / component manifest 是 0.1.0,但 spec / CHANGELOG 是 0.2.0。是否需要在合入前统一一下,避免后续包版本和二进制版本对不上?
  • API 合约和设计文档里的示例还有一点差异:实现里 create 请求要求 image_digest,响应是 { instance, decision, start_path };设计文档测试用例里多处按 image、可选 backend、顶层 .id/.state 来写。这里可能只是文档还没同步,建议先选定一个稳定 contract。
  • 本地验证结果供参考:cargo test --workspace 通过;cargo fmt --all -- --check 通过;cargo clippy --workspace --all-targets -- -D warnings 目前会失败;三个 example TOML 用标准 TOML parser 解析也会失败。

Comment thread src/anvil/examples/config.toml Outdated
#
# Production deployments install this file at /etc/anolisa/anvil/config.toml.

[daemon]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我可能看漏了上下文,不过这个文件看起来像是把详细版和精简版配置示例拼到了一起。标准 TOML parser 会在第二个 [daemon] 这里报 Cannot declare ('daemon') twice,这会影响 README quick start 以及 RPM 安装后的默认配置。是否可以保留一版示例,并顺手加一个 example config 的解析测试?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复。移除了三个 example TOML 中的重复 section(annotated 版本在前,compact 副本已删),现在均通过标准 TOML parser 解析验证。解析测试已在 config.rsdefaults_round_trip / parses_full_example 中覆盖。

policy_name = "agent-rl-default"
priority = 100

[match]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也像是同一个拼接问题:前面已经声明过 [match],标准 TOML parser 会在这一行报重复 table。考虑到 daemon 启动时会加载 examples/policies/*.toml/etc/anolisa/anvil/policies/,是否可以先把 policy 示例修成单份有效 TOML?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复,同上处理。保留了 annotated 版本,移除了尾部的 compact 副本。

policy_name = "agent-tool-default"
priority = 90

[match]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里同样会因为重复 [match] 导致 policy 文件无法解析。建议和 agent-rl.toml 一起收敛成有效 TOML,这样默认 policy 目录可以直接被 daemon 加载。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复,同上。

pub http_addr: String,
}

impl Default for ListenSection {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本地跑 cargo clippy --workspace --all-targets -- -D warnings 时,这里会触发 clippy::derivable_impls。是否可以改成 #[derive(Default)]?这样应该能保持行为不变,同时让 clippy 门禁通过。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修复。改为 #[derive(Debug, Clone, Default, Serialize, Deserialize)],移除了 manual impl。clippy 现在通过。


// 2. Backend selection. v0.1 marks every backend in the priority
// list as available — production probing lands in Phase 2.
let availability: Vec<BackendStatus> = decision

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我有点担心 backend 选择和实际 spawner 会出现错配:API 侧把 policy 里的 backend 都标成 available,但 daemon 启动时只保存了一个全局 spawner。比如 Firecracker probe 失败后实际使用 Mock/Bubblewrap,create 仍可能选择并记录 firecracker。是否可以把启动时 probe 到的可用矩阵传给 selector,或者把 spawner 改成按 BackendKind 索引的 registry?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

认同问题。v0.2 会引入 spawner registry(按 BackendKind 索引),启动时把 probe 结果传给 selector。当前已加 TODO(v0.2) 标注,确保不会遗忘。

let inst = map
.get_mut(&uuid)
.ok_or_else(|| AnvilDaemonError::NotFound(format!("instance {uuid}")))?;
inst.transition(SandboxState::Reset)?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里现在只是把实例状态切到 reset 再切回 warm,还没有按 policy 里的 reset_modefull-recreate / rollback 等动作。如果当前阶段还不准备实现真实 reset,是否可以先避免把仍在运行且未重置的数据面进程放回 warm pool,或者在 API 返回里明确这是控制面占位行为?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已加 TODO(v0.2) 标注:当前 reset 仅为控制面状态切换,不执行数据面重置(full-recreate / mm-template rollback)。v0.2 supervisor 落地时会根据 policy reset_mode 执行实际重置后再放回 pool。

.map_err(|e| AnvilDaemonError::BadRequest(format!("backend: {e}")))?;
let class_kind = WorkloadClass::from_str(class)
.map_err(|e| AnvilDaemonError::BadRequest(format!("class: {e}")))?;
let drained = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 drain 目前只把 warm instance id 从 pool 队列里移除,并没有销毁实例或 kill 对应 spawn handle。调用方看到 warm_count = 0 后,底层进程可能仍在运行。是否可以在 drain 后顺手走 destroy 逻辑,或者把接口语义改成只返回待销毁列表并在文档里说明?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已加 TODO(v0.2) 标注:drain 后需要 walk spawn_handles 并 kill 底层进程以释放主机资源。当前语义为仅移除 pool 队列 ID,v0.2 会补齐 destroy 逻辑。

@casparant

Copy link
Copy Markdown
Collaborator Author

所有 review 意见已处理,变更已 amend 到各自引入 commit 中(附 [v2] fix: 标注):

已修复(阻塞项):

  • ✅ 三个 example TOML 重复 section 已移除(均通过 tomllib 验证)
  • ListenSection 改为 #[derive(Default)],clippy 通过
  • ✅ 版本口径收敛:Cargo.toml / manifest 已统一到 0.2.0

已标注 TODO(v0.2 架构建议):

  • ✅ backend probe → selector 传入可用矩阵
  • ✅ reset 需执行数据面重置后再放回 pool
  • ✅ drain 后需 kill 底层进程

API 合约备注: 实现以代码为准(image_digest{instance, decision, start_path} 响应结构),设计文档将在 v0.2 对齐时统一更新。

casparant added 10 commits July 4, 2026 18:06
Introduce the anvil per-host sandbox orchestrator:

- anvil-core: policy engine, lifecycle state machine (8 states),
  backend selector (10 backends), pool manager, template registry,
  kernel hook registry
- anvil daemon: UDS HTTP API server (18 endpoints), daemon lifecycle
  commands (start/reload/doctor), signal handling (SIGHUP reload,
  SIGTERM graceful shutdown)
- examples: config.toml, agent-rl.toml, agent-tool.toml policies
- manifests/anvil.toml: component manifest (domain=anvil)

Architecture:
- Daemon-only API model (no CLI client; all management via HTTP API)
- Spawns delegate backend processes directly
- Linux-only target (x86_64 + aarch64)
- 31 unit tests covering state machine, policy evaluation,
  backend selection, pool management, template lifecycle,
  and hook mutual exclusion

[v2] fix: remove duplicate TOML sections in example configs
[v2] fix: add TODO(v0.2) for backend probe, reset, and drain semantics

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add BackendSpawner trait with dual implementation:
- LinuxSandboxSpawner: real process spawn via tokio::process::Command
- MockSpawner: simulated lifecycle for testing and macOS dev

Integrate into daemon POST /v1/instances flow:
- probe() checks binary availability at startup (build_spawner)
- spawn() called on Creating -> Running transition; failure forces
  the lifecycle to Destroyed and surfaces the error
- kill() called on destroy endpoint (non-fatal), handles tracked in
  ServerState::spawn_handles

Tests: mock spawner lifecycle + probe missing binary + kill no-pid noop.

Trait abstraction enables testing without a real linux-sandbox binary;
daemon auto-downgrades to mock when the configured backend binary path is
missing or fails probe.

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Change spawn() to invoke /usr/bin/bwrap with namespace isolation
flags (--unshare-pid, --unshare-net, --ro-bind, --die-with-parent)
instead of the hypothetical bwrap isolation interface.

Verified on Alinux 4: real bwrap process spawned, PID tracked,
kill on destroy confirmed.

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
…overy

Introduce [storage] config section:
- images_dir: unified directory for vmlinux, rootfs base, and memfile
  (default /var/lib/anvil/images, no runtime-specific path assumptions)

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add optional TCP listener alongside the existing UDS socket:
- New [listen] config section with http_addr field
- ListenSection with Default derive (empty = disabled)
- TCP bind in daemon startup when http_addr is non-empty
- Port 14159 as convention for remote platform API

Platforms (Substrate, E2B orchestrators) can now reach the daemon
over the network without going through the Unix domain socket.

[v2] fix: derive Default for ListenSection instead of manual impl

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Implement BackendSpawner for Firecracker:
- Generates vmconfig.json from images_dir (vmlinux + rootfs.ext4)
- Spawns FC in config-file mode with per-instance api.sock
- probe() verifies binary via --version
- kill() sends SIGTERM

Also adds AnvilError::BackendError variant for runtime file
discovery failures (missing vmlinux/rootfs).

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Assisted-by: Qoder:1.10.3
Add dist/ directory with packaging artifacts:
- anvil.service: systemd unit (Type=simple, daemon-style)
- anvil.spec: RPM spec for building and installing anvil
- tmpfiles-anvil.conf: create /run/anvil at boot

[v2] fix: bump Cargo.toml and manifest version to 0.2.0 matching spec

Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Assisted-by: Qoder:1.10.3
Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>
Add src/anvil/AGENTS.md with module-specific conventions:
- Two-crate workspace architecture
- Build/test commands
- Key design constraints (daemon-only API, BackendSpawner trait,
  policy-driven selection, lifecycle state machine, MockSpawner)
- Adding a new backend checklist

Register anvil in .github/commitlint.config.json scope-enum.

Assisted-by: Qoder:1.10.3
Signed-off-by: Caspar Zhang <caspar@linux.alibaba.com>

@kongche-jbw kongche-jbw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补两处很小的文档一致性建议,不一定需要挡当前 PR。

Comment thread src/anvil/README.md

- **HTTP API** — Unix domain socket (`/run/anvil/api.sock`) + TCP (`:14159`)
- **Policy-driven backend selection** — workload class → backend priority list
- **Lifecycle state machine** — 8 states (Created → Ready → Running → Paused → Checkpointed → Resetting → Destroying → Destroyed)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

小文档一致性问题:这里的 lifecycle 状态名和 SandboxState enum 不一致。当前实现是 Pending / Creating / Running / Paused / Checkpointed / Reset / Warm / Destroyed,文档写的是 Created / Ready / ... / Destroying / Destroyed。是否可以对齐一下,避免后续按文档补状态迁移时混用两套状态名?

Comment thread src/anvil/README.md
cargo build --release

# Run daemon
sudo ./target/release/anvil daemon start --config examples/config.toml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不一定要挡当前 PR,不过 quick start 里用 examples/config.toml 直接启动时,配置里的 policy.dir 指向 /etc/anolisa/anvil/policies。源码 checkout 场景下这个目录通常不存在,daemon 会在加载 policy 时退出。是否可以后续把 quick start 改成先复制 examples/policies/etc/...,或者提供一个 dev config 指向 examples/policies?这样 README 的第一条试用路径会流畅一些。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope:documentation ./docs/|./*.md|./NOTICE

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants