feat: stage AMD SEV-SNP attestation support#703
Conversation
|
SEV-SNP TCB/advisory policy slice is pushed. What changed:
Still fail-closed:
Validation:
|
|
Continued with the next quality-gate slice and pushed a small clippy cleanup commit. Commit:
What changed:
Validation now passing:
Independent review of the cleanup diff found no behavior/security regressions. |
|
Milestone 1 is done: PR #703 is now review-ready staging for AMD SEV-SNP, still without production key release. New commit:
What changed:
Validation passed after doc/proof refresh:
I am marking the PR ready for review now. Milestone 2 remains separate: production SNP key release policy + revocation/advisory collateral + guarded release enablement. |
|
Milestone 2 is now implemented and pushed. Commit: What changed:
Validation passed: cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runIndependent security review: no release-gate blockers found after the self-authorization startup-safety fix. |
SNP E2E smoke follow-upI kept going on the manual SNP smoke on What the smoke found/fixed:
Smoke status:
Validation passed after the fixes: cargo fmt --all
cargo test -p dstack-vmm --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run |
AMD SEV-SNP manual E2E smoke updateI pushed a follow-up commit that completes the dstack-managed SNP smoke path:
What the smoke proved
Failure gate also exercisedThe lab host reports verifier-derived Then, with an explicit lab-only allowlist ( Fixes included
Validation runAll passed locally: cargo fmt --all
cargo test -p dstack-attest --all-features
cargo test -p dstack-util --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runNo secret/key material was included in logs or this comment. |
|
Thank you so much for this — it's a huge and impressively thorough piece of work. 🙏 I have some other things on my plate right now, but I'll review this once I'm through them. Thanks again! |
Summary
This PR stages AMD SEV-SNP as a first-class dstack attestation platform alongside the existing TDX/Nitro/GCP paths, and now includes the controlled/fail-closed KMS key/cert release path for SNP.
At a high level, this branch:
/dev/sev-guestioctl fallback.app_idlaunch-measured for SNP by binding app identity into the measured kernel cmdline, matching the TDX semantic that app identity is part of the launch-measured identity rather than only KMS policy metadata.BootInfofrom verified evidence: measurement, chip id, app id, compose hash, rootfs hash, TCB status, and advisory ids.Default security posture
SNP release remains fail-closed by default.
Defaults:
Sensitive release surfaces guarded by this gate:
GetAppKeyGetKmsKeySignCertGetTempCaCertAdditional safety: KMS startup rejects SNP release enablement unless
enforce_self_authorization = true, so the self-authorized temp-CA path cannot silently bypass the SNP release policy.Even when local release is enabled, external auth must still allow the verified SNP
BootInfo.What was proven on hardware
Manual hardware smoke was rerun on the SNP host:
The reusable smoke script is checked in at
test-scripts/snp-e2e-smoke.sh. It is intended for manual SNP hosts, not CI.Latest sanitized result at PR head
38b02d7c:dstack-kms.dstack_kms_attestation_requests_total 2dstack_kms_attestation_failures_total 0dstack-prepare.sh.attestation_mode: dstack-amd-sev-snpRequesting app keys from KMS: https://10.0.2.2:15443/prpcSNP_APP_CONTAINER_STARTED/ full app-key success with the releasedmeta-dstackv0.5.11 image.Diagnosis: this is image/tooling skew, not a KMS release-policy bypass. The host/KMS binaries are built from PR #703, but the app VM still uses the
dstack-util/dstack-attestembedded inside the releasedmeta-dstackv0.5.11 guest image. That older guest-side verifier can reject SNP certificate collateral before the newer PR cert-chain/KDS fallback paths can help.So the PR currently proves live SNP report handling, golden-vector measurement recomputation, fail-closed release gates, dstack-managed SNP KMS boot, and app guest key-request boundary. Full fresh-box
SNP_APP_CONTAINER_STARTED/GetAppKeysuccess requires a coherentmeta-dstackguest image whose kernel/modules/initramfs/rootfs/verity metadata and guest userspace include the same PR #703dstack-util/dstack-attestSNP fixes.Quote / attestation proof
Earlier guest quote proof confirmed the SNP guest can produce a hardware report containing the expected challenge bytes:
The final KMS smoke additionally proves that the app guest's SNP evidence verifies through KMS and auth successfully enough to release app material under the explicit lab policy.
Measurement proof
A live golden-vector test on an SNP-capable host cross-checks dstack's pure Rust SNP measurement recomputation against
sev-snp-measure:cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocaptureLatest recorded proof:
See
docs/amd-sev-snp-review-readiness.mdfor the fuller proof block and review boundary.Important implementation notes
Key fixes discovered during E2E smoke:
.sys-config.jsonnow includessev_snp_measurementso KMS can recompute the same SNP launch measurement used by QEMU.rootfs_hashonly in kernel cmdline (dstack.rootfs_hash=...), so VMM/KMS now preserve and use that path.docker_compose_hash,rootfs_hash, andapp_id.EPYC-v4and confidential virtio PCI options (disable-legacy=on,iommu_platform=true).chip_id+ reported TCB when local evidence lacks cert collateral.mr_config_idchecks while preserving non-SNP behavior.dstack-prepare.shhandles SNP guest detection, earlychronycunavailability, and minimal smoke DNS fallback.Validation run
All passed locally on the final branch head:
Hardware smoke portability note
The smoke reaches KMS and the app guest key-request boundary on
chris@173.234.27.162with AMDSEV QEMU 10.0.2 and the checked-in smoke script. A separate local SNP host can boot SNP Linux with a newer Lit kernel, but the stockmeta-dstackv0.5.116.9.0-dstackkernel stops after OVMF/EFI loads kernel+initrd and QEMU reportscpus are not resettable, terminating, even with QEMU 10.0.2.Reviewers/testers should first run
test-scripts/snp-e2e-smoke.shunchanged and confirmSNP_KMS_CONTAINER_STARTEDplus the app guest key-request boundary. If it stops at the EFI-stub/reset boundary, treat that as host/image/kernel compatibility. If it reachesRequesting app keys from KMSand fails with the SNP cert-chain error, use/build a coherent PR #703meta-dstackguest image rather than changing KMS policy or debugging Chipotle.Known limitations / follow-ups
platform = "auto"remains conservative while SNP is experimental. Operators must explicitly setplatform = "amd-sev-snp".advisory_idsis currently explicit and empty. Future advisory/revocation collateral should populate it and will be denied unless explicitly allowlisted.test-scripts/snp-e2e-smoke.sh.meta-dstackguest image; rebuilding only host/KMS binaries leaves the app guest on older embedded verifier code.tcbStatus = "OutOfDate"; success required an explicit lab allowlist. Production defaults still deny this.Human review focus
Please pay special attention to:
Fail-closed release semantics
UpToDateonly by default.Measurement / identity binding
app_id, compose hash, rootfs hash, kernel/initrd/cmdline, OVMF, vCPU model, and guest features are all part of recomputation or policy input.app_idis launch-measured, not just auth metadata.AMD KDS collateral fallback
Non-SNP regression risk
DstackAmdSevSnp.Operational policy choice
UpToDateTCB in production should remain an explicit operator decision, not a default.