From 2d82eacd667bde3b49e8941279de617148e54e71 Mon Sep 17 00:00:00 2001 From: Kevin Wang Date: Wed, 27 May 2026 00:36:51 -0700 Subject: [PATCH] docs: fix issues found running the guide end-to-end on H100 spot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Walked through the guide end-to-end on a fresh us-central1-a a3-highgpu-1g SPOT instance (boot to CC verification to remove). Fixes: - §2.1: list gsutil explicitly — `dstack-cloud deploy` shells out to `gsutil cp` for the boot/shared image upload, and a partial gcloud SDK layout that omits gsutil from PATH crashes deploy at the upload step with `FileNotFoundError: 'gsutil'`. Include the symlink snippet for the common case. - §2.3: a dstack-cloud copy installed before #15 was merged still parses cleanly but drops the `provisioning_model` field on serialization, so the SPOT setting from §3.2 silently regresses to STANDARD. Spell out the "refresh your install" step. - §3.4: drop the stray `chmod +x prelaunch.sh` line from inside the script body — it would only run inside the TEE guest where the cwd has no such file. `dstack-cloud new` already creates the script as 0755, so editing it preserves the mode bits. - §4.1: `ls shared/` after `prepare` shows three files, not four — `.user-config` stays at project root until deploy's FAT image build. - §4.3: `libspdm_check_crypto_backend: LKCA wrappers found.` does not reach the serial console that `dstack-cloud logs` taps. Confirming LKCA is healthy via `nvidia-smi conf-compute -f` (§5.3) is the reliable check; keep the stub-fallback warning as a 0.6.0-only symptom. - README: drop dead links to guide_CN.md and workshop/ (neither exists in the repo). --- README.md | 6 +++--- guide_EN.md | 32 +++++++++++++++++++++++--------- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 58f5b57..33c4c28 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ Public guide for deploying a Confidential Compute GPU workload on [`dstack-cloud`](https://github.com/Phala-Network/meta-dstack-cloud). - **English:** [`guide_EN.md`](guide_EN.md) -- **中文:** [`guide_CN.md`](guide_CN.md) (coming soon) +- **中文:** _coming soon_ -Architecture diagram, troubleshooting notes, and example apps live under -[`workshop/`](workshop/). +The architecture diagram, troubleshooting cheat sheet, and image-build +recipe live inside `guide_EN.md` (see §1 and the appendices). diff --git a/guide_EN.md b/guide_EN.md index 1ff9e12..1566bc8 100644 --- a/guide_EN.md +++ b/guide_EN.md @@ -75,6 +75,7 @@ You'll need these on your local machine: | Tool | Tested version | Notes | | ----------------------------------------------- | -------------- | ----- | | [gcloud SDK](https://cloud.google.com/sdk/docs/install) | 565+ | `gcloud auth login` against your GCP account | +| `gsutil` | bundled with gcloud SDK | `dstack-cloud deploy` shells out to `gsutil cp` to stage the boot/shared images; make sure it's on `$PATH` (`ln -s "$(gcloud info --format='value(installation.sdk_root)')/bin/gsutil" ~/.local/bin/gsutil` if missing) | | [`dstack-cloud`](https://github.com/Phala-Network/meta-dstack-cloud/blob/main/scripts/bin/dstack-cloud) | latest from `main` | single Python file, drop on `$PATH` | | `openssl` | any | SSH-over-TLS proxy command | | `curl`, `jq`, `tar` | any | | @@ -126,8 +127,11 @@ dstack-cloud --help | head -5 > being present in your `dstack-cloud` copy — it exposes > `gcp_config.provisioning_model` (default `STANDARD`, set to `SPOT` > for H100 without on-demand quota). The `curl` command above pulls -> from `main`, so once #15 lands you're set; if you pinned an older -> revision, refresh it before proceeding. +> from `main`, so once #15 lands you're set; **if you already had +> `dstack-cloud` installed before #15 was merged, re-run the `curl` +> above to refresh it** — otherwise `dstack-cloud new` will produce an +> `app.json` without the `provisioning_model` field and your SPOT +> setting in §3.2 will silently fall back to `STANDARD`. ### 2.4 Configure `dstack-cloud` @@ -283,13 +287,15 @@ docker run --rm --privileged --pid=host --net=host -v /:/host \ -e SSH_GITHUB_USER="" \ kvin/dstack-openssh-installer:latest echo "OpenSSH installation complete" -chmod +x prelaunch.sh ``` `SSH_GITHUB_USER` should be a GitHub handle whose **public keys** you want to allow. Keys are fetched from `https://github.com/.keys` at deploy time. +`dstack-cloud new` already creates `prelaunch.sh` as executable +(`0755`), so you don't need to chmod it after editing. + --- ## 4. Deploy @@ -299,9 +305,12 @@ at deploy time. ```bash dstack-cloud prepare ls shared/ -# .instance_info .sys-config.json .user-config app-compose.json +# .instance_info .sys-config.json app-compose.json ``` +`.user-config` stays at the project root after `prepare`; it's only +copied into the shared FAT image at `deploy` time. + ### 4.2 Deploy ```bash @@ -327,18 +336,23 @@ dstack-cloud logs -n 20000 | grep -iE \ You're looking for, in order: ``` -dstack-prepare.sh ... Requesting app keys from KMS +dstack-prepare.sh ... Requesting app keys from KMS: https://kms.tdxlab.dstack.org:13001/prpc dstack-prepare.sh ... Key provider info: KeyProviderInfo { name: "kms", ... } dstack-prepare.sh ... Setting up disk encryption ... NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 580.105.08 -libspdm_check_crypto_backend: LKCA wrappers found. ... -app-compose.sh ... Container dstack-pytorch-1 Started +app-compose.sh ... pytorch Pulled +app-compose.sh ... Container dstack-pytorch-1 Creating ``` -If you see `libspdm expects LKCA but found stubs!`, you booted the -wrong image (probably `0.6.0`). Re-pull `0.6.1+`. +Note that the SPDM/LKCA handshake messages (`libspdm_check_crypto_backend: +LKCA wrappers found.` etc.) go to the kernel/journal but are **not** +emitted on the serial console that `dstack-cloud logs` shows. Confirm +LKCA actually engaged via `nvidia-smi conf-compute -f` after SSH in +(§5.3) — if `CC status: ON` you're good. If you see the stub-fallback +warning `libspdm expects LKCA but found stubs!` here (on a `0.6.0` or +older image, where it *does* leak to dmesg), re-pull `0.6.1+`. If you see `Failed to get app keys from KMS ... Failed to decode attestation`, the KMS at `:13001` is older than your guest image.