Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Variables consumed by docker-compose.yml and docker-compose-test.yml.
# docker compose reads this file automatically from the project root.

# Ubuntu base version. Selects which base image variant we pull from the
# dockerslurmcluster registry (must match one of the published base tags),
# and is also embedded in our own published image tag.
# Currently supported values: '26.04' or '24.04'
UBUNTU_VERSION=26.04

# Slurm version baked into the base image tag we pull. Bump only when the
# base dockerslurmcluster repo publishes new tags with a different slurm.
# Does NOT appear in our own published image tag -- consumers care about
# spack-stack version, slurm is implicit via the base image.
SLURM_VERSION=25.11.5

# Spack-stack version. Drives both the git checkout branch of jcsda/spack-stack
# and the tag suffix on our published images (and the corresponding buildcache
# repo on GHCR). Bump in lockstep with a new spack-stack release.
SPACK_STACK_VERSION=2.1.0

# Composite tags produced from the above:
# base image: ghcr.io/.../slurm-<role>:ubuntu-${UBUNTU_VERSION}-slurm-${SLURM_VERSION}
# our image: ghcr.io/.../slurm-spack-stack-<role>:ubuntu-${UBUNTU_VERSION}-spack-stack-${SPACK_STACK_VERSION}
# buildcache: ghcr.io/.../buildcache-ubuntu-${UBUNTU_VERSION}-spack-stack-${SPACK_STACK_VERSION}
#
# To switch bases for one invocation without editing this file:
# UBUNTU_VERSION=24.04 docker compose build
191 changes: 121 additions & 70 deletions .github/workflows/docker.yml

Large diffs are not rendered by default.

168 changes: 140 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,52 +26,134 @@ sizes. The cluster behaves as if it were running on multiple
nodes even if the containers are all running on the same host
machine.

# Image tags and base selection

Published images are tagged by Ubuntu version + spack-stack version:

* `ubuntu-26.04-spack-stack-2.1.0` (also published as `latest`)
* `ubuntu-24.04-spack-stack-2.1.0`

Internally each variant pulls from the
[NOAA-GSL/DockerSlurmCluster](https://github.com/NOAA-GSL/DockerSlurmCluster)
base registry at the matching `ubuntu-<UBUNTU_VERSION>-slurm-<SLURM_VERSION>` tag.
The base image's slurm version is implicit -- consumers of these images interact
with the slurm tooling that came with the base, plus the spack-stack scientific
software stack layered on top.

A separate per-(ubuntu, spack-stack) OCI buildcache repo (e.g.
`ghcr.io/noaa-gsl/dockerspackstackslurmcluster/buildcache-ubuntu-26.04-spack-stack-2.1.0`)
holds binary artifacts so rebuilds reuse cached packages instead of recompiling
from source. Caches are split per OS to prevent cross-OS spec contamination
during concretization.

## Configuring versions

The project root contains a `.env` file consumed by `docker compose`:

```bash
UBUNTU_VERSION=26.04
SLURM_VERSION=25.11.5
SPACK_STACK_VERSION=2.1.0
```

To run against the 24.04 base for one invocation without editing the file:

```bash
UBUNTU_VERSION=24.04 docker compose up -d --pull never
```

# Building the Containers

To build the containers from source:
## Quickest path: docker compose

## Master and Node Containers
`docker compose build` reads `.env` and constructs the full set of build args
automatically. To build all three containers (frontend, master, node) for the
default Ubuntu version:

```bash
docker build -t ghcr.io/noaa-gsl/dockerspackstackslurmcluster/slurm-spack-stack-master:latest -f master/Dockerfile master/
docker build -t ghcr.io/noaa-gsl/dockerspackstackslurmcluster/slurm-spack-stack-node:latest -f node/Dockerfile node/
docker compose build
```

## Frontend Container
Or just one:

```bash
docker compose build slurmfrontend
```

The frontend container requires a GitHub personal access token (PAT) with package write permissions to push built packages to the GitHub Container Registry build cache. Set your token in an environment variable and pass it as a secret during build:
To build for a non-default Ubuntu version:

```bash
export GITHUB_TOKEN=your_github_pat_here
docker build --progress=plain \
--secret id=github_token,env=GITHUB_TOKEN \
-t ghcr.io/noaa-gsl/dockerspackstackslurmcluster/slurm-spack-stack-frontend:latest \
-f frontend/Dockerfile \
frontend/
UBUNTU_VERSION=24.04 docker compose build slurmfrontend
```

**Note:** The `--progress=plain` flag shows full build output. The frontend build compiles 355+ scientific software packages from source and can take several hours on first build. Subsequent builds use the cached packages from GHCR.
### GitHub PAT for buildcache push

### Configuring Parallel Build Jobs
A GitHub personal access token (PAT) is only required if you want the build to
**push** newly-built spack packages back to the OCI buildcache (autopush) --
which is what CI and the original maintainer's builds do to keep the cache
populated. For most local development, where you just want to *consume*
artifacts the cache already has, no PAT is needed.

The frontend Dockerfile uses the `SPACK_BUILD_JOBS` build argument to control the number of parallel make jobs (`-j` flag) used when building each package (default: 8). This should match the number of CPU cores available:
The frontend Dockerfile only configures autopush when the docker secret
`github_token` is present *and non-empty*. Compose accepts an unset or empty
`GITHUB_TOKEN` environment variable (the secret simply becomes an empty file
inside the build), so pull-only builds work without setting anything:

**For 8-core systems (default):**
```bash
docker build --build-arg SPACK_BUILD_JOBS=8 ...
# Pull-only build: reads from the public buildcache, never pushes
docker compose build slurmfrontend
```

**For 16-core systems:**
For push-capable builds, set the PAT before invoking compose:

```bash
docker build --build-arg SPACK_BUILD_JOBS=16 ...
export GITHUB_TOKEN=your_github_pat_here # PAT with write:packages on the GHCR registry
docker compose build slurmfrontend
```

**With Docker Compose:**
Note: this assumes the buildcache repo on GHCR is **public** (which is the
case for the upstream NOAA-GSL caches). If you maintain a fork with a private
cache, you'll need a PAT with read permission on the cache repo even for
pull-only builds.

## Direct buildx invocation

Equivalent build command for the frontend, useful when you want full control
(`--no-cache`, `--progress=plain`, custom tags) without going through compose:

```bash
export GITHUB_TOKEN=your_github_pat_here
docker buildx build \
--progress=plain \
--pull \
--secret id=github_token,env=GITHUB_TOKEN \
--build-arg SPACK_BUILD_JOBS=8 \
--build-arg BASE_IMAGE_TAG=ubuntu-26.04-slurm-25.11.5 \
--build-arg UBUNTU_VERSION=26.04 \
--build-arg SPACK_STACK_VERSION=2.1.0 \
-t ghcr.io/noaa-gsl/dockerspackstackslurmcluster/slurm-spack-stack-frontend:ubuntu-26.04-spack-stack-2.1.0 \
-f frontend/Dockerfile \
frontend/
```

The frontend build compiles ~355 scientific software packages and can take
many hours on first build from an empty buildcache. Subsequent builds reuse
cached packages from GHCR and finish much faster.

## Configuring Parallel Build Jobs

`SPACK_BUILD_JOBS` controls the number of parallel make jobs (`-j` flag) used
when building each package (default: 8). Match it to the CPU count of your
build machine:

```bash
docker buildx build --build-arg SPACK_BUILD_JOBS=16 ...
# or
docker compose build --build-arg SPACK_BUILD_JOBS=16
```

You can also modify the default in `docker-compose.yml`:
You can also change the default in `docker-compose.yml`:

```yaml
services:
slurmfrontend:
Expand All @@ -80,27 +162,57 @@ services:
SPACK_BUILD_JOBS: 16 # Change from default 8
```

**Performance note:** Higher values speed up compilation of individual packages, especially large ones like ESMF, JEDI components, and NetCDF. However, on 32GB RAM systems, values above 8 may cause memory pressure during compilation of memory-intensive Fortran packages, potentially leading to swapping or OOM errors.
**Performance note:** higher values speed up compilation of individual
packages, especially large ones like ESMF, JEDI components, and NetCDF. On
32GB RAM systems values above 8 may cause memory pressure during compilation
of memory-intensive Fortran packages, potentially leading to swapping or OOM
errors.

# Quick Start

To start the slurm cluster environment:
To start the slurm cluster environment (default Ubuntu 26.04):
```
docker-compose -f docker-compose.yml up -d
docker compose -f docker-compose.yml up -d --pull never
```

For 24.04:
```
UBUNTU_VERSION=24.04 docker compose -f docker-compose.yml up -d --pull never
```

The frontend container takes several minutes on first launch (it populates the
shared `opt-vol` volume with the spack-stack install). Healthchecks ensure the
master and nodes wait for the frontend before starting.

### Switching `UBUNTU_VERSION` between runs

Docker named volumes are not auto-rebuilt when you change the image they're
attached to. To switch from 26.04 to 24.04 (or vice versa) on the same host,
you must explicitly remove the existing `home-vol` and `opt-vol` first:

```
docker compose down -v # the -v flag deletes the named volumes
UBUNTU_VERSION=24.04 docker compose up -d --pull never
```

Without `-v`, the new container will mount the previous run's `/opt`, which
contains spack-built binaries linked against the *previous* OS's glibc. The
cluster will appear to start fine but `srun` of any spack-built executable will
fail with `GLIBC_X.YZ not found`.

To stop the cluster:
```
docker-compose -f docker-compose.yml stop
docker compose -f docker-compose.yml stop
```
To check the cluster logs:
```
docker-compose -f docker-compose.yml logs -f
docker compose -f docker-compose.yml logs -f
```
(stop logs with CTRL-c")
(stop logs with CTRL-c)

To check status of the cluster containers:
```
docker-compose -f docker-compose.yml ps
docker compose -f docker-compose.yml ps
```
To check status of Slurm:
```
Expand Down
Loading
Loading