> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/slurm/containers.md).

# Containers

### Overview

The cluster supports several ways to run containerized workloads including Apptainer, Pyxis/Enroot, and Docker.

The recommended approach to run containers is **Apptainer**. Since it was designed for HPC/Slurm environments, it has several benefits and is significantly easier to use. It integrates cleanly with `srun` across multiple nodes, seamlessly uses the shared `/home` directory to cache images, and it runs containers as the submitting user (avoiding issues that Docker's privileged daemon introduces).

For software that does not require a container, see Modules for Lmod and SHPC. Python virtual environments (`.venv`) installed under `/home` are also a straightforward option, since `/home` is mounted consistently across login and compute pods.

***

### Apptainer

Apptainer (formerly Singularity) runs containers as the submitting user. Images can be pulled directly from a Docker registry at runtime or pre-built as `.sif` files for faster starts and multi-node use.

#### Registry login

Public images (such as `rocm/` on Docker Hub) can be pulled without authentication. Private registries require logging in first:

```bash
apptainer registry login --username <username> docker://docker.io
```

You will be prompted for your password or access token. To supply credentials non-interactively:

```bash
echo '<token>' | apptainer registry login --username <username> --password-stdin docker://docker.io
```

Many cloud OCI registries use token-based authentication. In that case, pass the token as the password; a username is still required. Consult your provider's documentation for their specific login requirements. See the [Apptainer registry login documentation](https://apptainer.org/docs/user/main/cli/apptainer_registry_login.html) for all options.

Credentials are stored under your home directory and apply to subsequent `apptainer pull`, `apptainer exec`, and `apptainer shell` calls that reference the registry. To remove stored credentials:

```bash
apptainer registry logout docker://docker.io
```

#### Pulling an image

Pull an image from a Docker registry and save it as a local `.sif` file. Running from a login pod is fine for this step since it does not require a GPU allocation:

```bash
apptainer pull rocm-pytorch.sif docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
```

The resulting `.sif` file can be used in any subsequent `apptainer exec` or `apptainer shell` call and starts faster than pulling the `docker://` URI at runtime. Store it on `/home` so it is accessible from compute pods.

#### Single-node interactive

```bash
srun -N 1 --gpus-per-node=8 --pty apptainer shell rocm-pytorch.sif
```

You can also pass a `docker://` URI directly without pulling first:

```bash
srun -N 1 --gpus-per-node=8 --pty \
  apptainer shell docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0
```

> **Note**: Apptainer will passthrough-mount the `/home/$USER` and `/tmp` directories. This can be disabled with `--contain` or `--no-home` flags.

#### Batch job (single node)

For single-node jobs, Apptainer can pull the image at runtime:

```bash
#!/bin/bash
#SBATCH --job-name=apptainer-job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=48
#SBATCH --time=02:00:00

srun apptainer exec \
  docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \
  python train.py
```

#### Multi-node jobs: building a BNXT-enabled SIF

For multi-node jobs, the container image must include the correct network software for the cluster's NICs. This can eithe be built into the image, or passed through using Apptainer's cdi interface. See Installing Network Software in Container images for details.

***

### Pyxis

Pyxis is a SPANK plugin that integrates OCI container execution directly into `srun` via flags. It uses Enroot under the hood to manage squashfs-format images (`.sqsh`).

#### Pulling and caching a container

Images can be pulled from public repos using the `--container-image` flag.

```bash
# Pull the image and stash it as a named container (run once per image per node)
srun \
  --container-writable \
  --container-name=my-pytorch \
  --container-image=rocm/pytorch:latest \
  true
```

The `--container-name` flag caches the image as a named Enroot container. Subsequent `srun` steps using the same name skip the pull and start much faster. Because the container is writable (`--container-writable`), any modifications made during one step are preserved across subsequent steps that reference the same named container. To persist the container to disk, the `--container-save=PATH` flag can be used, this saves the container state as a .sif file and can be reused in future jobs.

#### Running a job with a named container

```bash
#!/bin/bash
#SBATCH --job-name=pyxis-job
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --time=02:00:00

# Pull and cache on all nodes
srun \
  --no-container-remap-root \
  --container-writable \
  --container-name=my-pytorch \
  --container-image=rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \
  true

# Run the workload
srun \
  --no-container-remap-root \
  --container-name=my-pytorch \
  /opt/rccl-tests/all_reduce_perf -b 512M -e 8G -f 2 -g 1
```

Example Pyxis scripts are available at `/opt/tw/examples/libexec/*pyxis*.sbatch`.

#### Enroot as an Escape Hatch

Under the hood, Pyxis uses Enroot as a containerization engine. Some operations (like pulling an image from a private repo) require using the `enroot` cli tool.

```bash
# Login to Docker
echo "YOUR_PASSWORD_OR_TOKEN" | docker login -u YOUR_USERNAME --password-stdin
# Pull an image using with `dockerd://` (docker engine backend)
# Image unpacking needs to run in a privileged environment, it can't be done on a login node
srun enroot import dockerd://YOUR_REPO/YOUR_IMAGE:YOUR_TAG
```

This will leave a `*.sqsh` file in your current working directory, which can be passed to `--container-image` in future slurm jobs.

***

### Docker

> **Warning:** Docker on worker nodes runs as root via a privileged daemon, and it is recommended to use Apptainer or Pyxis instead.

Docker is available on worker nodes. You can use it to run containers, build images, or pull from a registry within a job allocation.

#### Running a container

```bash
srun -N 1 --gpus-per-node=8 --pty bash -l

# Inside the allocation:
docker run --rm --gpus all rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \
  python -c "import torch; print(torch.cuda.device_count())"
```

#### Building an image

If you need to build a custom image during a job, allocate a node and run the build from there:

```bash
srun -N 1 --pty bash

# Inside the allocation:
docker build -t my-org/my-image:latest -f Dockerfile .
docker push my-org/my-image:latest
```

#### Using a Docker image with Apptainer

Docker images can be consumed directly by Apptainer without running the Docker daemon at all, using the `docker://` URI:

```bash
srun apptainer exec docker://rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.10.0 \
  python train.py
```

This is the preferred pattern for job submission since it runs as the submitting user and integrates with Slurm resource accounting.

***

### Command Comparison

| Operation                     | Apptainer                                 | Pyxis (srun flags)                                               | Docker                                                        |
| ----------------------------- | ----------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------- |
| **Launch a batch job**        | `srun apptainer exec <img> <cmd>`         | `srun --container-image=<img> <cmd>`                             | `srun docker run --rm <img> <cmd>`                            |
| **Get an interactive shell**  | `srun --pty apptainer shell <img>`        | `srun --pty --container-image=<img> bash`                        | `srun --pty docker run -it --rm <img> --entrypoint /bin/bash` |
| **Download an image to Disk** | `apptainer pull <dst>.sif docker://<img>` | `srun --container-image=<img> --container-save=<name>.sqsh true` | (closest equivalent) `docker pull <img>`                      |
| **Volume mount**              | `--bind <src>[:<dst>]`                    | `--container-mounts=<src>:<dst>`                                 | `-v <src>:<dst>`                                              |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/containers.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
