# Running Containerized Jobs in Pyxis

This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance. Containers provide environment isolation, ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.

***

#### **Running Your First Containerized Job**&#x20;

In this example, you’ll run a **PyTorch matmul test** using Pyxis.\
This verifies that your containerized environment can access GPUs, RDMA interfaces, and Slurm’s MPI orchestration.

1. Create a Python script to measure matmul performance. Copy the following code block into a file named `torch_matmul.py`.

   <pre class="language-python" data-title="torch_matmul.py" data-line-numbers data-expandable="true"><code class="lang-python">import torch

   device = torch.device("cuda:0")
   dtype = torch.float16
   torch.set_default_device(device)

   print(f"Device: {torch.cuda.get_device_name(device)}")

   sizes = [1024, 2048, 4096, 8192]
   iters = 50

   for n in sizes:
       a = torch.randn((n, n), dtype=dtype)
       b = torch.randn((n, n), dtype=dtype)

       start = torch.cuda.Event(enable_timing=True)
       end = torch.cuda.Event(enable_timing=True)

       # warmup
       for _ in range(2):
           torch.matmul(a, b)
       torch.cuda.synchronize()

       start.record()
       for _ in range(iters):
           c = torch.matmul(a, b)
       end.record()
       torch.cuda.synchronize()

       elapsed_ms = start.elapsed_time(end)
       elapsed_s = elapsed_ms / 1e3

       # FLOPs for matmul ≈ 2 * n^3
       total_flops = 2 * n**3 * iters
       tflops = total_flops / elapsed_s / 1e12

       print(f"n={n:5d}  {tflops:6.2f} TFLOPs")

   </code></pre>
2. **Create a new job script** named `torch-matmul-pyxis.sbatch`:

   <pre class="language-bash" data-title="torch-matmul-pyxis.sbatch" data-line-numbers><code class="lang-bash">#!/bin/bash
   #SBATCH --job-name=torch_matmul
   #SBATCH --output=jid-%j.name-%x.log
   #SBATCH --gpus-per-node=8
   #SBATCH -N1

   # Script created in step 1.
   MATMUL_PY="$PWD/torch_matmul.py" 
   # pytorch-rocm image from Docker Hub, published by AMD
   CONTAINER_IMAGE='rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1'
   CONTAINER_NAME="pytorch_matmul_test"

   # Download the image and instantiate the container
   srun --container-name=$CONTAINER_NAME --container-image=$CONTAINER_IMAGE true

   # Run the benchmark
   srun --container-writable \
     --container-name=$CONTAINER_NAME \
     --container-mounts="$MATMUL_PY:/root/torch_matmul.py" \
     /opt/venv/bin/python /root/torch_matmul.py

   # Save the image to disk for use later
   srun --container-name=$CONTAINER_NAME \
     --container-save=$PWD/torch-matmul.sqsh \
     true
   </code></pre>
3. **Submit the job.** Here's an example run:

   ```shellscript
   $ sbatch torch-matmul-pyxis.sbatch
   Submitted batch job 90
   $ tail -f jid-90.name-torch_matmul.log
   pyxis: importing docker image: rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1
   pyxis: imported docker image: rocm/pytorch:rocm7.1.1_ubuntu22.04_py3.10_pytorch_release_2.9.1
   Device: AMD Instinct MI325X
   n= 1024  144.42 TFLOPs
   n= 2048  466.27 TFLOPs
   n= 4096  640.01 TFLOPs
   n= 8192  763.36 TFLOPs
   pyxis: exported container pyxis_90_pytorch_matmul_test to /home/bkitor@tensorwave.com/snpyxis/torch-matmul.sqsh
   ^C
   $ ls
   jid-90.name-torch_matmul.log  torch-matmu-pyxis.sbatch  torch-matmul.sqsh  torch_matmul.py
   ```

To highlight some of the key features of the `torch-matmul-pyxis.sbatch` file:

* Container images are automatically pulled from Docker Hub
* Named containers are persistent across `srun` invocations.&#x20;
  * On line 14, the `"pytorch_matmul_test"` container is created.
  * On line 18, the container is invoked to run the benchmark.
  * On line 23, the container is saved to disk for reuse later.
* On line 19: Use the `--container-mounts` flags to bridge data into and out of the container&#x20;
* On line 24: The `--container-save` flag writes the `"pytorch_matmul_test"` container as a squashfs file.&#x20;

***

#### **Using a Pre-Staged SquashFS Image**

Instead of pulling a container from a registry, you can point Slurm directly to a **pre-staged SquashFS (`.sqsh`) image**.\
This is often faster and preferred for large models or shared environments, as using a local `.sqsh` file avoids repeated network pulls and ensures consistent environments across jobs.

To modify the `torch-matmul-pyxis.sbatch` example to use a squashfs file, modify the following line:

```bash
CONTAINER_IMAGE='<filename>.sqsh'
```

Squashfs files can be generated by the `--container-save` flag, or generated via `enroot`. See the [Managing Container Images with Enroot](/slurm/managing-container-images-with-enroot.md)page for more info on generating `.sqsh` images.

***

#### **Pyxis Flags**

Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:

| Flag                                         | Description                                                                                                 |
| -------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| `--container-image`                          | Specifies the container to run. Accepts Docker/OCI URLs or local `.sqsh` images.                            |
| `--container-writable`                       | Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files. |
| `--container-mounts=/src:/dst[,/src2:/dst2]` | Binds local or shared directories into the container. Multiple mounts can be separated by commas.           |
| `--container-workdir=/path`                  | Sets the working directory inside the container (defaults to `/`).                                          |
| `--container-name=<name>`                    | Assigns a name to the running container instance, useful for debugging or monitoring.                       |
| `--container-save=PATH`                      | Save the container state to a squashfs file on the remote host filesystem.                                  |

***

#### **Learn More**

For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD:\
<https://slurm.schedmd.com/containers.html>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/running-containerized-jobs-in-pyxis.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
