Pyxis Quickstart

When you deploy Slurm on any Tensorwave Cluster, you get Pyxis support out of the box. This introduction shows you how to run a quick multi-node mpi example.

Pyxis enables the user to deploy slurm jobs using containers. Tensorwave infrastructure removes the hassle of configuring AMD GPU Nodes to properly launch multi-node training / work. Assuming you have a 2 nodes you'd like to run a RCCL tests on, you may run the following command:

srun -N 2 --mpi=pmix --ntasks-per-node=8 --gres=gpu:amd:8 --container-image=tensorwavehq/mpi-pyxis:latest /usr/local/bin/rccl-tests/build/all_reduce_perf -b 16 -e 8g -f 2 -g 1

Of course, this assumes that you want to use a publicly accessible image.

Using a Private Container Registry

To use a private container registry like ecr or anything else, you first you need to create enroot credentials. Bellow is a script that gives such an example... you must modify it to work for your needs:

# First machine line is for ECR
# Second machine line shows idiom for logging in
sudo mkdir -p /etc/enroot
cat <<EOF > /tmp/enroot.credentials
machine 123456789012.dkr.ecr.us-west-2.amazonaws.com login AWS password $(aws ecr get-login-password --region us-west-2)
machine $SOME_OTHER_REGISTRY_ADDRESS login $PROVIDER password $($COMMAND_TO_GET_LOGIN_TOKEN)
machine $SOME_OTHER_REGISTRY_ADDRESS login $PROVIDER password $LOGIN_TOKEN_OR_PASSD
EOF
sudo mv /tmp/enroot.credentials /etc/enroot/.credentials

Once you have your credentials, the next step is to build a .sqsh file from the image you want to import like so:

rm -rf /tmp/url_path/to/image && enroot import --output /tmp/url_path/to/image.sqsh 'docker://123456789012.dkr.ecr.us-west-2.amazonaws.com/url_path/to/image'

You do not need to take the paths we use literally, and can download the sqsh file anywhere on the system you want. You may do it in your home directory if you choose, we just chose the tmp directory as that is usually safe to delete.

Once your've imported your image /tmp/url_path/to/image.sqsh you can now run your slurm job. We are going to assume that this is also a multi-node RCCL test so the command to run the work is:

srun -N 2 --mpi=pmix --ntasks-per-node=8 --gres=gpu:amd:8 --container-image=/tmp/url_path/to/image.sqsh /usr/local/bin/rccl-tests/build/all_reduce_perf -b 16 -e 8g -f 2 -g 1

Mounting Directories in Pyxis

If you want to mount your work directory, all you need to do is add the --container-mounts= flag to your srun command. There can be only one --container-mounts= flag, where you may mount multiple paths separated by a comma like so:

--container-mounts=/usr/local:/usr/local,/example/path/one:/path/one

In general, you can get a list of all flags you need via srun --help . Further, the file system mounted in Pyxis is often read only, so you may run into issues unless you add --container-writable flag. An example on mounting multiple directories can be shown here:

export WORK_PROJ="/home/myuser:/home/myuser"
export OUTPUT="/mnt/weka:/mnt/weka"
srun -N 2 --mpi=pmix --ntasks-per-node=8 --gres=gpu:amd:8 --verbose --container-image=$SQSH_PATH/mpi.sqsh --container-writable --container-mounts=${WORK_PROJ},${OUTPUT} my_training_command ...

PreviousSlurm Quickstart NextEasy Porting: NVIDIA to AMD Guide

Last updated 1 month ago