Pyxis Quickstart
When you deploy Slurm on any Tensorwave Cluster, you get Pyxis support out of the box. This introduction shows you how to run a quick multi-node mpi example.
Pyxis enables the user to deploy slurm jobs using containers. Tensorwave infrastructure removes the hassle of configuring AMD GPU Nodes to properly launch multi-node training / work. Assuming you have a 2 nodes you'd like to run a RCCL tests on, you may run the following command:
Of course, this assumes that you want to use a publicly accessible image.
Using a Private Container Registry
To use a private container registry like ecr or anything else, you first you need to create enroot credentials. Bellow is a script that gives such an example... you must modify it to work for your needs:
Once you have your credentials, the next step is to build a .sqsh file from the image you want to import like so:
You do not need to take the paths we use literally, and can download the sqsh file anywhere on the system you want. You may do it in your home directory if you choose, we just chose the tmp directory as that is usually safe to delete.
Once your've imported your image /tmp/url_path/to/image.sqsh
you can now run your slurm job. We are going to assume that this is also a multi-node RCCL test so the command to run the work is:
Mounting Directories in Pyxis
If you want to mount your work directory, all you need to do is add the --container-mounts=
flag to your srun
command. There can be only one --container-mounts=
flag, where you may mount multiple paths separated by a comma like so:
--container-mounts=/usr/local:/usr/local,/example/path/one:/path/one
In general, you can get a list of all flags you need via srun --help
. Further, the file system mounted in Pyxis is often read only, so you may run into issues unless you add --container-writable
flag. An example on mounting multiple directories can be shown here:
Last updated