LogoLogo
  • Welcome to TensorWave
    • Introduction to ROCm
    • Bare Metal Quickstart
    • PyTorch Quickstart
    • Docker Quickstart
    • Kubernetes Quickstart
    • Hugging Face Quickstart
  • Slurm Quickstart
    • Pyxis Quickstart
  • GUIDES
    • Easy Porting: NVIDIA to AMD Guide
  • CONNECT WITH US
    • LinkedIn
    • X (Twitter)
    • Website
Powered by GitBook
On this page
  • What Problem Does Slurm Solve?
  • What Does A Simple Multi-Node Workload Look Like
  • Containers and More Info
Export as PDF

Slurm Quickstart

Slurm provides a multi-tenant framework for managing compute resources and jobs that span work on HPC clusters.

What Problem Does Slurm Solve?

Imagine you're a company managing 2 or more Nodes... possibly a cluster with thousands of GPU Accelerated Nodes. Further you will you like have multiple teams working on the same cluster, where allocating shared resources in a safe and efficient manner for your team to run their HPC/ Machine Learning work and coordinating with other teams for the same resources becomes infeasible.

Given a specific example of managing a 1024 Node cluster with 8912 GPU's, and several teams... one team may want to run small preliminary training runs using only 4 Nodes with 8 GPU's per Node.

What's Involved In Scheduling?

If this process were not automated, you'd need to coordinate with the other teams to not run workloads on the Nodes and GPU's you are currently using. Further, you'd have to manually set environment variables on all the processes to configure the traffic to go through the proper RDMA networks, speak with the right Node's involved in the work load and schedule all the Nodes to be spatially close to each other.

On a large scale, this is not reasonable and so the industry has looked towards solutions like Slurm to automate this entire process.

What Does A Simple Multi-Node Workload Look Like

Lets say that we want to run a quick multi-node RCCL test that demonstrates proper GPU + RDMA backend functionality on a 1024 Node Cluster with 8 GPU's per node. We will configure something that uses only 4 Nodes and 8 GPU's per node for a total of 32 GPU's.

First, you will be assigned a "Head Node" where the Slurm control daemon runs, that you will run all of your slurm commands on.

# First query the machines available to the cluster
tensorwave@headnode-cluster-001:~$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
tensorwave*    up   infinite   1024   idle compute-[0-1024]

Now that you've logged into your head node and queried the cluster, we want to run a simple workload available to all Tensorwave Slurm Clusters.

tensorwave@headnode-cluster-001:~$ srun -N 4 --mpi=pmix --ntasks-per-node=8 --gpus-per-task=1 /usr/local/bin/rccl-tests/build/all_reduce_perf -b 32 -e 8g -f 2 -g 1 

Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

PreviousHugging Face QuickstartNextPyxis Quickstart

Last updated 8 days ago