Running Jobs With Modules
Last updated
Software modules are provided for commonly used software packages.
As an example, we provide a module for Huggingface's Transformer Reinforcement Learning package. So getting a working TRL environment is as easy as module load trl. We also provide a sample sbatch script that uses the TRL module /opt/examples/libexec/trl-module.sbatch :
Line 12 (module load trl) loads the TRL software, and this is maintained through subsequent sub-shell calls, like the srun bash on line 26.
You can test-run the script with sbatch /opt/examples/libexec/trl-module.sbatch. It runs accross 4 nodes by default, but you can scale it with sbatch --nodes <num-nodes> /path/to/script.sbatch.
To explore available modules, run module avail or module spider. Avail gives a simplified output, while spider is more detailed and useful for sorting out module dependencies.
If you need a specific piece of software, contact us and we can provide a module that fits your needs.
Lmod user guide: https://lmod.readthedocs.io/en/latest/010_user.html
Last updated
#!/usr/bin/bash
#SBATCH --job-name=trl-finetuner
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
#SBATCH --nodes=4
set -exuo pipefail
module load trl
TRLFT_PY="/opt/examples/scripts/trl_tune/trl_tune.py"
GPUS_PER_NODE=8
MASTER_ADDR=$(hostname)
MASTER_PORT=6000
if [[ -d $HOME/.cache/huggingface/datasets/mlech26l___shell-helper ]]; then
HF_OFFLINE=1
else
HF_OFFLINE=0
fi
srun bash <<EOF
export MIOPEN_CUSTOM_CACHE_DIR=/tmp/miopen-cache
export MIOPEN_USER_DB_PATH=/tmp/miopen-user-db
export HF_DATASETS_OFFLINE=${HF_OFFLINE}
export HF_HUB_DISABLE_PROGRESS_BARS=1
MY_IP=\$(hostname)
IS_HOST=0
REMOTE_ADDR=${MASTER_ADDR}
if [ "\$MY_IP" == "${MASTER_ADDR}" ];then
IS_HOST=1
REMOTE_ADDR=localhost
fi
export OMP_NUM_THREADS=8
echo "MY IP is \$MY_IP and am I host? \$IS_HOST and what is master addr ${MASTER_ADDR}"
USE_ROCM=1
python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $SLURM_NNODES \
--rdzv_endpoint \${REMOTE_ADDR}:${MASTER_PORT} \
--rdzv_backend c10d \
--max_restarts 0 \
--rdzv_id=1 \
--rdzv_conf=is_host=\$IS_HOST \
--local_addr "\$(hostname)" \
$TRLFT_PY
EOF$ sbatch /opt/examples/libexec/trl-module.sbatch
Submitted batch job 268
$ tail -f jid-268.name-trl-finetuner.log
++ hostname
+ MASTER_ADDR=tus1-p13-g2
+ MASTER_PORT=6000
+ [[ -d /home/tensorwave/.cache/huggingface/datasets/mlech26l___shell-helper ]]
+ HF_OFFLINE=1
+ srun bash
MY IP is tus1-p13-g2 and am I host? 1 and what is master addr tus1-p13-g2
MY IP is tus1-p14-g24 and am I host? 0 and what is master addr tus1-p13-g2
MY IP is tus1-p14-g37 and am I host? 0 and what is master addr tus1-p13-g2
MY IP is tus1-p16-g17 and am I host? 0 and what is master addr tus1-p13-g2
Rank: 0 out of 32
Number of gpus available: 8
GPU 0: AMD Instinct MI325X
GPU 1: AMD Instinct MI325X
GPU 2: AMD Instinct MI325X
GPU 3: AMD Instinct MI325X
GPU 4: AMD Instinct MI325X
GPU 5: AMD Instinct MI325X
GPU 6: AMD Instinct MI325X
GPU 7: AMD Instinct MI325X
Loading model: LiquidAi/LFM2.5-1.2B-Instruct
...
Found the latest cached dataset configuration 'default' at /home/tensorwave/.cache/huggingface/datasets/mlech26l___shell-helper/default/0.0.0/bf4e04b465240544350f49c89cd108c35698f588 (last modified on Sat Feb 14 08:59:42 2026).
Launching training
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}k
