# Running Jobs With Modules

Software modules are provided for commonly used software packages.&#x20;

#### Modules Quickstart

As an example, we provide a module for [Huggingface's Transformer Reinforcement Learning](https://github.com/huggingface/trl/tree/v0.28.0) package. So getting a working TRL environment is as easy as `module load trl`. We also provide a sample sbatch script that uses the TRL module `/opt/examples/libexec/trl-module.sbatch` :

{% code title="/opt/examples/libexec/trl-module.sbatch" lineNumbers="true" expandable="true" %}

```bash
#!/usr/bin/bash
#SBATCH --job-name=trl-finetuner
#SBATCH --output=jid-%j.name-%x.log
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
#SBATCH --nodes=4

set -exuo pipefail

module load trl

TRLFT_PY="/opt/examples/scripts/trl_tune/trl_tune.py"

GPUS_PER_NODE=8
MASTER_ADDR=$(hostname)
MASTER_PORT=6000

if [[ -d $HOME/.cache/huggingface/datasets/mlech26l___shell-helper ]]; then
  HF_OFFLINE=1
else
  HF_OFFLINE=0
fi

srun bash <<EOF

export MIOPEN_CUSTOM_CACHE_DIR=/tmp/miopen-cache
export MIOPEN_USER_DB_PATH=/tmp/miopen-user-db

export HF_DATASETS_OFFLINE=${HF_OFFLINE}
export HF_HUB_DISABLE_PROGRESS_BARS=1

MY_IP=\$(hostname)
IS_HOST=0
REMOTE_ADDR=${MASTER_ADDR}
if [ "\$MY_IP" == "${MASTER_ADDR}" ];then
 IS_HOST=1
 REMOTE_ADDR=localhost
fi

export OMP_NUM_THREADS=8

echo "MY IP is \$MY_IP and am I host? \$IS_HOST and what is master addr ${MASTER_ADDR}"
USE_ROCM=1

python -u -m torch.distributed.run \
 --nproc_per_node $GPUS_PER_NODE \
 --nnodes $SLURM_NNODES \
 --rdzv_endpoint \${REMOTE_ADDR}:${MASTER_PORT} \
 --rdzv_backend c10d \
 --max_restarts 0 \
 --rdzv_id=1 \
 --rdzv_conf=is_host=\$IS_HOST \
 --local_addr "\$(hostname)" \
 $TRLFT_PY

EOF
```

{% endcode %}

Line 12 (`module load trl`) loads the TRL software, and this is maintained through subsequent sub-shell calls, like the `srun bash`  on line 26.

You can test-run the script with `sbatch /opt/examples/libexec/trl-module.sbatch`. It runs accross 4 nodes by default, but you can scale it with `sbatch --nodes <num-nodes> /path/to/script.sbatch`.

{% code expandable="true" %}

```
$ sbatch /opt/examples/libexec/trl-module.sbatch
Submitted batch job 268
$ tail -f jid-268.name-trl-finetuner.log
++ hostname
+ MASTER_ADDR=tus1-p13-g2
+ MASTER_PORT=6000
+ [[ -d /home/tensorwave/.cache/huggingface/datasets/mlech26l___shell-helper ]]
+ HF_OFFLINE=1
+ srun bash
MY IP is tus1-p13-g2 and am I host? 1 and what is master addr tus1-p13-g2
MY IP is tus1-p14-g24 and am I host? 0 and what is master addr tus1-p13-g2
MY IP is tus1-p14-g37 and am I host? 0 and what is master addr tus1-p13-g2
MY IP is tus1-p16-g17 and am I host? 0 and what is master addr tus1-p13-g2
Rank: 0 out of 32
Number of gpus available: 8
  GPU 0: AMD Instinct MI325X
  GPU 1: AMD Instinct MI325X
  GPU 2: AMD Instinct MI325X
  GPU 3: AMD Instinct MI325X
  GPU 4: AMD Instinct MI325X
  GPU 5: AMD Instinct MI325X
  GPU 6: AMD Instinct MI325X
  GPU 7: AMD Instinct MI325X
Loading model: LiquidAi/LFM2.5-1.2B-Instruct

...

Found the latest cached dataset configuration 'default' at /home/tensorwave/.cache/huggingface/datasets/mlech26l___shell-helper/default/0.0.0/bf4e04b465240544350f49c89cd108c35698f588 (last modified on Sat Feb 14 08:59:42 2026).
Launching training
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '1.294', 'grad_norm': '0.9996', 'learning_rate': '1.85e-05', 'entropy': '1.288', 'num_tokens': '8.023e+06', 'mean_token_accuracy': '0.7068', 'epoch': '0.3067'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}
{'loss': '0.9138', 'grad_norm': '0.837', 'learning_rate': '1.696e-05', 'entropy': '0.9295', 'num_tokens': '1.604e+07', 'mean_token_accuracy': '0.7676', 'epoch': '0.6135'}k
```

{% endcode %}

#### Module Management

To explore available modules, run `module avail` or `module spider`. Avail gives a simplified output, while spider is more detailed and useful for sorting out module dependencies.

If you need a specific piece of software, contact us and we can provide a module that fits your needs.

#### Resources

Lmod user guide: <https://lmod.readthedocs.io/en/latest/010_user.html>&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/slurm/running-jobs-with-modules.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
