1 of 13

TensorWave Docs

Welcome to TensorWave

Introduction to ROCm

ROCm, otherwise known as Radeon Open Compute, is an open source GPU Compute Framework that enables developers to customize their GPU software while collaborating with other developers. It consists of a variety of drivers, development tools, and API's that enables GPU programming from the low-level kernel to end-user applications. ROCm can be deployed in many ways, including through the use of containers such as Docker, Spack, and your own build from source.

Learn more about ROCm here.

ROCm Quickstart Guide

Bare Metal Quickstart

Your node has Linux (Ubuntu) OS. It comes pre-loaded with some tools to simplify your set-up process.

Connecting to Your Node

When your bare metal node is ready, you will be provided its username and IP address. To connect to your node, use the following command on a device with one of the SSH keys:

This command will be your primary method of accessing and managing your node.

Node Basics

Your node comes with Ubuntu 22.04 LTS and ROCm. All SSH keys you've initially provided will have root user access.

Adding More SSH Keys

In the event that you would like to provide more users SSH access to your node, follow these steps:

Copy your key, which should be structured like:

SSH into your node using:

Open your authorized keys file using:

Paste your key at the bottom of the file
Save and exit
Restart the sshd service using:

Monitoring Your GPUs

To get more information on your GPUs, use rocm-smi in your terminal. To continue to monitor this, you can run watch -n 0.5 rocm-smi. This will provide information on both IDs and usage in intervals of 0.5 seconds, as shown below:

Downloading and Uploading Files

To download files from your server, use the scp command:

To upload files to your server, you may also use the scp command:

Accessing Remote Services Locally

Often times, you will find yourself needing to access a service being exposed on your remote server, locally. To do so, use the following command:

For example, let's assume you want to access a Jupyter Notebook that you've exposed on port 8888. From your local command line, you'll want to use the following:

Then, you can access this in your browser at .

PyTorch Quickstart

Estimated time: 2 minutes, 3 minutes with buffer

Using ROCm Devices

PyTorch is officially supported by AMD for ROCm, and should be plug-and-play once set up correctly.

Learn more about installing PyTorch with ROCm

Docker Quickstart

Estimated time: 5 minutes, 6 minutes with buffer.

Pulling a Docker Image

Your node comes with Docker Engine installed, so all Docker functionality should be available to you on your first connection. Begin by pulling the desired image:

You can verify that your image was properly pulled by running the following command and checking for your desired image:

If the pull was successful, your output should look similar to this:

TensorWave's officially supported images can be found .

Running a Docker Container

In order to run your Docker containers with GPU acceleration, you must mount the devices. For certain applications, you must also add the container to a group to utilize your GPUs.

Using the docker run Command

Here's an example command to mount the devices and configure the correct permissions:

The usage of each option is as follows:

--device /dev/kfd
- This command mounts the main compute interface to your container.
--device /dev/dri

Using docker-compose

The following is an equivalent docker-compose to the command above:

To use it, create a docker-compose.yml file in any subdirectory, and within that subdirectory, run:

Verifying Setup

If done properly, the output of either the run or compose command should be similar to:

For other containers, to verify that your Docker container has access to your GPUs, run both rocm-smi and rocminfo. These commands will reveal information about the GPUs mounted to your container.

If one or both of these commands fails to execute successfully, please double check your running commands.

Kubernetes Quickstart

Estimated time: 6 minutes, 8 minutes with buffer.

Installing Tooling

We'll start by installing kubectl and k3d. kubectl is a command-line tool for managing kubernetes interfaces, and k3d is a lightweight wrapper to run k3s in Docker. Download and install the lateset release of kubectl using the folllowing commands:

Next, install the latest release of k3d using this command:

Learn more about installing kubectl , and installing k3d .

Creating and Configuring a Cluster

Now, you must create a cluster:

Then go ahead and check your context:

This should list the cluster you just created, but if not, run the following command to switch to the needed context:

In order to operate with acceleration, Kubernetes must also be set up with the AMD GPU Operator and Labeler plugins. You can install these using the following commands:

Then, go ahead and make your deployment manifest. Create a directory for your manifest and direct into it, then open up a deployment.yaml file.

From there, paste in the following yaml:

You'll notice there are a few extra configurations we added. These are necessary to running the pod with GPU acceleration.

- This specifies that the container requires 1 AMD GPU. You must explicitly request GPU resources so that Kubernetes can schedule the pod on a node with an available AMD GPU.
- These definitions allow the cluster to use the necessary volumes from the host for utilizing the AMD GPUs.

Continue by applying the manifest using the following:

This should take a few minutes to create the container. You can monitor the status here:

Once this output displays that STATUS is Completed, you're ready to check output. Running:

Should give an output of:

You'll notice that you only have one GPU. That's because, as covered earlier, we specified a resource limit of one. You may raise or lower this number as necessary.

Teardown

Navigate back to your base directory and remove your k8s-hello-world folder:

Hugging Face Quickstart

Estimated time: 7 minutes, 9 minutes with buffer.

Hugging Face is an AI/ML platform for the entire model pipeline. For this quickstart, we'll walk you through accelerated inference using a pretrained model.

Learn more about Hugging Face here.

Installing Dependencies

Because PyTorch with ROCm comes preloaded on your device, you will not need to install this dependency. However, you will still need a couple of libraries in order to run our quickstart script. Begin by installing using the following command:

This should take no more than a few minutes.

Creating and Running Inference Script

Next, go ahead and create and navigate to a new directory to create your script in:

Then, create a new script:

Within this script, paste the following code and exit:

After doing so, you may run the script using the following:

This runs a small model on one GPU, but feel free to swap out your model and prompts to your liking, then map to the proper devices. The output should be similar to:

Teardown

Navigate back to your base directory and remove your hf-hello-world folder:

Slurm Quickstart

Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.

Overview

TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.

Why Slurm on Kubernetes?

Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:

Scalable compute pools — resize your slurm cluster within your K8s environment.
Container-native workflows — integrate directly with your existing Docker or Enroot environments.
Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.

The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.

Quickstart Example

Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.

Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.

2. Inspect Available Resources

List available partitions and node states:

Example:

Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.

3. Launch a Multi-Node Job

To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

Slurm automatically handles:

GPU and node allocation
Network interface binding
MPI coordination

Containers and More Info

Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

Enroot Containers

TensorWave Slurm uses Enroot as its lightweight, high-performance container runtime for HPC and AI workloads.

Unlike traditional container engines, Enroot runs entirely in user space with no privileged daemons or root access required, making it ideal for multi-tenant and secure compute environments.

Enroot executes standard Docker or OCI images as unprivileged user processes, unpacking each image into an isolated filesystem that can be shared across nodes. It preserves direct access to GPUs, high-speed interconnects, and local storage, ensuring your jobs are performant inside containers.

You don’t need to run Enroot commands directly; TensorWave Slurm handles that automatically through Pyxis, which integrates Enroot with familiar Slurm tools like srun and sbatch. Together, they allow you to launch containerized jobs using the same workflow you already know with the added benefits of portability and reproducibility.

Running Jobs in Pyxis

TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.

This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance.

Containers are the preferred way to run workloads in TensorWave Slurm. They ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.

Running Your First Containerized Job

In this example, you’ll run a multi-node RCCL performance test

GUIDES

Easy Porting: NVIDIA to AMD Guide

Introduction

With both AMD and NVIDIA establishing themselves as top offerings for AI compute, questions have arisen over the differences in software required to run on each. Real-world workloads can run on both types of hardware with little to no code changes, and we're excited to demonstrate this further today.

We'll start by training an image classifier on the CIFAR-10 dataset in PyTorch on both NVIDIA and AMD.

Learn more about the CIFAR-10 dataset .

We'll then move on to a more practical use-case: fine-tuning Llama 3.1 8B on a corpus of SQL data.

Learn more about Llama 3.1 .

Training an Image Classification Model

To start, you're going to need to install PyTorch locally. Install the appropriate version depending on your hardware.

This will be the only difference in process for this tutorial.

Next, navigate to the directory you'd like to set this tutorial up in. From there, create the following Python script:

This script loads the dataset, transforms it, then trains and evaluates a CNN model that can classify at around 80% accuracy. This model gets saved at the model_save_path, which can be configured on your own.

You'll notice that at the top, we set our computation device via device = torch.device('cuda'). In PyTorch's ROCm installation, 'cuda' actually points to AMD GPUs, leaving no need to make any changes to any of your desired scripts.

Next, create the following inference script:

This script loads the model generated by the previous script, then classifies the specified image in image_url into one of the 10 categories:

That's it!

Fine-Tuning LLMs

For the purposes of this tutorial, we'll be fine-tuning Facebook's OPT-350m model. We'll begin by setting up our dependencies for significantly speeding up LLM training.

The following tutorial assumes the following prerequisites. If you're using different versions, please adjust your commands accordingly.

Linux (Ubuntu)
CUDA 12.1 or ROCm 6.2

Begin by installing the needed dependencies.

Notice that, as above, this will be the only difference between the two training processes

From there, make the following script in a subfolder you'd like to do your work in.

This script trains Facebook's OPT-350m model on an imdb review dataset, and saves the model for later inference. To conduct inference, use the following script:

Accelerated Inference for Llama 3.1 (and other HF Models)

For this section of the tutorial, we're going to use vLLM, a framework for accelerated LLM inference and serving.

More information on vLLM .

We're going to serve Llama 3.1 8B Instruct through Docker containers. We'll start by pulling the images and serving the endpoints from there. Note that since the Llama models are gated, we'll have to log in through huggingface-cli to use them.

Note that you will have to note your HuggingFace API Token for both methods.

In a separate terminal, you can now query the endpoints!

CONNECT WITH US

========================================= ROCm System Management Interface ========================================= =================================================== Concise Info =================================================== Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% Name (20 chars) (Junction) (Socket) (Mem, Compute) ==================================================================================================================== 0 [0x74a1 : 0x00] 45.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 1 [0x74a1 : 0x00] 42.0°C 135.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 2 [0x74a1 : 0x00] 42.0°C 137.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 3 [0x74a1 : 0x00] 48.0°C 141.0W NPS1, SPX 138Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 4 [0x74a1 : 0x00] 46.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 5 [0x74a1 : 0x00] 40.0°C 137.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 6 [0x74a1 : 0x00] 47.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X 7 [0x74a1 : 0x00] 42.0°C 132.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0% AMD Instinct MI300X ==================================================================================================================== =============================================== End of ROCm SMI Log ================================================

import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader import os device = torch.device('cuda') class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super(SimpleCNN, self).__init__() self.features = nn.Sequential( nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2) ) self.classifier = nn.Sequential( nn.Dropout(0.5), nn.Linear(256 * 4 * 4, 512), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Linear(512, num_classes) ) def forward(self, x): x = self.features(x) x = x.view(x.size(0), -1) x = self.classifier(x) return x def train_and_evaluate(model, train_loader, test_loader, num_epochs=10, learning_rate=0.01): model = model.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9) for epoch in range(num_epochs): model.train() for images, labels in train_loader: images, labels = images.to(device), labels.to(device) optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() model.eval() correct = 0 total = 0 with torch.no_grad(): for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total print(f'Epoch [{epoch+1}/{num_epochs}], Accuracy: {accuracy:.2f}%') return model def save_model(model, path): torch.save(model.state_dict(), path) print(f"Model saved to {path}") transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False) model = SimpleCNN() trained_model = train_and_evaluate(model, train_loader, test_loader) model_save_path = 'cifar10_cnn_model.pth' save_model(trained_model, model_save_path) trained_model.eval() correct = 0 total = 0 with torch.no_grad(): for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = trained_model(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f'Final Test Accuracy: {100 * correct / total:.2f}%')

import torch import torch.nn as nn from torchvision import transforms from PIL import Image import requests from io import BytesIO device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model_save_path = 'cifar10_cnn_model.pth' def load_model(model, path): model.load_state_dict(torch.load(path, map_location=device)) model.eval() print(f"Model loaded from {path}") return model class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super(SimpleCNN, self).__init__() self.features = nn.Sequential( nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2) ) self.classifier = nn.Sequential( nn.Dropout(0.5), nn.Linear(256 * 4 * 4, 512), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Linear(512, num_classes) ) def forward(self, x): x = self.features(x) x = x.view(x.size(0), -1) x = self.classifier(x) return x def predict_image_from_url(model, image_url): transform = transforms.Compose([ transforms.Resize((32, 32)), # CIFAR10 images are 32x32 transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ]) # Download the image response = requests.get(image_url) image = Image.open(BytesIO(response.content)).convert('RGB') image = transform(image).unsqueeze(0).to(device) model.eval() with torch.no_grad(): output = model(image) _, predicted = torch.max(output, 1) classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck') return classes[predicted.item()] # Initialize and load the model model = SimpleCNN() model = load_model(model, model_save_path) model = model.to(device) # Predict from URL image_url = 'https://images.twinkl.co.uk/tw1n/image/private/t_630/u/ux/frog-2_ver_1.jpg' predicted_class = predict_image_from_url(model, image_url) print(f"The image is predicted to be: {predicted_class}")

TensorWave Docs

Welcome to TensorWave

Introduction to ROCm

ROCm Quickstart Guide

Bare Metal Quickstart

Connecting to Your Node

Node Basics

Adding More SSH Keys

Monitoring Your GPUs

Downloading and Uploading Files

Accessing Remote Services Locally

PyTorch Quickstart

Using ROCm Devices

Docker Quickstart

Pulling a Docker Image

Running a Docker Container

Using the docker run Command

Using docker-compose

Verifying Setup

Kubernetes Quickstart

Installing Tooling

Creating and Configuring a Cluster

Teardown

Hugging Face Quickstart

Installing Dependencies

Creating and Running Inference Script

Teardown

Slurm Quickstart

Overview

Why Slurm on Kubernetes?

Quickstart Example

1. Connect to Your Login Node

2. Inspect Available Resources

3. Launch a Multi-Node Job

Containers and More Info

Enroot Containers

Running Jobs in Pyxis

Running Your First Containerized Job

GUIDES

Easy Porting: NVIDIA to AMD Guide

Introduction

Training an Image Classification Model

Fine-Tuning LLMs

Accelerated Inference for Llama 3.1 (and other HF Models)

CONNECT WITH US

Introduction to ROCm

ROCm Quickstart Guide

Bare Metal Quickstart

Connecting to Your Node

Node Basics

Adding More SSH Keys

Monitoring Your GPUs

Downloading and Uploading Files

Accessing Remote Services Locally

Docker Quickstart

Pulling a Docker Image

Running a Docker Container

Using the docker run Command

Using docker-compose

Verifying Setup

Kubernetes Quickstart

Installing Tooling

Creating and Configuring a Cluster

Teardown

Hugging Face Quickstart

Installing Dependencies

Creating and Running Inference Script

Teardown

Slurm Quickstart

Overview

Why Slurm on Kubernetes?

Quickstart Example

1. Connect to Your Login Node

2. Inspect Available Resources

3. Launch a Multi-Node Job

Containers and More Info

Enroot Containers

Running Jobs in Pyxis

Running Your First Containerized Job

PyTorch Quickstart