Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
ROCm, otherwise known as Radeon Open Compute, is an open source GPU Compute Framework that enables developers to customize their GPU software while collaborating with other developers. It consists of a variety of drivers, development tools, and API's that enables GPU programming from the low-level kernel to end-user applications. ROCm can be deployed in many ways, including through the use of containers such as Docker, Spack, and your own build from source.
Your node has Linux (Ubuntu) OS. It comes pre-loaded with some tools to simplify your set-up process.
When your bare metal node is ready, you will be provided its username and IP address. To connect to your node, use the following command on a device with one of the SSH keys:
This command will be your primary method of accessing and managing your node.
Your node comes with Ubuntu 22.04 LTS and ROCm. All SSH keys you've initially provided will have root user access.
In the event that you would like to provide more users SSH access to your node, follow these steps:
Copy your key, which should be structured like:
SSH into your node using:
Open your authorized keys file using:
Paste your key at the bottom of the file
Save and exit
Restart the sshd service using:
To get more information on your GPUs, use rocm-smi in your terminal. To continue to monitor this, you can run watch -n 0.5 rocm-smi. This will provide information on both IDs and usage in intervals of 0.5 seconds, as shown below:
To download files from your server, use the scp command:
To upload files to your server, you may also use the scp command:
Often times, you will find yourself needing to access a service being exposed on your remote server, locally. To do so, use the following command:
For example, let's assume you want to access a Jupyter Notebook that you've exposed on port 8888. From your local command line, you'll want to use the following:
Then, you can access this in your browser at .
ssh [username]@[ip_address]ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDxAZn... user@hostssh [username]@[ip_address]nano /home/[username]/.ssh/authorized_keyssudo systemctl restart sshd============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 2 0x74a1, 39334 45.0°C 142.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
1 3 0x74a1, 34119 42.0°C 135.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
2 4 0x74a1, 664 42.0°C 137.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
3 5 0x74a1, 33001 48.0°C 142.0W NPS1, SPX, 0 154Mhz 900Mhz 0% auto 750.0W 0% 0%
4 6 0x74a1, 15994 46.0°C 143.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
5 7 0x74a1, 63627 40.0°C 137.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
6 8 0x74a1, 33811 47.0°C 143.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
7 9 0x74a1, 41883 42.0°C 132.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================// For individual files
scp [username]@[ip_address]:/path/to/remote/file /path/to/local/destination
// For subdirectories
scp -r [username]@[ip_address]:/path/to/remote/directory /path/to/local/destination// For individual files
scp /path/to/local/file [username]@[ip_address]:/path/to/remote/destination
// For subdirectories
scp -r /path/to/local/file [username]@[ip_address]:/path/to/remote/destinationssh -L local_port:remote_host:remote_port [username]@[ip_address]ssh -L 8888:localhost:8888 [username]@[ip_address]Estimated time: 5 minutes, 6 minutes with buffer.
Your node comes with Docker Engine installed, so all Docker functionality should be available to you on your first connection. Begin by pulling the desired image:
You can verify that your image was properly pulled by running the following command and checking for your desired image:
If the pull was successful, your output should look similar to this:
In order to run your Docker containers with GPU acceleration, you must mount the devices. For certain applications, you must also add the container to a group to utilize your GPUs.
Here's an example command to mount the devices and configure the correct permissions:
The usage of each option is as follows:
--device /dev/kfd
This command mounts the main compute interface to your container.
--device /dev/dri
The following is an equivalent docker-compose to the command above:
To use it, create a docker-compose.yml file in any subdirectory, and within that subdirectory, run:
If done properly, the output of either the run or compose command should be similar to:
For other containers, to verify that your Docker container has access to your GPUs, run both rocm-smi and rocminfo. These commands will reveal information about the GPUs mounted to your container.
If one or both of these commands fails to execute successfully, please double check your running commands.
docker pull tensorwavehq/hello_world:latestdocker imagesREPOSITORY TAG IMAGE ID CREATED SIZE
tensorwavehq/hello_world latest 359e600f7aac 2 minutes ago 61.2GB/renderD<node>, where the node is the ID of the node you want to mount.--group-add video (optional)
This command adds your container to the server's video group, which is necessary for certain applications (including PyTorch).
docker run --device /dev/kfd --device /dev/dri --group-add video tensorwavehq/hello_world:latestversion: '3'
services:
hello_world:
image: tensorwavehq/hello_world:latest
devices:
- /dev/kfd
- /dev/dri
group_add:
- videodocker compose upCUDA available: True
Number of GPUs: 8
GPU 0: AMD Instinct MI300X
GPU 1: AMD Instinct MI300X
GPU 2: AMD Instinct MI300X
GPU 3: AMD Instinct MI300X
GPU 4: AMD Instinct MI300X
GPU 5: AMD Instinct MI300X
GPU 6: AMD Instinct MI300X
GPU 7: AMD Instinct MI300XEstimated time: 6 minutes, 8 minutes with buffer.
We'll start by installing kubectl and k3d. kubectl is a command-line tool for managing kubernetes interfaces, and k3d is a lightweight wrapper to run k3s in Docker. Download and install the lateset release of kubectl using the folllowing commands:
Next, install the latest release of k3d using this command:
Now, you must create a cluster:
Then go ahead and check your context:
This should list the cluster you just created, but if not, run the following command to switch to the needed context:
In order to operate with acceleration, Kubernetes must also be set up with the AMD GPU Operator and Labeler plugins. You can install these using the following commands:
Then, go ahead and make your deployment manifest. Create a directory for your manifest and direct into it, then open up a deployment.yaml file.
From there, paste in the following yaml:
You'll notice there are a few extra configurations we added. These are necessary to running the pod with GPU acceleration.
This specifies that the container requires 1 AMD GPU. You must explicitly request GPU resources so that Kubernetes can schedule the pod on a node with an available AMD GPU.
These definitions allow the cluster to use the necessary volumes from the host for utilizing the AMD GPUs.
Continue by applying the manifest using the following:
This should take a few minutes to create the container. You can monitor the status here:
Once this output displays that STATUS is Completed, you're ready to check output. Running:
Should give an output of:
You'll notice that you only have one GPU. That's because, as covered earlier, we specified a resource limit of one. You may raise or lower this number as necessary.
Navigate back to your base directory and remove your k8s-hello-world folder:
Estimated time: 7 minutes, 9 minutes with buffer.
Hugging Face is an AI/ML platform for the entire model pipeline. For this quickstart, we'll walk you through accelerated inference using a pretrained model.
Because PyTorch with ROCm comes preloaded on your device, you will not need to install this dependency. However, you will still need a couple of libraries in order to run our quickstart script. Begin by installing using the following command:
This should take no more than a few minutes.
Next, go ahead and create and navigate to a new directory to create your script in:
Then, create a new script:
Within this script, paste the following code and exit:
After doing so, you may run the script using the following:
This runs a small model on one GPU, but feel free to swap out your model and prompts to your liking, then map to the proper devices. The output should be similar to:
Navigate back to your base directory and remove your hf-hello-world folder:
Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.
TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.
With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.
Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:
Scalable compute pools — resize your slurm cluster within your K8s environment.
Container-native workflows — integrate directly with your existing Docker or Enroot environments.
Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.
The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.
Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.
Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.
List available partitions and node states:
Example:
Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.
To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):
Slurm automatically handles:
GPU and node allocation
Network interface binding
MPI coordination
Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.
TensorWave Slurm uses Enroot as its lightweight, high-performance container runtime for HPC and AI workloads.
Unlike traditional container engines, Enroot runs entirely in user space with no privileged daemons or root access required, making it ideal for multi-tenant and secure compute environments.
Enroot executes standard Docker or OCI images as unprivileged user processes, unpacking each image into an isolated filesystem that can be shared across nodes. It preserves direct access to GPUs, high-speed interconnects, and local storage, ensuring your jobs are performant inside containers.
You don’t need to run Enroot commands directly; TensorWave Slurm handles that automatically through Pyxis, which integrates Enroot with familiar Slurm tools like srun and sbatch.
Together, they allow you to launch containerized jobs using the same workflow you already know with the added benefits of portability and reproducibility.
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectlwget -q -O - https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bashThese mounts correspond to the above volumes, allowing the container to access the GPU hardware.
This security context runs the containers in the pod as the group ID 110, the render group, which is necessary for PyTorch to detect the devices properly (PyTorch is used in the hello world container).
Although Pyxis automatically handles Enroot under the hood, you can manually import container images for debugging or pre-caching.
For example, to pull and unpack a PyTorch ROCm image locally:
This workflow downloads the image, converts it into an Enroot container bundle, and runs it as an unprivileged user process.
You’ll typically never need to do this when submitting jobs through Pyxis, but it’s a useful way to verify container contents or pre-stage larger images.
# Import a container image into Enroot format
enroot import docker://tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2
# Create a runnable instance
enroot start tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqshsecurityContext:
runAsGroup: 110k3d cluster create hello-world-clusterkubectl config current-contextkubectl config use-context k3d-hello-world-clusterkubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yamlmkdir k8s-hello-world
cd k8s-hello-world
nano deployment.yamlapiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world
spec:
replicas: 1
selector:
matchLabels:
app: hello-world
template:
metadata:
labels:
app: hello-world
spec:
containers:
- name: hello-world
image: tensorwavehq/hello_world:latest
resources:
limits:
amd.com/gpu: 1
volumeMounts:
- name: dev-kfd
mountPath: /dev/kfd
- name: dev-dri
mountPath: /dev/dri
securityContext:
runAsGroup: 110
volumes:
- name: dev-kfd
hostPath:
path: /dev/kfd
- name: dev-dri
hostPath:
path: /dev/driresources:
limits:
amd.com/gpu: 1volumes:
- name: dev-kfd
hostPath:
path: /dev/kfd
- name: dev-dri
hostPath:
path: /dev/drivolumeMounts:
- name: dev-kfd
mountPath: /dev/kfd
- name: dev-dri
mountPath: /dev/drikubectl apply -f deployment.yamlkubectl get pods -l app=hello-worldkubectl logs -l app=hello-worldCUDA available: True
Number of GPUs: 1
GPU 0: AMD Instinct MI300Xcd ~
rm -rf k8s-hello-world/pip install transformersmkdir hf-hello-world
cd hf-hello-worldnano hello-world.pyimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
# Load model without quantization
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
# Move model to GPU
model = model.to("cuda")
# Input text
print("Warming up model...")
input_text = "Hello, my name is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
warmup = model.generate(**inputs, max_new_tokens=20)
print("Preparing text...")
input_text = "According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
print("Starting inference...")
start = time.time()
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
no_repeat_ngram_size=2
)
t = time.time()-start
print(f"inference time: {t}")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))python3 hello-world.pyWarming up model...
Preparing text...
Starting inference...
inference time: 0.4770219326019287
According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because it can.
Well, if you fly with a little fat of your body, you can fly pretty damn well. You just have to be careful.cd ~
rm -rf hf-hello-world/ssh <username>@<slurm-login-endpoint>sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpuworker* up infinite 1024 idle compute-[0-1023]srun -N4 \
--mpi=pmix \
--ntasks-per-node=8 \
--gpus-per-node=8 \
--cpus-per-task=16 \
/usr/local/bin/rccl-tests/build/all_reduce_perf -b 32 -e 8g -f 2 -g 1TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.
This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance.
Containers are the preferred way to run workloads in TensorWave Slurm. They ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.
In this example, you’ll run a multi-node RCCL performance test
Create a new job script named rccl-pyxis.sbatch:
Submit the job to Slurm:
Monitor progress:
Once complete, your results will appear under the results/ directory, with each job’s output and error logs named using the Slurm job ID.
Instead of pulling a container from a registry, you can point Slurm directly to a pre-staged SquashFS (.sqsh) image.
This is often faster and preferred for large models or shared environments.
Example:
Using a local .sqsh file avoids repeated network pulls and ensures consistent environments across jobs.
Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:
--container-image
Specifies the container to run. Accepts Docker/OCI URLs or local .sqsh images.
--container-writable
Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files.
--container-mounts=/src:/dst[,/src2:/dst2]
Binds local or shared directories into the container. Multiple mounts can be separated by commas.
--container-workdir=/path
Sets the working directory inside the container (defaults to /).
--container-name=<name>
Assigns a name to the running container instance, useful for debugging or monitoring.
For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD: https://slurm.schedmd.com/containers.html
In order to test whether your system is configured to use PyTorch with GPU acceleration, begin by starting a new file to run a couple of debugging commands:
The following code will return a boolean indicating whether your GPUs are being detected by PyTorch:
Now, go ahead and run your file using:
In the event that this does not return True, there are a couple things you must check.
One reason the above command may not function properly is that the incorrect version of PyTorch is installed. To check, add the following line to your debugging file:
You should get an output similar to:
Or:
If this output is not a ROCm-enabled PyTorch build, you must reinstall PyTorch with the correct version. One way to do this would be:
To ensure ROCm is properly configured, run the following command:
The output should be similar to (depending on your number of devices):
If this is not the case, ROCm is not properly installed. You will more likely, however, have issues running the following command:
The output should be of the format:
If this command errors, it's most likely that devices are not properly mounted, or your user is not a part of the render group.
Navigate back to your base directory and remove your pytorch-hello-world folder:
#!/bin/bash
#SBATCH --job-name=rccl_multi_node
#SBATCH --output=results/rccl_multi_node-%j.out
#SBATCH --error=results/rccl_multi_node-%j.out
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH -N4
CONTAINER_IMAGE='tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2'
export NCCL_IB_QPS_PER_CONNECTION=2
export NCCL_BUFFSIZE=8388608
export UCX_NET_DEVICES=eno0
# Minimize uneccessary logs when running with Pyxis
export OMPI_MCA_btl=^openib
export PMIX_MCA_gds=hash
export UCX_WARN_UNUSED_ENV_VARS=n
srun --mpi=pmix \
--container-writable \
--container-name=rccl-pyxis-run \
--container-image=${CONTAINER_IMAGE} \
/usr/local/bin/rccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1sbatch --nodes=<number-of-nodes> rccl-pyxis.sbatchsqueue -u $USER# You can set the image in the previous example to a local .sqsh file
CONTAINER_IMAGE='tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh'torch.device("cuda")mkdir pytorch-hello-world
cd pytorch-hello-world
nano debug.pyimport torch
print(torch.cuda.is_available())python3 debug.pyprint(torch.__version__)[torch_version]a0+git[hash][torch_version].dev[date]+rocm[rocm_version]pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/rocm-smi========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
Name (20 chars) (Junction) (Socket) (Mem, Compute)
====================================================================================================================
0 [0x74a1 : 0x00] 45.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
1 [0x74a1 : 0x00] 42.0°C 135.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
2 [0x74a1 : 0x00] 42.0°C 137.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
3 [0x74a1 : 0x00] 48.0°C 141.0W NPS1, SPX 138Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
4 [0x74a1 : 0x00] 46.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
5 [0x74a1 : 0x00] 40.0°C 137.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
6 [0x74a1 : 0x00] 47.0°C 142.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
7 [0x74a1 : 0x00] 42.0°C 132.0W NPS1, SPX 132Mhz 900Mhz 0% auto 750.0W 0% 0%
AMD Instinct MI300X
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================rocminfoROCk module version 6.7.0 is loaded
=====================
HSA System Attributes
=====================
....cd ~
rm -rf pytorch-hello-world/With both AMD and NVIDIA establishing themselves as top offerings for AI compute, questions have arisen over the differences in software required to run on each. Real-world workloads can run on both types of hardware with little to no code changes, and we're excited to demonstrate this further today.
We'll start by training an image classifier on the CIFAR-10 dataset in PyTorch on both NVIDIA and AMD.
We'll then move on to a more practical use-case: fine-tuning Llama 3.1 8B on a corpus of SQL data.
To start, you're going to need to install PyTorch locally. Install the appropriate version depending on your hardware.
Next, navigate to the directory you'd like to set this tutorial up in. From there, create the following Python script:
This script loads the dataset, transforms it, then trains and evaluates a CNN model that can classify at around 80% accuracy. This model gets saved at the model_save_path, which can be configured on your own.
You'll notice that at the top, we set our computation device via device = torch.device('cuda'). In PyTorch's ROCm installation, 'cuda' actually points to AMD GPUs, leaving no need to make any changes to any of your desired scripts.
Next, create the following inference script:
This script loads the model generated by the previous script, then classifies the specified image in image_url into one of the 10 categories:
That's it!
For the purposes of this tutorial, we'll be fine-tuning Facebook's OPT-350m model. We'll begin by setting up our dependencies for significantly speeding up LLM training.
The following tutorial assumes the following prerequisites. If you're using different versions, please adjust your commands accordingly.
Linux (Ubuntu)
CUDA 12.1 or ROCm 6.2
Begin by installing the needed dependencies.
From there, make the following script in a subfolder you'd like to do your work in.
This script trains Facebook's OPT-350m model on an imdb review dataset, and saves the model for later inference. To conduct inference, use the following script:
For this section of the tutorial, we're going to use vLLM, a framework for accelerated LLM inference and serving.
We're going to serve Llama 3.1 8B Instruct through Docker containers. We'll start by pulling the images and serving the endpoints from there. Note that since the Llama models are gated, we'll have to log in through huggingface-cli to use them.
In a separate terminal, you can now query the endpoints!
pip install requests
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2/pip install requests
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import os
device = torch.device('cuda')
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(128, 256, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2)
)
self.classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(256 * 4 * 4, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
def train_and_evaluate(model, train_loader, test_loader, num_epochs=10, learning_rate=0.01):
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
for epoch in range(num_epochs):
model.train()
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f'Epoch [{epoch+1}/{num_epochs}], Accuracy: {accuracy:.2f}%')
return model
def save_model(model, path):
torch.save(model.state_dict(), path)
print(f"Model saved to {path}")
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
model = SimpleCNN()
trained_model = train_and_evaluate(model, train_loader, test_loader)
model_save_path = 'cifar10_cnn_model.pth'
save_model(trained_model, model_save_path)
trained_model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = trained_model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Final Test Accuracy: {100 * correct / total:.2f}%')import torch
import torch.nn as nn
from torchvision import transforms
from PIL import Image
import requests
from io import BytesIO
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_save_path = 'cifar10_cnn_model.pth'
def load_model(model, path):
model.load_state_dict(torch.load(path, map_location=device))
model.eval()
print(f"Model loaded from {path}")
return model
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(128, 256, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2)
)
self.classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(256 * 4 * 4, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
def predict_image_from_url(model, image_url):
transform = transforms.Compose([
transforms.Resize((32, 32)), # CIFAR10 images are 32x32
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
# Download the image
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert('RGB')
image = transform(image).unsqueeze(0).to(device)
model.eval()
with torch.no_grad():
output = model(image)
_, predicted = torch.max(output, 1)
classes = ('plane', 'car', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck')
return classes[predicted.item()]
# Initialize and load the model
model = SimpleCNN()
model = load_model(model, model_save_path)
model = model.to(device)
# Predict from URL
image_url = 'https://images.twinkl.co.uk/tw1n/image/private/t_630/u/ux/frog-2_ver_1.jpg'
predicted_class = predict_image_from_url(model, image_url)
print(f"The image is predicted to be: {predicted_class}")'plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
pip install packaging ninja accelerate wandb
export GPU_ARCHS="gfx942"
export ROCM_HOME="/opt/rocm"
pip install --no-deps --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'
pip install trl
pip install --no-deps peftpip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
pip install packaging ninja accelerate wandb bitsandbytes trl
pip install --no-deps peft# imports
from datasets import load_dataset
from trl import SFTTrainer
# get dataset
dataset = load_dataset("imdb", split="train")
# get trainer
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
# train
trainer.train()
trainer.save_model("imdb_saved")from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model and tokenizer
model_path = "imdb_saved"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
def generate_text(prompt, max_length=150):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
num_return_sequences=1,
no_repeat_ngram_size=2
)
# Decode and return the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text
# Test with a positive prompt
positive_prompt = "This movie was amazing! The plot"
print("Model loading...")
positive_response = generate_text(positive_prompt)
print("Positive prompt:")
print(positive_response )
# Test with a negative prompt
negative_prompt = "I hated this film. The acting"
print("\nNegative prompt:")
print(generate_text(negative_prompt))
# Test with a neutral prompt
neutral_prompt = "This movie was okay. It had"
print("\nNeutral prompt:")
print(generate_text(neutral_prompt))docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri vllm-rocm
huggingface-cli login #paste your token as needed
vllm serve meta-llama/Llama-3.1-8B-Instructdocker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instructcurl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "What is the meaning of life?",
"max_tokens": 128,
"top_p": 0.95,
"top_k": 20,
"temperature": 0.8
}'