Only this pageAll pages
Powered by GitBook
1 of 13

TensorWave Docs

Welcome to TensorWave

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

GUIDES

Loading...

CONNECT WITH US

Introduction to ROCm

ROCm, otherwise known as Radeon Open Compute, is an open source GPU Compute Framework that enables developers to customize their GPU software while collaborating with other developers. It consists of a variety of drivers, development tools, and API's that enables GPU programming from the low-level kernel to end-user applications. ROCm can be deployed in many ways, including through the use of containers such as Docker, Spack, and your own build from source.

Learn more about ROCm here.

ROCm Quickstart Guide

Linux

Bare Metal Quickstart


Your node has Linux (Ubuntu) OS. It comes pre-loaded with some tools to simplify your set-up process.

Connecting to Your Node

When your bare metal node is ready, you will be provided its username and IP address. To connect to your node, use the following command on a device with one of the SSH keys:

This command will be your primary method of accessing and managing your node.


Node Basics

Your node comes with Ubuntu 22.04 LTS and ROCm. All SSH keys you've initially provided will have root user access.

Adding More SSH Keys

In the event that you would like to provide more users SSH access to your node, follow these steps:

  1. Copy your key, which should be structured like:

  1. SSH into your node using:

  1. Open your authorized keys file using:

  1. Paste your key at the bottom of the file

  2. Save and exit

  3. Restart the sshd service using:

Monitoring Your GPUs

To get more information on your GPUs, use rocm-smi in your terminal. To continue to monitor this, you can run watch -n 0.5 rocm-smi. This will provide information on both IDs and usage in intervals of 0.5 seconds, as shown below:

Downloading and Uploading Files

To download files from your server, use the scp command:

To upload files to your server, you may also use the scp command:

Accessing Remote Services Locally

Often times, you will find yourself needing to access a service being exposed on your remote server, locally. To do so, use the following command:

For example, let's assume you want to access a Jupyter Notebook that you've exposed on port 8888. From your local command line, you'll want to use the following:

Then, you can access this in your browser at .


ssh [username]@[ip_address]
http://localhost:8888
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDxAZn... user@host
ssh [username]@[ip_address]
nano /home/[username]/.ssh/authorized_keys
sudo systemctl restart sshd
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
==========================================================================================================================
0       2     0x74a1,   39334  45.0°C      142.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
1       3     0x74a1,   34119  42.0°C      135.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
2       4     0x74a1,   664    42.0°C      137.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
3       5     0x74a1,   33001  48.0°C      142.0W    NPS1, SPX, 0        154Mhz  900Mhz  0%   auto  750.0W  0%     0%
4       6     0x74a1,   15994  46.0°C      143.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
5       7     0x74a1,   63627  40.0°C      137.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
6       8     0x74a1,   33811  47.0°C      143.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
7       9     0x74a1,   41883  42.0°C      132.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
// For individual files
scp [username]@[ip_address]:/path/to/remote/file /path/to/local/destination
// For subdirectories
scp -r [username]@[ip_address]:/path/to/remote/directory /path/to/local/destination
// For individual files
scp /path/to/local/file [username]@[ip_address]:/path/to/remote/destination
// For subdirectories
scp -r /path/to/local/file [username]@[ip_address]:/path/to/remote/destination
ssh -L local_port:remote_host:remote_port [username]@[ip_address]
ssh -L 8888:localhost:8888 [username]@[ip_address]

Docker Quickstart

Estimated time: 5 minutes, 6 minutes with buffer.


Pulling a Docker Image

Your node comes with Docker Engine installed, so all Docker functionality should be available to you on your first connection. Begin by pulling the desired image:

You can verify that your image was properly pulled by running the following command and checking for your desired image:

If the pull was successful, your output should look similar to this:

TensorWave's officially supported images can be found .


Running a Docker Container

In order to run your Docker containers with GPU acceleration, you must mount the devices. For certain applications, you must also add the container to a group to utilize your GPUs.

Using the docker run Command

Here's an example command to mount the devices and configure the correct permissions:

The usage of each option is as follows:

  • --device /dev/kfd

    • This command mounts the main compute interface to your container.

  • --device /dev/dri

Using docker-compose

The following is an equivalent docker-compose to the command above:

To use it, create a docker-compose.yml file in any subdirectory, and within that subdirectory, run:

Verifying Setup

If done properly, the output of either the run or compose command should be similar to:

For other containers, to verify that your Docker container has access to your GPUs, run both rocm-smi and rocminfo. These commands will reveal information about the GPUs mounted to your container.

If one or both of these commands fails to execute successfully, please double check your running commands.


docker pull tensorwavehq/hello_world:latest
docker images
REPOSITORY                      TAG            IMAGE ID       CREATED          SIZE
tensorwavehq/hello_world        latest         359e600f7aac   2 minutes ago   61.2GB
This command mounts the Direct Rendering Interface for your GPU. To restrict access, append /renderD<node>, where the node is the ID of the node you want to mount.
  • --group-add video (optional)

    • This command adds your container to the server's video group, which is necessary for certain applications (including PyTorch).

  • here
    docker run --device /dev/kfd --device /dev/dri --group-add video tensorwavehq/hello_world:latest
    version: '3'
    services:
      hello_world:
        image: tensorwavehq/hello_world:latest
        devices:
          - /dev/kfd
          - /dev/dri
        group_add:
          - video
    docker compose up
    CUDA available: True
    Number of GPUs: 8
    GPU 0: AMD Instinct MI300X
    GPU 1: AMD Instinct MI300X
    GPU 2: AMD Instinct MI300X
    GPU 3: AMD Instinct MI300X
    GPU 4: AMD Instinct MI300X
    GPU 5: AMD Instinct MI300X
    GPU 6: AMD Instinct MI300X
    GPU 7: AMD Instinct MI300X

    Kubernetes Quickstart

    Estimated time: 6 minutes, 8 minutes with buffer.


    Installing Tooling

    We'll start by installing kubectl and k3d. kubectl is a command-line tool for managing kubernetes interfaces, and k3d is a lightweight wrapper to run k3s in Docker. Download and install the lateset release of kubectl using the folllowing commands:

    Next, install the latest release of k3d using this command:

    Learn more about installing kubectl , and installing k3d .


    Creating and Configuring a Cluster

    Now, you must create a cluster:

    Then go ahead and check your context:

    This should list the cluster you just created, but if not, run the following command to switch to the needed context:

    In order to operate with acceleration, Kubernetes must also be set up with the AMD GPU Operator and Labeler plugins. You can install these using the following commands:

    Then, go ahead and make your deployment manifest. Create a directory for your manifest and direct into it, then open up a deployment.yaml file.

    From there, paste in the following yaml:

    You'll notice there are a few extra configurations we added. These are necessary to running the pod with GPU acceleration.

      • This specifies that the container requires 1 AMD GPU. You must explicitly request GPU resources so that Kubernetes can schedule the pod on a node with an available AMD GPU.

      • These definitions allow the cluster to use the necessary volumes from the host for utilizing the AMD GPUs.

    Continue by applying the manifest using the following:

    This should take a few minutes to create the container. You can monitor the status here:

    Once this output displays that STATUS is Completed, you're ready to check output. Running:

    Should give an output of:

    You'll notice that you only have one GPU. That's because, as covered earlier, we specified a resource limit of one. You may raise or lower this number as necessary.


    Teardown

    Navigate back to your base directory and remove your k8s-hello-world folder:

    Hugging Face Quickstart

    Estimated time: 7 minutes, 9 minutes with buffer.


    Hugging Face is an AI/ML platform for the entire model pipeline. For this quickstart, we'll walk you through accelerated inference using a pretrained model.

    Learn more about Hugging Face here.


    Installing Dependencies

    Because PyTorch with ROCm comes preloaded on your device, you will not need to install this dependency. However, you will still need a couple of libraries in order to run our quickstart script. Begin by installing using the following command:

    This should take no more than a few minutes.


    Creating and Running Inference Script

    Next, go ahead and create and navigate to a new directory to create your script in:

    Then, create a new script:

    Within this script, paste the following code and exit:

    After doing so, you may run the script using the following:

    This runs a small model on one GPU, but feel free to swap out your model and prompts to your liking, then map to the proper devices. The output should be similar to:


    Teardown

    Navigate back to your base directory and remove your hf-hello-world folder:

    Slurm Quickstart

    Slurm provides a multi-tenant framework for managing compute resources and jobs that span large clusters.

    Overview

    TensorWave Slurm combines the power of Slurm, the industry-standard workload manager for HPC and AI, with the flexibility of a Kubernetes-native orchestration layer. This integration delivers a modern, multi-tenant environment that scales seamlessly across AMD Instinct GPU clusters—enabling teams to run distributed training, fine-tuning, and simulation workloads without managing the underlying infrastructure.

    With TensorWave, you get the familiar Slurm interface running on top of a cloud-native control plane that provides automated scheduling, easy scaling, and container-based execution.


    Why Slurm on Kubernetes?

    Traditional Slurm deployments were designed for static on-prem clusters. TensorWave modernizes that model by running Slurm inside Kubernetes, unlocking:

    • Scalable compute pools — resize your slurm cluster within your K8s environment.

    • Container-native workflows — integrate directly with your existing Docker or Enroot environments.

    • Multi-tenant isolation — each user or team runs in a secure namespace with defined resource limits.

    The result is a unified, cloud-native scheduling experience that bridges HPC scalability with Kubernetes reliability.


    Quickstart Example

    1. Connect to Your Login Node

    Each Slurm environment provides a login node, your interactive entry point for running Slurm commands. Behind the scenes, this login node runs as a managed Kubernetes pod, with the same Slurm interface (srun, sinfo, sbatch), and the benefits of cloud-native orchestration.

    Once connected, you’ll have access to all standard Slurm utilities and your team’s partitions. From here, you can submit jobs, monitor queues, and launch multi-node workloads just as you would on a traditional HPC cluster.


    2. Inspect Available Resources

    List available partitions and node states:

    Example:

    Even though these nodes are dynamically managed by Kubernetes, the Slurm CLI remains identical to traditional HPC clusters.


    3. Launch a Multi-Node Job

    To verify connectivity and RDMA functionality, run a distributed RCCL test across four nodes (32 GPUs total):

    Slurm automatically handles:

    • GPU and node allocation

    • Network interface binding

    • MPI coordination

    Containers and More Info

    Often, users will want to run their HPC payloads in containers. You can learn more about this in the Pyxis Quickstart section.

    Enroot Containers

    TensorWave Slurm uses Enroot as its lightweight, high-performance container runtime for HPC and AI workloads.

    Unlike traditional container engines, Enroot runs entirely in user space with no privileged daemons or root access required, making it ideal for multi-tenant and secure compute environments.

    Enroot executes standard Docker or OCI images as unprivileged user processes, unpacking each image into an isolated filesystem that can be shared across nodes. It preserves direct access to GPUs, high-speed interconnects, and local storage, ensuring your jobs are performant inside containers.

    You don’t need to run Enroot commands directly; TensorWave Slurm handles that automatically through Pyxis, which integrates Enroot with familiar Slurm tools like srun and sbatch. Together, they allow you to launch containerized jobs using the same workflow you already know with the added benefits of portability and reproducibility.


    curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
    wget -q -O - https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
  • These mounts correspond to the above volumes, allowing the container to access the GPU hardware.

    • This security context runs the containers in the pod as the group ID 110, the render group, which is necessary for PyTorch to detect the devices properly (PyTorch is used in the hello world container).

  • here
    here
    transformers
    Optional: Importing an Image with Enroot

    Although Pyxis automatically handles Enroot under the hood, you can manually import container images for debugging or pre-caching.

    For example, to pull and unpack a PyTorch ROCm image locally:

    This workflow downloads the image, converts it into an Enroot container bundle, and runs it as an unprivileged user process.

    You’ll typically never need to do this when submitting jobs through Pyxis, but it’s a useful way to verify container contents or pre-stage larger images.

    # Import a container image into Enroot format
    enroot import docker://tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2
    
    # Create a runnable instance
    enroot start tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh
    securityContext:
        runAsGroup: 110
    k3d cluster create hello-world-cluster
    kubectl config current-context
    kubectl config use-context k3d-hello-world-cluster
    kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
    kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml
    mkdir k8s-hello-world
    cd k8s-hello-world
    nano deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: hello-world
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: hello-world
      template:
        metadata:
          labels:
            app: hello-world
        spec:
          containers:
            - name: hello-world
              image: tensorwavehq/hello_world:latest
              resources:
                limits:
                  amd.com/gpu: 1
              volumeMounts:
                - name: dev-kfd
                  mountPath: /dev/kfd
                - name: dev-dri
                  mountPath: /dev/dri
              securityContext:
                runAsGroup: 110
          volumes:
            - name: dev-kfd
              hostPath:
                path: /dev/kfd
            - name: dev-dri
              hostPath:
                path: /dev/dri
    resources:
      limits:
        amd.com/gpu: 1
    volumes:
      - name: dev-kfd
        hostPath:
          path: /dev/kfd
      - name: dev-dri
        hostPath:
          path: /dev/dri
    volumeMounts:
      - name: dev-kfd
        mountPath: /dev/kfd
      - name: dev-dri
        mountPath: /dev/dri
    kubectl apply -f deployment.yaml
    kubectl get pods -l app=hello-world
    kubectl logs -l app=hello-world
    CUDA available: True
    Number of GPUs: 1
    GPU 0: AMD Instinct MI300X
    cd ~
    rm -rf k8s-hello-world/
    pip install transformers
    mkdir hf-hello-world
    cd hf-hello-world
    nano hello-world.py
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import time
    
    # Load model without quantization
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
    model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
    
    # Move model to GPU
    model = model.to("cuda")
    
    # Input text
    print("Warming up model...")
    input_text = "Hello, my name is"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    warmup = model.generate(**inputs, max_new_tokens=20)
    
    print("Preparing text...")
    input_text = "According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because"
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    
    print("Starting inference...")
    start = time.time()
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        no_repeat_ngram_size=2
    )
    t = time.time()-start
    print(f"inference time: {t}")
    
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    python3 hello-world.py
    Warming up model...
    Preparing text...
    Starting inference...
    inference time: 0.4770219326019287
    According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because it can.
    Well, if you fly with a little fat of your body, you can fly pretty damn well.  You just have to be careful.
    cd ~
    rm -rf hf-hello-world/
    ssh <username>@<slurm-login-endpoint>
    sinfo
    PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
    gpuworker*     up   infinite   1024   idle compute-[0-1023]
    srun -N4 \
    --mpi=pmix \
    --ntasks-per-node=8 \
    --gpus-per-node=8 \
    --cpus-per-task=16 \
    /usr/local/bin/rccl-tests/build/all_reduce_perf -b 32 -e 8g -f 2 -g 1

    Running Jobs in Pyxis

    TensorWave Slurm integrates Pyxis, a container runtime plugin for Slurm that enables users to run containerized workloads directly within their jobs.

    This integration lets you launch distributed AI or HPC jobs inside optimized ROCm containers while maintaining full GPU, RDMA, and filesystem performance.

    Containers are the preferred way to run workloads in TensorWave Slurm. They ensure consistent environments across nodes, simplify dependency management, and let you reproduce results reliably.


    Running Your First Containerized Job

    In this example, you’ll run a multi-node RCCL performance test

    PyTorch Quickstart

    Estimated time: 2 minutes, 3 minutes with buffer


    Using ROCm Devices

    PyTorch is officially supported by AMD for ROCm, and should be plug-and-play once set up correctly.

    Learn more about installing PyTorch with ROCm

    using Pyxis. This verifies that your containerized environment can access GPUs, RDMA interfaces, and Slurm’s MPI orchestration.
    1. Create a new job script named rccl-pyxis.sbatch:

    1. Submit the job to Slurm:

    1. Monitor progress:

    Once complete, your results will appear under the results/ directory, with each job’s output and error logs named using the Slurm job ID.


    Using a Pre-Staged SquashFS Image

    Instead of pulling a container from a registry, you can point Slurm directly to a pre-staged SquashFS (.sqsh) image. This is often faster and preferred for large models or shared environments.

    Example:

    Using a local .sqsh file avoids repeated network pulls and ensures consistent environments across jobs.


    Pyxis Flags

    Pyxis extends Slurm with several container-related flags that control how your job interacts with the container environment. Below are the most commonly used options:

    Flag
    Description

    --container-image

    Specifies the container to run. Accepts Docker/OCI URLs or local .sqsh images.

    --container-writable

    Makes the container filesystem writable during execution. Useful for logs, checkpoints, or temporary files.

    --container-mounts=/src:/dst[,/src2:/dst2]

    Binds local or shared directories into the container. Multiple mounts can be separated by commas.

    --container-workdir=/path

    Sets the working directory inside the container (defaults to /).

    --container-name=<name>

    Assigns a name to the running container instance, useful for debugging or monitoring.


    Learn More

    For advanced configuration options and the full list of supported flags, see the official containers documentation from SchedMD: https://slurm.schedmd.com/containers.html

    AMD GPU devices are configured and accessed the exact same way as NVIDIA GPU devices. This means that any workflow that sets the PyTorch device the following way will work out-of-the-box, assuming PyTorch can detect your GPUs:

    Debugging

    In order to test whether your system is configured to use PyTorch with GPU acceleration, begin by starting a new file to run a couple of debugging commands:

    The following code will return a boolean indicating whether your GPUs are being detected by PyTorch:

    Now, go ahead and run your file using:

    In the event that this does not return True, there are a couple things you must check.

    PyTorch Setup

    One reason the above command may not function properly is that the incorrect version of PyTorch is installed. To check, add the following line to your debugging file:

    You should get an output similar to:

    Or:

    If this output is not a ROCm-enabled PyTorch build, you must reinstall PyTorch with the correct version. One way to do this would be:

    Checking ROCm Setup

    To ensure ROCm is properly configured, run the following command:

    The output should be similar to (depending on your number of devices):

    If this is not the case, ROCm is not properly installed. You will more likely, however, have issues running the following command:

    The output should be of the format:

    If this command errors, it's most likely that devices are not properly mounted, or your user is not a part of the render group.


    Teardown

    Navigate back to your base directory and remove your pytorch-hello-world folder:

    here
    #!/bin/bash
    #SBATCH --job-name=rccl_multi_node
    #SBATCH --output=results/rccl_multi_node-%j.out
    #SBATCH --error=results/rccl_multi_node-%j.out
    #SBATCH --ntasks-per-node=8
    #SBATCH --gpus-per-node=8
    #SBATCH --cpus-per-task=16  
    #SBATCH -N4
    
    CONTAINER_IMAGE='tensorwavehq/pytorch-bnxt:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2'
    
    export NCCL_IB_QPS_PER_CONNECTION=2
    export NCCL_BUFFSIZE=8388608
    export UCX_NET_DEVICES=eno0
    
    # Minimize uneccessary logs when running with Pyxis
    export OMPI_MCA_btl=^openib
    export PMIX_MCA_gds=hash
    export UCX_WARN_UNUSED_ENV_VARS=n
    
    srun --mpi=pmix \
      --container-writable \
      --container-name=rccl-pyxis-run \
      --container-image=${CONTAINER_IMAGE} \
      /usr/local/bin/rccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
    sbatch --nodes=<number-of-nodes> rccl-pyxis.sbatch
    squeue -u $USER
    # You can set the image in the previous example to a local .sqsh file
    CONTAINER_IMAGE='tensorwavehq+pytorch-bnxt+rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0_bnxt_re23301522_v0.2.sqsh'
    torch.device("cuda")
    mkdir pytorch-hello-world
    cd pytorch-hello-world
    nano debug.py
    import torch
    print(torch.cuda.is_available())
    python3 debug.py
    print(torch.__version__)
    [torch_version]a0+git[hash]
    [torch_version].dev[date]+rocm[rocm_version]
    pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/
    rocm-smi
    ========================================= ROCm System Management Interface =========================================
    =================================================== Concise Info ===================================================
    Device  [Model : Revision]    Temp        Power     Partitions      SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  
            Name (20 chars)       (Junction)  (Socket)  (Mem, Compute)                                                  
    ====================================================================================================================
    0       [0x74a1 : 0x00]       45.0°C      142.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    1       [0x74a1 : 0x00]       42.0°C      135.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    2       [0x74a1 : 0x00]       42.0°C      137.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    3       [0x74a1 : 0x00]       48.0°C      141.0W    NPS1, SPX       138Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    4       [0x74a1 : 0x00]       46.0°C      142.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    5       [0x74a1 : 0x00]       40.0°C      137.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    6       [0x74a1 : 0x00]       47.0°C      142.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    7       [0x74a1 : 0x00]       42.0°C      132.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%    
            AMD Instinct MI300X                                                                                         
    ====================================================================================================================
    =============================================== End of ROCm SMI Log ================================================
    rocminfo
    ROCk module version 6.7.0 is loaded
    =====================    
    HSA System Attributes    
    =====================    
    ....
    cd ~
    rm -rf pytorch-hello-world/

    Easy Porting: NVIDIA to AMD Guide

    Introduction

    With both AMD and NVIDIA establishing themselves as top offerings for AI compute, questions have arisen over the differences in software required to run on each. Real-world workloads can run on both types of hardware with little to no code changes, and we're excited to demonstrate this further today.

    We'll start by training an image classifier on the CIFAR-10 dataset in PyTorch on both NVIDIA and AMD.

    Learn more about the CIFAR-10 dataset .

    We'll then move on to a more practical use-case: fine-tuning Llama 3.1 8B on a corpus of SQL data.

    Learn more about Llama 3.1 .


    Training an Image Classification Model

    To start, you're going to need to install PyTorch locally. Install the appropriate version depending on your hardware.

    This will be the only difference in process for this tutorial.

    Next, navigate to the directory you'd like to set this tutorial up in. From there, create the following Python script:

    This script loads the dataset, transforms it, then trains and evaluates a CNN model that can classify at around 80% accuracy. This model gets saved at the model_save_path, which can be configured on your own.

    You'll notice that at the top, we set our computation device via device = torch.device('cuda'). In PyTorch's ROCm installation, 'cuda' actually points to AMD GPUs, leaving no need to make any changes to any of your desired scripts.

    Next, create the following inference script:

    This script loads the model generated by the previous script, then classifies the specified image in image_url into one of the 10 categories:

    That's it!


    Fine-Tuning LLMs

    For the purposes of this tutorial, we'll be fine-tuning Facebook's OPT-350m model. We'll begin by setting up our dependencies for significantly speeding up LLM training.

    The following tutorial assumes the following prerequisites. If you're using different versions, please adjust your commands accordingly.

    • Linux (Ubuntu)

    • CUDA 12.1 or ROCm 6.2

    Begin by installing the needed dependencies.

    Notice that, as above, this will be the only difference between the two training processes

    From there, make the following script in a subfolder you'd like to do your work in.

    This script trains Facebook's OPT-350m model on an imdb review dataset, and saves the model for later inference. To conduct inference, use the following script:


    Accelerated Inference for Llama 3.1 (and other HF Models)

    For this section of the tutorial, we're going to use vLLM, a framework for accelerated LLM inference and serving.

    More information on vLLM .

    We're going to serve Llama 3.1 8B Instruct through Docker containers. We'll start by pulling the images and serving the endpoints from there. Note that since the Llama models are gated, we'll have to log in through huggingface-cli to use them.

    Note that you will have to note your HuggingFace API Token for both methods.

    In a separate terminal, you can now query the endpoints!

    here
    here
    here
    pip install requests
    pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2/
    pip install requests
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torchvision import datasets, transforms
    from torch.utils.data import DataLoader
    import os
    
    device = torch.device('cuda')
    
    class SimpleCNN(nn.Module):
        def __init__(self, num_classes=10):
            super(SimpleCNN, self).__init__()
            self.features = nn.Sequential(
                nn.Conv2d(3, 64, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2),
                nn.Conv2d(64, 128, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2),
                nn.Conv2d(128, 256, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2)
            )
            self.classifier = nn.Sequential(
                nn.Dropout(0.5),
                nn.Linear(256 * 4 * 4, 512),
                nn.ReLU(inplace=True),
                nn.Dropout(0.5),
                nn.Linear(512, num_classes)
            )
    
        def forward(self, x):
            x = self.features(x)
            x = x.view(x.size(0), -1)
            x = self.classifier(x)
            return x
    
    def train_and_evaluate(model, train_loader, test_loader, num_epochs=10, learning_rate=0.01):
        model = model.to(device)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
    
        for epoch in range(num_epochs):
            model.train()
            for images, labels in train_loader:
                images, labels = images.to(device), labels.to(device)
                optimizer.zero_grad()
                outputs = model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
    
            model.eval()
            correct = 0
            total = 0
            with torch.no_grad():
                for images, labels in test_loader:
                    images, labels = images.to(device), labels.to(device)
                    outputs = model(images)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
            
            accuracy = 100 * correct / total
            print(f'Epoch [{epoch+1}/{num_epochs}], Accuracy: {accuracy:.2f}%')
    
        return model
    
    def save_model(model, path):
        torch.save(model.state_dict(), path)
        print(f"Model saved to {path}")
    
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
    train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
    
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
    
    model = SimpleCNN()
    trained_model = train_and_evaluate(model, train_loader, test_loader)
    
    model_save_path = 'cifar10_cnn_model.pth'
    save_model(trained_model, model_save_path)
    
    trained_model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = trained_model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f'Final Test Accuracy: {100 * correct / total:.2f}%')
    import torch
    import torch.nn as nn
    from torchvision import transforms
    from PIL import Image
    import requests
    from io import BytesIO
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model_save_path = 'cifar10_cnn_model.pth'
    
    def load_model(model, path):
        model.load_state_dict(torch.load(path, map_location=device))
        model.eval()
        print(f"Model loaded from {path}")
        return model
    
    class SimpleCNN(nn.Module):
        def __init__(self, num_classes=10):
            super(SimpleCNN, self).__init__()
            self.features = nn.Sequential(
                nn.Conv2d(3, 64, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2),
                nn.Conv2d(64, 128, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2),
                nn.Conv2d(128, 256, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2, 2)
            )
            self.classifier = nn.Sequential(
                nn.Dropout(0.5),
                nn.Linear(256 * 4 * 4, 512),
                nn.ReLU(inplace=True),
                nn.Dropout(0.5),
                nn.Linear(512, num_classes)
            )
    
        def forward(self, x):
            x = self.features(x)
            x = x.view(x.size(0), -1)
            x = self.classifier(x)
            return x
    
    def predict_image_from_url(model, image_url):
        transform = transforms.Compose([
            transforms.Resize((32, 32)),  # CIFAR10 images are 32x32
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])
        
        # Download the image
        response = requests.get(image_url)
        image = Image.open(BytesIO(response.content)).convert('RGB')
        
        image = transform(image).unsqueeze(0).to(device)
        model.eval()
        with torch.no_grad():
            output = model(image)
            _, predicted = torch.max(output, 1)
        
        classes = ('plane', 'car', 'bird', 'cat', 'deer',
                   'dog', 'frog', 'horse', 'ship', 'truck')
        return classes[predicted.item()]
    
    # Initialize and load the model
    model = SimpleCNN()
    model = load_model(model, model_save_path)
    model = model.to(device)
    
    # Predict from URL
    image_url = 'https://images.twinkl.co.uk/tw1n/image/private/t_630/u/ux/frog-2_ver_1.jpg'
    predicted_class = predict_image_from_url(model, image_url)
    print(f"The image is predicted to be: {predicted_class}")
    'plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
    pip install packaging ninja accelerate wandb
    export GPU_ARCHS="gfx942"
    export ROCM_HOME="/opt/rocm"
    pip install --no-deps --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl' 
    pip install trl
    pip install --no-deps peft
    pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
    pip install packaging ninja accelerate wandb bitsandbytes trl
    pip install --no-deps peft
    # imports
    from datasets import load_dataset
    from trl import SFTTrainer
    
    # get dataset
    dataset = load_dataset("imdb", split="train")
    
    # get trainer
    trainer = SFTTrainer(
        "facebook/opt-350m",
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=512,
    )
    
    # train
    trainer.train()
    
    trainer.save_model("imdb_saved")
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    # Load the model and tokenizer
    model_path = "imdb_saved" 
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
    # Move the model to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    
    def generate_text(prompt, max_length=150):
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                num_return_sequences=1,
                no_repeat_ngram_size=2
            )
        
        # Decode and return the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return generated_text
    
    # Test with a positive prompt
    positive_prompt = "This movie was amazing! The plot"
    print("Model loading...")
    positive_response = generate_text(positive_prompt)
    print("Positive prompt:")
    print(positive_response )
    
    # Test with a negative prompt
    negative_prompt = "I hated this film. The acting"
    print("\nNegative prompt:")
    print(generate_text(negative_prompt))
    
    # Test with a neutral prompt
    neutral_prompt = "This movie was okay. It had"
    print("\nNeutral prompt:")
    print(generate_text(neutral_prompt))
    docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50
    docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri vllm-rocm
    huggingface-cli login #paste your token as needed
    vllm serve meta-llama/Llama-3.1-8B-Instruct
    docker pull vllm/vllm-openai:latest
    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model meta-llama/Llama-3.1-8B-Instruct
    curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "What is the meaning of life?",
    "max_tokens": 128,
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 0.8
    }'