With both AMD and NVIDIA establishing themselves as top offerings for AI compute, questions have arisen over the differences in software required to run on each. Real-world workloads can run on both types of hardware with little to no code changes, and we're excited to demonstrate this further today.
We'll start by training an image classifier on the CIFAR-10 dataset in PyTorch on both NVIDIA and AMD.
This script loads the dataset, transforms it, then trains and evaluates a CNN model that can classify at around 80% accuracy. This model gets saved at the model_save_path, which can be configured on your own.
You'll notice that at the top, we set our computation device via device = torch.device('cuda'). In PyTorch's ROCm installation, 'cuda' actually points to AMD GPUs, leaving no need to make any changes to any of your desired scripts.
For the purposes of this tutorial, we'll be fine-tuning Facebook's OPT-350m model. We'll begin by setting up our dependencies for significantly speeding up LLM training.
The following tutorial assumes the following prerequisites. If you're using different versions, please adjust your commands accordingly.
Notice that, as above, this will be the only difference between the two training processes
From there, make the following script in a subfolder you'd like to do your work in.
# importsfrom datasets import load_datasetfrom trl import SFTTrainer# get datasetdataset =load_dataset("imdb", split="train")# get trainertrainer =SFTTrainer("facebook/opt-350m", train_dataset=dataset, dataset_text_field="text", max_seq_length=512,)# traintrainer.train()trainer.save_model("imdb_saved")
This script trains Facebook's OPT-350m model on an imdb review dataset, and saves the model for later inference. To conduct inference, use the following script:
from transformers import AutoModelForCausalLM, AutoTokenizerimport torch# Load the model and tokenizermodel_path ="imdb_saved"model = AutoModelForCausalLM.from_pretrained(model_path)tokenizer = AutoTokenizer.from_pretrained(model_path)# Move the model to GPU if availabledevice ="cuda"if torch.cuda.is_available()else"cpu"model = model.to(device)defgenerate_text(prompt,max_length=150): inputs =tokenizer(prompt, return_tensors="pt").to(device)# Generatewith torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=max_length, num_return_sequences=1, no_repeat_ngram_size=2 )# Decode and return the generated text generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)return generated_text# Test with a positive promptpositive_prompt ="This movie was amazing! The plot"print("Model loading...")positive_response =generate_text(positive_prompt)print("Positive prompt:")print(positive_response )# Test with a negative promptnegative_prompt ="I hated this film. The acting"print("\nNegative prompt:")print(generate_text(negative_prompt))# Test with a neutral promptneutral_prompt ="This movie was okay. It had"print("\nNeutral prompt:")print(generate_text(neutral_prompt))
Accelerated Inference for Llama 3.1 (and other HF Models)
For this section of the tutorial, we're going to use vLLM, a framework for accelerated LLM inference and serving.
We're going to serve Llama 3.1 8B Instruct through Docker containers. We'll start by pulling the images and serving the endpoints from there. Note that since the Llama models are gated, we'll have to log in through huggingface-cli to use them.
dockerpullrocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri vllm-rocm
huggingface-clilogin#paste your token as neededvllmservemeta-llama/Llama-3.1-8B-Instruct
Note that you will have to note your HuggingFace API Token for both methods.
In a separate terminal, you can now query the endpoints!
curlhttp://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "meta-llama/Llama-3.1-8B-Instruct","prompt": "What is the meaning of life?","max_tokens": 128,"top_p": 0.95,"top_k": 20,"temperature": 0.8}'