Hugging Face Quickstart

Estimated time: 7 minutes, 9 minutes with buffer.


Hugging Face is an AI/ML platform for the entire model pipeline. For this quickstart, we'll walk you through accelerated inference using a pretrained model.

Learn more about Hugging Face here.


Installing Dependencies

Because PyTorch with ROCm comes preloaded on your device, you will not need to install this dependency. However, you will still need a couple of libraries in order to run our quickstart script. Begin by installing transformers using the following command:

pip install transformers

This should take no more than a few minutes.


Creating and Running Inference Script

Next, go ahead and create and navigate to a new directory to create your script in:

mkdir hf-hello-world
cd hf-hello-world

Then, create a new script:

nano hello-world.py

Within this script, paste the following code and exit:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

# Load model without quantization
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

# Move model to GPU
model = model.to("cuda")

# Input text
print("Warming up model...")
input_text = "Hello, my name is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
warmup = model.generate(**inputs, max_new_tokens=20)

print("Preparing text...")
input_text = "According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

print("Starting inference...")
start = time.time()
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    no_repeat_ngram_size=2
)
t = time.time()-start
print(f"inference time: {t}")

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After doing so, you may run the script using the following:

python3 hello-world.py

This runs a small model on one GPU, but feel free to swap out your model and prompts to your liking, then map to the proper devices. The output should be similar to:

Warming up model...
Preparing text...
Starting inference...
inference time: 0.4770219326019287
According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because it can.
Well, if you fly with a little fat of your body, you can fly pretty damn well.  You just have to be careful.

Teardown

Navigate back to your base directory and remove your hf-hello-world folder:

cd ~
rm -rf hf-hello-world/

Last updated