Hugging Face Quickstart
Estimated time: 7 minutes, 9 minutes with buffer.
Installing Dependencies
pip install transformersCreating and Running Inference Script
mkdir hf-hello-world
cd hf-hello-worldnano hello-world.pyimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
# Load model without quantization
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
# Move model to GPU
model = model.to("cuda")
# Input text
print("Warming up model...")
input_text = "Hello, my name is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
warmup = model.generate(**inputs, max_new_tokens=20)
print("Preparing text...")
input_text = "According to all known laws of aviation, there is no way that a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
print("Starting inference...")
start = time.time()
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95,
no_repeat_ngram_size=2
)
t = time.time()-start
print(f"inference time: {t}")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Teardown
Last updated

