Chapter 10: Fine-Tuning Large Language Models (LLMs)

Introduction

Welcome to Chapter 10, where we unlock the incredible power of Large Language Models (LLMs) by teaching them new tricks! You’ve already built a strong foundation in deep learning, understood neural network architectures, and learned how to train and evaluate models. Now, imagine taking a highly intelligent, pre-trained LLM and making it even smarter for your specific needs. That’s exactly what fine-tuning allows us to do.

In this chapter, we’ll explore the fascinating world of fine-tuning LLMs. We’ll start by understanding why it’s a game-changer for specialized applications, diverging from the general-purpose nature of base models. We’ll then dive deep into modern, efficient techniques like Parameter-Efficient Fine-Tuning (PEFT), which allows us to adapt these massive models without needing supercomputers. Get ready for hands-on exercises where you’ll take a pre-trained LLM, prepare a custom dataset, and fine-tune it to perform a new task, boosting its performance significantly.

This chapter builds directly on your knowledge from previous sections, particularly those on neural network architectures, model training workflows, and evaluation metrics. We’ll leverage the Hugging Face transformers library, which has become the industry standard for working with LLMs, and introduce you to the peft library for efficient fine-tuning. By the end, you’ll not only understand the theory but also have the practical skills to adapt LLMs for real-world scenarios.

Core Concepts

Large Language Models are powerful, but they are often trained on vast, general-purpose datasets, making them good at many things but not great at specific, niche tasks. This is where fine-tuning comes in.

What is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained model (like a base LLM) and further training it on a smaller, task-specific dataset. Think of it like this: an LLM is a brilliant generalist student who knows a lot about everything. Fine-tuning is sending that student to a specialized master’s program where they focus intensely on one subject, becoming an expert in that particular domain or task.

Why Fine-Tune?

Domain Adaptation: Make an LLM perform better on text from a specific industry (e.g., legal documents, medical reports, financial news) that might have unique terminology or styles.
Task Specialization: Teach an LLM a new skill or improve its performance on a specific task (e.g., sentiment analysis, summarization of a particular type of content, code generation for a specific framework, answering questions about a private knowledge base).
Improved Performance: Even for tasks the base LLM can do, fine-tuning can significantly boost accuracy, coherence, and relevance.
Reduced Inference Cost: Sometimes, a smaller, fine-tuned model can outperform a larger, general-purpose model on a specific task, leading to more efficient deployment.

Types of Fine-Tuning

Historically, fine-tuning meant updating all the parameters of the pre-trained model. While effective, this is incredibly resource-intensive for today’s massive LLMs, requiring significant GPU memory and computational power. This led to the rise of Parameter-Efficient Fine-Tuning (PEFT) techniques.

1. Full Fine-Tuning: This involves updating all the weights of the pre-trained model.

Pros: Can achieve the highest performance, especially if the new task is very different from the pre-training task.
Cons: Extremely computationally expensive, requires large amounts of GPU memory, and can lead to “catastrophic forgetting” where the model forgets its general knowledge.

2. Parameter-Efficient Fine-Tuning (PEFT): PEFT methods focus on updating only a small fraction of the model’s parameters, or adding a few new, small trainable layers, while keeping the vast majority of the original pre-trained weights frozen. This dramatically reduces computational cost, memory usage, and storage for the fine-tuned model.

Let’s visualize the difference:

flowchart TD A["Pre-trained LLM"] -->|"Full Fine-Tuning"| C["Update ALL parameters"] C --> D["High compute/memory cost"] C --> E["Store full new model"] A -->|"PEFT (e.g., LoRA)"| F["Add/Update SMALL number of parameters"] F --> G["Low compute/memory cost"] G --> H["Store small 'adapter' weights"] H --> I["Combine base model inference"]

Key PEFT Techniques (as of 2026):

LoRA (Low-Rank Adaptation of Large Language Models): This is arguably the most popular and effective PEFT method. LoRA injects small, trainable matrices into each layer of the pre-trained transformer architecture. During fine-tuning, only these new matrices are updated, while the original LLM weights remain frozen. This means you only need to store these small “adapter” weights, which are orders of magnitude smaller than the full model. When performing inference, the adapters are merged with the base model.
QLoRA (Quantized LoRA): An extension of LoRA that further reduces memory usage by quantizing the base LLM weights (e.g., to 4-bit precision) and performing LoRA fine-tuning on these quantized weights. This allows fine-tuning even larger models on consumer-grade GPUs.
Prompt Tuning/Prefix Tuning: These methods add a small, trainable “soft prompt” or “prefix” to the input sequence, which guides the model’s behavior without modifying its internal weights.

For our hands-on example, we will focus on LoRA due to its widespread adoption, effectiveness, and ease of use with the Hugging Face peft library.

Data Preparation for Fine-Tuning

The quality and format of your fine-tuning data are paramount. For instruction-tuned models (the most common type of LLM you’ll fine-tune), your data should typically be in an “instruction-response” format.

A common format looks like this:

[
  {
    "instruction": "Summarize this article about AI trends.",
    "input": "Article text goes here...",
    "output": "A concise summary of the article."
  },
  {
    "instruction": "Translate the following sentence to French.",
    "input": "Hello, how are you?",
    "output": "Bonjour, comment allez-vous?"
  }
]

Often, the input field can be empty if the instruction alone provides enough context. The instruction and output fields are crucial. You’ll then convert these into a single text string that the model learns to complete, often using specific tokens to delineate roles (e.g., ### Instruction:, ### Input:, ### Response:).

Key Considerations:

Quality over Quantity: A smaller, high-quality dataset often yields better results than a large, noisy one.
Diversity: If your task has different variations, ensure your dataset covers them.
Consistency: Maintain a consistent format and style for instructions and responses.
Safety: Filter out harmful or biased content.

Choosing an LLM for Fine-Tuning

The choice of base LLM depends on your resources and specific needs. As of 2026, popular choices for fine-tuning include:

Llama 3/4 Family (Meta): Continues to be a leading open-source choice, offering various sizes (e.g., 8B, 70B parameters) with strong performance and a permissive license.
Mistral/Mixtral Family (Mistral AI): Known for being highly performant for their size, often outperforming larger models, with efficient architectures.
Gemma Family (Google): Google’s open-source models, derived from their Gemini research, offering good performance and a friendly license.
Other specialized models: Many smaller, task-specific models are released regularly.

For hands-on learning, starting with a smaller model (e.g., 7B or 8B parameter variants) is recommended, especially when using QLoRA on consumer GPUs.

Evaluation Metrics for Fine-Tuned LLMs

Evaluating fine-tuned LLMs is complex because their outputs are often free-form text. Traditional metrics include:

Perplexity: Measures how well a probability model predicts a sample. Lower perplexity generally indicates a better model, but it’s not always a direct proxy for human-perceived quality in generation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Commonly used for summarization tasks, comparing overlap of n-grams between generated and reference summaries.
BLEU (Bilingual Evaluation Understudy): Primarily for machine translation, measuring the similarity between a machine-generated translation and a set of high-quality reference translations.
Human Evaluation: The gold standard. Human judges assess fluency, coherence, relevance, factual accuracy, and helpfulness. This is often done by comparing model outputs side-by-side.
Task-Specific Metrics: Depending on your specific task (e.g., F1-score for classification, custom metrics for code generation).

For instruction tuning, human evaluation or GPT-4/other LLM-based evaluations are often preferred for their ability to capture nuanced quality.

Step-by-Step Implementation: Fine-Tuning an LLM with LoRA

Let’s get our hands dirty! We’ll fine-tune a small, open-source LLM using LoRA and 4-bit quantization (QLoRA) on a custom instruction dataset. This setup is designed to be runnable on a single GPU with at least 12-16GB of VRAM (e.g., NVIDIA RTX 3060/4060 or better).

Our Goal: Take a base LLM and fine-tune it to follow a very specific instruction format for a simple task, like generating short, creative descriptions based on a prompt.

Setup Your Environment

First, ensure you have Python 3.10 or newer. We’ll use pip to install the necessary libraries.

# Create a virtual environment (recommended)
python -m venv llm_finetune_env
source llm_finetune_env/bin/activate # On Windows: .\llm_finetune_env\Scripts\activate

# Install core libraries (versions as of early 2026 are stable with these commands)
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.2 peft==0.7.1 bitsandbytes==0.42.0 accelerate==0.25.0 datasets==2.16.1 trl==0.7.8
pip install -U scikit-learn pandas numpy

torch: The underlying deep learning framework. We specify cu121 for CUDA 12.1 compatibility; adjust if your CUDA version differs.
transformers: Hugging Face’s library for pre-trained models.
peft: Hugging Face’s library for Parameter-Efficient Fine-Tuning.
bitsandbytes: For 4-bit quantization.
accelerate: For easier mixed-precision training and distributed setups.
datasets: For loading and preparing datasets.
trl: (Transformer Reinforcement Learning) A library built on transformers for RLHF and supervised fine-tuning (SFT), simplifying the training loop.

Step 1: Prepare Your Custom Dataset

For this exercise, let’s create a very small synthetic dataset in a JSON Lines (.jsonl) format. Imagine we want the LLM to generate “fun facts” based on a topic.

Create a file named fun_facts_dataset.jsonl with the following content:

{"instruction": "Generate a fun fact.", "input": "Topic: Space", "output": "A day on Venus is longer than a year on Venus."}
{"instruction": "Generate a fun fact.", "input": "Topic: Animals", "output": "Octopuses have three hearts."}
{"instruction": "Generate a fun fact.", "input": "Topic: History", "output": "The shortest war in history lasted only 38 to 45 minutes."}
{"instruction": "Generate a fun fact.", "input": "Topic: Food", "output": "Strawberries are not technically berries, but bananas are."}
{"instruction": "Generate a fun fact.", "input": "Topic: Technology", "output": "The first computer mouse was made of wood."}
{"instruction": "Generate a fun fact.", "input": "Topic: Nature", "output": "Honey never spoils."}
{"instruction": "Generate a fun fact.", "input": "Topic: Science", "output": "It takes a photon up to 100,000 years to travel from the sun's core to its surface."}
{"instruction": "Generate a fun fact.", "input": "Topic: Sports", "output": "The original Olympic games did not include women."}
{"instruction": "Generate a fun fact.", "input": "Topic: Art", "output": "The Mona Lisa has no eyebrows."}
{"instruction": "Generate a fun fact.", "input": "Topic: Music", "output": "The world's oldest instrument is a flute made from a vulture's wing bone, over 40,000 years old."}

Explanation: Each line is a JSON object representing one training example. We have instruction, input, and output fields. This format is common for instruction-tuning datasets.

Step 2: Write Your Fine-Tuning Script

Now, let’s create a Python script named finetune_llm.py. We’ll build it piece by piece.

Part 1: Imports and Basic Setup

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer # SFTTrainer simplifies supervised fine-tuning
import os

# --- Configuration ---
# Model name (choose a small, open-source model, e.g., Mistral-7B-v0.1 or similar)
# For this example, let's use a widely available 7B model.
# NOTE: Model versions can change. Always check Hugging Face Hub for the latest.
MODEL_NAME = "mistralai/Mistral-7B-v0.1" # Example, check for latest stable 7B model
DATASET_PATH = "fun_facts_dataset.jsonl"
OUTPUT_DIR = "./results_fun_facts"

# Ensure CUDA is available
if not torch.cuda.is_available():
    raise SystemError("CUDA is not available. Fine-tuning LLMs requires a GPU.")

print(f"Using model: {MODEL_NAME}")
print(f"Output directory: {OUTPUT_DIR}")

Explanation:

We import all necessary classes from torch, transformers, peft, datasets, and trl.
os is for basic path operations.
MODEL_NAME: This specifies which pre-trained LLM we’re starting with. mistralai/Mistral-7B-v0.1 is a good choice for its balance of performance and size, making it feasible for QLoRA. Always check the Hugging Face Hub for the latest and most suitable base models.
DATASET_PATH: Points to our fun_facts_dataset.jsonl file.
OUTPUT_DIR: Where the fine-tuned model and training logs will be saved.
We add a check for CUDA availability, as GPU is essential.

Part 2: Load Model and Tokenizer with Quantization

# --- 4-bit Quantization Configuration ---
# This configuration enables 4-bit quantization using bitsandbytes,
# which greatly reduces memory footprint and allows larger models to fit on consumer GPUs.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                 # Load model weights in 4-bit
    bnb_4bit_quant_type="nf4",         # Use NF4 (NormalFloat 4) quantization type
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computation (if GPU supports it, otherwise float16)
    bnb_4bit_use_double_quant=True,    # Double quantization can further save memory
)

# --- Load Model and Tokenizer ---
print("Loading model and tokenizer...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",                 # Automatically map model layers to available devices
    trust_remote_code=True             # Required for some models
)
model.config.use_cache = False         # Disable cache for fine-tuning
model.config.pretraining_tp = 1        # Tensor parallelism for pretraining (keep at 1 for fine-tuning)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Set pad token to EOS token for casual LMs
tokenizer.padding_side = "right"       # Pad on the right side
print("Model and tokenizer loaded.")

Explanation:

BitsAndBytesConfig: This is crucial for QLoRA.
- load_in_4bit=True: Tells transformers to load the model in 4-bit precision.
- bnb_4bit_quant_type="nf4": Specifies the NormalFloat 4-bit quantization, which is optimal for neural networks.
- bnb_4bit_compute_dtype=torch.bfloat16: Sets the datatype for computations. bfloat16 (Brain Floating Point) is preferred if your GPU supports it (e.g., NVIDIA Ampere architecture and newer), as it has a wider dynamic range than float16. If not, torch.float16 can be used.
- bnb_4bit_use_double_quant=True: Applies a second, smaller quantization pass, saving a bit more memory at minimal performance cost.
AutoModelForCausalLM.from_pretrained: Loads the pre-trained LLM.
- quantization_config: Passes our bnb_config to enable 4-bit loading.
- device_map="auto": Hugging Face accelerate handles distributing the model across available GPUs or CPU if memory is limited.
- trust_remote_code=True: Some models require this to load custom architectures.
model.config.use_cache = False: Disabling attention cache is standard practice during training to save memory.
AutoTokenizer.from_pretrained: Loads the tokenizer corresponding to our model.
tokenizer.pad_token = tokenizer.eos_token: For causal language models, it’s common to use the End-Of-Sequence (EOS) token as the padding token.
tokenizer.padding_side = "right": When generating text, we usually pad on the right.

Part 3: Configure LoRA

# --- LoRA Configuration ---
# This defines the LoRA adapter's parameters.
peft_config = LoraConfig(
    lora_alpha=16,                     # LoRA scaling factor
    lora_dropout=0.1,                  # Dropout probability for LoRA layers
    r=64,                              # LoRA attention dimension (rank)
    bias="none",                       # Whether to train bias parameters
    task_type="CAUSAL_LM",             # Specifies the task type (Causal Language Modeling)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # Modules to apply LoRA
)

# Prepare model for k-bit training (PEFT utility)
model = prepare_model_for_kbit_training(model)
# Get PEFT model (wraps the base model with LoRA adapters)
model = get_peft_model(model, peft_config)

# Print trainable parameters to see the massive reduction
print(model.print_trainable_parameters())

Explanation:

LoraConfig: This object holds the hyperparameters for our LoRA adapters.
- lora_alpha: A scaling factor for the LoRA weights. Higher alpha means stronger adaptation.
- lora_dropout: Dropout applied to the LoRA layers to prevent overfitting.
- r: The LoRA “rank” or dimension. This is a critical hyperparameter. A higher r means more trainable parameters and potentially more expressive adapters, but also more memory/compute. Common values are 8, 16, 32, 64.
- bias="none": We typically don’t fine-tune bias parameters with LoRA.
- task_type="CAUSAL_LM": Essential for the peft library to correctly configure the adapters for language generation.
- target_modules: This specifies which linear layers in the transformer architecture will have LoRA adapters applied. For Mistral-like models, q_proj, k_proj, v_proj, o_proj (query, key, value, output projections in attention) are standard. gate_proj, up_proj, down_proj are for the feed-forward network.
prepare_model_for_kbit_training(model): A peft utility that casts the lm_head (the final classification layer) to float32 and adds a gradient checkpointing wrapper, which is crucial for memory efficiency during QLoRA training.
get_peft_model(model, peft_config): This function takes our base model and LoraConfig and returns a new model object where LoRA adapters are injected. When you call model.train(), only these adapters will be updated.
model.print_trainable_parameters(): This handy function shows you how many parameters are actually being trained vs. the total parameters. You’ll see a dramatic reduction!

Part 4: Load and Process Dataset

# --- Load and Format Dataset ---
print(f"Loading dataset from {DATASET_PATH}...")
# Load our custom JSONL dataset
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

# Define the formatting function for instruction tuning
def formatting_prompts_func(example):
    # This function converts our instruction-input-output JSON into a single text string
    # that the model will learn to complete.
    # We use a simple template similar to Alpaca/Vicuna instruction format.
    output_texts = []
    for i in range(len(example['instruction'])):
        # Construct the prompt
        prompt = f"### Instruction:\n{example['instruction'][i]}\n"
        if example['input'][i]:
            prompt += f"### Input:\n{example['input'][i]}\n"
        prompt += f"### Response:\n{example['output'][i]}{tokenizer.eos_token}"
        output_texts.append(prompt)
    return output_texts

print("Dataset loaded and formatting function defined.")

Explanation:

load_dataset("json", ...): We use Hugging Face datasets to load our jsonl file. split="train" loads the entire file as the training split. For larger projects, you’d have separate train/validation/test splits.
formatting_prompts_func: This function is critical. It takes a dictionary of examples (from the dataset) and formats them into the specific string structure that the LLM will learn from.
- We use ### Instruction:, ### Input:, and ### Response: to clearly delineate sections.
- {tokenizer.eos_token} is appended to the end of the target response. This teaches the model when to stop generating. The model learns to generate the output given the instruction and input, and then emit the EOS token.

Part 5: Training Arguments and SFTTrainer

# --- Training Arguments ---
# These arguments control the training process.
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,             # Directory to save checkpoints and logs
    num_train_epochs=5,                # Number of training epochs
    per_device_train_batch_size=2,     # Batch size per GPU (adjust based on VRAM)
    gradient_accumulation_steps=2,     # Accumulate gradients over N steps to simulate larger batch size
    optim="paged_adamw_8bit",          # Optimizer (paged AdamW is memory-efficient for QLoRA)
    save_steps=100,                    # Save checkpoint every N steps
    logging_steps=10,                  # Log training metrics every N steps
    learning_rate=2e-4,                # Learning rate
    weight_decay=0.001,                # Weight decay for regularization
    fp16=False,                        # Use float16 precision (set to True if your GPU supports it and you're not using bfloat16 for compute_dtype)
    bf16=True,                         # Use bfloat16 precision (set to True if your GPU supports it, e.g., Ampere+)
    max_grad_norm=0.3,                 # Clip gradients to prevent exploding gradients
    max_steps=-1,                      # Set to -1 to train for num_train_epochs
    warmup_ratio=0.03,                 # Linear warmup over 3% of training steps
    group_by_length=True,              # Group samples by length to reduce padding and speed up training
    lr_scheduler_type="cosine",        # Learning rate scheduler
    report_to="tensorboard",           # Report metrics to TensorBoard (install tensorboard if you want to use it)
    push_to_hub=False,                 # Don't push to Hugging Face Hub automatically
)

# --- SFTTrainer ---
# SFTTrainer simplifies the supervised fine-tuning process.
trainer = SFTTrainer(
    model=model,                       # Our PEFT-wrapped model
    train_dataset=dataset,             # Our formatted training dataset
    peft_config=peft_config,           # LoRA configuration
    max_seq_length=512,                # Maximum sequence length for input (adjust based on your data)
    tokenizer=tokenizer,               # Our tokenizer
    formatting_func=formatting_prompts_func, # Our function to format prompts
    args=training_args,                # Our training arguments
)

# --- Start Training ---
print("Starting training...")
trainer.train()
print("Training complete!")

# --- Save the fine-tuned adapter ---
trainer.save_model(OUTPUT_DIR)
print(f"Fine-tuned LoRA adapter saved to {OUTPUT_DIR}")

Explanation:

TrainingArguments: This class from transformers is where you define all the parameters for your training run.
- num_train_epochs, per_device_train_batch_size, learning_rate, optim, fp16/bf16 are standard deep learning hyperparameters.
- gradient_accumulation_steps: Crucial for simulating larger batch sizes when per_device_train_batch_size is small due to memory constraints. A per_device_train_batch_size of 2 with gradient_accumulation_steps of 2 effectively means a batch size of 4 for gradient updates.
- optim="paged_adamw_8bit": A memory-efficient AdamW variant from bitsandbytes specifically designed for 8-bit quantized models.
- report_to="tensorboard": If you install tensorboard (pip install tensorboard), you can monitor training progress by running tensorboard --logdir ./results_fun_facts in a new terminal.
SFTTrainer: This class from the trl library is a high-level wrapper around transformers.Trainer specifically for supervised fine-tuning (SFT). It handles tokenization and formatting internally, making the setup much cleaner.
- max_seq_length: Important. If your combined instruction, input, and output text often exceeds this, it will be truncated. Adjust based on your typical data length.
- formatting_func: We pass our custom function here, and SFTTrainer will apply it to the dataset examples.
trainer.train(): Kicks off the fine-tuning process!
trainer.save_model(OUTPUT_DIR): After training, this saves only the LoRA adapter weights (not the full base model) to the specified directory. This is why PEFT is so efficient for storage.

Part 6: Inference with the Fine-Tuned Model

After training, you’ll want to test your fine-tuned model. You’ll need to load the original base model and then load the adapter weights on top of it.

Create a new Python script, inference.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import os

# --- Configuration ---
MODEL_NAME = "mistralai/Mistral-7B-v0.1" # Must be the same base model used for fine-tuning
ADAPTER_PATH = "./results_fun_facts" # Path where the LoRA adapter was saved

# Ensure CUDA is available
if not torch.cuda.is_available():
    raise SystemError("CUDA is not available. Inference for LLMs usually requires a GPU.")

# --- 4-bit Quantization Configuration (same as training) ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# --- Load Base Model ---
print(f"Loading base model: {MODEL_NAME}...")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
print("Base model loaded.")

# --- Load Tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# --- Load PEFT Adapter ---
print(f"Loading PEFT adapter from: {ADAPTER_PATH}...")
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model = model.eval() # Set model to evaluation mode
print("PEFT adapter loaded and merged with base model.")

# --- Inference Function ---
def generate_fun_fact(instruction, input_text):
    prompt = f"### Instruction:\n{instruction}\n"
    if input_text:
        prompt += f"### Input:\n{input_text}\n"
    prompt += f"### Response:\n" # Model will complete this

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=512).to("cuda")

    # Generate output
    with torch.no_grad(): # No need to compute gradients during inference
        outputs = model.generate(
            **inputs,
            max_new_tokens=100, # Max tokens to generate
            do_sample=True,     # Use sampling for more creative outputs
            top_p=0.9,          # Nucleus sampling (top_p)
            temperature=0.7,    # Sampling temperature
            eos_token_id=tokenizer.eos_token_id, # Stop generation at EOS token
            pad_token_id=tokenizer.pad_token_id # Use pad token for padding
        )

    # Decode the generated tokens
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the response part
    # We look for the "### Response:" tag and return everything after it,
    # and before any subsequent "###" or EOS token.
    response_start = generated_text.find("### Response:")
    if response_start != -1:
        response_text = generated_text[response_start + len("### Response:"):].strip()
        # Remove any subsequent instruction/input tags if model over-generates
        next_tag_start = response_text.find("### Instruction:")
        if next_tag_start != -1:
            response_text = response_text[:next_tag_start].strip()
        next_tag_start = response_text.find("### Input:")
        if next_tag_start != -1:
            response_text = response_text[:next_tag_start].strip()
        return response_text
    return "Failed to extract response."

# --- Test the fine-tuned model ---
print("\n--- Testing Fine-Tuned Model ---")
print("Generated Fun Fact (Topic: Oceans):")
print(generate_fun_fact("Generate a fun fact.", "Topic: Oceans"))

print("\nGenerated Fun Fact (Topic: Ancient Civilizations):")
print(generate_fun_fact("Generate a fun fact.", "Topic: Ancient Civilizations"))

print("\nGenerated Fun Fact (Topic: Inventions):")
print(generate_fun_fact("Generate a fun fact.", "Topic: Inventions"))

Explanation:

PeftModel.from_pretrained(base_model, ADAPTER_PATH): This is the magic line. It takes your quantized base model and then loads the LoRA adapter weights, effectively merging them for inference.
model.eval(): It’s crucial to set the model to evaluation mode to disable dropout and ensure consistent behavior.
generate_fun_fact: This function encapsulates the inference logic.
- It constructs the prompt using the exact same format as your training data. This consistency is vital for good performance.
- tokenizer(...) converts the text prompt into numerical tokens.
- model.generate(...) is the core generation method.
  - max_new_tokens: Controls the length of the generated response.
  - do_sample=True, top_p, temperature: These parameters control the creativity and randomness of the generation. do_sample=True enables sampling, top_p (nucleus sampling) focuses on a subset of high-probability tokens, and temperature scales the probability distribution (higher temperature means more random).
  - eos_token_id: Ensures the model stops generating when it produces the EOS token.
- The post-processing logic extracts only the relevant “### Response:” part from the generated text.

To run these scripts:

Save the fun_facts_dataset.jsonl file.
Save the finetune_llm.py script.
Run python finetune_llm.py in your terminal (inside your activated virtual environment). This will take some time, depending on your GPU.
Once training is complete, save the inference.py script.
Run python inference.py to see your fine-tuned model in action!

You should observe that the model, even after training on a tiny dataset, starts generating “fun fact”-like responses, demonstrating its adaptation.

Mini-Challenge

Challenge: Experiment with LoRA hyperparameters and the dataset.

Modify LoRA Rank (r): In finetune_llm.py, change the r parameter in LoraConfig (e.g., from 64 to 32, or 16).
Add More Data: Add 5-10 more diverse “fun fact” examples to your fun_facts_dataset.jsonl file.
Retrain and Observe: Rerun finetune_llm.py and then inference.py.

Hint:

A lower r means fewer trainable parameters, which might train faster but could result in less expressive adaptation. A higher r might capture more nuances but requires more memory and time.
More diverse, high-quality data almost always improves performance, especially for a specific task.

What to observe/learn:

How does changing r affect the “Trainable parameters” count printed by model.print_trainable_parameters()?
Does the model’s output quality or adherence to the “fun fact” style change with different r values or more data? Pay attention to coherence and relevance.
Note the training time difference when changing r.

Common Pitfalls & Troubleshooting

Out of Memory (OOM) Errors:
- Symptom: Your script crashes with CUDA out of memory or similar errors, especially during training.
- Troubleshooting:
  - Reduce per_device_train_batch_size: This is the first thing to try. Start with 1 if needed.
  - Increase gradient_accumulation_steps: If you reduce batch size, compensate by increasing accumulation steps to maintain a similar effective batch size.
  - Reduce max_seq_length: Shorter sequences use less memory.
  - Reduce LoRA r: A smaller rank means fewer trainable parameters and less memory.
  - Verify bnb_4bit_compute_dtype: Ensure torch.bfloat16 is used if your GPU supports it, otherwise fallback to torch.float16 (and set fp16=True in TrainingArguments).
  - Close other GPU applications: Ensure no other programs are consuming GPU memory.
  - Consider a smaller base model: If all else fails, you might need a base model with fewer parameters.
Poor Model Performance / Doesn’t Learn:
- Symptom: The fine-tuned model generates irrelevant, generic, or nonsensical responses; it doesn’t seem to have learned the new task.
- Troubleshooting:
  - Dataset Quality & Quantity: This is the most common culprit. Is your dataset large enough (even for PEFT, you need a decent number of examples, often hundreds to thousands for meaningful change)? Is it clean, consistent, and truly representative of your desired task?
  - Formatting Issues: Double-check your formatting_prompts_func. The model must see the exact same prompt structure during training as it will during inference. A mismatch here is fatal.
  - Learning Rate: The learning rate 2e-4 is a good starting point for LoRA, but try slightly adjusting it (e.g., 1e-4, 5e-5).
  - num_train_epochs: Are you training for enough epochs? For small datasets, more epochs might be needed.
  - LoRA r and lora_alpha: Experiment with these. A very low r might not be expressive enough.
  - Gradient Clipping (max_grad_norm): If gradients explode, training can become unstable. 0.3 is a good default.
  - Check logs: Review the training loss in your TensorBoard logs. Is it decreasing steadily? If it’s flat or jumping, something is wrong.
Tokenizer Issues (Encoding/Decoding Problems):
- Symptom: Generated text contains weird characters, or the model seems to ignore parts of the prompt.
- Troubleshooting:
  - tokenizer.pad_token and tokenizer.padding_side: Ensure these are correctly set, especially padding_side="right" for causal LMs.
  - skip_special_tokens=True: Make sure you use this during decoding to remove [PAD], [EOS], etc., from the final output.
  - max_seq_length: If your prompts are frequently truncated, the model won’t see the full context.

Summary

Congratulations! You’ve successfully navigated the exciting world of fine-tuning Large Language Models.

Here are the key takeaways from this chapter:

Fine-tuning adapts pre-trained LLMs to specific domains or tasks, making them more specialized and performant.
Parameter-Efficient Fine-Tuning (PEFT) techniques, especially LoRA and QLoRA, are essential for fine-tuning massive LLMs on limited hardware by only updating a small fraction of parameters.
Data preparation is critical, requiring high-quality, instruction-response formatted examples that match the inference prompt structure.
The Hugging Face transformers library provides the models and tokenizers, while peft and trl simplify the efficient fine-tuning process.
Quantization (e.g., 4-bit with bitsandbytes) is key to reducing memory footprint, allowing larger models to be fine-tuned on consumer GPUs.
Evaluation of fine-tuned LLMs often relies on a combination of automated metrics and crucial human judgment.
Troubleshooting OOM errors involves adjusting batch size, gradient accumulation, sequence length, and LoRA parameters.

You now have a powerful new tool in your AI engineer’s toolkit! Being able to adapt and specialize LLMs opens up a vast array of practical applications, from custom chatbots to specialized content generators.

What’s Next? In the upcoming chapters, we’ll continue our journey by exploring related concepts. We’ll delve into Embeddings to understand how models represent meaning, then move to Multimodal Models that combine different data types (like text and images). We’ll also cover Inference Optimization to make your fine-tuned models run faster and more efficiently in production, and discuss advanced Hardware Considerations for scaling your AI projects.

References

Hugging Face PEFT Documentation: https://huggingface.co/docs/peft/en/index
Hugging Face trl Documentation (SFTTrainer): https://huggingface.co/docs/trl/en/sft_trainer
Hugging Face transformers Documentation: https://huggingface.co/docs/transformers/en/index
Hugging Face bitsandbytes Integration: https://huggingface.co/docs/transformers/en/main_classes/quantization
PyTorch Official Website: https://pytorch.org/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Chapter 10: Fine-Tuning Large Language Models (LLMs)

Table of Contents

Introduction

Core Concepts

What is Fine-Tuning?

Types of Fine-Tuning

Data Preparation for Fine-Tuning

Choosing an LLM for Fine-Tuning

Evaluation Metrics for Fine-Tuned LLMs

Step-by-Step Implementation: Fine-Tuning an LLM with LoRA

Setup Your Environment

Step 1: Prepare Your Custom Dataset

Step 2: Write Your Fine-Tuning Script

Part 6: Inference with the Fine-Tuned Model

Mini-Challenge

Common Pitfalls & Troubleshooting

Summary

References