Welcome back, future LLM master! In Chapter 3, we successfully set up our Tunix environment and explored its foundational components. Now, it’s time to put that knowledge into action and perform our very first model alignment task: Supervised Fine-Tuning (SFT).
This chapter is your hands-on guide to taking a pre-trained Large Language Model (LLM) and teaching it a new, specific skill using a carefully curated dataset. We’ll walk through everything from preparing your data to configuring Tunix’s powerful Trainer and observing your model learn. By the end, you’ll have a practical understanding of SFT and the confidence to apply it to your own projects. Get ready to make some LLMs smarter!
Core Concepts: Understanding Supervised Fine-Tuning (SFT)
Before we dive into code, let’s solidify our understanding of what SFT is and why it’s such a crucial step in the LLM lifecycle.
What is Supervised Fine-Tuning?
Imagine you have a brilliant student who knows a lot about many subjects but isn’t specialized in any particular one. That’s your pre-trained LLM! It has learned the general patterns of language from vast amounts of text, making it good at predicting the next word in a sequence.
Now, you want this student to become an expert in, say, answering specific coding questions. You wouldn’t re-teach them everything from scratch. Instead, you’d provide them with many examples of coding questions and their correct answers, guiding them to focus their existing knowledge on this new task.
This process is precisely what Supervised Fine-Tuning (SFT) does for LLMs:
- Starts with a Pre-trained LLM: We leverage the immense general knowledge already encoded in a base model.
- Uses Labeled Data: We provide a dataset consisting of input-output pairs (e.g.,
(prompt, desired_response)). The “supervised” part comes from these explicit labels. - Adapts to a Specific Task: The model’s weights are adjusted to minimize the difference between its predictions and the desired outputs in the SFT dataset. This “fine-tunes” its behavior towards the new task.
Why is SFT important? It’s often the first and most fundamental step in aligning an LLM. It allows us to:
- Make a general-purpose model follow specific instructions.
- Improve performance on domain-specific tasks (e.g., legal, medical, coding).
- Change the model’s output style or format.
The SFT Dataset: The Fuel for Fine-Tuning
The quality and format of your SFT dataset are paramount. For SFT, your data typically consists of pairs, where an “input” (or prompt) is mapped to a “target” (or completion).
Consider this example:
"prompt": "What is the capital of France?",
"completion": "The capital of France is Paris."
Or for a more complex instruction-following scenario:
"prompt": "Instruction: Summarize the following text.\nText: The quick brown fox jumps over the lazy dog. This is a common pangram.\n\nSummary:",
"completion": "The text describes the pangram 'The quick brown fox jumps over the lazy dog'."
Common formats for SFT datasets include JSONL (JSON Lines), where each line is a self-contained JSON object, making it easy to stream and process large datasets.
Tunix’s Role: Efficient SFT with JAX
Tunix (Tune-in-JAX) is purpose-built to make this process efficient and scalable, especially on JAX-accelerated hardware like TPUs or powerful GPUs. It provides:
tunix.data.Dataset: A flexible way to load, process, and prepare your data for training. It handles tokenization, batching, and other data transformations.tunix.Trainer: The core orchestrator for the fine-tuning process. It manages the training loop, optimizer, learning rate schedules, checkpointing, and evaluation.- JAX-native backend: Leveraging JAX’s
jitcompilation andpmapfor distributed training means your SFT runs will be highly optimized.
Let’s visualize the SFT workflow with Tunix:
This diagram illustrates how a pre-trained LLM is fed into the SFT process along with a specialized dataset. Tunix handles the data preparation and training, resulting in a fine-tuned model ready for specific tasks.
Step-by-Step Implementation: Your First SFT Model
It’s time to get our hands dirty! We’ll go through the process of setting up a simple SFT task using Tunix. For this example, we’ll fine-tune a small, pre-trained model to follow a specific instruction format.
Prerequisites:
- You have a working Python environment with Tunix installed (as covered in Chapter 3).
- You have
jax,jaxlib, andtransformersinstalled.pip install "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html(adjustcuda12_pipfor your CUDA version if using GPU)pip install tunix transformers sentencepiece
Step 1: Prepare Your SFT Dataset
First, we need some data! We’ll create a tiny dataset of instruction-response pairs. In a real-world scenario, this would be a much larger file.
Let’s create a Python script named run_sft.py.
# run_sft.py
import json
import os
# 1. Our simple SFT dataset
sft_data = [
{"prompt": "Tell me a fun fact about space.", "completion": "Did you know that a day on Venus is longer than a year on Venus?"},
{"prompt": "What is the capital of Canada?", "completion": "The capital of Canada is Ottawa."},
{"prompt": "Explain the concept of 'hello world' in programming.", "completion": "'Hello, World!' is a simple program often used to illustrate the basic syntax of a programming language. It typically just prints the text 'Hello, World!' to the console."},
{"prompt": "Who wrote 'Romeo and Juliet'?", "completion": "William Shakespeare wrote 'Romeo and Juliet'."}
]
# Define a filename for our dataset
dataset_filename = "simple_sft_dataset.jsonl"
# Save the dataset to a JSONL file
print(f"Saving dataset to {dataset_filename}...")
with open(dataset_filename, "w") as f:
for entry in sft_data:
f.write(json.dumps(entry) + "\n")
print("Dataset saved successfully.")
# We'll continue adding code to this file.
Explanation:
- We import
jsonandosfor file operations. sft_datais a Python list of dictionaries, where each dictionary represents one example. Each example has a"prompt"(the input) and a"completion"(the desired output).- We then iterate through this list and write each dictionary as a JSON string on a new line into
simple_sft_dataset.jsonl. This is the standard JSONL format.
Step 2: Load and Process Data with Tunix
Now that we have our dataset, we need to load it and prepare it for training using Tunix’s data utilities. This involves tokenization.
Let’s extend run_sft.py:
# run_sft.py (continued)
# ... (previous code for dataset creation) ...
import jax
import jax.numpy as jnp
from transformers import AutoTokenizer, FlaxAutoModelForCausalLM
from tunix import data as tunix_data
from tunix import Trainer, TrainState
from tunix.models.flax_llm import FlaxLLMForCausalLM
from tunix.optimizers import get_optimizer
from tunix.schedules import get_lr_schedule
# 2. Load a pre-trained model and tokenizer
# We'll use a small model for demonstration purposes to keep it fast.
# Tunix integrates well with Hugging Face models.
# As of 2026-01-30, 'google/flan-t5-small' is a good choice for a small, instructional model.
# Make sure to have `sentencepiece` installed for T5 models.
model_name = "google/flan-t5-small"
print(f"\nLoading model and tokenizer: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# We need to instantiate a Flax model for JAX compatibility
model = FlaxAutoModelForCausalLM.from_pretrained(model_name)
# Tunix often expects its own wrapper around Flax models for certain functionalities
# For this basic SFT, we can directly use the Flax model, but for advanced features,
# you might wrap it: llm_model = FlaxLLMForCausalLM(model)
print("Model and tokenizer loaded.")
# A simple data processing function
def tokenize_function(examples):
# Combine prompt and completion to form the full text
full_text = [f"{p}\n{c}{tokenizer.eos_token}" for p, c in zip(examples["prompt"], examples["completion"])]
# Tokenize the combined text
tokenized = tokenizer(
full_text,
max_length=128, # A reasonable max length for our small examples
truncation=True,
padding="max_length",
return_tensors="jax", # Important for JAX compatibility
)
# Tunix Trainer expects 'input_ids' and 'labels'
# For causal LMs, input_ids are typically the labels shifted
tokenized["labels"] = tokenized["input_ids"]
return tokenized
# 3. Create a Tunix Dataset
print(f"\nLoading and processing dataset from {dataset_filename}...")
# Use tunix_data.load_jsonl for convenience
raw_dataset = tunix_data.load_jsonl(dataset_filename)
# Convert to a Tunix Dataset object
sft_dataset = tunix_data.Dataset(
raw_dataset,
tokenizer=tokenizer,
tokenization_fn=tokenize_function,
batch_size=2, # Small batch size for demonstration
shuffle=True,
drop_remainder=True,
num_replicas=jax.device_count(), # For multi-device training if available
)
print("Dataset loaded and configured.")
Explanation:
- We import necessary JAX, Hugging Face, and Tunix components.
model_name: We’re usinggoogle/flan-t5-small. This is a relatively small, instruction-tuned model from Hugging Face, perfect for quick experiments. It’s aFlaxAutoModelForCausalLM, meaning it’s compatible with JAX.tokenizer = AutoTokenizer.from_pretrained(model_name): Loads the appropriate tokenizer for our model.model = FlaxAutoModelForCausalLM.from_pretrained(model_name): Loads the pre-trained Flax model weights.tokenize_function(examples): This is a crucial function.- It takes a batch of raw examples (dictionaries with “prompt” and “completion”).
- It combines the
promptandcompletioninto a single string, adding aneos_token(end-of-sequence) at the end of the completion. This teaches the model when to stop generating. - It then uses the
tokenizerto convert these strings into numericalinput_ids, padding them tomax_lengthand truncating if longer.return_tensors="jax"ensures the output is JAX arrays. - Crucially, for causal language modeling, the
labelsfor training are typically the same as theinput_ids. The model learns to predict the next token given the previous ones.
raw_dataset = tunix_data.load_jsonl(dataset_filename): Tunix provides a helper to load JSONL files.sft_dataset = tunix_data.Dataset(...): This initializes Tunix’s data pipeline.raw_dataset: Our loaded data.tokenizer: The tokenizer we loaded.tokenization_fn: Our custom function to process the raw data into token IDs.batch_size: How many examples to process at once. Keep this small for initial tests.num_replicas: This is important for JAX; it tells Tunix how many accelerators (GPUs/TPUs) are available to distribute the batch across.jax.device_count()automatically detects this.
Step 3: Configure Tunix Trainer
The Trainer is the heart of the fine-tuning process. It brings together the model, data, optimizer, and learning rate schedule.
Continue editing run_sft.py:
# run_sft.py (continued)
# ... (previous code for data loading and processing) ...
# 4. Configure Tunix Trainer
print("\nConfiguring Tunix Trainer...")
# Define the optimizer
# Tunix provides `get_optimizer` for common optimizers like AdamW
optimizer = get_optimizer(
name="adamw",
learning_rate=get_lr_schedule(
name="constant",
initial_learning_rate=1e-4, # A common starting point for SFT
),
weight_decay=0.01,
)
# Initialize the TrainState
# This holds the model parameters, optimizer state, and other training metadata.
# We need to provide a dummy input to initialize the model's parameters shape.
dummy_input = {
"input_ids": jnp.zeros((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
"attention_mask": jnp.ones((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
"labels": jnp.zeros((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
}
# Make sure dummy_input is replicated across devices
dummy_input = jax.tree_map(lambda x: jnp.array([x] * jax.device_count()), dummy_input)
# Tunix's TrainState requires a specific model wrapper or direct Flax model.
# Let's use FlaxLLMForCausalLM for full Tunix compatibility
llm_model = FlaxLLMForCausalLM(model)
# Initialize the TrainState with the model and optimizer
train_state = TrainState.create(
apply_fn=llm_model.__call__,
params=llm_model.params,
tx=optimizer,
# Need to pass a dummy input for parameter initialization
**dummy_input
)
# Initialize the Trainer
trainer = Trainer(
model=llm_model,
train_dataset=sft_dataset,
eval_dataset=None, # We're skipping evaluation for this simple example
train_state=train_state,
epochs=3, # Train for 3 epochs - a small number for quick demonstration
max_steps_per_epoch=None, # Train on all data per epoch
log_steps=1, # Log every step
output_dir="./sft_output", # Directory to save checkpoints and logs
)
print("Trainer configured.")
Explanation:
- Optimizer: We define an
AdamWoptimizer, a popular choice for deep learning, with a constant learning rate of1e-4. Tunix providesget_optimizerandget_lr_schedulehelpers. TrainState: This is a JAX/Flax concept that holds the mutable state of your training process, including the model’s parameters and the optimizer’s state (e.g., momentum buffers).- We create
dummy_inputto help JAX infer the shapes of the model’s parameters during initialization. This is a common JAX pattern. We also replicate it for multi-device training. llm_model = FlaxLLMForCausalLM(model): We wrap our Hugging Face Flax model with Tunix’sFlaxLLMForCausalLM. This wrapper ensures compatibility with Tunix’s training loop and methods.TrainState.create(...): Initializes theTrainStatewith the model’sapply_fn(how the model processes inputs), initial parameters, and the optimizer (tx).
- We create
Trainer: We instantiate theTrainerwith:model: Our Tunix-wrapped LLM.train_dataset: Oursft_datasetcreated earlier.epochs: How many times to iterate over the entire dataset. 3 is very small, but good for a first run.log_steps: How frequently to log training progress.output_dir: Where checkpoints and logs will be saved.
Step 4: Run the Fine-Tuning
With everything configured, starting the training is a single line of code!
Add this to run_sft.py:
# run_sft.py (continued)
# ... (previous code for Trainer configuration) ...
# 5. Run the fine-tuning
print("\nStarting Supervised Fine-Tuning...")
trainer.train()
print("Fine-tuning complete!")
# 6. Save the fine-tuned model
output_model_path = os.path.join(trainer.output_dir, "final_sft_model")
print(f"\nSaving fine-tuned model to {output_model_path}...")
trainer.save_model(output_model_path)
print("Model saved.")
Explanation:
trainer.train(): This kicks off the entire training process. You’ll see logs printed to your console showing the loss decreasing (hopefully!).trainer.save_model(output_model_path): After training, we save the learned model weights to a specified directory. This allows us to load and use it later.
Step 5: Inference with the Fine-Tuned Model
Now for the exciting part: let’s see if our model learned anything! We’ll load the fine-tuned model and try to generate responses.
Append to run_sft.py:
# run_sft.py (continued)
# ... (previous code for saving model) ...
# 7. Perform inference with the fine-tuned model
print("\nPerforming inference with the fine-tuned model...")
# Load the fine-tuned model
# We need to reload the Flax model and then wrap it with FlaxLLMForCausalLM
fine_tuned_model = FlaxAutoModelForCausalLM.from_pretrained(output_model_path)
fine_tuned_llm = FlaxLLMForCausalLM(fine_tuned_model)
# Function to generate a response
def generate_response(prompt_text, max_new_tokens=50):
# Prepare the prompt for the model
input_ids = tokenizer(prompt_text, return_tensors="jax").input_ids
# Generate tokens
# Note: Tunix's FlaxLLMForCausalLM might have a specific generate method,
# or you can use the underlying Hugging Face model's generate method.
# For simplicity, we'll use the Hugging Face model's method here.
output_ids = fine_tuned_model.generate(
input_ids,
max_new_tokens=max_new_tokens,
do_sample=True, # Use sampling for more varied outputs
temperature=0.7, # Controls randomness
top_k=50, # Limits the vocabulary to the top K most likely tokens
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
# Decode the generated tokens
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Post-process to remove the original prompt if it's included in the response
# This might depend on how the model was trained and how the prompt was formatted.
if prompt_text in response:
return response[len(prompt_text):].strip()
return response.strip()
# Test prompts
test_prompts = [
"Tell me a fun fact about the ocean.",
"What is the capital of France?",
"Explain recursion in programming.",
"Who discovered gravity?",
]
for prompt in test_prompts:
print(f"\nPrompt: {prompt}")
response = generate_response(prompt)
print(f"Response: {response}")
print("\nInference complete.")
Explanation:
- We reload the model from the
output_model_pathto ensure we’re using our fine-tuned weights. generate_responsefunction:- Takes a
prompt_text. - Tokenizes the prompt.
- Uses the
fine_tuned_model.generate()method (from Hugging Face Transformers) to produce new tokens. We’re usingdo_sample=Trueandtemperature,top_kfor more creative and less deterministic outputs. - Decodes the generated
output_idsback into human-readable text. - Includes a simple post-processing step to remove the original prompt from the response, which is often desirable.
- Takes a
- We then test our fine-tuned model with a few prompts, including one that was in our training data (“What is the capital of Canada?”) and some new ones.
To run the full script, simply execute:
python run_sft.py
You should observe the model providing more accurate or specific responses to the questions it was fine-tuned on, and potentially improved instruction following for similar new prompts.
Mini-Challenge: Tweak and Observe!
Great job completing your first SFT run! Now, let’s play around a bit to build intuition.
Challenge:
- Add two more unique prompt-completion pairs to your
sft_datalist inrun_sft.py. Make them very specific, perhaps about a fictional character or a niche topic. - Change the
epochsparameter in theTrainerfrom3to5. - Re-run the script.
- Observe the new responses for your added prompts and for the existing test prompts. Did the model’s behavior change? Is it more accurate on the new specific facts?
Hint: Pay close attention to the loss values during training. Does it continue to decrease? Does the model seem to “memorize” the new facts better?
What to observe/learn:
- How even a small increase in training data and epochs can influence a model’s ability to recall specific facts or follow instructions.
- The trade-off between training time and model performance.
- The general trend of loss during training (it should generally decrease, indicating learning).
Common Pitfalls & Troubleshooting
Fine-tuning can sometimes be tricky. Here are a few common issues you might encounter and how to approach them:
“Out of Memory” (OOM) Errors:
- Symptom: Your script crashes with a message like
CUDA out of memoryorResourceExhaustedError. - Cause: The model, batch size, or
max_lengthis too large for your GPU/TPU’s memory. - Solution:
- Reduce
batch_size: This is often the first step. Try1or2. - Reduce
max_length: If your sequences are very long, shorteningmax_lengthintokenize_functioncan help. - Use a smaller model: If you’re using a massive LLM, consider a smaller variant for initial experiments.
- Gradient Accumulation: For very large models, Tunix supports gradient accumulation (processing batches sequentially but updating weights less frequently) which can simulate larger batch sizes with less memory. (This is an advanced feature not covered in this chapter, but good to know).
- Reduce
- Symptom: Your script crashes with a message like
Model Not Learning / Underfitting:
- Symptom: The training loss doesn’t decrease significantly, or the model’s responses are still generic after fine-tuning.
- Cause:
- Insufficient Data: Your SFT dataset might be too small or not diverse enough for the task.
- Too Few Epochs: The model hasn’t had enough time to learn.
- Learning Rate Too Low: The model’s updates are too small to make meaningful progress.
- Solution:
- Increase
epochs: Give the model more training time. - Increase
learning_rate: Experiment with values like5e-5or1e-5. - Improve Dataset Quality/Quantity: This is often the most impactful solution. More high-quality, relevant data is key.
- Increase
Overfitting:
- Symptom: The training loss goes very low, but the model performs poorly on new, unseen data (it just “memorizes” the training examples).
- Cause:
- Too Many Epochs: The model has learned the training data too well, including its noise.
- Learning Rate Too High: Updates are too aggressive.
- Small Dataset: Easy for the model to memorize a tiny dataset.
- Solution:
- Reduce
epochs: Stop training earlier. - Reduce
learning_rate: Make updates smaller. - Add Regularization: Techniques like weight decay (already included in our
AdamWoptimizer) help prevent overfitting. Tunix’sTrainermight offer more advanced regularization options. - Use an
eval_dataset: This is crucial. If you provide aneval_datasetto theTrainer, it can track validation loss, allowing you to stop training when validation loss starts to increase (early stopping).
- Reduce
Summary
Congratulations on completing your first Supervised Fine-Tuning with Tunix! You’ve taken a significant step in understanding how to adapt powerful LLMs to your specific needs.
Here are the key takeaways from this chapter:
- SFT’s Purpose: It’s the process of teaching a pre-trained LLM specific skills or behaviors using labeled input-output examples.
- Dataset Importance: A well-structured dataset (often in JSONL format with
promptandcompletionpairs) is crucial for effective SFT. - Tunix Data Pipeline:
tunix.data.Datasetand custom tokenization functions streamline data preparation, including tokenization, batching, and device replication. - Tunix Trainer: The
tunix.Trainerorchestrates the entire fine-tuning process, managing the model, optimizer, learning rate, and training loop. - Practical Application: You’ve successfully prepared data, configured a trainer, run a fine-tuning job, and performed inference with your newly specialized LLM.
In the next chapter, we’ll delve deeper into more advanced fine-tuning techniques beyond basic SFT, exploring how to further align models with human preferences and complex instructions using methods like Reinforcement Learning from Human Feedback (RLHF). Get ready for more exciting challenges!
References
- Tunix Official GitHub Repository: https://github.com/google/tunix
- Tunix Documentation: https://tunix.readthedocs.io/
- JAX Documentation: https://jax.readthedocs.io/en/latest/
- Hugging Face Transformers Documentation: https://huggingface.co/docs/transformers/index
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.