Introduction
Welcome to Chapter 21! After exploring the theoretical foundations of deep learning, neural networks, and various architectures, it’s time to get your hands dirty with a complete, practical project. In this chapter, we’ll build a custom image classifier from scratch, leveraging the power of modern deep learning frameworks and techniques.
This project will guide you through the entire lifecycle of an image classification task: from preparing your own dataset, to selecting and modifying a pre-trained model, training it, and evaluating its performance. By the end, you’ll not only have a working image classifier but also a much deeper understanding of the practical considerations involved in real-world deep learning applications. This is a foundational skill for any aspiring AI/ML engineer or researcher, opening doors to advanced computer vision tasks.
Before we dive in, ensure you’re comfortable with Python programming, basic machine learning concepts, and the fundamentals of deep learning, including neural networks and convolutional layers, as covered in previous chapters. We’ll be using PyTorch, one of the leading deep learning frameworks, so a basic familiarity with its tensors and operations will be beneficial, though we’ll explain each step.
Understanding Image Classification
At its core, image classification is the task of assigning a label or category to an entire image. For example, given an image, a classifier might tell us if it contains a “cat,” a “dog,” or an “airplane.” This seemingly simple task is a cornerstone of many advanced AI applications, from self-driving cars recognizing pedestrians to medical imaging systems detecting diseases.
How do machines “see” and classify images? Unlike humans, who perceive objects holistically, computers process images as grids of pixel values. Deep learning, particularly with Convolutional Neural Networks (CNNs), provides a powerful way for machines to learn hierarchical features from these pixel values. CNNs can automatically detect edges, textures, shapes, and eventually entire objects, forming increasingly complex representations as data passes through their layers.
The Problem: Limited Data and Training Time
Training a powerful CNN from scratch requires massive datasets and significant computational resources. What if you only have a few hundred or thousand images for your specific classification task? This is where a technique called Transfer Learning becomes incredibly valuable.
The Power of Transfer Learning
Imagine you’ve spent years learning to identify various animals. Now, if someone asks you to identify a new breed of dog you’ve never seen before, you don’t start from scratch. Instead, you leverage your existing knowledge of what makes a “dog” a “dog” (ears, snout, fur, etc.) and adapt it to the new breed. Transfer learning in deep learning works similarly.
Transfer learning is a technique where a model trained on a large, general dataset (like ImageNet, which contains millions of images across 1000 categories) is repurposed for a new, often smaller, related task. The idea is that the features learned by the model on the large dataset (e.g., detecting edges, corners, textures, and even parts of objects) are generic and useful for many computer vision tasks.
There are two primary ways to apply transfer learning:
- Feature Extractor: You take a pre-trained CNN, remove its final classification layer, and use the rest of the network as a fixed feature extractor. The features extracted are then fed into a new, smaller classifier (e.g., a simple fully connected layer) that you train from scratch on your specific dataset. This is efficient when your dataset is small and very different from the pre-training dataset.
- Fine-tuning: You take a pre-trained CNN and replace its final classification layer, just like with a feature extractor. However, instead of freezing all the pre-trained layers, you unfreeze some or all of them and continue training the entire network (or parts of it) on your new dataset, usually with a very low learning rate. This allows the model to adapt its learned features more closely to your specific data, often leading to better performance, especially if your dataset is larger and similar to the pre-training data.
For this project, we’ll focus on a common and highly effective approach: using a pre-trained model as a feature extractor and then fine-tuning its final layers.
Choosing Our Tools: PyTorch
For this project, we’ll be using PyTorch, a powerful open-source machine learning framework developed by Facebook (now Meta). PyTorch is known for its flexibility, Python-friendly interface, and dynamic computational graph, which makes debugging and experimentation intuitive.
As of January 2026, the latest stable version of PyTorch is 2.4.0 (or similar, depending on release cadence), building upon the strong foundations of PyTorch 2.x with continued improvements in performance, compiler optimizations, and distributed training capabilities. Its torchvision library provides convenient access to popular datasets, model architectures, and image transformations, making it ideal for computer vision tasks.
Dataset Considerations: Custom Data
For any image classification project, your data is paramount. We’ll simulate a “custom” dataset. For this project, our dataset will be organized in a standard way that torchvision.datasets.ImageFolder can easily understand:
your_custom_dataset/
├── class_a/
│ ├── image1.jpg
│ ├── image2.png
│ └── ...
├── class_b/
│ ├── imageA.jpeg
│ ├── imageB.jpg
│ └── ...
└── ...
Each subfolder name (class_a, class_b) will automatically become a class label.
Step-by-Step Implementation
Let’s get started! We’ll build our custom image classifier piece by piece.
Step 1: Setting Up Your Environment
First, open your terminal or command prompt. We need to install PyTorch and torchvision. For optimal performance, especially with deep learning, a GPU (Graphics Processing Unit) is highly recommended. If you have an NVIDIA GPU, ensure you have CUDA installed.
# For NVIDIA GPU users (e.g., CUDA 12.1, common for PyTorch 2.x in 2026)
# Check PyTorch's official website for the exact command if your CUDA version differs.
# As of 2026-01-17, PyTorch 2.4.0 is a stable target.
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# For CPU-only users (if you don't have a compatible GPU or prefer CPU)
# This will install the CPU version of PyTorch.
# pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cpu
Explanation:
pip install: This is the standard Python package installer.torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0: We’re specifying exact versions to ensure consistency. These versions are chosen as likely stable releases for January 2026, compatible with the PyTorch 2.x ecosystem. Always refer to the official PyTorch installation guide for the most up-to-date and precise commands for your specific system and CUDA version.--index-url https://download.pytorch.org/whl/cu121: This flag tells pip to download the packages from a specific index, in this case, the one containing CUDA 12.1 compatible binaries. If you’re on CPU, you’d use/whl/cpu.
Next, let’s create a Python script named image_classifier.py.
Step 2: Preparing Our Custom Dataset
For this project, we’ll create a dummy dataset structure. In a real scenario, you would replace these with your actual image files.
Action: Create a directory structure like this in the same folder as your image_classifier.py:
data/
├── train/
│ ├── cat/
│ │ ├── cat_001.jpg
│ │ ├── cat_002.jpg
│ │ └── ... (add a few more dummy images, even placeholders)
│ └── dog/
│ ├── dog_001.jpg
│ ├── dog_002.jpg
│ └── ... (add a few more dummy images)
└── val/
├── cat/
│ ├── cat_003.jpg
│ └── ...
└── dog/
├── dog_003.jpg
└── ...
You can use any small image files you have, or even create empty .jpg files as placeholders for this exercise. Just make sure there are at least 2-3 images per class in both train and val directories.
Now, let’s add the code to load and preprocess this data.
# image_classifier.py
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, models, transforms
from torch.utils.data import DataLoader
import os
import time
import copy
# 1. Define device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 2. Define data transformations
# These transformations ensure images are consistent (resized, converted to tensor)
# and normalized according to ImageNet's mean and standard deviation,
# which is crucial for pre-trained models.
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224), # Randomly crop and resize to 224x224
transforms.RandomHorizontalFlip(), # Randomly flip the image horizontally
transforms.ToTensor(), # Convert image to PyTorch Tensor
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # Normalize ImageNet stats
]),
'val': transforms.Compose([
transforms.Resize(256), # Resize the image to 256x256
transforms.CenterCrop(224), # Crop the center to 224x224
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
}
# 3. Load datasets
data_dir = 'data' # Make sure your 'data' folder is in the same directory as this script
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
data_transforms[x])
for x in ['train', 'val']}
# 4. Create data loaders
# DataLoader provides an iterable over the dataset, handling batching, shuffling, etc.
dataloaders = {x: DataLoader(image_datasets[x], batch_size=4,
shuffle=True, num_workers=2) # num_workers can be adjusted based on system
for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes
print(f"Detected classes: {class_names}")
print(f"Training dataset size: {dataset_sizes['train']}")
print(f"Validation dataset size: {dataset_sizes['val']}")
Explanation:
device: This line checks if a CUDA-enabled GPU is available. If so, it usescuda:0(the first GPU); otherwise, it defaults tocpu. Moving computations to the GPU significantly speeds up training.data_transforms: This dictionary defines how our images will be preprocessed.transforms.Compose: Chains multiple transformations together.transforms.RandomResizedCrop(224)/transforms.Resize(256)&transforms.CenterCrop(224): Ensures all images are resized to a consistent 224x224 pixels, which is the input size expected by many pre-trained models. Random operations are good for training to add variability.transforms.RandomHorizontalFlip(): A common data augmentation technique that randomly flips images horizontally, helping the model generalize better.transforms.ToTensor(): Converts the image from a PIL Image or NumPy array to a PyTorchTensor. It also scales pixel values to the range [0.0, 1.0].transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]): Normalizes the image tensors using the mean and standard deviation of the ImageNet dataset. This is crucial because our pre-trained model was trained on ImageNet with this normalization.
image_datasets:datasets.ImageFolderis a super convenient utility fromtorchvisionthat automatically loads images from a directory structure where subfolders represent classes. It applies the defined transformations.dataloaders:DataLoaderwraps theImageFolderdataset, providing an efficient way to iterate over batches of images during training.batch_sizedetermines how many images are processed at once,shuffle=Trueshuffles the data for each epoch (important for training), andnum_workersspecifies how many subprocesses to use for data loading (speeds up I/O).dataset_sizesandclass_names: We extract the total number of images in each split and the names of the detected classes.
Step 3: Loading a Pre-trained Model
Now, let’s load a pre-trained model and modify its final layer for our specific classification task. We’ll use resnet18, a relatively small but effective Convolutional Neural Network.
# image_classifier.py (continue appending to the file)
# 5. Load a pre-trained model
model_ft = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Get the number of input features for the last fully connected layer
num_ftrs = model_ft.fc.in_features
# Replace the last layer with a new one that has 'len(class_names)' output features
# This new layer will be trained from scratch for our specific classes.
model_ft.fc = nn.Linear(num_ftrs, len(class_names))
# Move the model to the chosen device (GPU or CPU)
model_ft = model_ft.to(device)
print(f"Model architecture modified. Final classification layer now has {len(class_names)} outputs.")
Explanation:
models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1): This line loads theresnet18architecture. Crucially,weights=models.ResNet18_Weights.IMAGENET1K_V1tells PyTorch to download and load the weights pre-trained on the ImageNet-1K dataset. This is the core of transfer learning!num_ftrs = model_ft.fc.in_features: We inspect the last layer of the ResNet model (model_ft.fc). This layer is typically a fully connected (linear) layer that outputs 1000 classes (for ImageNet). We get the number of input features this layer expects.model_ft.fc = nn.Linear(num_ftrs, len(class_names)): We replace the original final fully connected layer with a new one. The input features remain the same (num_ftrs), but the output features are nowlen(class_names), which is the number of classes in our custom dataset (e.g., 2 for “cat” and “dog”). This new layer’s weights will be randomly initialized and will be the primary focus of our initial training.model_ft = model_ft.to(device): We move the entire model to the specified device (GPU if available, otherwise CPU).
Step 4: Defining Loss Function and Optimizer
For training, we need a way to measure how “wrong” our model’s predictions are (loss function) and a strategy to adjust the model’s weights to reduce that error (optimizer).
# image_classifier.py (continue appending to the file)
# 6. Define Loss Function and Optimizer
criterion = nn.CrossEntropyLoss() # Suitable for multi-class classification
# Observe that all parameters are being optimized since we haven't frozen any layers yet.
# However, the newly initialized `model_ft.fc` layer will have much larger gradients
# initially and will learn faster.
optimizer_ft = optim.Adam(model_ft.parameters(), lr=0.001)
# Optionally, you can set up a learning rate scheduler to reduce the learning rate
# as training progresses, which can help achieve better convergence.
# Here, we reduce the learning rate by a factor of 0.1 every 7 epochs.
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
print("Loss function (CrossEntropyLoss) and Optimizer (Adam) configured.")
Explanation:
criterion = nn.CrossEntropyLoss():CrossEntropyLossis a common and effective loss function for multi-class classification problems. It combinesLogSoftmaxandNLLLossin one single class, which is numerically stable.optimizer_ft = optim.Adam(model_ft.parameters(), lr=0.001): We use the Adam optimizer, a popular choice known for its efficiency and good performance.model_ft.parameters()tells the optimizer which parameters (weights and biases) in our model it should update.lr=0.001sets the initial learning rate.exp_lr_scheduler = optim.lr_scheduler.StepLR(...): A learning rate scheduler dynamically adjusts the learning rate during training.StepLRdecreases the learning rate by a factor ofgamma(here, 0.1) everystep_size(here, 7) epochs. This often helps the model to converge more effectively by taking larger steps initially and smaller, more precise steps later.
Step 5: Training the Model
This is the core of the learning process. We’ll define a training function that iterates over our data, makes predictions, calculates loss, and updates the model’s weights.
# image_classifier.py (continue appending to the file)
# 7. Training function
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
for epoch in range(num_epochs):
print(f'Epoch {epoch}/{num_epochs - 1}')
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'val']:
if phase == 'train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
# Track gradients only in train phase
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1) # Get the predicted class
loss = criterion(outputs, labels)
# Backward pass + optimize only if in training phase
if phase == 'train':
loss.backward() # Compute gradients
optimizer.step() # Update model parameters
# Statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
if phase == 'train':
scheduler.step() # Update learning rate scheduler
epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects.double() / dataset_sizes[phase]
print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
# Deep copy the model if it's the best validation accuracy so far
if phase == 'val' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model_wts = copy.deepcopy(model.state_dict())
print()
time_elapsed = time.time() - since
print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
print(f'Best val Acc: {best_acc:.4f}')
# Load best model weights
model.load_state_dict(best_model_wts)
return model
# Start training!
print("Starting model training...")
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=10) # Reduced epochs for quick demo
print("Training finished.")
Explanation of train_model function:
- Initialization: Records start time, initializes
best_model_wts(to save the model with the highest validation accuracy) andbest_acc. - Epoch Loop: The outer loop iterates for
num_epochs. - Phase Loop (
trainvs.val): Inside each epoch, there are two phases: ’train’ and ‘val’ (validation).model.train()/model.eval(): Sets the model to training or evaluation mode. This is important because layers likeDropoutorBatchNormbehave differently during training and inference.running_loss,running_corrects: Variables to accumulate loss and correct predictions for the current phase.
- Data Iteration: The inner loop iterates over batches of data from the
dataloaders.inputs = inputs.to(device),labels = labels.to(device): Moves the input images and their corresponding labels to the specified device (GPU or CPU).optimizer.zero_grad(): Clears the gradients of all optimized parameters. Gradients accumulate by default, so you need to zero them out before each backward pass.with torch.set_grad_enabled(phase == 'train'): This context manager enables gradient calculation only if we are in the training phase. For validation, gradients are not needed, saving memory and computation.outputs = model(inputs): Performs the forward pass, feeding inputs through the model to get raw predictions (logits)._, preds = torch.max(outputs, 1):torch.maxreturns the maximum value and its index. We’re interested in the index, which represents the predicted class.loss = criterion(outputs, labels): Calculates the loss between the model’s outputs and the true labels.loss.backward(): Performs the backward pass, computing gradients for all parameters that require gradients.optimizer.step(): Updates the model’s parameters using the calculated gradients and the optimizer’s algorithm.
scheduler.step(): After each training phase, the learning rate scheduler is updated, potentially decreasing the learning rate.- Statistics: Calculates and prints the average loss and accuracy for the current phase.
- Best Model Check: If the current validation accuracy is better than
best_accso far, the model’s current state (weights) is saved asbest_model_wts. - Finalization: After all epochs, the model is loaded with the
best_model_wts(the one that performed best on validation), and training time is printed.
Step 6: Saving the Model
Once trained, you’ll want to save your model so you can use it later without retraining.
# image_classifier.py (continue appending to the file)
# 8. Save the trained model
model_save_path = 'custom_image_classifier.pth'
torch.save(model_ft.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")
# To load the model later:
# loaded_model = models.resnet18(weights=None) # Load architecture without pre-trained weights
# num_ftrs_loaded = loaded_model.fc.in_features
# loaded_model.fc = nn.Linear(num_ftrs_loaded, len(class_names))
# loaded_model.load_state_dict(torch.load(model_save_path))
# loaded_model = loaded_model.to(device)
# loaded_model.eval() # Set to evaluation mode for inference
# print("Model loaded successfully for inference.")
Explanation:
torch.save(model_ft.state_dict(), model_save_path): This line saves only the learned parameters (weights and biases) of the model, not the entire model architecture. This is generally preferred as it keeps the file size small and makes it flexible to load into different environments. The file is saved ascustom_image_classifier.pth.- Loading Snippet: The commented-out section shows how you would load these saved weights later. You first need to instantiate the model architecture (e.g.,
resnet18) and then load the state dictionary into it. Remember to callmodel.eval()after loading for inference.
Congratulations! You’ve just built and trained a custom image classifier using a pre-trained ResNet model and PyTorch.
Mini-Challenge: Experiment with Hyperparameters and Models
Now that you have a working pipeline, it’s time to experiment and deepen your understanding.
Challenge:
- Change the Pre-trained Model: Instead of
resnet18, try usingresnet34orvgg16fromtorchvision.models. You’ll need to adjust thenum_ftrsand the finalnn.Linearlayer accordingly for the new model’s architecture. - Adjust Learning Rate & Epochs: Change the
lrin theAdamoptimizer (e.g., to0.01or0.0001). Also, increasenum_epochsin thetrain_modelfunction (e.g., to 20 or 30). - Observe the Effects: How do these changes impact the training speed, final accuracy, and validation loss?
Hint:
- For
resnet34, the modification to thefclayer is similar toresnet18. - For
vgg16, the final classifier is often withinmodel.classifierand might involve multiplenn.Linearlayers. You’ll need to replace the lastnn.Linearlayer inmodel.classifierwith a new one. For example,model_ft.classifier[6] = nn.Linear(model_ft.classifier[6].in_features, len(class_names)). Always inspect the model’s structure by printingprint(model_ft)after loading.
What to observe/learn: This challenge will help you understand that deep learning is often an iterative process of experimentation. Different architectures and hyperparameters can significantly affect model performance and training dynamics. You’ll gain intuition about how to debug model behavior by looking at training and validation loss/accuracy curves.
Common Pitfalls & Troubleshooting
Even experienced practitioners encounter issues. Here are a few common ones you might face:
CUDA out of memory: This error occurs when your GPU doesn’t have enough memory to process the current batch of data or the model itself.- Solution: Reduce
batch_size(e.g., from 4 to 2 or even 1). If that’s not enough, try using a smaller model (e.g.,resnet18instead ofresnet50) or use CPU training (slower).
- Solution: Reduce
- Incorrect Data Paths or Empty Classes: If
ImageFoldercan’t find your images or if a class folder is empty, you might get errors or unexpecteddataset_sizes.- Solution: Double-check your
data_dirand the exact spelling of subfolder names. Ensure there are actual image files in each class subfolder.
- Solution: Double-check your
- Overfitting: Your training accuracy is very high, but validation accuracy is much lower. The model has memorized the training data but doesn’t generalize well to unseen data.
- Solution:
- Increase data augmentation (
transforms). - Reduce model complexity (e.g., use
resnet18if you were usingresnet50). - Add regularization techniques (e.g., Dropout layers, weight decay in optimizer).
- Reduce
num_epochs. - Get more diverse training data.
- Increase data augmentation (
- Solution:
- Underfitting: Both training and validation accuracy are low. The model isn’t learning enough from the data.
- Solution:
- Increase model complexity (e.g., use a deeper ResNet).
- Increase
num_epochs. - Increase the learning rate (carefully).
- Ensure your data is clean and properly labeled.
- Check if the pre-trained weights are actually being loaded.
- Solution:
Summary
In this chapter, you’ve taken a significant step from theory to practice by building a custom image classifier. Here are the key takeaways:
- Image Classification Fundamentals: You understand the goal of image classification and how CNNs are leveraged for this task.
- Transfer Learning: You’ve practically applied transfer learning using a pre-trained
resnet18model, significantly reducing the need for massive datasets and long training times. - PyTorch Workflow: You’ve gained hands-on experience with a complete PyTorch workflow, including:
- Setting up your environment and handling devices (CPU/GPU).
- Preparing custom datasets with
ImageFolderandtransforms. - Creating efficient data pipelines with
DataLoader. - Loading and modifying pre-trained models.
- Defining loss functions (
nn.CrossEntropyLoss) and optimizers (optim.Adam). - Implementing a full training and validation loop with learning rate scheduling.
- Saving and loading model weights.
- Practical Problem Solving: You’ve encountered and considered solutions for common deep learning challenges like
CUDA out of memoryand overfitting/underfitting.
This project is a fundamental building block. From here, you can explore more advanced computer vision tasks, delve into deploying your models, or even experiment with different architectures and fine-tuning strategies. The journey into becoming a proficient AI/ML engineer is paved with such hands-on experiences!
References
- PyTorch Official Website: The primary resource for PyTorch documentation, tutorials, and installation guides. Always refer here for the latest stable versions and best practices.
- PyTorch
torchvisionDocumentation: Detailed information on datasets, models, and transforms available in thetorchvisionlibrary. - PyTorch Transfer Learning Tutorial: An excellent official tutorial that covers transfer learning in PyTorch, which this chapter heavily draws inspiration from in terms of structure and concepts.
- ImageNet: The large-scale visual database that many pre-trained models are trained on, providing a foundation for transfer learning.
- Adam Optimizer Paper: While not directly referenced in the text, the Adam optimizer is a cornerstone of modern deep learning and its original paper provides deep insights into its mechanics.
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.