Introduction

Welcome to Chapter 11! So far, you’ve mastered the fundamentals of setting up Tunix, loading models, and initiating basic post-training runs. But what if the standard tools aren’t quite enough for your specific research or application? What if you need to guide your Language Model (LLM) with a unique objective, fine-tune its learning process with a specialized algorithm, or automate complex actions during training?

This chapter is your gateway to unlocking the full power of Tunix customization. We’ll dive deep into how you can define and integrate your own loss functions to precisely shape your LLM’s learning objective, craft sophisticated optimizers using JAX’s powerful Optax library to control parameter updates, and implement intelligent callbacks to monitor, control, and react to your training process. By the end of this chapter, you’ll be able to tailor Tunix to virtually any LLM post-training scenario, moving beyond off-the-shelf solutions to truly bespoke training pipelines.

To get the most out of this chapter, you should be comfortable with the basic Tunix training loop concepts covered in previous chapters, have a foundational understanding of JAX, and be familiar with common machine learning concepts like loss functions and optimizers. Let’s get started on making Tunix truly yours!

Core Concepts: The Pillars of Customization

Tunix, being built on JAX, inherits JAX’s flexibility and composability. This “white-box” design, as Google describes it, means you have granular control over every aspect of your training. We’ll focus on three key areas: loss functions, optimizers, and callbacks.

Understanding Loss Functions in LLM Post-Training

At its heart, machine learning is about minimizing errors. A loss function is the mathematical engine that quantifies this error. It measures how “wrong” your model’s predictions are compared to the true targets. During training, the optimizer uses the gradient of this loss to adjust the model’s parameters, iteratively making the model “less wrong.”

What is a Loss Function?

Imagine you’re teaching a child to identify fruits. If they say “apple” when shown an orange, that’s an error. The loss function is like a scorekeeper that gives a higher penalty for bigger mistakes (e.g., saying “rock” for an orange) and a smaller penalty for minor ones (e.g., saying “tangerine” for an orange).

For LLMs, common tasks like next-token prediction often use Cross-Entropy Loss. However, in post-training, especially for alignment techniques like Reinforcement Learning from Human Feedback (RLHF), you might encounter more complex losses like Kullback-Leibler (KL) Divergence to penalize divergence from a reference model, or custom losses designed for specific safety or factual consistency objectives.

Why Customize Loss Functions?

  • Task Specificity: Default losses might not perfectly align with your unique post-training goal. For example, you might want to penalize certain types of hallucinations more heavily.
  • Robustness: Custom losses can be designed to be more robust to noisy data or outliers.
  • Regularization: You can add regularization terms directly into your loss function to prevent overfitting or encourage desired model properties.
  • Multi-objective Optimization: Combine multiple objectives (e.g., fluency, coherence, safety) into a single composite loss.

How Tunix Integrates Custom Losses

Tunix, like many JAX-based libraries, expects loss functions to be pure Python functions that operate on JAX arrays. The key is that they must be differentiable, as JAX will automatically compute their gradients.

graph TD A[Model Output] --> B{Loss Function} B --> C[Target Label] B --> D[Loss Value] D --> E[JAX Autograd] E --> F[Gradients] F --> G[Optimizer] G --> H[Update Model Parameters]

Figure 11.1: Simplified data flow for loss function and gradient computation.

Mastering Optimizers with Optax

An optimizer is the algorithm that adjusts your model’s internal parameters (weights and biases) based on the gradients computed from the loss function. It’s the “how” of learning.

What is an Optimizer?

If the loss function tells you how far off you are, the optimizer tells you which way to go and how big a step to take to reduce that error. Imagine you’re blindfolded on a mountain, trying to find the lowest point. The loss function tells you your current elevation. The optimizer tells you which direction is downhill and how far to step based on the steepness.

Common optimizers include Stochastic Gradient Descent (SGD), Adam, and AdamW. Optax is JAX’s library for gradient processing and optimization, providing a highly modular and composable way to build custom optimizers.

Why Customize Optimizers?

  • Learning Rate Schedules: Dynamically change the learning rate over time (e.g., warm-up, decay) for more stable and effective training.
  • Advanced Algorithms: Use specialized optimizers or combine multiple optimization techniques (e.g., gradient clipping + AdamW).
  • Memory Efficiency: Some optimizers are more memory-efficient for very large models.
  • Hyperparameter Tuning: Fine-tune optimizer behavior for optimal performance on your specific task.

How Tunix Leverages Optax

Tunix integrates seamlessly with Optax. You define your optimizer using Optax’s building blocks, and Tunix uses this Optax optimizer state to manage parameter updates within its training loop. This allows for immense flexibility without rewriting core training logic.

Enhancing Training with Callbacks

Callbacks are functions or objects that can be executed at specific points during the training process. They allow you to inject custom logic without modifying the core training loop.

What are Callbacks?

Think of callbacks as event listeners for your training process. When a certain event happens (e.g., an epoch ends, a batch finishes, training starts), your callback can “listen” for that event and perform a predefined action.

Why Use Callbacks?

  • Logging: Record metrics, gradients, or other data to a file or a visualization tool (e.g., TensorBoard, Weights & Biases).
  • Early Stopping: Automatically stop training if the model’s performance on a validation set stops improving, preventing overfitting and saving compute.
  • Model Checkpointing: Save the model’s weights at regular intervals or when a new best performance is achieved.
  • Learning Rate Scheduling: Adjust the learning rate based on validation metrics.
  • Custom Metrics: Compute and log metrics not natively handled by the training loop.
  • Dynamic Adjustments: Modify training parameters or even model architecture mid-training (though this is more advanced).

Tunix’s Callback System

Tunix provides a flexible callback system, allowing you to define classes with methods that correspond to various lifecycle events of the training process (e.g., on_train_begin, on_step_end, on_epoch_end).

Step-by-Step Implementation

Let’s put these concepts into practice. We’ll assume you have a basic Tunix setup with a model and dataset ready, similar to what you’d have from Chapter 3 or 4. For demonstration, we’ll use a simplified training loop structure.

Prerequisites: Tunix and JAX Setup (as of 2026-01-30)

First, ensure you have Tunix and its dependencies installed.

# It's always a good idea to use a virtual environment
python -m venv tunix_env
source tunix_env/bin/activate # On Windows: .\tunix_env\Scripts\activate

# Install Tunix from its official GitHub repository for the latest stable version
# As of 2026-01-30, we'll assume a stable release like v0.2.0 or newer.
# Always check the official Tunix GitHub for the absolute latest stable release.
pip install "tunix[full] @ git+https://github.com/google/tunix.git@v0.2.0"

# Verify JAX, Flax, Optax versions (these will be installed as Tunix dependencies)
# JAX: ~0.4.23 or newer
# Optax: ~0.1.7 or newer
# Flax: ~0.7.5 or newer
pip show jax flax optax

Note: The Tunix version v0.2.0 is a placeholder for a stable release by early 2026. Always refer to the official Tunix GitHub releases page for the most current stable tag or branch.

1. Defining a Custom Loss Function

Let’s create a custom loss function that’s a slight modification of standard cross-entropy, perhaps adding a small L2 regularization to the model’s weights directly within the loss calculation (though typically L2 is handled by the optimizer or a separate regularization term, this illustrates a custom composite loss).

We’ll define a function that takes logits, labels, and model params as input.

import jax
import jax.numpy as jnp
import optax
import flax.linen as nn
from tunix.trainer import Trainer # Assuming Tunix Trainer structure
from tunix.models import Transformer # Example Tunix model

# --- 1. Define a Custom Loss Function ---
def custom_llm_loss(logits: jnp.ndarray, labels: jnp.ndarray, params: flax.core.FrozenDict, l2_reg_factor: float = 1e-4) -> jnp.ndarray:
    """
    Computes a custom loss for LLM post-training, combining cross-entropy
    with L2 regularization on model parameters.

    Args:
        logits: The model's output logits (raw predictions).
        labels: The true target labels (e.g., next token IDs).
        params: The model's parameters, used for L2 regularization.
        l2_reg_factor: The strength of the L2 regularization.

    Returns:
        A scalar JAX array representing the total loss.
    """
    # Standard Cross-Entropy Loss
    # We assume labels are token IDs and logits are for each token in vocabulary
    one_hot_labels = jax.nn.one_hot(labels, num_classes=logits.shape[-1])
    # Optax's softmax_cross_entropy_with_integer_labels is robust
    ce_loss = optax.softmax_cross_entropy_with_integer_labels(logits=logits, labels=labels).mean()

    # L2 Regularization on parameters
    l2_loss = 0.0
    for key in jax.tree_util.tree_leaves(params):
        # We only apply L2 to array-like parameters (weights, biases)
        if isinstance(key, jnp.ndarray) and key.ndim > 0:
            l2_loss += jnp.sum(key**2)

    total_loss = ce_loss + l2_reg_factor * l2_loss
    return total_loss

Explanation:

  • We import jax, jax.numpy, optax, and flax.linen as these are fundamental for JAX-native operations.
  • custom_llm_loss takes logits (model predictions), labels (true values), and params (model weights for regularization) as input.
  • jax.nn.one_hot converts integer labels into one-hot encoded vectors, which is useful for some loss formulations, though optax.softmax_cross_entropy_with_integer_labels can directly take integer labels.
  • optax.softmax_cross_entropy_with_integer_labels: This is a robust way to compute cross-entropy loss in JAX. We take its mean across the batch.
  • L2 Regularization: We iterate through the params tree (which is a nested structure of model weights). For each numerical array, we compute the sum of its squares and add it to l2_loss.
  • Finally, we combine ce_loss and l2_loss using l2_reg_factor to get total_loss.

2. Implementing a Custom Optimizer with Optax

Now, let’s build a custom optimizer using Optax. We’ll combine AdamW with a linear warm-up followed by a cosine decay learning rate schedule.

# --- 2. Implement a Custom Optimizer with Optax ---
def create_custom_optimizer(
    learning_rate: float,
    total_steps: int,
    warmup_steps: int,
    weight_decay: float = 1e-1
) -> optax.GradientTransformation:
    """
    Creates a custom Optax optimizer with AdamW and a combined learning rate schedule.

    Args:
        learning_rate: The peak learning rate.
        total_steps: Total number of training steps.
        warmup_steps: Number of steps for linear warm-up.
        weight_decay: L2 regularization strength for AdamW.

    Returns:
        An optax.GradientTransformation object.
    """
    # 2.1. Define the Learning Rate Schedule
    # Linear warm-up
    warmup_fn = optax.linear_schedule(
        init_value=0.0,
        end_value=learning_rate,
        transition_steps=warmup_steps
    )

    # Cosine decay after warm-up
    decay_fn = optax.cosine_decay_schedule(
        init_value=learning_rate,
        decay_steps=total_steps - warmup_steps
    )

    # Combine the schedules
    # The `transition_steps` for `join_schedules` is where the switch from warmup to decay happens
    lr_schedule = optax.join_schedules(
        schedules=[warmup_fn, decay_fn],
        boundaries=[warmup_steps]
    )

    # 2.2. Define the Optimizer Chain
    # We use optax.chain for combining multiple transformations
    optimizer = optax.chain(
        optax.clip_by_global_norm(1.0), # Gradient clipping to prevent exploding gradients
        optax.adamw(learning_rate=lr_schedule, weight_decay=weight_decay) # AdamW with our schedule
    )
    return optimizer

# Example usage (you'd pass these to your Tunix Trainer)
# peak_lr = 1e-4
# total_training_steps = 10000
# num_warmup_steps = 1000
# custom_optim = create_custom_optimizer(peak_lr, total_training_steps, num_warmup_steps)

Explanation:

  • create_custom_optimizer takes parameters like learning_rate, total_steps, warmup_steps, and weight_decay.
  • Learning Rate Schedule:
    • optax.linear_schedule creates a schedule that linearly increases the learning rate from 0.0 to learning_rate over warmup_steps.
    • optax.cosine_decay_schedule creates a schedule that decays the learning rate from learning_rate to a small value using a cosine function over the remaining steps.
    • optax.join_schedules combines these two, switching from the warm-up to the decay schedule at warmup_steps.
  • Optimizer Chain:
    • optax.chain allows you to compose multiple gradient transformations.
    • optax.clip_by_global_norm(1.0) is a common practice for LLMs to prevent gradients from becoming too large, which can destabilize training.
    • optax.adamw is a popular optimizer. We pass our lr_schedule and weight_decay to it.

3. Creating and Using a Custom Callback

Let’s define a custom callback that logs the average loss every N steps and saves a checkpoint if the validation loss improves.

import os
import time
from typing import Any, Dict, Optional
from tunix.trainer import TrainerCallback, TrainerState # Assuming Tunix provides these base classes

# --- 3. Creating and Using a Custom Callback ---
class CustomLoggerAndCheckpointCallback(TrainerCallback):
    """
    A custom callback to log average loss periodically and save model checkpoints
    based on improved validation loss.
    """
    def __init__(self, log_interval_steps: int, checkpoint_dir: str = "./checkpoints",
                 monitor_metric: str = "val_loss", mode: str = "min"):
        super().__init__()
        self.log_interval_steps = log_interval_steps
        self.checkpoint_dir = checkpoint_dir
        self.monitor_metric = monitor_metric
        self.mode = mode
        self.best_metric_value = None
        self.step_losses = []
        os.makedirs(self.checkpoint_dir, exist_ok=True)
        print(f"CustomLoggerAndCheckpointCallback initialized. Checkpoints will be saved to: {self.checkpoint_dir}")

    def on_train_begin(self, state: TrainerState, **kwargs: Any) -> None:
        """Called at the beginning of training."""
        print("Training started! Initializing custom callback.")
        self.best_metric_value = float('inf') if self.mode == 'min' else -float('inf')

    def on_step_end(self, state: TrainerState, **kwargs: Any) -> None:
        """Called at the end of each training step."""
        # Assuming Tunix TrainerState has 'current_step' and 'loss_value'
        current_step = state.current_step
        current_loss = state.loss_value
        self.step_losses.append(current_loss)

        if (current_step + 1) % self.log_interval_steps == 0:
            avg_loss = jnp.mean(jnp.array(self.step_losses)).item()
            print(f"Step {current_step + 1}/{state.total_steps} - Average Loss ({self.log_interval_steps} steps): {avg_loss:.4f}")
            self.step_losses = [] # Reset for next interval

    def on_epoch_end(self, state: TrainerState, logs: Dict[str, Any], **kwargs: Any) -> None:
        """Called at the end of each epoch."""
        # Check if the monitored metric is available in logs
        if self.monitor_metric in logs:
            current_metric_value = logs[self.monitor_metric]
            print(f"Epoch {state.current_epoch} - {self.monitor_metric}: {current_metric_value:.4f}")

            should_save = False
            if self.mode == 'min':
                if current_metric_value < self.best_metric_value:
                    self.best_metric_value = current_metric_value
                    should_save = True
            elif self.mode == 'max':
                if current_metric_value > self.best_metric_value:
                    self.best_metric_value = current_metric_value
                    should_save = True

            if should_save:
                checkpoint_path = os.path.join(self.checkpoint_dir, f"model_epoch_{state.current_epoch:03d}_{self.monitor_metric}_{self.best_metric_value:.4f}.tunix")
                # Assuming Tunix Trainer has a save_model method
                # state.trainer.save_model(state.params, checkpoint_path) # This is hypothetical, depends on Tunix API
                print(f"Saving best model checkpoint to {checkpoint_path}")
                # In a real Tunix implementation, you would call a method on the Trainer
                # or pass the state.params to a saving utility.
                # For demonstration, let's just print a placeholder save.
                # Example: state.trainer.save_checkpoint(state.params, state.opt_state, checkpoint_path)
                print(f"Would save model state here to {checkpoint_path}")
        else:
            print(f"Warning: Monitored metric '{self.monitor_metric}' not found in logs for epoch {state.current_epoch}.")

    def on_train_end(self, state: TrainerState, **kwargs: Any) -> None:
        """Called at the end of training."""
        print("Training ended! Custom callback finished.")

# --- Integration Example ---
# Assuming you have a Tunix Trainer instance setup
# trainer = Trainer(...)

# # Instantiate your custom callback
# my_callback = CustomLoggerAndCheckpointCallback(
#     log_interval_steps=50,
#     checkpoint_dir="./my_llm_checkpoints",
#     monitor_metric="val_loss", # This metric needs to be computed and logged by Tunix Trainer
#     mode="min"
# )

# # Add the callback to your Tunix Trainer
# trainer.add_callback(my_callback)
# # Then you would call trainer.train(...)

Explanation:

  • We define CustomLoggerAndCheckpointCallback which inherits from tunix.trainer.TrainerCallback.
  • __init__: Sets up logging interval, checkpoint directory, and the metric to monitor.
  • on_train_begin: A simple message indicating the start of training and initializing best_metric_value.
  • on_step_end: This method is called after each training step. We collect the loss for the current step and, if log_interval_steps have passed, calculate and print the average loss.
  • on_epoch_end: Called at the end of each epoch. It checks the monitor_metric from the logs dictionary provided by the Tunix Trainer. If the metric shows improvement (e.g., val_loss decreases), it prints a message indicating a checkpoint would be saved.
    • Note: The actual save_model or save_checkpoint call is hypothetical and depends on the exact Tunix API for saving model states. You would typically call a method on the trainer object itself or use a utility function provided by Tunix.
  • on_train_end: A final message when training concludes.

Integrating Custom Components into Tunix

Now, let’s imagine a simplified Tunix Trainer setup to see how these custom components would be plugged in.

# Assuming you have a model, dataset, and basic Tunix setup
# For this example, we'll use placeholder classes.

# Placeholder for a Tunix-compatible model
class MyTunixModel(nn.Module):
    num_heads: int = 8
    num_layers: int = 4
    vocab_size: int = 1000
    embed_dim: int = 256

    @nn.compact
    def __call__(self, x: jnp.ndarray, train: bool = True):
        # Simplified transformer block for demonstration
        x = nn.Embed(num_embeddings=self.vocab_size, features=self.embed_dim)(x)
        for _ in range(self.num_layers):
            x = nn.SelfAttention(num_heads=self.num_heads)(x)
            x = nn.Dense(features=self.embed_dim)(x)
            x = nn.LayerNorm()(x)
        logits = nn.Dense(features=self.vocab_size)(x)
        return logits

# Placeholder for a dataset (e.g., a simple iterator)
class DummyDataset:
    def __init__(self, num_batches: int = 100, batch_size: int = 4, seq_len: int = 64, vocab_size: int = 1000):
        self.num_batches = num_batches
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.vocab_size = vocab_size

    def __iter__(self):
        for _ in range(self.num_batches):
            # Simulate input tokens and target labels
            inputs = jnp.array(jax.random.randint(jax.random.PRNGKey(0), (self.batch_size, self.seq_len), 0, self.vocab_size))
            labels = jnp.array(jax.random.randint(jax.random.PRNGKey(1), (self.batch_size, self.seq_len), 0, self.vocab_size))
            yield {"input_ids": inputs, "labels": labels}

    def __len__(self):
        return self.num_batches

# Initialize a JAX PRNGKey
key = jax.random.PRNGKey(42)
model_key, dropout_key = jax.random.split(key)

# Instantiate your custom components
peak_lr = 1e-4
total_training_steps = 1000 # For a short demo
num_warmup_steps = 100

custom_optim = create_custom_optimizer(peak_lr, total_training_steps, num_warmup_steps)
my_callback = CustomLoggerAndCheckpointCallback(
    log_interval_steps=10,
    checkpoint_dir="./my_llm_checkpoints_demo",
    monitor_metric="val_loss", # This metric needs to be computed and logged by Tunix Trainer
    mode="min"
)

# Initialize model and parameters
dummy_input = jnp.ones((1, 64), dtype=jnp.int32) # batch_size=1, seq_len=64
model = MyTunixModel()
params = model.init(model_key, dummy_input)['params'] # Initialize only 'params'

# Create a dummy Tunix Trainer (this is a simplified representation)
class SimplifiedTunixTrainer:
    def __init__(self, model_module: nn.Module, params: flax.core.FrozenDict,
                 optimizer: optax.GradientTransformation, loss_fn: callable,
                 train_dataset: Any, val_dataset: Any = None, callbacks: Optional[list] = None):
        self.model_module = model_module
        self.params = params
        self.optimizer = optimizer
        self.opt_state = optimizer.init(params)
        self.loss_fn = loss_fn
        self.train_dataset = train_dataset
        self.val_dataset = val_dataset
        self.callbacks = callbacks if callbacks is not None else []
        self.current_step = 0
        self.current_epoch = 0
        self.total_steps = len(train_dataset) # Simplified
        self.rng_key = jax.random.PRNGKey(0)

        # Callbacks registration
        for cb in self.callbacks:
            cb.trainer = self # Allow callbacks to interact with trainer if needed

    @jax.jit
    def train_step(self, params, opt_state, batch, rng_key):
        input_ids = batch["input_ids"]
        labels = batch["labels"]

        def compute_loss(params):
            logits = self.model_module.apply({'params': params}, input_ids, train=True, rngs={'dropout': rng_key})
            # Pass params to custom_llm_loss for regularization calculation
            loss = self.loss_fn(logits, labels, params)
            return loss, logits # Return logits if needed for other metrics

        # Compute gradient of the loss with respect to parameters
        (loss_value, logits), grads = jax.value_and_grad(compute_loss, has_aux=True)(params)

        # Apply gradients
        updates, opt_state = self.optimizer.update(grads, opt_state, params)
        params = optax.apply_updates(params, updates)

        return params, opt_state, loss_value, logits

    def train(self, num_epochs: int):
        trainer_state = TrainerState(
            current_step=0,
            current_epoch=0,
            total_steps=self.total_steps * num_epochs,
            params=self.params,
            opt_state=self.opt_state,
            rng_key=self.rng_key,
            loss_value=0.0 # Will be updated
        )
        # Call on_train_begin for all callbacks
        for cb in self.callbacks:
            cb.on_train_begin(trainer_state)

        for epoch in range(num_epochs):
            self.current_epoch = epoch
            print(f"\n--- Epoch {epoch + 1}/{num_epochs} ---")
            epoch_losses = []

            for i, batch in enumerate(self.train_dataset):
                trainer_state.current_step = self.current_step
                trainer_state.current_epoch = self.current_epoch

                # Split RNG key for dropout and other random operations
                step_key, dropout_key = jax.random.split(trainer_state.rng_key)
                trainer_state.rng_key = step_key # Update trainer state's rng_key

                self.params, self.opt_state, loss_value, logits = self.train_step(
                    self.params, self.opt_state, batch, dropout_key
                )
                trainer_state.params = self.params
                trainer_state.opt_state = self.opt_state
                trainer_state.loss_value = loss_value
                epoch_losses.append(loss_value)

                # Call on_step_end for all callbacks
                for cb in self.callbacks:
                    cb.on_step_end(trainer_state, logs={"loss": loss_value.item()}) # .item() to get scalar Python float
                
                self.current_step += 1

            # Simulate validation loss computation for callback
            val_loss = jnp.mean(jnp.array(epoch_losses)).item() * 0.9 # Just for demo, usually on val_dataset
            logs = {"loss": jnp.mean(jnp.array(epoch_losses)).item(), "val_loss": val_loss}

            # Call on_epoch_end for all callbacks
            for cb in self.callbacks:
                cb.on_epoch_end(trainer_state, logs=logs)

        # Call on_train_end for all callbacks
        for cb in self.callbacks:
            cb.on_train_end(trainer_state)

# Instantiate the dummy dataset
train_data = DummyDataset(num_batches=50, batch_size=4, seq_len=64, vocab_size=1000)

# Instantiate the simplified Tunix Trainer
simplified_trainer = SimplifiedTunixTrainer(
    model_module=model,
    params=params,
    optimizer=custom_optim,
    loss_fn=custom_llm_loss, # Our custom loss function
    train_dataset=train_data,
    callbacks=[my_callback] # Our custom callback
)

# Run the training
# simplified_trainer.train(num_epochs=2)
print("\n--- Custom Tunix Training Setup Complete ---")
print("To run the demo, uncomment `simplified_trainer.train(num_epochs=2)` above.")
print("Observe how the custom loss, optimizer schedule, and callback logging/checkpointing interact.")
print("The output will show step-wise average loss and epoch-end validation metric monitoring.")

Explanation of Integration:

  • We’ve defined MyTunixModel and DummyDataset as placeholders to simulate a real Tunix environment.
  • A SimplifiedTunixTrainer class is created to show how model_module, params, optimizer, loss_fn, and callbacks would be passed in.
  • The train_step method uses jax.value_and_grad to compute loss and gradients, then optimizer.update and optax.apply_updates to update parameters, demonstrating the JAX/Optax integration.
  • The train method orchestrates the training loop, calling the appropriate callback methods (on_train_begin, on_step_end, on_epoch_end, on_train_end) at their respective points.
  • Crucially, our custom_llm_loss is passed directly as loss_fn, and custom_optim is passed as optimizer. Our my_callback is added to the callbacks list.
  • Running simplified_trainer.train(num_epochs=2) would execute this custom training flow.

Mini-Challenge: Enhancing the Callback

Your turn! Let’s enhance our custom callback.

Challenge: Modify the CustomLoggerAndCheckpointCallback to also include a simple early stopping mechanism. If the monitor_metric (e.g., val_loss) does not improve for a specified number of consecutive epochs (called patience), the callback should signal the trainer to stop training.

Hint:

  1. Add patience: int and patience_counter: int = 0 to the callback’s __init__.
  2. In on_epoch_end, when the metric doesn’t improve, increment patience_counter. If it does improve, reset patience_counter to 0.
  3. If patience_counter exceeds patience, you’ll need a way to stop the training. In a real Tunix Trainer, there might be a trainer.stop_training = True flag or a similar mechanism in the TrainerState. For our SimplifiedTunixTrainer, you could raise a custom exception (StopTrainingException) that the train loop catches.

What to Observe/Learn:

  • How callbacks can actively control the training flow, not just observe it.
  • The interplay between different callback functionalities (logging, checkpointing, early stopping).
  • The importance of TrainerState for callbacks to access and potentially modify global training state.

Common Pitfalls & Troubleshooting

Customization, while powerful, can introduce new challenges. Here are a few common pitfalls:

  1. Shape Mismatches in Custom Loss Functions:

    • Pitfall: Your custom loss function expects logits or labels of a certain shape, but the model output or dataset provides something different. This often leads to JAX ShapeError or TypeError.
    • Troubleshooting: Print the shapes of logits and labels right at the start of your custom_llm_loss function. print(f"Logits shape: {logits.shape}, Labels shape: {labels.shape}"). Ensure they are compatible with your loss calculations (e.g., (batch_size, seq_len, vocab_size) for logits and (batch_size, seq_len) for labels for token classification).
  2. Non-Differentiable Operations in Loss:

    • Pitfall: You accidentally include an operation in your custom loss that JAX cannot differentiate (e.g., converting a JAX array to a Python float in the middle of a calculation, or using certain non-JAX NumPy functions).
    • Troubleshooting: JAX will usually raise a clear error message like “Abstract tracer value encountered at …” or “Gradient of … is not defined.” Review your loss function for any non-JAX operations or explicit conversions. Stick to jax.numpy functions.
  3. Incorrect Optimizer Initialization or Schedule Logic:

    • Pitfall: Your learning rate schedule might not be applied correctly, or the optimizer’s internal state (opt_state) isn’t managed properly, leading to NaN losses or stagnant training.
    • Troubleshooting:
      • Plot your learning rate schedule: Create a dummy loop for total_steps and print lr_schedule(step) to visualize its progression.
      • Check optax.chain order: The order of transformations matters (e.g., gradient clipping usually comes before applying the main optimizer).
      • Ensure optimizer.init(params) is called once at the start and optimizer.update is called correctly in each step.
  4. Callback Side Effects and State Management:

    • Pitfall: Your callback modifies TrainerState in an unexpected way, or it relies on state that isn’t guaranteed to be present or updated at its execution point. For instance, trying to access val_loss in on_step_end when it’s only computed on_epoch_end.
    • Troubleshooting:
      • Be explicit about what TrainerState attributes your callback uses.
      • Print the logs dictionary passed to on_epoch_end to see what metrics are actually available.
      • Minimize side effects. If a callback needs to modify shared state, ensure it’s done safely and predictably.

Summary

Phew! You’ve just taken a massive leap in your Tunix journey. In this chapter, we’ve explored the critical avenues for customizing your LLM post-training pipeline:

  • Loss Functions: You learned how to define custom, differentiable loss functions in JAX, combining standard objectives like cross-entropy with custom regularization or task-specific terms to precisely guide your model’s learning.
  • Optimizers: We delved into Optax, JAX’s powerful optimizer library, demonstrating how to construct sophisticated optimizers with custom learning rate schedules (like warm-up and cosine decay) and gradient transformations (like global norm clipping).
  • Callbacks: You mastered the art of creating custom callbacks to inject logic at various points in the training lifecycle, enabling features like periodic logging, conditional checkpointing, and even early stopping.

By understanding and applying these customization techniques, you’re no longer limited to off-the-shelf solutions. You can now design and implement highly specialized post-training routines tailored to the unique demands of your LLMs and research objectives.

What’s Next?

In the next chapter, we’ll build upon this foundation by exploring advanced model architectures and integration with external JAX/Flax components. Get ready to see how you can bring even more complex and custom models into the Tunix ecosystem!


References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.