Chapter 9: Distributed Training and Scaling with Tunix

Welcome back, intrepid Tunix explorer! So far, we’ve mastered the fundamentals of Tunix, understood its core concepts, and even applied it to fine-tune smaller language models. But what happens when our models grow to billions or even trillions of parameters? What happens when our datasets are so massive that a single GPU or even a single machine can’t handle them?

That’s where distributed training comes in! In this chapter, we’re going to dive into the exciting world of scaling our LLM post-training efforts. We’ll learn how Tunix, powered by JAX, allows us to harness the power of multiple devices – whether they’re GPUs or TPUs – to train larger models faster and more efficiently.

This chapter will equip you with the knowledge to:

Understand why distributed training is essential for large language models.
Grasp JAX’s powerful primitives for multi-device computation.
Learn how Tunix seamlessly integrates with JAX’s distributed capabilities.
Implement and conceptualize distributed post-training for your LLMs.

To get the most out of this chapter, it’s helpful to have a solid understanding of JAX basics, including jit compilation, and how Tunix’s Trainer works from previous chapters. Don’t worry if it sounds complex; we’ll break it down into the smallest, most manageable steps!

Core Concepts: Why Go Distributed?

Imagine trying to paint a massive mural all by yourself. It would take ages, right? Now imagine you have a team of painters, each tackling a different section or working on the same section simultaneously. That’s the essence of distributed training!

Large Language Models (LLMs) are, well, large. They have an enormous number of parameters, requiring significant computational power and memory. Training or post-training these models on a single device (like one GPU) quickly becomes impractical due to:

Memory Limits: The model parameters, optimizer states, and activations for a single batch can easily exceed the memory capacity of even the most powerful GPUs.
Training Speed: Even if a model fits, a single device might take weeks or months to complete training on a large dataset. Time is money, and faster iteration is crucial for research and development.

This is where distributed training comes to the rescue. It involves spreading the computational workload across multiple devices (e.g., multiple GPUs on one machine, or multiple machines each with multiple GPUs).

JAX’s Philosophy on Parallelism

JAX, being designed for high-performance numerical computation on accelerators, has parallelism built into its very core. It provides elegant and powerful abstractions to manage multiple devices without getting bogged down in low-level communication details.

The key idea in JAX is “program transformation.” You write your code once, as if it were running on a single device, and then JAX transforms it to run efficiently across many. This is often referred to as “single-program, multiple-data” (SPMD).

Let’s explore JAX’s primary tools for parallelism:

1. `pmap`: The Pioneer for Data Parallelism

pmap (parallel map) was one of JAX’s earliest and most straightforward ways to achieve data parallelism. It maps a function over a batch of data, executing a copy of the function on each available device, with each device processing a slice of the input data.

How it works:

You define a function.
You decorate it with @jax.pmap.
When you call the pmap-decorated function, JAX automatically splits the input arrays along their leading axis and sends a slice to each device.
Each device computes its part.
JAX can also handle reduction operations (like summing gradients) across devices.

While powerful, pmap has some limitations, especially for more complex sharding strategies beyond simple data parallelism.

2. `shard_map` and `jax.sharding.Mesh`: The Modern Approach

For more flexible and advanced sharding strategies, JAX introduced shard_map and the jax.sharding.Mesh API. These provide a more explicit and powerful way to define how arrays are sharded (split) across devices.

jax.sharding.Mesh: This is a logical abstraction of your hardware devices. You define a multi-dimensional mesh, assigning names to each dimension (e.g., 'data', 'model'). This allows you to think about sharding in terms of these logical dimensions.
jax.sharding.NamedSharding: Once you have a Mesh, you use NamedSharding to specify how an array’s dimensions map to the mesh’s named dimensions. For example, you could shard a batch dimension across the 'data' mesh dimension.
shard_map: This is a lower-level primitive that allows you to specify the input and output sharding for a function. It’s more general than pmap and can handle complex model parallelism and data parallelism combinations.

While shard_map gives immense control, for many common LLM post-training scenarios, Tunix (and the underlying Flax/JAX ecosystem) often abstracts this away, allowing you to specify sharding via configuration or a Mesh object without directly writing shard_map calls yourself.

Data Parallelism vs. Model Parallelism

When scaling LLMs, we typically encounter two main types of parallelism:

Data Parallelism: This is the most common and often easiest to implement.
- Each device gets a copy of the model.
- The input data is split into smaller batches, and each device processes a different batch.
- After computing gradients, the gradients from all devices are aggregated (e.g., averaged) to update the model parameters.
- This is ideal when your model fits on a single device, but your dataset is too large or you want faster training.
Model Parallelism: This is used when the model itself is too large to fit into the memory of a single device.
- The model’s parameters are split across multiple devices. Each device holds only a part of the model.
- During a forward pass, data flows through devices sequentially, with each device computing its part of the model.
- This is more complex to implement but essential for truly enormous models.

Tunix, leveraging JAX, is primarily designed to facilitate efficient data parallelism for post-training. However, with JAX’s Mesh and NamedSharding, advanced users can also implement various forms of model parallelism if needed. For most Tunix use cases, focusing on data parallelism will be sufficient to achieve significant scaling.

Here’s a simple diagram illustrating data parallelism:

flowchart TD subgraph Host_Machine["Host Machine"] A[DataLoader] --> B[Split Data] end subgraph Device_1["Device 1 GPU TPU"] C1[Model Copy] D1[Process Data Slice 1] E1[Compute Gradients 1] end subgraph Device_2["Device 2 GPU TPU"] C2[Model Copy] D2[Process Data Slice 2] E2[Compute Gradients 2] end subgraph Device_N["Device N GPU TPU"] CN[Model Copy] DN[Process Data Slice N] EN[Compute Gradients N] end B --> D1 B --> D2 B --> DN D1 --> C1 D2 --> C2 DN --> CN C1 --> E1 C2 --> E2 CN --> EN E1 --> Aggregate_Gradients[Aggregate Gradients] E2 --> Aggregate_Gradients EN --> Aggregate_Gradients Aggregate_Gradients --> Update_Model[Update Model Parameters] Update_Model --> Sync_Copies[Sync Model Copies] Sync_Copies --> C1 Sync_Copies --> C2 Sync_Copies --> CN

A friendly note: The diagram simplifies the communication. In reality, JAX handles the low-level communication and synchronization between devices very efficiently, often leveraging collective operations.

Step-by-Step Implementation: Scaling with JAX and Tunix

Let’s get our hands dirty (conceptually, for now) and see how this translates into code. We’ll start with a basic JAX example to understand pmap and then discuss how Tunix integrates.

First, ensure you have JAX installed with the appropriate backend for your hardware (CPU, GPU, or TPU). As of 2026-01-30, the latest stable JAX version is likely around 0.4.x or higher. Always check the official JAX GitHub releases for the most up-to-date version.

# For CPU
pip install jax[cpu]>=0.4.23

# For GPU (CUDA 12, adjust if needed for your specific CUDA version)
# Ensure NVIDIA drivers are installed first
pip install jax[cuda12_pip]>=0.4.23 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# For TPU (on Google Cloud, typically pre-configured)
# pip install jax[tpu]>=0.4.23 # often handled by cloud environment

Let’s assume we have multiple devices available. You can check how many JAX sees:

import jax
import jax.numpy as jnp

# Get all available devices
devices = jax.devices()
print(f"JAX sees {len(devices)} devices: {devices}")

# If you want to use only a subset, you can set an environment variable
# e.g., export JAX_VISIBLE_DEVICES="0,1" before running your script

Thought: If you only see one device, don’t worry! You can still follow along; JAX will simulate pmap on a single device, though you won’t get actual speedup.

1. Basic Data Parallelism with `jax.pmap`

Let’s create a simple JAX function that performs a matrix multiplication and then apply pmap to it.

import jax
import jax.numpy as jnp

# Assume we have 2 devices for this example.
# If you have fewer, JAX will still run but might use fewer cores or simulate.
num_devices = len(jax.devices())
print(f"Running with {num_devices} devices.")

# Define a simple function that takes two arrays and adds them
def simple_add(x, y):
    """A simple function to add two numbers."""
    return x + y

# Now, let's make it parallel using pmap
@jax.pmap
def parallel_add(x, y):
    """
    This function will be mapped across all available devices.
    Each device will receive a slice of x and y.
    """
    print(f"Executing on device: {jax.devices()[0]}") # This will print for each device
    return x + y

# Prepare input data
# We need to ensure the leading dimension of our input arrays
# is divisible by the number of devices.
# Let's create two arrays of shape (num_devices, some_size)
input_x = jnp.arange(num_devices * 5, dtype=jnp.float32).reshape(num_devices, 5)
input_y = jnp.ones((num_devices, 5), dtype=jnp.float32)

print("\nInput x:\n", input_x)
print("\nInput y:\n", input_y)

# Call the pmap-decorated function
output = parallel_add(input_x, input_y)

print("\nOutput from parallel_add:\n", output)
print("\nOutput shape:", output.shape)

What’s happening here?

input_x and input_y are created with a leading dimension equal to num_devices. This is crucial for pmap because it slices arrays along their leading axis.
When parallel_add(input_x, input_y) is called, JAX takes input_x and input_y.
It slices input_x into num_devices chunks. The first chunk input_x[0] goes to device 0, input_x[1] to device 1, and so on. The same happens for input_y.
Each device independently executes the simple_add logic on its assigned slice.
The print statement inside parallel_add will execute once per device, showing which device is running that particular slice of computation.
Finally, JAX gathers the results from all devices and stacks them back into a single array, again along the leading axis.

Notice how the output shape (num_devices, 5) is preserved. pmap maintains the “batch” dimension across devices.

2. Integrating with Tunix for Distributed Post-Training

Tunix is built on top of JAX and Flax, which are inherently designed for multi-device execution. When you use Tunix’s Trainer (or similar components), it often leverages these underlying JAX capabilities to distribute the workload.

The good news is that for many common data-parallel scenarios, Tunix (like its upstream libraries Flax and Orbax) often handles the complexities of pmap, shard_map, and device management for you. You typically configure your distributed environment and then use Tunix’s API as usual.

Let’s outline a conceptual Tunix distributed training setup. This won’t be runnable code without a full Tunix Trainer implementation, but it illustrates the principles.

import jax
import jax.numpy as jnp
from jax.experimental import mesh_utils
from jax.sharding import NamedSharding, PartitionSpec

# Let's assume Tunix has a Trainer class that can take device configurations
# This is illustrative, actual Tunix API might vary slightly but follows JAX patterns.
# As of 2026-01-30, Tunix is designed for flexibility.
# from tunix import TunixModel, TunixOptimizer, TunixTrainer, TunixConfig

# --- Conceptual Tunix components (pseudo-code) ---
class TunixModel:
    def __init__(self, name="dummy_llm"):
        print(f"Initializing {name} parameters...")
        # In a real scenario, this would load pre-trained weights
        self.params = {"w": jnp.ones((10, 10)), "b": jnp.zeros((10,))}

    def apply(self, params, inputs):
        # A simple forward pass
        return jnp.dot(inputs, params["w"]) + params["b"]

class TunixOptimizer:
    def __init__(self, learning_rate=1e-4):
        self.lr = learning_rate
        self.state = {} # e.g., Adam states

    def update(self, params, grads):
        new_params = jax.tree_util.tree_map(
            lambda p, g: p - self.lr * g, params, grads
        )
        # Update optimizer state here
        return new_params, self.state

class TunixTrainer:
    def __init__(self, model, optimizer, config):
        self.model = model
        self.optimizer = optimizer
        self.config = config
        self.params = model.params # Initial parameters

        # --- IMPORTANT: JAX Mesh and Sharding Setup ---
        # 1. Get available devices
        local_devices = jax.devices()
        num_devices = len(local_devices)

        # 2. Create a device mesh
        # This defines a logical arrangement of your devices.
        # Here, we assume a simple 1D mesh for data parallelism.
        # For more complex setups, you could have (num_hosts, num_devices_per_host)
        self.mesh = mesh_utils.create_device_mesh((num_devices,), devices=local_devices)
        print(f"\nCreated JAX Mesh: {self.mesh}")

        # 3. Define the sharding strategy for model parameters
        # We want model parameters to be replicated across all devices
        # for data parallelism, so each device has a full copy.
        self.param_sharding = NamedSharding(self.mesh, PartitionSpec()) # Empty spec means replicate

        # 4. Replicate initial parameters across devices
        # This is where the actual replication happens using the sharding spec.
        # We use jax.device_put to place params on the mesh with the specified sharding.
        self.params = jax.device_put(self.params, self.param_sharding)
        print(f"Initial parameters sharded: {self.params}")
        print(f"Example parameter 'w' sharding: {self.params['w'].sharding}")

        # 5. JIT compile the training step with pmap or shard_map
        # Tunix would internally define a `train_step` function.
        # This function would be decorated with `jax.pmap` or use `shard_map`
        # to handle data distribution and gradient aggregation.
        # For simplicity, we'll demonstrate a pmap-like structure here.

        def _train_step_fn(params, optimizer_state, batch_inputs, batch_labels):
            # Calculate loss and gradients
            def loss_fn(current_params):
                predictions = self.model.apply(current_params, batch_inputs)
                loss = jnp.mean((predictions - batch_labels)**2) # Simple MSE loss
                return loss, predictions

            (loss, predictions), grads = jax.value_and_grad(loss_fn, has_aux=True)(params)

            # All-reduce gradients across devices (JAX handles this automatically with pmap)
            # For data parallelism, we often average gradients.
            grads = jax.lax.pmean(grads, axis_name='devices')

            # Update parameters
            new_params, new_optimizer_state = self.optimizer.update(params, grads)
            return new_params, new_optimizer_state, loss

        # Apply pmap. 'devices' is the axis_name for the mesh dimension.
        # We need to ensure inputs also have a leading 'device' dimension.
        self._pmapped_train_step = jax.pmap(_train_step_fn, axis_name='devices')


    def train(self, dataset, num_epochs):
        print("\nStarting distributed training...")
        # Assume dataset yields batches that are already sharded for devices
        optimizer_state = self.optimizer.state # Initialize optimizer state

        for epoch in range(num_epochs):
            for batch_idx, (batch_inputs, batch_labels) in enumerate(dataset):
                # Ensure batch_inputs and batch_labels have a leading dimension
                # for the number of devices.
                # E.g., batch_inputs.shape = (num_devices, micro_batch_size, feature_dim)
                # This is a critical part of data loading for pmap.

                # Perform the pmapped training step
                self.params, optimizer_state, current_loss = \
                    self._pmapped_train_step(self.params, optimizer_state, batch_inputs, batch_labels)

                if batch_idx % 10 == 0:
                    # current_loss will be an array (loss_on_device_0, loss_on_device_1, ...)
                    # We might want to average it for a single metric.
                    avg_loss = jnp.mean(current_loss)
                    print(f"Epoch {epoch+1}, Batch {batch_idx}: Avg Loss = {avg_loss:.4f}")

        print("Distributed training finished!")
        return self.params


# --- Main execution block ---
if __name__ == "__main__":
    # Configuration for Tunix Trainer
    class TunixTrainingConfig:
        learning_rate = 1e-4
        batch_size_per_device = 32 # This would be the micro-batch size
        total_batch_size = len(jax.devices()) * batch_size_per_device
        num_epochs = 2

    config = TunixTrainingConfig()

    # Initialize model and optimizer
    model = TunixModel()
    optimizer = TunixOptimizer(learning_rate=config.learning_rate)

    # Initialize Tunix Trainer, which sets up distributed aspects
    trainer = TunixTrainer(model, optimizer, config)

    # Prepare a dummy dataset that mimics sharded batches
    def create_dummy_dataset(num_batches=100, num_devices=len(jax.devices()),
                             batch_size_per_device=config.batch_size_per_device,
                             input_dim=10, output_dim=10):
        for _ in range(num_batches):
            # Create a batch with leading device dimension
            inputs = jnp.ones((num_devices, batch_size_per_device, input_dim))
            labels = jnp.zeros((num_devices, batch_size_per_device, output_dim))
            yield inputs, labels

    dummy_dataset = create_dummy_dataset()

    # Start training
    final_params = trainer.train(dummy_dataset, num_epochs=config.num_epochs)
    print("\nFinal model parameters (example 'w' on device 0):\n", final_params['w'][0])

Breaking Down the Tunix Distributed Workflow (Conceptual):

Device Mesh Creation (mesh_utils.create_device_mesh): The TunixTrainer first identifies all available JAX devices and organizes them into a logical Mesh. For data parallelism, a simple 1D mesh is common. This Mesh is a fundamental building block for JAX’s advanced sharding capabilities.
Parameter Sharding (NamedSharding): For data parallelism, we want each device to have a full copy of the model parameters. This is achieved by creating a NamedSharding object with an empty PartitionSpec(). This tells JAX to replicate the parameters across all devices in the mesh.
Parameter Replication (jax.device_put): The initial model parameters are then explicitly placed onto the Mesh with the specified NamedSharding. This ensures all devices start with identical model weights.
_pmapped_train_step Function: This is the core of the distributed training loop.
- It’s decorated with jax.pmap (or internally uses shard_map). The axis_name='devices' links this pmap to the named dimension of our Mesh.
- Inside this function, the loss_fn computes the loss and gradients per device.
- Gradient Aggregation (jax.lax.pmean): After each device computes its gradients, jax.lax.pmean(grads, axis_name='devices') is crucial. This performs an “all-reduce” operation, summing the gradients from all devices and then averaging them. Each device then receives the averaged gradients.
- Parameter Update: Each device uses these averaged gradients to update its local copy of the model parameters. Because gradients are averaged and updates are deterministic, all model copies remain synchronized.
Data Loading: The dataset in our train method is expected to provide batches that are already sharded along the device dimension. Meaning, if you have 8 devices, a batch might have a shape like (8, micro_batch_size, ...). Your data loading pipeline (e.g., using tf.data or torch.utils.data with JAX) would be responsible for pre-sharding data for efficient transfer to devices.
Trainer.train: The main training loop iterates through epochs and batches, calling the _pmapped_train_step. The trainer manages the params and optimizer_state, which are themselves sharded JAX arrays.

This conceptual setup highlights how Tunix, by leveraging JAX’s powerful primitives, simplifies distributed LLM post-training. The end-user typically interacts with high-level configuration and a Trainer API, while JAX handles the complex multi-device synchronization under the hood.

Mini-Challenge: Parallelizing a JAX Function

Let’s put your pmap understanding to the test!

Challenge: You have a simple JAX function that calculates the mean of each row in a batch of data. Your task is to modify this function and its call to run in a data-parallel fashion using jax.pmap across all available devices.

import jax
import jax.numpy as jnp

# Original function
def calculate_row_means(data_batch):
    return jnp.mean(data_batch, axis=-1)

# Prepare dummy data for a single device
single_device_data = jnp.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
])

print("Original data:\n", single_device_data)
print("Row means (single device):\n", calculate_row_means(single_device_data))

# --- Your task starts here ---
# 1. Decorate `calculate_row_means` with `jax.pmap`
# 2. Modify `single_device_data` to be suitable for `pmap` (i.e., add a leading device dimension)
# 3. Call the pmapped function and print the result.

Hint: Remember that pmap expects the leading dimension of its inputs to correspond to the number of devices. If you have 4 devices, your input should have a shape like (4, batch_size, ...).

What to observe/learn:

How pmap transparently slices and gathers data.
The shape of the output from a pmap-decorated function.
The elegance of writing single-device code and letting JAX parallelize it.

Common Pitfalls & Troubleshooting

Distributed training, while powerful, can introduce its own set of challenges. Here are a few common pitfalls and tips for troubleshooting:

Device Not Found/Configured:
- Problem: JAX only sees one device, or you get errors about device allocation.
- Troubleshooting:
  - Ensure your JAX installation matches your hardware (CPU, GPU, TPU).
  - Verify your drivers are up-to-date (for GPUs).
  - Check environment variables like JAX_PLATFORM_NAME (e.g., export JAX_PLATFORM_NAME=cuda) or JAX_VISIBLE_DEVICES.
  - For multi-node setups, ensure your torch.distributed.launch or similar distributed launcher is correctly configured to set up inter-node communication.
  - Run jax.devices() at the start of your script to confirm JAX sees all expected devices.
Input Data Sharding Issues (Shape Mismatches):
- Problem: pmap or shard_map complains about input shapes not matching the number of devices or sharding specifications.
- Troubleshooting:
  - For pmap, always ensure your primary batch dimension is the leading dimension and is divisible by the number of devices. input_data.shape[0] == num_devices.
  - For shard_map and NamedSharding, carefully review your PartitionSpec definitions and how they align with your array dimensions.
  - Use jax.debug.print or print(x.shape) at various points inside your pmap-decorated function to inspect the shapes of arrays on each device.
Communication Overhead and Performance Bottlenecks:
- Problem: Your distributed training is slower than expected, or not scaling linearly with the number of devices.
- Troubleshooting:
  - Batch Size: Ensure your per-device batch size is large enough to keep the devices busy and amortize communication costs. Too small a batch size can lead to communication dominating computation.
  - Profiling: Use JAX’s built-in profilers (e.g., jax.profiler) to identify bottlenecks. Is it computation, data loading, or communication?
  - Data Loading: Your data loading pipeline needs to be highly optimized to feed data to all devices quickly. Pre-sharding and asynchronous loading are key.
  - JIT Compilation: Ensure your entire training step is JIT-compiled (@jax.jit or implicitly by pmap). Avoid Python loops inside the compiled function.
  - jax.lax.pmean / all-reduce: While necessary for data parallelism, pmean involves communication. If you have many parameters or gradients, this can be a bottleneck. JAX’s XLA backend is highly optimized for these collective operations.
Debugging Distributed Programs:
- Problem: It’s hard to debug errors that occur across multiple devices.
- Troubleshooting:
  - Start Small: If possible, test your core logic on a single device first, then scale up.
  - jax.disable_jit(): Temporarily disable JIT to get Python tracebacks, though this will be much slower.
  - jax.debug.print: This is your best friend! It prints values from each device during a compiled execution, making it invaluable for inspecting intermediate states.
  - pdb on pmap: Debugging with pdb inside pmap can be tricky. You might need to temporarily remove pmap or use jax.debug.breakpoint() for more controlled stops.

Summary

Congratulations! You’ve navigated the complexities of distributed training and scaling with Tunix and JAX. This is a critical skill for working with modern LLMs.

Here are the key takeaways from this chapter:

Necessity of Distribution: Large LLMs and datasets demand distributed training to overcome memory limits and achieve reasonable training speeds.
JAX’s Parallelism Primitives: JAX provides powerful, high-level abstractions for multi-device computation, primarily pmap for data parallelism and the more flexible shard_map with jax.sharding.Mesh for advanced sharding.
Data Parallelism: The most common approach, where each device holds a copy of the model and processes a slice of the data, with gradients aggregated across devices.
Tunix Integration: Tunix leverages JAX’s underlying distributed capabilities. While you might not directly call pmap or shard_map in your Tunix code, understanding these JAX primitives explains how Tunix’s Trainer efficiently distributes workloads, manages device meshes, and handles parameter sharding and gradient aggregation.
Key Components: jax.devices(), mesh_utils.create_device_mesh, jax.sharding.NamedSharding, and jax.lax.pmean are fundamental concepts for distributed JAX programs.
Troubleshooting: Be mindful of device configuration, input data shapes, potential communication bottlenecks, and use jax.debug.print for effective debugging.

You now have a solid foundation for understanding how Tunix scales to tackle even the largest LLM post-training tasks. In the next chapter, we’ll shift our focus to performance optimization techniques, further enhancing the speed and efficiency of your Tunix workflows.

References

JAX Official Documentation - Distributed Arrays and Sharding: https://jax.readthedocs.io/en/latest/notebooks/Distributed_JAX_intro.html
JAX Official Documentation - pmap: https://jax.readthedocs.io/en/latest/jax.html#pmap
JAX Official Documentation - shard_map: https://jax.readthedocs.io/en/latest/_autosummary/jax.experimental.shard_map.shard_map.html
Google Developers Blog - Introducing Tunix: A JAX-Native Library for LLM Post-Training: https://developers.googleblog.com/introducing-tunix-a-jax-native-library-for-llm-post-training/
Tunix ReadTheDocs (Official Documentation): https://tunix.readthedocs.io/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.