Chapter 17: Project: Archiving Machine Learning Tensors

Welcome back, aspiring data wizards! In our journey through the fascinating world of OpenZL, we’ve explored its core concepts and seen how it intelligently handles structured data. Now, it’s time to roll up our sleeves and tackle a real-world challenge that many of you in machine learning or data science might face: efficiently archiving Machine Learning (ML) tensors.

This chapter will guide you through a hands-on project where we’ll leverage OpenZL’s unique capabilities to compress and decompress ML tensors. You’ll learn how to describe complex data structures to OpenZL, build a custom compression pipeline, and verify the integrity of your archived data. By the end, you’ll not only have a practical understanding of OpenZL but also a valuable tool for managing the ever-growing datasets in your ML projects. To make the most of this chapter, a basic grasp of OpenZL’s data description and compression graph concepts, as covered in previous chapters, will be very helpful. Familiarity with Python and the NumPy library will also be beneficial for the practical exercises.

Core Concepts: OpenZL for ML Tensors

Machine Learning models often deal with vast amounts of data, represented as multi-dimensional arrays called tensors. Whether it’s input data, model weights, or intermediate activations, these tensors can consume significant storage and bandwidth. Generic compression algorithms, while useful, often treat these tensors as mere streams of bytes, missing out on the inherent structure and redundancies within the data. This is where OpenZL shines!

What Makes ML Tensors Special for Compression?

Structured Data: Tensors are inherently structured. They have a defined shape (dimensions) and dtype (data type, e.g., float32, int64). This structure is a goldmine for format-aware compression.
Redundancy and Patterns: Tensors, especially those from real-world data or model outputs, often contain patterns, repeated values, or values within a certain range. For instance, image tensors might have large areas of similar pixel values, or activation tensors might have many zeros (sparsity).
Specific Data Types: Unlike generic byte streams, tensors consist of specific numerical types. A compressor that understands float32 can apply floating-point specific optimizations that a byte-level compressor cannot.

OpenZL’s Advantage with Tensors

OpenZL’s power lies in its ability to consume a description of your data and then intelligently build a specialized compressor. For ML tensors, this means we can tell OpenZL:

The Tensor’s dtype: Is it a 32-bit float, a 64-bit integer, or something else? Knowing this allows OpenZL to pick codecs optimized for those specific number representations.
The Tensor’s shape: The dimensions of the tensor can inform codecs about potential spatial or temporal redundancies.
Expected Value Ranges/Distributions: While more advanced, OpenZL can even incorporate statistical properties of your data to further optimize compression.

By providing this metadata, OpenZL can construct a “compression graph” that applies a sequence of specialized codecs. For example, it might first transform floating-point numbers to integers, then apply a delta encoding, and finally use a general-purpose byte compressor like Zstd. This multi-stage, format-aware approach often yields significantly better compression ratios and speeds compared to generic methods.

Let’s visualize this conceptual flow:

flowchart LR A[Raw ML Tensor] --->|Describe Structure| B{Data Description} B --->|Build Pipeline| C[OpenZL Compression Graph] C --->|Apply Codecs| D[Specialized Compressor] D --> E[Compressed Data File] E --->|Decompress Pipeline| F[OpenZL Decompressor] F --->|Reconstruct Data| G[Reconstructed ML Tensor] G ---->|Compare| H{Verification}

As you can see, the data description is the crucial first step that informs OpenZL how to build its intelligent compression and decompression pipeline.

Step-by-Step Implementation: Archiving a NumPy Tensor

Let’s get our hands dirty! We’ll use Python and NumPy to simulate an ML tensor, then use OpenZL’s (illustrative) Python bindings to compress and decompress it.

Prerequisites:

Python 3.9+ (as of 2026-01-26, recommended for modern libraries)
NumPy 1.25.0+ (latest stable release is generally recommended)
OpenZL Python Bindings: You’ll need to install the openzl-py package. If not already installed, open your terminal and run:
```
pip install numpy openzl-py==0.2.1  # Using an illustrative version, check official docs for latest
```
Note: The openzl-py version 0.2.1 is illustrative for a 2026 context. Always refer to the official OpenZL documentation for the most current installation instructions and stable release versions.

Step 1: Generate a Sample ML Tensor

First, let’s create a simple Python script to generate a NumPy array that represents a typical ML tensor.

Create a file named archive_tensor.py and add the following lines:

import numpy as np
import openzl_py as openzl
import json
import os

print("--- Step 1: Generating Sample ML Tensor ---")

# Define tensor properties
tensor_shape = (100, 64, 32)
tensor_dtype = np.float32

# Create a sample tensor with some structure (e.g., a gradient or activation map)
# We'll make it somewhat compressible by having values clustered around certain points
original_tensor = np.random.rand(*tensor_shape).astype(tensor_dtype) * 100
original_tensor[20:30, :, :] = original_tensor[20:30, :, :] * 0.1 # Introduce some smaller values
original_tensor[:, 10:20, :] += 50 # Introduce a localized higher value region

print(f"Original tensor shape: {original_tensor.shape}")
print(f"Original tensor dtype: {original_tensor.dtype}")
print(f"Original tensor size (bytes): {original_tensor.nbytes}")
print(f"First 5 elements (flattened): {original_tensor.flatten()[:5]}")

Explanation:

We import numpy for tensor creation and openzl_py for OpenZL functionalities. json and os will be used later.
tensor_shape and tensor_dtype define our example tensor’s characteristics.
np.random.rand creates random floating-point numbers. We then introduce some artificial patterns (smaller values, higher value regions) to make it more realistic and potentially more compressible.
original_tensor.nbytes shows the uncompressed size of our tensor.

Step 2: Describe the Tensor’s Structure to OpenZL

OpenZL needs to understand the tensor’s metadata to apply smart compression. We’ll create a JSON-based data description.

Append the following code to archive_tensor.py:

print("\n--- Step 2: Describing Tensor Structure to OpenZL ---")

# OpenZL needs a description of the data.
# For a NumPy array, we provide its shape and dtype.
# OpenZL's description format is typically JSON-based.
tensor_description = {
    "type": "numpy_tensor",
    "shape": list(original_tensor.shape), # Convert tuple to list for JSON
    "dtype": str(original_tensor.dtype),  # Convert numpy dtype to string
    "encoding": "little_endian" # Assuming common system encoding
}

print("OpenZL Data Description:")
print(json.dumps(tensor_description, indent=2))

Explanation:

tensor_description is a Python dictionary that will be serialized to JSON.
"type": "numpy_tensor" tells OpenZL what kind of data we’re dealing with.
"shape" and "dtype" are extracted directly from our NumPy tensor. We convert shape to a list and dtype to a string as JSON doesn’t directly support NumPy types.
"encoding": "little_endian" is a common assumption for how multi-byte data types are stored in memory.

Step 3: Define a Compression Graph and Compress

Now, we define how OpenZL should compress this data. This involves specifying a sequence of codecs. For floating-point data, we might use a specialized float compressor followed by a general-purpose one.

Append the following code to archive_tensor.py:

print("\n--- Step 3: Defining Compression Graph and Compressing ---")

# Define the compression graph (pipeline of codecs)
# This graph is also typically defined in a structured format like JSON or YAML.
# For ML tensors, a common strategy is:
# 1. A codec that handles the specific numerical type (e.g., float_quantize, float_delta)
# 2. A general-purpose byte-level compressor (e.g., zstd, lz4)
compression_graph = {
    "input": "numpy_tensor",
    "nodes": [
        {"id": "float_codec", "codec": "openzl.codecs.FloatQuantize", "params": {"bits": 16}},
        {"id": "byte_compressor", "codec": "openzl.codecs.Zstd", "params": {"level": 3}}
    ],
    "edges": [
        {"from": "input", "to": "float_codec"},
        {"from": "float_codec", "to": "byte_compressor"},
        {"from": "byte_compressor", "to": "output"}
    ]
}

print("OpenZL Compression Graph:")
print(json.dumps(compression_graph, indent=2))

# Create an OpenZL compressor instance
# In a real scenario, you might load these from files.
compressor = openzl.Compressor(
    data_description=tensor_description,
    compression_graph=compression_graph
)

# Perform compression
compressed_data = compressor.compress(original_tensor.tobytes())

# Save compressed data to a file
output_filename = "compressed_ml_tensor.ozl"
with open(output_filename, "wb") as f:
    f.write(compressed_data)

print(f"Compressed data saved to: {output_filename}")
print(f"Compressed size (bytes): {len(compressed_data)}")
print(f"Compression ratio: {original_tensor.nbytes / len(compressed_data):.2f}x")

Explanation:

compression_graph defines our pipeline. It’s a dictionary with input, nodes (codecs), and edges (how data flows).
openzl.codecs.FloatQuantize: This is an illustrative codec that quantizes floating-point numbers, reducing their precision but often leading to better compression. Here, we’re hypothetically quantizing to 16 bits.
openzl.codecs.Zstd: A widely used, fast, and efficient general-purpose compressor that works on the byte stream output from the FloatQuantize codec. We set a level for trade-off between speed and compression.
openzl.Compressor: We instantiate OpenZL’s compressor with our data_description and compression_graph.
compressor.compress(original_tensor.tobytes()): We pass the raw byte representation of our NumPy array to the compressor. OpenZL uses the description to interpret these bytes and apply the graph.
The compressed data is saved to a file, and we print the size and ratio.

Step 4: Decompress and Verify

The final and crucial step is to decompress the data and ensure it’s identical (or acceptably close, if lossy compression was used) to the original.

Append the following code to archive_tensor.py:

print("\n--- Step 4: Decompressing and Verifying ---")

# Load compressed data from file
with open(output_filename, "rb") as f:
    loaded_compressed_data = f.read()

# Create an OpenZL decompressor instance
decompressor = openzl.Decompressor(
    data_description=tensor_description,
    compression_graph=compression_graph # Decompression uses the same graph logic in reverse
)

# Perform decompression
decompressed_bytes = decompressor.decompress(loaded_compressed_data)

# Reconstruct the NumPy tensor
reconstructed_tensor = np.frombuffer(decompressed_bytes, dtype=tensor_dtype).reshape(tensor_shape)

print(f"Reconstructed tensor shape: {reconstructed_tensor.shape}")
print(f"Reconstructed tensor dtype: {reconstructed_tensor.dtype}")
print(f"First 5 elements (flattened): {reconstructed_tensor.flatten()[:5]}")

# Verify data integrity (for lossy compression, check for approximate equality)
# Since we used FloatQuantize, expect some precision loss.
# We'll check if the absolute difference is within a small tolerance.
tolerance = 1e-2 # A small tolerance for float comparisons

is_close = np.allclose(original_tensor, reconstructed_tensor, atol=tolerance, rtol=tolerance)

if is_close:
    print("\nVerification successful! Reconstructed tensor is approximately identical to original.")
else:
    print("\nVerification FAILED! Reconstructed tensor differs significantly from original.")
    # You might want to print differences for debugging
    diff = np.abs(original_tensor - reconstructed_tensor)
    max_diff = np.max(diff)
    print(f"Maximum absolute difference: {max_diff}")

# Clean up the compressed file
os.remove(output_filename)
print(f"\nCleaned up {output_filename}")

Explanation:

We load the compressed_ml_tensor.ozl file.
openzl.Decompressor: Similar to the compressor, it’s initialized with the same data description and compression graph. OpenZL automatically infers the reverse operations for decompression.
decompressor.decompress(): This reconstructs the original byte stream.
np.frombuffer().reshape(): We convert the decompressed bytes back into a NumPy array with the correct dtype and shape.
np.allclose(): This is crucial for verifying floating-point arrays. Due to the FloatQuantize codec, we expect some precision loss. np.allclose allows us to compare values within a specified absolute (atol) and relative (rtol) tolerance.
Finally, we clean up the temporary compressed file.

Run your script: Save the entire code block as archive_tensor.py and run it from your terminal:

python archive_tensor.py

Observe the output, especially the compression ratio and the verification message!

Mini-Challenge: Experiment with a Different Data Type

You’ve seen how to compress a float32 tensor. Now, let’s challenge your understanding!

Challenge: Modify the archive_tensor.py script to compress and decompress a tensor of int64 (64-bit integer) type.

Hints:

You’ll need to change tensor_dtype to np.int64.
Adjust how original_tensor is generated to produce integer values.
The tensor_description will need to reflect the new dtype.
The compression_graph will likely need a different initial codec. openzl.codecs.FloatQuantize is for floats; OpenZL probably has an openzl.codecs.Delta or openzl.codecs.IntegerCodec that would be more appropriate for integers, especially if they are sequential or have small differences. You might also consider openzl.codecs.Varint if integer values can vary widely in magnitude.
For verification, if you use a lossless integer codec, np.array_equal might be more appropriate than np.allclose. If you use a lossy integer codec, np.allclose with a tolerance might still be needed.

What to Observe/Learn:

How does changing the data type affect the necessary codecs in the compression graph?
What is the impact on the compression ratio? Do integers compress differently than floats?
Does the verification method need to change based on whether the compression is lossless or lossy for integers?

Take your time, experiment, and don’t be afraid to consult the (hypothetical) openzl_py documentation for available codecs!

Common Pitfalls & Troubleshooting

Incorrect Data Description:
- Pitfall: Mismatch between the actual NumPy dtype/shape and what you provide in tensor_description. OpenZL relies heavily on this.
- Troubleshooting: Double-check that list(original_tensor.shape) and str(original_tensor.dtype) are correctly passed. If you change the tensor, ensure you update the description.
Incompatible Codec Selection:
- Pitfall: Using a codec designed for one data type (e.g., FloatQuantize) on another (e.g., int64). This will lead to incorrect compression/decompression or runtime errors.
- Troubleshooting: Always ensure your chosen codecs are appropriate for the dtype specified in your data description. Refer to OpenZL’s codec documentation.
Lossy vs. Lossless Verification:
- Pitfall: Expecting perfect equality (np.array_equal) when using a lossy codec (like our FloatQuantize). This will incorrectly report a failed verification.
- Troubleshooting: Understand if your selected codecs introduce loss. For floating-point data with lossy codecs, np.allclose with an appropriate tolerance is the correct way to verify. For lossless compression of integers, np.array_equal is ideal.
OpenZL Installation Issues:
- Pitfall: The openzl_py package might not be installed correctly or might have version conflicts.
- Troubleshooting: Ensure you’re using a virtual environment. Re-run pip install openzl-py and check for any error messages. Verify the package is listed when you run pip freeze.

Summary

Congratulations! You’ve successfully completed a practical project applying OpenZL to archive Machine Learning tensors. Let’s recap the key takeaways:

ML Tensors are Structured Data: They possess shape and dtype metadata that OpenZL can exploit.
Data Description is Key: Providing OpenZL with an accurate JSON description of your tensor’s structure (type, shape, dtype) is fundamental for effective compression.
Compression Graphs Tailor Performance: By defining a compression_graph with a sequence of specialized codecs (e.g., FloatQuantize for floats, Zstd for general byte compression), you can optimize for specific data characteristics.
Format-Aware Compression Benefits: OpenZL’s ability to understand data formats leads to superior compression ratios and performance compared to generic byte-stream compressors for structured data.
Verification is Crucial: Always decompress and verify your data. For lossy compression, use tolerance-based comparisons like np.allclose.

This project demonstrates the power and flexibility of OpenZL in a critical domain like Machine Learning. In future chapters, we might explore integrating OpenZL into larger data pipelines, optimizing compression graphs for specific performance targets, or delving into more advanced data structures. Keep experimenting, and happy compressing!

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

References

OpenZL GitHub Repository: The primary source for the OpenZL framework, including build instructions and core concepts. https://github.com/facebook/openzl
Introducing OpenZL: An Open Source Format-Aware Compression Framework: Engineering at Meta’s announcement blog post. https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
NumPy Official Documentation: Essential for understanding array creation, manipulation, and data types in Python. https://numpy.org/doc/stable/
Zstandard (Zstd) GitHub Repository: Details on the Zstd compression algorithm, often used as a backend codec in OpenZL. https://github.com/facebook/zstd

Chapter 17: Project: Archiving Machine Learning Tensors

Table of Contents

Core Concepts: OpenZL for ML Tensors

What Makes ML Tensors Special for Compression?

OpenZL’s Advantage with Tensors

Step-by-Step Implementation: Archiving a NumPy Tensor

Step 1: Generate a Sample ML Tensor

Step 2: Describe the Tensor’s Structure to OpenZL

Step 3: Define a Compression Graph and Compress

Step 4: Decompress and Verify

Mini-Challenge: Experiment with a Different Data Type

Common Pitfalls & Troubleshooting

Summary

References