Welcome back, aspiring data wizards! In our journey through the fascinating world of OpenZL, we’ve explored its core concepts and seen how it intelligently handles structured data. Now, it’s time to roll up our sleeves and tackle a real-world challenge that many of you in machine learning or data science might face: efficiently archiving Machine Learning (ML) tensors.
This chapter will guide you through a hands-on project where we’ll leverage OpenZL’s unique capabilities to compress and decompress ML tensors. You’ll learn how to describe complex data structures to OpenZL, build a custom compression pipeline, and verify the integrity of your archived data. By the end, you’ll not only have a practical understanding of OpenZL but also a valuable tool for managing the ever-growing datasets in your ML projects. To make the most of this chapter, a basic grasp of OpenZL’s data description and compression graph concepts, as covered in previous chapters, will be very helpful. Familiarity with Python and the NumPy library will also be beneficial for the practical exercises.
Core Concepts: OpenZL for ML Tensors
Machine Learning models often deal with vast amounts of data, represented as multi-dimensional arrays called tensors. Whether it’s input data, model weights, or intermediate activations, these tensors can consume significant storage and bandwidth. Generic compression algorithms, while useful, often treat these tensors as mere streams of bytes, missing out on the inherent structure and redundancies within the data. This is where OpenZL shines!
What Makes ML Tensors Special for Compression?
- Structured Data: Tensors are inherently structured. They have a defined
shape(dimensions) anddtype(data type, e.g.,float32,int64). This structure is a goldmine for format-aware compression. - Redundancy and Patterns: Tensors, especially those from real-world data or model outputs, often contain patterns, repeated values, or values within a certain range. For instance, image tensors might have large areas of similar pixel values, or activation tensors might have many zeros (sparsity).
- Specific Data Types: Unlike generic byte streams, tensors consist of specific numerical types. A compressor that understands
float32can apply floating-point specific optimizations that a byte-level compressor cannot.
OpenZL’s Advantage with Tensors
OpenZL’s power lies in its ability to consume a description of your data and then intelligently build a specialized compressor. For ML tensors, this means we can tell OpenZL:
- The Tensor’s
dtype: Is it a 32-bit float, a 64-bit integer, or something else? Knowing this allows OpenZL to pick codecs optimized for those specific number representations. - The Tensor’s
shape: The dimensions of the tensor can inform codecs about potential spatial or temporal redundancies. - Expected Value Ranges/Distributions: While more advanced, OpenZL can even incorporate statistical properties of your data to further optimize compression.
By providing this metadata, OpenZL can construct a “compression graph” that applies a sequence of specialized codecs. For example, it might first transform floating-point numbers to integers, then apply a delta encoding, and finally use a general-purpose byte compressor like Zstd. This multi-stage, format-aware approach often yields significantly better compression ratios and speeds compared to generic methods.
Let’s visualize this conceptual flow:
As you can see, the data description is the crucial first step that informs OpenZL how to build its intelligent compression and decompression pipeline.
Step-by-Step Implementation: Archiving a NumPy Tensor
Let’s get our hands dirty! We’ll use Python and NumPy to simulate an ML tensor, then use OpenZL’s (illustrative) Python bindings to compress and decompress it.
Prerequisites:
- Python 3.9+ (as of 2026-01-26, recommended for modern libraries)
- NumPy 1.25.0+ (latest stable release is generally recommended)
- OpenZL Python Bindings: You’ll need to install the
openzl-pypackage. If not already installed, open your terminal and run:Note: Thepip install numpy openzl-py==0.2.1 # Using an illustrative version, check official docs for latestopenzl-pyversion0.2.1is illustrative for a 2026 context. Always refer to the official OpenZL documentation for the most current installation instructions and stable release versions.
Step 1: Generate a Sample ML Tensor
First, let’s create a simple Python script to generate a NumPy array that represents a typical ML tensor.
Create a file named archive_tensor.py and add the following lines:
import numpy as np
import openzl_py as openzl
import json
import os
print("--- Step 1: Generating Sample ML Tensor ---")
# Define tensor properties
tensor_shape = (100, 64, 32)
tensor_dtype = np.float32
# Create a sample tensor with some structure (e.g., a gradient or activation map)
# We'll make it somewhat compressible by having values clustered around certain points
original_tensor = np.random.rand(*tensor_shape).astype(tensor_dtype) * 100
original_tensor[20:30, :, :] = original_tensor[20:30, :, :] * 0.1 # Introduce some smaller values
original_tensor[:, 10:20, :] += 50 # Introduce a localized higher value region
print(f"Original tensor shape: {original_tensor.shape}")
print(f"Original tensor dtype: {original_tensor.dtype}")
print(f"Original tensor size (bytes): {original_tensor.nbytes}")
print(f"First 5 elements (flattened): {original_tensor.flatten()[:5]}")
Explanation:
- We import
numpyfor tensor creation andopenzl_pyfor OpenZL functionalities.jsonandoswill be used later. tensor_shapeandtensor_dtypedefine our example tensor’s characteristics.np.random.randcreates random floating-point numbers. We then introduce some artificial patterns (smaller values, higher value regions) to make it more realistic and potentially more compressible.original_tensor.nbytesshows the uncompressed size of our tensor.
Step 2: Describe the Tensor’s Structure to OpenZL
OpenZL needs to understand the tensor’s metadata to apply smart compression. We’ll create a JSON-based data description.
Append the following code to archive_tensor.py:
print("\n--- Step 2: Describing Tensor Structure to OpenZL ---")
# OpenZL needs a description of the data.
# For a NumPy array, we provide its shape and dtype.
# OpenZL's description format is typically JSON-based.
tensor_description = {
"type": "numpy_tensor",
"shape": list(original_tensor.shape), # Convert tuple to list for JSON
"dtype": str(original_tensor.dtype), # Convert numpy dtype to string
"encoding": "little_endian" # Assuming common system encoding
}
print("OpenZL Data Description:")
print(json.dumps(tensor_description, indent=2))
Explanation:
tensor_descriptionis a Python dictionary that will be serialized to JSON."type": "numpy_tensor"tells OpenZL what kind of data we’re dealing with."shape"and"dtype"are extracted directly from our NumPy tensor. We convertshapeto a list anddtypeto a string as JSON doesn’t directly support NumPy types."encoding": "little_endian"is a common assumption for how multi-byte data types are stored in memory.
Step 3: Define a Compression Graph and Compress
Now, we define how OpenZL should compress this data. This involves specifying a sequence of codecs. For floating-point data, we might use a specialized float compressor followed by a general-purpose one.
Append the following code to archive_tensor.py:
print("\n--- Step 3: Defining Compression Graph and Compressing ---")
# Define the compression graph (pipeline of codecs)
# This graph is also typically defined in a structured format like JSON or YAML.
# For ML tensors, a common strategy is:
# 1. A codec that handles the specific numerical type (e.g., float_quantize, float_delta)
# 2. A general-purpose byte-level compressor (e.g., zstd, lz4)
compression_graph = {
"input": "numpy_tensor",
"nodes": [
{"id": "float_codec", "codec": "openzl.codecs.FloatQuantize", "params": {"bits": 16}},
{"id": "byte_compressor", "codec": "openzl.codecs.Zstd", "params": {"level": 3}}
],
"edges": [
{"from": "input", "to": "float_codec"},
{"from": "float_codec", "to": "byte_compressor"},
{"from": "byte_compressor", "to": "output"}
]
}
print("OpenZL Compression Graph:")
print(json.dumps(compression_graph, indent=2))
# Create an OpenZL compressor instance
# In a real scenario, you might load these from files.
compressor = openzl.Compressor(
data_description=tensor_description,
compression_graph=compression_graph
)
# Perform compression
compressed_data = compressor.compress(original_tensor.tobytes())
# Save compressed data to a file
output_filename = "compressed_ml_tensor.ozl"
with open(output_filename, "wb") as f:
f.write(compressed_data)
print(f"Compressed data saved to: {output_filename}")
print(f"Compressed size (bytes): {len(compressed_data)}")
print(f"Compression ratio: {original_tensor.nbytes / len(compressed_data):.2f}x")
Explanation:
compression_graphdefines our pipeline. It’s a dictionary withinput,nodes(codecs), andedges(how data flows).openzl.codecs.FloatQuantize: This is an illustrative codec that quantizes floating-point numbers, reducing their precision but often leading to better compression. Here, we’re hypothetically quantizing to 16 bits.openzl.codecs.Zstd: A widely used, fast, and efficient general-purpose compressor that works on the byte stream output from theFloatQuantizecodec. We set alevelfor trade-off between speed and compression.openzl.Compressor: We instantiate OpenZL’s compressor with ourdata_descriptionandcompression_graph.compressor.compress(original_tensor.tobytes()): We pass the raw byte representation of our NumPy array to the compressor. OpenZL uses the description to interpret these bytes and apply the graph.- The compressed data is saved to a file, and we print the size and ratio.
Step 4: Decompress and Verify
The final and crucial step is to decompress the data and ensure it’s identical (or acceptably close, if lossy compression was used) to the original.
Append the following code to archive_tensor.py:
print("\n--- Step 4: Decompressing and Verifying ---")
# Load compressed data from file
with open(output_filename, "rb") as f:
loaded_compressed_data = f.read()
# Create an OpenZL decompressor instance
decompressor = openzl.Decompressor(
data_description=tensor_description,
compression_graph=compression_graph # Decompression uses the same graph logic in reverse
)
# Perform decompression
decompressed_bytes = decompressor.decompress(loaded_compressed_data)
# Reconstruct the NumPy tensor
reconstructed_tensor = np.frombuffer(decompressed_bytes, dtype=tensor_dtype).reshape(tensor_shape)
print(f"Reconstructed tensor shape: {reconstructed_tensor.shape}")
print(f"Reconstructed tensor dtype: {reconstructed_tensor.dtype}")
print(f"First 5 elements (flattened): {reconstructed_tensor.flatten()[:5]}")
# Verify data integrity (for lossy compression, check for approximate equality)
# Since we used FloatQuantize, expect some precision loss.
# We'll check if the absolute difference is within a small tolerance.
tolerance = 1e-2 # A small tolerance for float comparisons
is_close = np.allclose(original_tensor, reconstructed_tensor, atol=tolerance, rtol=tolerance)
if is_close:
print("\nVerification successful! Reconstructed tensor is approximately identical to original.")
else:
print("\nVerification FAILED! Reconstructed tensor differs significantly from original.")
# You might want to print differences for debugging
diff = np.abs(original_tensor - reconstructed_tensor)
max_diff = np.max(diff)
print(f"Maximum absolute difference: {max_diff}")
# Clean up the compressed file
os.remove(output_filename)
print(f"\nCleaned up {output_filename}")
Explanation:
- We load the
compressed_ml_tensor.ozlfile. openzl.Decompressor: Similar to the compressor, it’s initialized with the same data description and compression graph. OpenZL automatically infers the reverse operations for decompression.decompressor.decompress(): This reconstructs the original byte stream.np.frombuffer().reshape(): We convert the decompressed bytes back into a NumPy array with the correctdtypeandshape.np.allclose(): This is crucial for verifying floating-point arrays. Due to theFloatQuantizecodec, we expect some precision loss.np.allcloseallows us to compare values within a specified absolute (atol) and relative (rtol) tolerance.- Finally, we clean up the temporary compressed file.
Run your script: Save the entire code block as archive_tensor.py and run it from your terminal:
python archive_tensor.py
Observe the output, especially the compression ratio and the verification message!
Mini-Challenge: Experiment with a Different Data Type
You’ve seen how to compress a float32 tensor. Now, let’s challenge your understanding!
Challenge: Modify the archive_tensor.py script to compress and decompress a tensor of int64 (64-bit integer) type.
Hints:
- You’ll need to change
tensor_dtypetonp.int64. - Adjust how
original_tensoris generated to produce integer values. - The
tensor_descriptionwill need to reflect the newdtype. - The
compression_graphwill likely need a different initial codec.openzl.codecs.FloatQuantizeis for floats; OpenZL probably has anopenzl.codecs.Deltaoropenzl.codecs.IntegerCodecthat would be more appropriate for integers, especially if they are sequential or have small differences. You might also consideropenzl.codecs.Varintif integer values can vary widely in magnitude. - For verification, if you use a lossless integer codec,
np.array_equalmight be more appropriate thannp.allclose. If you use a lossy integer codec,np.allclosewith a tolerance might still be needed.
What to Observe/Learn:
- How does changing the data type affect the necessary codecs in the compression graph?
- What is the impact on the compression ratio? Do integers compress differently than floats?
- Does the verification method need to change based on whether the compression is lossless or lossy for integers?
Take your time, experiment, and don’t be afraid to consult the (hypothetical) openzl_py documentation for available codecs!
Common Pitfalls & Troubleshooting
- Incorrect Data Description:
- Pitfall: Mismatch between the actual NumPy
dtype/shapeand what you provide intensor_description. OpenZL relies heavily on this. - Troubleshooting: Double-check that
list(original_tensor.shape)andstr(original_tensor.dtype)are correctly passed. If you change the tensor, ensure you update the description.
- Pitfall: Mismatch between the actual NumPy
- Incompatible Codec Selection:
- Pitfall: Using a codec designed for one data type (e.g.,
FloatQuantize) on another (e.g.,int64). This will lead to incorrect compression/decompression or runtime errors. - Troubleshooting: Always ensure your chosen codecs are appropriate for the
dtypespecified in your data description. Refer to OpenZL’s codec documentation.
- Pitfall: Using a codec designed for one data type (e.g.,
- Lossy vs. Lossless Verification:
- Pitfall: Expecting perfect equality (
np.array_equal) when using a lossy codec (like ourFloatQuantize). This will incorrectly report a failed verification. - Troubleshooting: Understand if your selected codecs introduce loss. For floating-point data with lossy codecs,
np.allclosewith an appropriate tolerance is the correct way to verify. For lossless compression of integers,np.array_equalis ideal.
- Pitfall: Expecting perfect equality (
- OpenZL Installation Issues:
- Pitfall: The
openzl_pypackage might not be installed correctly or might have version conflicts. - Troubleshooting: Ensure you’re using a virtual environment. Re-run
pip install openzl-pyand check for any error messages. Verify the package is listed when you runpip freeze.
- Pitfall: The
Summary
Congratulations! You’ve successfully completed a practical project applying OpenZL to archive Machine Learning tensors. Let’s recap the key takeaways:
- ML Tensors are Structured Data: They possess
shapeanddtypemetadata that OpenZL can exploit. - Data Description is Key: Providing OpenZL with an accurate JSON description of your tensor’s structure (
type,shape,dtype) is fundamental for effective compression. - Compression Graphs Tailor Performance: By defining a
compression_graphwith a sequence of specialized codecs (e.g.,FloatQuantizefor floats,Zstdfor general byte compression), you can optimize for specific data characteristics. - Format-Aware Compression Benefits: OpenZL’s ability to understand data formats leads to superior compression ratios and performance compared to generic byte-stream compressors for structured data.
- Verification is Crucial: Always decompress and verify your data. For lossy compression, use tolerance-based comparisons like
np.allclose.
This project demonstrates the power and flexibility of OpenZL in a critical domain like Machine Learning. In future chapters, we might explore integrating OpenZL into larger data pipelines, optimizing compression graphs for specific performance targets, or delving into more advanced data structures. Keep experimenting, and happy compressing!
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.
References
- OpenZL GitHub Repository: The primary source for the OpenZL framework, including build instructions and core concepts. https://github.com/facebook/openzl
- Introducing OpenZL: An Open Source Format-Aware Compression Framework: Engineering at Meta’s announcement blog post. https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
- NumPy Official Documentation: Essential for understanding array creation, manipulation, and data types in Python. https://numpy.org/doc/stable/
- Zstandard (Zstd) GitHub Repository: Details on the Zstd compression algorithm, often used as a backend codec in OpenZL. https://github.com/facebook/zstd