Chapter 5: Building Compression Plans: The OpenZL Workflow

Welcome back, aspiring data compression expert! In the previous chapters, we laid the groundwork for understanding OpenZL’s architecture and setting up our environment. Now, it’s time to dive into the heart of OpenZL: building and executing compression plans. This is where OpenZL truly shines, allowing us to leverage its format-aware capabilities for superior compression of structured data.

In this chapter, we’ll walk through the complete OpenZL workflow, from describing your data’s shape to training an optimized compression plan and then using it to compress and decompress your files. Understanding this workflow is crucial, as it’s the foundation for achieving the best possible compression ratios and speeds for your specific datasets. Get ready to put your knowledge into practice and see OpenZL in action!

The OpenZL Workflow: An Overview

OpenZL isn’t a “one-size-fits-all” compressor; it’s a framework that learns how to best compress your data. This learning process is captured in what OpenZL calls a “compression plan.” The overall process can be visualized as a clear, sequential flow:

flowchart TD A["Structured Data"] --> B{"Define Schema<br>(SDDL)"} B --> C["Sample Data"] C --> D["Train Compression Plan"] D --> E["Compression Plan"] E & A --> F["Compress Data"] F --> G["Compressed Data"] G & E --> H["Decompress Data"] H --> I["Original Data"]

As you can see, the journey begins with your data and ends with compressed (and decompressed) data. The critical intermediate steps involve defining your data’s structure and training a plan. Let’s break down these core concepts.

Core Concepts: SDDL and Compression Plans

OpenZL’s power stems from its ability to understand the structure of your data. This understanding is achieved through two main components: SDDL for describing the data, and Compression Plans for orchestrating the compression process.

What is SDDL? (Simple Data Description Language)

Imagine you have a box of LEGOs. To build something specific, you first need to know what pieces you have and how they fit together, right? SDDL serves a similar purpose for OpenZL.

SDDL (Simple Data Description Language) is a domain-specific language that allows you to describe the precise structure of your structured data. Instead of OpenZL trying to guess your data’s format (like a generic compressor), you explicitly tell it. This “format-awareness” is what enables OpenZL to apply highly specialized and effective codecs to different parts of your data.

Why is it important?

Targeted Compression: OpenZL can apply the most efficient codec for an int to an integer field, a different one for a float, and yet another for a string.
Improved Ratios: By understanding boundaries and types, OpenZL avoids common pitfalls of generic compressors that might treat a number as a string, leading to suboptimal results.
Lossless Guarantee: When you decompress, OpenZL knows exactly how to reconstruct the original data because it understands its structure, ensuring perfect lossless recovery.

Let’s look at a simple example. Imagine we’re tracking sensor readings, each with a timestamp, temperature, and humidity.

// sensor_data.sddl
struct SensorReading {
    timestamp: uint64;     // Unix timestamp in milliseconds
    temperature: float32;  // Temperature in Celsius
    humidity: float32;     // Relative humidity percentage
}

In this SDDL snippet:

struct SensorReading: We’re defining a composite data type named SensorReading. It’s like a blueprint for a single sensor record.
timestamp: uint64;: This line declares a field named timestamp that is an unsigned 64-bit integer. OpenZL will know to use codecs optimized for large integers.
temperature: float32;: This declares a temperature field as a 32-bit floating-point number. Again, specific floating-point codecs can be applied.
humidity: float32;: Another 32-bit float for humidity.

SDDL supports various primitive types (uint8, int16, float64, bool), composite types (struct, array), and more advanced constructs. It’s designed to be simple yet expressive enough to describe complex data layouts. You can find comprehensive documentation on SDDL at the OpenZL official site (as of early 2026).

What are Compression Plans?

Once OpenZL understands your data’s structure via SDDL, it needs a strategy to compress it. This strategy is called a Compression Plan.

A Compression Plan in OpenZL is essentially a directed acyclic graph (DAG) of codecs. Think of it as a meticulously designed pipeline where different codecs (compression algorithms) are applied in a specific order to different parts of your data.

How are plans created? OpenZL doesn’t just pick a random plan. It trains a plan. This involves:

Exploration: OpenZL, using your SDDL schema and a sample of your actual data, explores various combinations of available codecs and their parameters.
Optimization: It evaluates these combinations based on your specified optimization goals (e.g., prioritize maximum compression ratio, prioritize fastest compression speed, or find a balance).
Selection: It selects the “best” plan that meets your criteria for your specific data.

This training process is powerful because it tailors the compression strategy to your unique dataset, often outperforming generic compressors. The output of this training is a .plan file, which is a binary file containing the optimized graph of codecs.

Why train a plan?

Optimal Performance: A plan is specifically tuned for your data, leading to better compression ratios or speeds than general-purpose algorithms.
Adaptability: Different datasets have different characteristics. Training allows OpenZL to adapt its strategy.
Reproducibility: Once a plan is trained, it can be used consistently across all your data, ensuring predictable results.

Step-by-Step Implementation: Compressing Sensor Data

Let’s get hands-on and implement the OpenZL workflow to compress our SensorReading data.

Scenario: We have a stream of sensor readings that we want to store efficiently. Each reading consists of a timestamp, temperature, and humidity.

Prerequisites: You should have OpenZL installed and configured as covered in Chapter 3. We’ll be using the openzl command-line tool.

Step 1: Define the Data Schema with SDDL

First, we need to tell OpenZL about our data’s structure.

Create a file named sensor_data.sddl in your working directory.
Add the following content to sensor_data.sddl:
```
// sensor_data.sddl
struct SensorReading {
    timestamp: uint64;     // Unix timestamp in milliseconds
    temperature: float32;  // Temperature in Celsius
    humidity: float32;     // Relative humidity percentage
}

// We'll be compressing a stream of these readings
array<SensorReading> readings;
```
Explanation:
- The struct SensorReading defines the layout of a single sensor record, as discussed before.
- The new line array<SensorReading> readings; is crucial. It tells OpenZL that our input data will be an array (a sequence) of SensorReading structs. This is a common pattern for time-series or tabular data.

Step 2: Prepare Sample Data

OpenZL needs some real data to “learn” from during the training process. This sample data should be representative of the data you intend to compress. For our example, let’s create a small binary file containing a few SensorReading records.

Since directly writing binary data can be tricky, we’ll use a Python script to generate a sample file that strictly adheres to our SDDL schema.

Create a file named generate_sample_data.py in the same directory.

Add the following Python code:

# generate_sample_data.py
import struct
import time

def generate_sensor_data(filename="sample_sensor_data.bin", num_records=100):
    """Generates a binary file with sample SensorReading data."""
    with open(filename, "wb") as f:
        for i in range(num_records):
            timestamp = int(time.time() * 1000) + i * 1000 # Milliseconds, increasing by 1 second
            temperature = 20.0 + (i % 10) * 0.5            # 20.0, 20.5, 21.0, ...
            humidity = 50.0 + (i % 7) * 1.5                # 50.0, 51.5, 53.0, ...

            # Pack data according to SDDL: uint64, float32, float32
            # 'Q' for unsigned long long (uint64), 'f' for float (float32)
            packed_data = struct.pack('<Qff', timestamp, temperature, humidity)
            f.write(packed_data)
    print(f"Generated {num_records} sensor records to {filename}")

if __name__ == "__main__":
    generate_sensor_data()

Explanation:

We use the struct module to pack our Python numbers into binary formats that match our SDDL types (<Qff means little-endian, unsigned long long, float, float).
The script generates num_records (default 100) of plausible sensor data, with timestamp increasing and temperature/humidity showing some variation.

Run the Python script from your terminal:
```
python generate_sample_data.py
```
This will create a file named sample_sensor_data.bin containing 100 sensor records. This is our sample data for training.

Step 3: Train a Compression Plan

Now that we have our data schema (SDDL) and sample data, we can train OpenZL to create an optimal compression plan.

Run the openzl train command:
```
openzl train \
    --sddl sensor_data.sddl \
    --input sample_sensor_data.bin \
    --output sensor_data.plan \
    --target-metric ratio \
    --max-time 60s
```
Explanation of arguments:
- openzl train: The command to initiate plan training.
- --sddl sensor_data.sddl: Specifies our SDDL schema file. OpenZL uses this to understand the data structure.
- --input sample_sensor_data.bin: Provides the sample data for OpenZL to analyze and optimize against.
- --output sensor_data.plan: The path where the generated compression plan will be saved.
- --target-metric ratio: Tells OpenZL to prioritize the best compression ratio. You could also use speed for faster compression or balanced for a compromise.
- --max-time 60s: Limits the training process to a maximum of 60 seconds. Training can be computationally intensive, so setting a time limit is good practice. For real-world scenarios, you might allow it to run longer for better optimization.
Observe the output: OpenZL will print progress messages as it explores different codec combinations and evaluates them. After some time (up to 60 seconds), it will report the best plan found and save it to sensor_data.plan.
You should see output similar to (details will vary):
```
INFO: Training started...
INFO: Exploring codec combinations...
INFO: Best plan found (ratio: 0.15, speed: 1234 MB/s)
INFO: Plan saved to sensor_data.plan
```
Congratulations! You’ve just trained your first OpenZL compression plan.

Step 4: Compress Data using the Plan

With our sensor_data.plan in hand, we can now compress actual data. Let’s compress our sample_sensor_data.bin file using the plan we just created.

Run the openzl compress command:
```
openzl compress \
    --input sample_sensor_data.bin \
    --plan sensor_data.plan \
    --output compressed_sensor_data.zl
```
Explanation of arguments:
- openzl compress: The command to perform compression.
- --input sample_sensor_data.bin: The data file we want to compress.
- --plan sensor_data.plan: The pre-trained compression plan to use.
- --output compressed_sensor_data.zl: The path where the compressed output will be saved. The .zl extension is a common convention for OpenZL compressed files.
OpenZL will quickly process the file and save the compressed version. Compare the size of sample_sensor_data.bin with compressed_sensor_data.zl. You should see a significant reduction!

Step 5: Decompress Data

The final step in our workflow is to decompress the data back to its original form, verifying that our lossless compression worked perfectly.

Run the openzl decompress command:
```
openzl decompress \
    --input compressed_sensor_data.zl \
    --plan sensor_data.plan \
    --output decompressed_sensor_data.bin
```
Explanation of arguments:
- openzl decompress: The command to perform decompression.
- --input compressed_sensor_data.zl: The compressed file.
- --plan sensor_data.plan: The same plan used for compression is required for decompression. This is how OpenZL knows how to reverse the process.
- --output decompressed_sensor_data.bin: The path where the decompressed output will be saved.
Verify the data: To confirm lossless compression, you can compare the original sample_sensor_data.bin with the decompressed_sensor_data.bin file. They should be byte-for-byte identical. On Linux/macOS, you can use diff:
```
diff sample_sensor_data.bin decompressed_sensor_data.bin
```
If diff returns no output, the files are identical! Mission accomplished.

Mini-Challenge: Expanding Your Schema

You’ve successfully compressed and decompressed data using OpenZL’s core workflow. Now, let’s test your understanding with a small modification.

Challenge: Imagine our sensor also started reporting pressure as a 32-bit floating-point number.

Modify sensor_data.sddl to include a pressure: float32; field within the SensorReading struct.
Modify generate_sample_data.py to include a pressure value (e.g., 1013.25 + (i % 5) * 0.1) and pack it correctly into the binary data (remember to add another 'f' to the struct.pack format string).
Re-generate sample_sensor_data.bin using the updated Python script.
Re-train a new compression plan (you can overwrite sensor_data.plan or save it as sensor_data_v2.plan).
Re-compress and decompress the new sample_sensor_data.bin using your new plan.
Verify the decompressed data.

Hint: Pay close attention to the order of fields in your SDDL and how you pack them in Python. They must match exactly!

What to Observe/Learn: This exercise reinforces the direct relationship between your data’s structure, its SDDL definition, and the training process. You’ll see how OpenZL adapts its plan when the underlying data schema changes, highlighting its flexibility.

Common Pitfalls & Troubleshooting

Even with a clear workflow, you might encounter issues. Here are a few common pitfalls:

SDDL-Data Mismatch:
- Problem: Your sample_sensor_data.bin does not actually conform to the structure defined in sensor_data.sddl. For example, you defined uint64 but packed a uint32, or you forgot a field in your Python script.
- Symptom: openzl train or openzl compress might fail with “data parsing error,” “schema mismatch,” or “unexpected end of input.”
- Solution: Double-check your SDDL file and your data generation script (e.g., generate_sample_data.py). Ensure every field’s type and order in the binary data exactly matches the SDDL. Use struct format codes carefully.
Insufficient or Non-Representative Sample Data:
- Problem: Your sample_sensor_data.bin is too small, or it doesn’t contain the full range of values/patterns present in your real data.
- Symptom: The trained plan might yield poor compression ratios on your actual, larger dataset, or it might be slower than expected.
- Solution: Provide a larger, diverse sample of your real data for training. If your data has distinct phases or outliers, try to include examples of these in your training set.
Training Time vs. Quality:
- Problem: You set --max-time too low, and OpenZL couldn’t find an optimal plan. Or you let it run too long for minimal gain.
- Symptom: Suboptimal compression ratio/speed, or training takes an excessively long time.
- Solution: Experiment with --max-time. For critical applications, allow more time for training on a powerful machine. For less critical data, a shorter training might be acceptable. Monitor the output of openzl train to see if the “best plan found” metrics are still improving significantly towards the end of the allocated time.

Summary

In this chapter, you’ve gained a deep understanding of the core OpenZL workflow and put it into practice:

SDDL (Simple Data Description Language): You learned how to precisely describe the structure of your data, enabling OpenZL’s format-aware compression.
Compression Plans: You understood that these are optimized DAGs of codecs, tailored to your specific data through a training process.
Hands-on Workflow: You successfully defined an SDDL schema, generated sample data, trained a compression plan, and then used it to compress and decompress binary sensor data.
Troubleshooting: You’re now aware of common issues like SDDL-data mismatches and how to address them.

You’ve built a solid foundation for using OpenZL effectively. In the next chapter, we’ll explore more advanced SDDL features and delve deeper into how OpenZL integrates with existing systems, opening up even more possibilities for efficient data handling!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.