Welcome back, compression explorers! In our journey through OpenZL, we’ve seen how it intelligently uses existing codecs and compression plans to optimize data storage. But what happens when your data is truly unique, with patterns that generic codecs might miss? Or when you have specific performance or compression ratio goals that require a tailor-made solution?

That’s precisely what we’ll tackle in this chapter: creating custom codecs. You’ll learn how to extend OpenZL’s capabilities by writing your own compression and decompression logic, allowing you to fine-tune the framework for your most specialized datasets. This is where OpenZL truly shines as a framework, not just a collection of compressors.

By the end of this chapter, you’ll understand the core components of a custom codec, how to define your data’s structure using OpenZL’s Simple Data Description Language (SDDL), and how to integrate your bespoke codec into a compression plan. Get ready to put on your architect’s hat – this is where you get to truly shape your compression strategy!

Before we dive in, make sure you’re comfortable with OpenZL’s basic concepts, like compression plans and how data flows through stages, which we covered in previous chapters.

Core Concepts: Building Your Own Compression Logic

At its heart, OpenZL views compression as a graph of transformations, where each node is a “codec” acting on data. When you create a custom codec, you’re essentially adding a new, specialized node to this graph, equipped with its own unique way of processing data.

Why Custom Codecs?

OpenZL comes with a suite of highly optimized, general-purpose codecs. So, why bother writing your own?

  • Unique Data Patterns: Your data might have very specific, domain-specific redundancies that a generic compressor wouldn’t recognize. Think about sensor readings that often increment by small, predictable amounts, or image data with repeating pixel blocks.
  • Performance Tuning: You might need to prioritize compression speed over ratio, or vice-versa, for a specific data type, beyond what standard codecs offer.
  • Integration with Existing Algorithms: Perhaps you already have a specialized, efficient compression algorithm developed for your specific application that you want to integrate into the OpenZL ecosystem.

The Role of SDDL (Simple Data Description Language)

When dealing with custom codecs, especially for structured data, OpenZL’s Simple Data Description Language (SDDL) becomes indispensable. SDDL is a declarative language used to describe the schema of your structured data. It’s not just documentation; it’s executable, allowing OpenZL to understand the layout of your data, enabling it to:

  1. Parse and Serialize: SDDL helps OpenZL parse raw bytes into structured objects and serialize structured objects back into bytes, which is crucial for codecs that operate on specific data fields.
  2. Guide Compression Plans: OpenZL can use SDDL definitions to automatically infer optimal compression plans or to help you manually map codecs to specific fields within your data structure.

Think of SDDL as the blueprint for your data. Just as a carpenter can build a complex structure much more efficiently with a detailed blueprint, OpenZL can apply targeted compression when it has a clear understanding of your data’s internal organization.

The Custom Codec Interface

To create a custom codec, you’ll typically implement a specific interface provided by OpenZL. This interface usually requires at least two core methods:

  • compress(data: bytes) -> bytes: This method takes raw input bytes, applies your custom compression logic, and returns the compressed bytes.
  • decompress(compressed_data: bytes) -> bytes: This method takes the bytes previously compressed by your compress method and reconstructs the original raw bytes.

It’s critical that your decompress method perfectly reverses the compress operation to ensure lossless recovery of your data!

The Custom Codec Development Workflow

Let’s visualize the steps involved in creating and integrating a custom codec:

flowchart TD A[Identify Unique Data Structure] --> B{Define Data with SDDL}; B --> C[Implement Codec Interface: compress]; C --> D[Implement Codec Interface: decompress]; D --> E[Register Custom Codec OpenZL]; E --> F[Integrate into Compression Plan]; F --> G[Achieve Optimal Compression];
  1. Identify Unique Data Structure: Pinpoint the specific data format or patterns you want to optimize.
  2. Define Data with SDDL: Describe your data’s schema using SDDL, if it’s structured. This tells OpenZL how to understand your data.
  3. Implement compress(): Write the logic to transform raw input bytes into a more compact form.
  4. Implement decompress(): Write the logic to reconstruct the original bytes from your compressed data.
  5. Register Custom Codec: Make your new codec known to the OpenZL framework.
  6. Integrate into Compression Plan: Include your custom codec as a stage in an OpenZL compression plan.
  7. Achieve Optimal Compression: Observe the benefits of your tailored compression strategy!

Step-by-Step Implementation: A Delta Encoding Codec

Let’s walk through building a simple custom codec in Python, focusing on a common pattern for time-series data: delta encoding. We’ll compress a sequence of floating-point numbers where consecutive values often have small differences.

For this example, we’ll assume you have OpenZL installed. As of January 2026, the recommended way to install the Python bindings for OpenZL is via pip:

pip install openzl-py==0.1.0 # Using a plausible version for 2026

Step 1: Define Your Data with SDDL (Conceptual)

While our custom codec will directly operate on a stream of raw floats for simplicity, it’s good practice to understand how SDDL would describe such data. Imagine we’re compressing a stream of SensorReading objects, each containing a timestamp and a float value.

Here’s how you might define a SensorReading and a sequence of them using SDDL:

// sensor_data.sddl
// This file would conceptually describe the data your application uses.

struct SensorReading {
    u64 timestamp; // Unix timestamp in milliseconds
    f32 value;     // The actual sensor reading (e.g., temperature)
}

// A collection of sensor readings over time
sequence<SensorReading> SensorReadingsSequence;

In a more advanced scenario, OpenZL could use this SDDL to automatically feed the value field (as a sequence<f32>) directly to our custom codec. For now, we’ll manually prepare a bytes stream of floats.

Step 2: Implement the Custom Codec

We’ll create a DeltaFloatCodec that applies delta encoding to a stream of f32 (32-bit floating-point) values. Delta encoding stores the difference between consecutive values instead of the values themselves. If values change slowly, these differences (deltas) are often small and can be compressed more effectively.

Let’s start by defining the basic structure of our codec class. We’ll assume OpenZL provides a BaseCodec class for us to inherit from.

import struct
import openzl
from openzl.codec import BaseCodec # Assuming this interface exists

# Define our custom codec for a sequence of floating-point numbers
class DeltaFloatCodec(BaseCodec):
    """
    A custom OpenZL codec that applies simple delta encoding to a stream of f32 values.
    It stores the difference from the previous value.
    """
    def __init__(self, codec_id: str = "my_delta_float_codec"):
        # Every OpenZL codec needs a unique identifier.
        # We pass it to the base class constructor.
        super().__init__(codec_id)
        # Initialize state for delta encoding.
        # This will store the last encountered value during compression/decompression.
        self.previous_value = 0.0

    def compress(self, raw_floats_bytes: bytes) -> bytes:
        # Placeholder for compression logic
        print(f"[{self.codec_id}] Compressing {len(raw_floats_bytes)} bytes...")
        return raw_floats_bytes # For now, just pass through

Explanation:

  • We import struct for converting between Python floats and raw bytes, which is essential for low-level byte manipulation.
  • openzl.codec.BaseCodec is our assumed base class, providing the necessary interface for OpenZL to recognize our codec.
  • The __init__ method takes a codec_id (e.g., "my_delta_float_codec"), which must be unique. This ID is how you’ll refer to your codec in compression plans.
  • self.previous_value is initialized to 0.0. This will be our “state” for delta encoding, keeping track of the last float processed.

Now, let’s add the actual delta encoding logic to the compress method.

import struct
import openzl
from openzl.codec import BaseCodec

class DeltaFloatCodec(BaseCodec):
    def __init__(self, codec_id: str = "my_delta_float_codec"):
        super().__init__(codec_id)
        self.previous_value = 0.0

    def compress(self, raw_floats_bytes: bytes) -> bytes:
        print(f"[{self.codec_id}] Compressing {len(raw_floats_bytes)} bytes...")
        compressed_data = bytearray() # Use bytearray for efficient appending

        # Reset previous_value for each compression call
        # This is important if the codec is reused for separate blocks of data.
        self.previous_value = 0.0

        # Iterate through the input bytes, assuming each f32 is 4 bytes (little-endian)
        for i in range(0, len(raw_floats_bytes), 4):
            # Extract 4 bytes representing a float
            current_float_bytes = raw_floats_bytes[i:i+4]
            # Unpack the bytes into a Python float (little-endian '<f')
            current_value = struct.unpack('<f', current_float_bytes)[0]

            # Calculate the delta from the previous value
            delta = current_value - self.previous_value

            # Pack the delta back into 4 bytes and append to our compressed data
            compressed_data.extend(struct.pack('<f', delta))

            # Update the previous_value for the next iteration
            self.previous_value = current_value

        return bytes(compressed_data) # Convert bytearray to immutable bytes

Explanation of compress:

  • We use a bytearray to build the compressed data efficiently, converting it to immutable bytes at the end.
  • self.previous_value is reset to 0.0 at the start of compress. This ensures that each call to compress on a new block of data starts with a clean slate, preventing state leakage between different compression operations.
  • The loop iterates 4 bytes at a time, as f32 (single-precision float) occupies 4 bytes.
  • struct.unpack('<f', ...)[0] converts 4 bytes into a float. The < specifies little-endian byte order, and f specifies a float.
  • delta is calculated as current_value - self.previous_value.
  • struct.pack('<f', delta) converts the delta float back into 4 bytes.
  • self.previous_value is updated to current_value for the next delta calculation.

Next, we need the decompress method to reverse this process:

import struct
import openzl
from openzl.codec import BaseCodec

class DeltaFloatCodec(BaseCodec):
    def __init__(self, codec_id: str = "my_delta_float_codec"):
        super().__init__(codec_id)
        self.previous_value = 0.0

    def compress(self, raw_floats_bytes: bytes) -> bytes:
        print(f"[{self.codec_id}] Compressing {len(raw_floats_bytes)} bytes...")
        compressed_data = bytearray()
        self.previous_value = 0.0

        for i in range(0, len(raw_floats_bytes), 4):
            current_float_bytes = raw_floats_bytes[i:i+4]
            current_value = struct.unpack('<f', current_float_bytes)[0]
            delta = current_value - self.previous_value
            compressed_data.extend(struct.pack('<f', delta))
            self.previous_value = current_value

        return bytes(compressed_data)

    def decompress(self, compressed_deltas_bytes: bytes) -> bytes:
        print(f"[{self.codec_id}] Decompressing {len(compressed_deltas_bytes)} bytes...")
        decompressed_data = bytearray()
        # Reset previous_value for decompression, just like compression
        self.previous_value = 0.0

        # Iterate through the compressed deltas
        for i in range(0, len(compressed_deltas_bytes), 4):
            # Extract 4 bytes representing a delta float
            current_delta_bytes = compressed_deltas_bytes[i:i+4]
            # Unpack the bytes into a Python float
            delta = struct.unpack('<f', current_delta_bytes)[0]

            # Reconstruct the original value: original = previous + delta
            original_value = self.previous_value + delta

            # Pack the reconstructed value and append
            decompressed_data.extend(struct.pack('<f', original_value))

            # Update previous_value for the next reconstruction
            self.previous_value = original_value

        return bytes(decompressed_data)

Explanation of decompress:

  • Similar to compress, self.previous_value is reset to 0.0 to ensure correct state for the start of decompression.
  • The loop unpacks each 4-byte delta value.
  • The original_value is reconstructed by adding the delta to the self.previous_value.
  • This original_value is then packed back into bytes and added to decompressed_data.
  • self.previous_value is updated to the original_value for the next iteration.

Step 3: Register the Custom Codec with OpenZL

Now that our DeltaFloatCodec is ready, we need to tell OpenZL about it. This is typically done by registering an instance of your codec with OpenZL’s global registry.

# Assuming the DeltaFloatCodec class is defined above
# ... (DeltaFloatCodec class definition) ...

# Create an instance of our custom codec
my_custom_codec = DeltaFloatCodec("my_delta_float_codec")

# Register it with OpenZL
openzl.register_codec(my_custom_codec)

print(f"Custom codec '{my_custom_codec.codec_id}' registered successfully.")

Explanation:

  • openzl.register_codec() is a hypothetical function (but common in such frameworks) that adds your codec to OpenZL’s internal list of available codecs, making it discoverable by the CompressionPlanBuilder.

Step 4: Use the Custom Codec in a Compression Plan

Finally, let’s put our custom codec to work! We’ll generate some sample data, create a compression plan that includes my_delta_float_codec, and then compress and decompress the data.

import random
import struct
import openzl
from openzl.codec import BaseCodec

# Ensure our DeltaFloatCodec is defined and registered before this block
# (Copy-paste the DeltaFloatCodec class and registration code here for a runnable script)
class DeltaFloatCodec(BaseCodec):
    def __init__(self, codec_id: str = "my_delta_float_codec"):
        super().__init__(codec_id)
        self.previous_value = 0.0
    def compress(self, raw_floats_bytes: bytes) -> bytes:
        print(f"[{self.codec_id}] Compressing {len(raw_floats_bytes)} bytes...")
        compressed_data = bytearray()
        self.previous_value = 0.0
        for i in range(0, len(raw_floats_bytes), 4):
            current_float_bytes = raw_floats_bytes[i:i+4]
            current_value = struct.unpack('<f', current_float_bytes)[0]
            delta = current_value - self.previous_value
            compressed_data.extend(struct.pack('<f', delta))
            self.previous_value = current_value
        return bytes(compressed_data)
    def decompress(self, compressed_deltas_bytes: bytes) -> bytes:
        print(f"[{self.codec_id}] Decompressing {len(compressed_deltas_bytes)} bytes...")
        decompressed_data = bytearray()
        self.previous_value = 0.0
        for i in range(0, len(compressed_deltas_bytes), 4):
            current_delta_bytes = compressed_deltas_bytes[i:i+4]
            delta = struct.unpack('<f', current_delta_bytes)[0]
            original_value = self.previous_value + delta
            decompressed_data.extend(struct.pack('<f', original_value))
            self.previous_value = original_value
        return bytes(decompressed_data)

# Register the codec
my_custom_codec = DeltaFloatCodec("my_delta_float_codec")
openzl.register_codec(my_custom_codec)
print(f"Custom codec '{my_custom_codec.codec_id}' registered successfully.")


# 1. Generate Sample Data
# Let's create a sequence of floats that gradually increase, then jump, then gradually increase again.
raw_sensor_values = []
current_base = 100.0
for _ in range(20): # First segment: small changes
    raw_sensor_values.append(current_base + random.uniform(-0.1, 0.1))
    current_base = raw_sensor_values[-1] # Next value is close to previous

current_base = 150.0 # A significant jump
for _ in range(20): # Second segment: small changes around new base
    raw_sensor_values.append(current_base + random.uniform(-0.1, 0.1))
    current_base = raw_sensor_values[-1]


# Convert the list of floats into a single bytes object
# This simulates the raw data stream OpenZL would pass to a codec.
raw_bytes_to_compress = b''.join([struct.pack('<f', val) for val in raw_sensor_values])

print(f"\nOriginal data size: {len(raw_bytes_to_compress)} bytes")
print(f"Original values (first 5): {[round(v, 2) for v in raw_sensor_values[:5]]}...")
print(f"Original values (last 5): {[round(v, 2) for v in raw_sensor_values[-5:]]}...")


# 2. Create a Compression Plan using our custom codec
# We add our custom codec as a stage.
# In a real scenario, you might combine it with other codecs (e.g., LZ4 after delta encoding).
plan = openzl.CompressionPlanBuilder() \
            .add_stage("my_delta_float_codec") \
            .build()

print(f"\nCreated compression plan with stages: {[s.codec_id for s in plan.stages]}")


# 3. Compress the data
compressed_data = plan.compress(raw_bytes_to_compress)
print(f"Compressed data size: {len(compressed_data)} bytes")


# 4. Decompress the data
decompressed_bytes = plan.decompress(compressed_data)


# 5. Verify the results
# Convert decompressed bytes back to floats for comparison
decompressed_values = [struct.unpack('<f', decompressed_bytes[i:i+4])[0]
                       for i in range(0, len(decompressed_bytes), 4)]

print(f"Decompressed values (first 5): {[round(v, 2) for v in decompressed_values[:5]]}...")
print(f"Decompressed values (last 5): {[round(v, 2) for v in decompressed_values[-5:]]}...")

# Check for perfect reconstruction (allowing for floating-point inaccuracies)
is_match = all(abs(o - d) < 1e-6 for o, d in zip(raw_sensor_values, decompressed_values))

if is_match:
    print("\nCompression and decompression successful! Data perfectly reconstructed.")
else:
    print("\nError: Data mismatch after compression and decompression!")

# You'll notice that for this simple float delta encoding, the compressed size
# is often *the same* as the original size, because we're still storing
# each delta as a full 4-byte float. The real benefit comes when you
# further optimize the encoding of these deltas (e.g., variable-length integers, quantization),
# which is a great lead-in to our mini-challenge!

Explanation:

  • We generate raw_sensor_values that simulate sensor data with gradual changes.
  • These floats are packed into a bytes object using struct.pack, mimicking the raw data OpenZL would typically handle.
  • openzl.CompressionPlanBuilder().add_stage("my_delta_float_codec").build() creates a plan that uses our custom codec. If you had an SDDL definition, you could also map this codec to a specific field (e.g., plan.map_codec("SensorReading.value", "my_delta_float_codec")).
  • The plan.compress() and plan.decompress() methods orchestrate the compression/decompression using our registered codec.
  • Finally, we unpack the decompressed bytes and compare them to the original to ensure lossless compression.

You’ll likely observe that the “compressed” data size is the same as the original. This is because our DeltaFloatCodec currently stores each delta as a full 4-byte float. While conceptually delta encoding is applied, we haven’t actually reduced the bit-width of the deltas. This is where the real art of custom codec design comes in!

Mini-Challenge: Smarter Delta Encoding

Your current DeltaFloatCodec correctly performs delta encoding, but it doesn’t actually reduce the data size because it stores each delta as a full f32.

Challenge: Enhance the DeltaFloatCodec to truly save space. Modify it to:

  1. Quantize Small Deltas: If a delta value is very small (e.g., abs(delta) < 0.01), quantize it and represent it as a more compact integer type (e.g., an i16 - 2-byte integer) after scaling.
  2. Flag Data Type: Prepend each delta with a single “type byte” to indicate whether the following data is a full f32 or a quantized i16. This allows the decompressor to correctly interpret the stream.

Hint:

  • You’ll need to define a QUANTIZATION_FACTOR (e.g., 1000) and a SMALL_DELTA_THRESHOLD (e.g., 0.01).
  • When a delta is small, calculate int(delta * QUANTIZATION_FACTOR).
  • Choose a byte value (e.g., 0x00) to signal an i16 delta and another (e.g., 0x01) for an f32 delta.
  • Remember to handle struct.pack and struct.unpack for i16 (<h for short integer) and f32.
  • The decompress method must mirror the compress method’s encoding precisely.

What to Observe/Learn:

  • How does the compressed_data size change with this improvement?
  • What is the trade-off between compression ratio and code complexity?
  • How important is precise byte-level control for effective compression?

Common Pitfalls & Troubleshooting

Creating custom codecs can be tricky due to the low-level byte manipulation involved. Here are some common issues:

  1. State Management Errors:
    • Problem: For stateful codecs like our DeltaFloatCodec, forgetting to reset self.previous_value at the start of compress or decompress can lead to incorrect results if the codec instance is reused across multiple, unrelated data blocks.
    • Solution: Always ensure your codec’s internal state is properly initialized or reset at the beginning of each compress and decompress call, unless your design explicitly requires state to persist across blocks (which is more complex).
  2. Byte Order (Endianness) Mismatches:
    • Problem: If you pack data as little-endian (<f) but unpack it as big-endian (>f), or vice-versa, your data will be corrupted.
    • Solution: Be consistent with the byte order specified in struct.pack and struct.unpack. Little-endian (<) is common for modern systems, but always verify.
  3. Off-by-One Errors in Byte Processing:
    • Problem: When iterating through byte streams (e.g., range(0, len(data), 4)), it’s easy to miscalculate indices, leading to incomplete or incorrect parsing of data units.
    • Solution: Double-check your loop bounds and slicing logic. Test with edge cases (empty data, single data unit).
  4. SDDL-Codec Mismatch:
    • Problem: If you’re using SDDL to describe your data, but your custom codec expects a different data format or structure than what the SDDL-driven OpenZL parser provides, you’ll encounter errors.
    • Solution: Ensure the data format your codec expects (e.g., a stream of f32s) aligns with how OpenZL would extract or provide that data based on your SDDL definition. For direct bytes codecs, this is less of an issue, but for field-specific codecs, it’s critical.

Summary

Phew! You’ve just taken a significant step in mastering OpenZL by learning how to craft your own custom codecs. Let’s recap the key takeaways:

  • Custom codecs allow you to tailor OpenZL’s compression capabilities to unique data structures and specific performance requirements.
  • SDDL (Simple Data Description Language) is crucial for defining structured data schemas, guiding OpenZL in parsing and organizing data for your codecs.
  • You implement a custom codec by creating a class that adheres to OpenZL’s BaseCodec interface, specifically implementing the compress() and decompress() methods.
  • State management within your codec, especially for methods like delta encoding, requires careful attention to ensure consistent and correct operation.
  • Once implemented, your custom codec must be registered with OpenZL and then can be integrated into any compression plan, just like built-in codecs.
  • Low-level byte manipulation with tools like Python’s struct module is often necessary, requiring precision in handling byte order and data types.

In the next chapter, we’ll explore advanced compression plans and how to combine multiple codecs, including your custom creations, to achieve multi-stage, highly optimized compression pipelines.


References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.