16. Project: Data Pipeline Testing with Python (Kafka & DB)

Introduction

Welcome back, intrepid tester! So far, we’ve explored the foundational concepts of Testcontainers and used them to test single-service applications in various languages. But what about testing more complex systems, like the beating heart of many modern applications: a data pipeline?

In this chapter, we’re going to tackle a real-world scenario: building and testing a simplified data pipeline in Python. This pipeline will involve two crucial external services: Apache Kafka for message queuing and PostgreSQL for data storage. Testing such a system traditionally is a headache, requiring manual setup of these services, which leads to flaky, slow, and inconsistent tests. Thankfully, Testcontainers comes to our rescue! We’ll use testcontainers-python to spin up fresh, isolated instances of both Kafka and PostgreSQL for every test run, ensuring your tests are reliable and fast.

This chapter assumes you’re comfortable with basic Python programming, pytest for testing, and the core concepts of Testcontainers as covered in previous chapters. By the end, you’ll have a solid understanding of how to implement robust integration tests for multi-service data pipelines using Python and Testcontainers.

Core Concepts: Testing Data Pipelines with Testcontainers

Data pipelines are everywhere. From processing user analytics to handling financial transactions, they are critical components that move, transform, and store data. A typical, simplified data pipeline might look like this: a producer sends data to a message broker (like Kafka), a consumer processes that data, and then stores it in a database (like PostgreSQL).

Testing this kind of system presents unique challenges:

Dependencies: Your application doesn’t run in isolation; it depends on Kafka and PostgreSQL.
State Management: Kafka retains messages, and databases store data. How do you ensure a clean slate for each test?
Realism vs. Mocks: While mocks can isolate units, they don’t verify integration. We need to test against real Kafka and real PostgreSQL.
Environment Consistency: How do you guarantee that your development, CI/CD, and fellow developers all test against the exact same versions and configurations of these services?

This is precisely where Testcontainers shines.

The Testcontainers Solution for Data Pipelines

Testcontainers allows us to declare the services our tests need right alongside our test code. When a test runs, Testcontainers:

Pulls Images: If not already present, it pulls the necessary Docker images (e.g., confluentinc/cp-kafka, postgres).
Spins Up Containers: It starts new, isolated container instances for each service. Each container gets its own network space, ports, and file system.
Waits for Readiness: It intelligently waits for the services inside the containers to be fully operational (e.g., Kafka broker ready to accept connections, PostgreSQL database accepting queries).
Provides Connection Details: It exposes dynamic connection information (like bootstrap_servers for Kafka, or connection_url for PostgreSQL) that your test code can use.
Cleans Up: After the tests complete, it automatically tears down and removes the containers, leaving no trace.

This process ensures that every test run operates in a pristine, reproducible environment. Let’s visualize this:

flowchart TD A[Test Execution Starts] --> B[Pytest Fixture: Kafka] B --> C{Pytest Fixture: PostgreSQL} C --> D[Testcontainers Creates Kafka Container] D --> E[Kafka Container Starts & Becomes Ready] E --> F[Testcontainers Creates PostgreSQL Container] F --> G[PostgreSQL Container Starts & Becomes Ready] G --> H[Test Code Runs] H --> I[Application Produces to Kafka Container] I --> J[Application Consumes from Kafka Container & Writes to PostgreSQL Container] J --> K[Test Asserts Data in PostgreSQL Container] K --> L[Test Execution Ends] L --> M[Testcontainers Stops & Removes All Containers]

Why Not Just Mocks or In-Memory Fakes?

Mocks: While useful for unit testing, mocks only simulate behavior. They cannot catch issues related to actual network protocols, data types, query syntax, or version incompatibilities between your application and the real Kafka/PostgreSQL.
In-Memory Fakes: Some databases offer in-memory versions (like H2 for Java). These are faster but might not behave identically to their production counterparts (e.g., slight SQL dialect differences, specific features). Kafka doesn’t have a widely used in-memory equivalent that accurately mimics its distributed nature.

Testcontainers gives us the best of both worlds: the speed and isolation of a clean environment, combined with the confidence of testing against actual, unmodified production-grade services.

`testcontainers-python` for Multiple Containers

When working with multiple containers, pytest fixtures are your best friend. Each fixture can manage the lifecycle of a single container. Since pytest allows fixtures to depend on other fixtures, you can easily compose your test environment. We’ll define a fixture for Kafka and another for PostgreSQL, and then our test function will simply declare these fixtures as arguments to receive their fully started and configured instances.

Step-by-Step Implementation: Building Our Pipeline Test

Let’s get our hands dirty! We’ll set up a Python project, create a simple data processing logic, and then write our Testcontainers-powered integration tests.

1. Project Setup

First, create a new directory for our project and navigate into it.

mkdir python-data-pipeline-test
cd python-data-pipeline-test

Next, let’s set up a Python virtual environment and install our dependencies.

python3 -m venv .venv
source .venv/bin/activate

Now, create a requirements.txt file with the following content. We’ll use specific versions for clarity as of 2026-02-14:

# requirements.txt
pytest==8.1.0
testcontainers==4.14.1
psycopg2-binary==2.9.9
confluent-kafka==2.3.0

Install these packages:

pip install -r requirements.txt

pytest: Our testing framework.
testcontainers: The Python library for interacting with Docker.
psycopg2-binary: A robust PostgreSQL adapter for Python.
confluent-kafka: The official Confluent Kafka client for Python.

2. The Application Logic: Our Simplified Data Processor

We’ll create a very basic data processor. It will have two main functions: one to produce a message to Kafka, and another to consume a message, add a “processed” status, and store it in PostgreSQL.

Create a file named data_processor.py:

# data_processor.py
import json
import logging
from kafka import KafkaProducer, KafkaConsumer
import psycopg2

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def produce_message(bootstrap_servers: str, topic: str, message: dict):
    """Produces a single JSON message to a Kafka topic."""
    # Explanation:
    # We create a KafkaProducer instance, providing the Kafka broker's address
    # (bootstrap_servers) and a serializer to turn our Python dict into JSON bytes.
    producer = KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=lambda v: json.dumps(v).encode('utf-8')
    )
    future = producer.send(topic, message)
    try:
        record_metadata = future.get(timeout=10) # Wait for the message to be sent
        logger.info(f"Message sent to topic {record_metadata.topic} partition {record_metadata.partition} offset {record_metadata.offset}")
    except Exception as e:
        logger.error(f"Failed to send message: {e}")
        raise # Re-raise to indicate failure in testing
    finally:
        producer.close() # Always close the producer

def consume_and_process(bootstrap_servers: str, topic: str, group_id: str, db_connection_url: str):
    """
    Consumes messages from Kafka, processes them (adds a status), and stores in PostgreSQL.
    Processes only one message for simplicity in testing.
    """
    # Explanation:
    # We set up a KafkaConsumer. 'auto_offset_reset='earliest'' means it starts
    # reading from the beginning of the topic if no offset is committed.
    # 'consumer_timeout_ms' prevents it from blocking indefinitely if no messages arrive.
    consumer = KafkaConsumer(
        topic,
        bootstrap_servers=bootstrap_servers,
        group_id=group_id,
        auto_offset_reset='earliest',
        enable_auto_commit=True, # Auto-commit offsets after processing
        value_deserializer=lambda x: json.loads(x.decode('utf-8')),
        consumer_timeout_ms=5000 # Timeout after 5 seconds if no message
    )

    try:
        # We iterate over messages, but for testing simplicity, we'll process and break after one.
        for message in consumer:
            data = message.value
            logger.info(f"Received message: {data}")

            # Simulate processing: add a status field
            processed_data = {
                "id": data.get("id"),
                "value": data.get("value"),
                "status": "processed", # This is our added value
                "original_timestamp": data.get("timestamp")
            }
            logger.info(f"Processed data: {processed_data}")

            # Store in PostgreSQL
            try:
                # We connect to PostgreSQL using the provided URL
                with psycopg2.connect(db_connection_url) as conn:
                    with conn.cursor() as cur:
                        cur.execute(
                            """
                            INSERT INTO processed_events (id, value, status, original_timestamp)
                            VALUES (%s, %s, %s, %s)
                            """,
                            (processed_data["id"], processed_data["value"], processed_data["status"], processed_data["original_timestamp"])
                        )
                    conn.commit() # Commit the transaction
                logger.info("Data stored in PostgreSQL successfully.")
                break # Process only one message for this simplified example and exit
            except Exception as db_e:
                logger.error(f"Error storing data in PostgreSQL: {db_e}")
                raise # Re-raise to signal failure

    finally:
        consumer.close() # Always close the consumer

3. Writing the Test with `testcontainers-python`

Now for the main event! We’ll create test_pipeline.py to test our data_processor.py functions using Testcontainers.

Create a file named test_pipeline.py in the same directory:

# test_pipeline.py
import pytest
import time
import json
import psycopg2
from kafka import KafkaProducer, KafkaConsumer

from testcontainers.kafka import KafkaContainer
from testcontainers.postgresql import PostgreSQLContainer

# Import our application logic from data_processor.py
from data_processor import produce_message, consume_and_process

# Define common Kafka topic and consumer group for our test
TEST_TOPIC = "pipeline-events"
CONSUMER_GROUP = "test-group"

@pytest.fixture(scope="module")
def kafka_container():
    """
    Pytest fixture to spin up a Kafka container for the test module.
    It uses a specific Confluent Kafka image.
    """
    # Explanation:
    # We instantiate KafkaContainer, specifying a Docker image.
    # The 'with ... as' statement ensures the container is properly stopped and removed.
    # .start() initiates the container. Testcontainers intelligently waits for Kafka to be ready.
    # .get_bootstrap_servers() provides the dynamically assigned host:port for Kafka.
    with KafkaContainer(image="confluentinc/cp-kafka:7.6.1") as kafka:
        kafka.start()
        print(f"\nKafka bootstrap servers: {kafka.get_bootstrap_servers()}")
        yield kafka # Yield the running container to tests

@pytest.fixture(scope="module")
def postgres_container():
    """
    Pytest fixture to spin up a PostgreSQL container for the test module.
    Initializes a simple table for processed events.
    """
    # Explanation:
    # Similar to Kafka, we instantiate PostgreSQLContainer with a specific image.
    # .start() initiates the container, and Testcontainers waits for it.
    # .get_connection_url() gives us the JDBC-style connection string for psycopg2.
    with PostgreSQLContainer(image="postgres:16") as postgres:
        postgres.start()
        print(f"\nPostgreSQL connection URL: {postgres.get_connection_url()}")

        # A small delay to ensure the DB is fully accepting connections,
        # although Testcontainers' wait strategy is usually robust.
        time.sleep(2)

        # Initialize the database schema needed by our application
        conn = None
        try:
            conn = psycopg2.connect(postgres.get_connection_url())
            cursor = conn.cursor()
            cursor.execute("""
                CREATE TABLE IF NOT EXISTS processed_events (
                    id VARCHAR(255) PRIMARY KEY,
                    value TEXT,
                    status VARCHAR(50),
                    original_timestamp VARCHAR(255)
                );
            """)
            conn.commit() # Commit the schema creation
            cursor.close()
            print("PostgreSQL schema initialized.")
        except Exception as e:
            print(f"Error initializing PostgreSQL schema: {e}")
            raise # Fail the fixture if schema creation fails
        finally:
            if conn:
                conn.close()
        yield postgres # Yield the running container to tests

def test_data_pipeline_end_to_end(kafka_container, postgres_container):
    """
    Tests the end-to-end data pipeline:
    1. Produces a message to Kafka.
    2. Runs the consumer/processor application logic that reads from Kafka
       and writes to PostgreSQL.
    3. Verifies the processed data in PostgreSQL.
    """
    # Explanation:
    # Our test function takes the fixtures as arguments. Pytest automatically
    # resolves and provides the running container instances.
    # We define a sample message to push through our pipeline.
    test_message = {
        "id": "event-123",
        "value": "Hello Testcontainers",
        "timestamp": "2026-02-14T10:00:00Z"
    }

    print(f"\nProducing message: {test_message}")
    # Call our application's produce_message function, passing the Kafka container's
    # bootstrap servers.
    produce_message(kafka_container.get_bootstrap_servers(), TEST_TOPIC, test_message)

    # Give Kafka a brief moment for internal routing before the consumer tries to pick it up.
    time.sleep(1)

    print(f"Starting consumer/processor for topic '{TEST_TOPIC}'...")
    # Call our application's consume_and_process logic, connecting it to both
    # the Kafka and PostgreSQL containers.
    consume_and_process(
        kafka_container.get_bootstrap_servers(),
        TEST_TOPIC,
        CONSUMER_GROUP,
        postgres_container.get_connection_url()
    )

    # Verify data in PostgreSQL
    print("Verifying data in PostgreSQL...")
    conn = None
    try:
        conn = psycopg2.connect(postgres_container.get_connection_url())
        cursor = conn.cursor()
        # Query the database to retrieve the processed event.
        cursor.execute("SELECT id, value, status, original_timestamp FROM processed_events WHERE id = %s", (test_message["id"],))
        result = cursor.fetchone() # Get the first (and only) row
        cursor.close()
        conn.close()

        # Assertions to check if the data was stored and processed correctly.
        assert result is not None, "Processed event not found in PostgreSQL."
        assert result[0] == test_message["id"]
        assert result[1] == test_message["value"]
        assert result[2] == "processed"  # This verifies our processing logic
        assert result[3] == test_message["timestamp"]

        print("Data verification successful!")

    except Exception as e:
        pytest.fail(f"Error during PostgreSQL verification: {e}") # Fail the test if any error occurs
    finally:
        if conn:
            conn.close()

4. Running the Tests

Make sure your Docker daemon is running! Then, from your python-data-pipeline-test directory, run pytest:

pytest -s -v test_pipeline.py

-s: Allows print statements to show up in the console (useful for debugging).
-v: Verbose output.

You should see output similar to this (exact details may vary):

============================= test session starts ==============================
...
collecting ...
collected 1 item

test_pipeline.py::test_data_pipeline_end_to_end
Kafka bootstrap servers: localhost:52335
PostgreSQL connection URL: jdbc:postgresql://localhost:52336/test?user=test&password=test
PostgreSQL schema initialized.

Producing message: {'id': 'event-123', 'value': 'Hello Testcontainers', 'timestamp': '2026-02-14T10:00:00Z'}
Message sent to topic pipeline-events partition 0 offset 0
Starting consumer/processor for topic 'pipeline-events'...
Received message: {'id': 'event-123', 'value': 'Hello Testcontainers', 'timestamp': '2026-02-14T10:00:00Z'}
Processed data: {'id': 'event-123', 'value': 'Hello Testcontainers', 'status': 'processed', 'original_timestamp': '2026-02-14T10:00:00Z'}
Data stored in PostgreSQL successfully.
Verifying data in PostgreSQL...
Data verification successful!
PASSED                                                                   [100%]
...
============================== 1 passed in 25.XXs ==============================

Congratulations! You’ve just successfully performed an end-to-end integration test of a data pipeline using Kafka and PostgreSQL, all orchestrated by Testcontainers. Notice how Testcontainers managed the startup and shutdown of both containers automatically, ensuring a clean testing environment.

Mini-Challenge: Enhance the Pipeline Logic

It’s your turn to apply what you’ve learned!

Challenge: Modify the data_processor.py file to add a new processing step. Instead of just adding status: "processed", also capitalize the value field of the message before storing it in PostgreSQL. Then, update test_pipeline.py to assert this new capitalized value.

Hint:

Locate the processed_data dictionary creation within the consume_and_process function in data_processor.py.
Modify processed_data["value"] to apply the capitalization.
In test_pipeline.py, adjust the expected value in your assertion to match the new capitalized format.

What to Observe/Learn: This challenge reinforces how changes in your application logic are quickly reflected and verified in your integration tests, demonstrating the value of testing against real dependencies.

Common Pitfalls & Troubleshooting

Even with Testcontainers, you might run into issues. Here are some common ones and how to approach them:

Docker Not Running:
- Symptom: You’ll see errors like “Cannot connect to the Docker daemon” or DockerException.
- Fix: Ensure your Docker Desktop application (macOS/Windows) or Docker daemon (Linux) is running before you execute tests.
Container Startup Timeout:
- Symptom: TimeoutException from Testcontainers, indicating it couldn’t detect the service as ready within the default time.
- Fix: Some containers, especially complex ones like Kafka, might need more time on slower machines or under heavy load. You can often increase the timeout in Testcontainers, though it’s usually well-tuned. For example, KafkaContainer(image="...", startup_timeout=120). Check container logs to see why it’s slow.
Client Connection Errors (e.g., Kafka NO_BROKERS_AVAILABLE, PostgreSQL could not connect to server):
- Symptom: Your Python application code fails to connect to the Testcontainers-managed services.
- Fix:
  - Verify get_bootstrap_servers() / get_connection_url(): Ensure you’re passing the exact values returned by Testcontainers to your client libraries. These are dynamic and change per test run.
  - Wait a bit longer: Sometimes, the readiness probe passes, but the service is still initializing some internal components. Adding a time.sleep(1) or time.sleep(2) after container.start() or before client connection attempts can sometimes resolve flaky connection issues, especially for Kafka clients.
  - Check container logs: Use print(kafka.get_logs()) (or postgres.get_logs()) within your fixture to see if the container itself reports any startup errors or misconfigurations.
Resource Exhaustion:
- Symptom: Docker runs slowly, tests hang, or your system becomes unresponsive.
- Fix: Running many large containers simultaneously consumes RAM and CPU.
  - scope="module" or scope="session": For pytest fixtures, using a broader scope like module or session means the containers are started once for a whole module or test session, then reused across tests. We used scope="module" in this chapter, which is a good default.
  - Review images: Are you using the smallest necessary Docker images?
  - Allocate more Docker resources: Increase RAM/CPU limits in your Docker Desktop settings.
Incorrect Kafka Topic/Consumer Group:
- Symptom: Your consumer never receives messages, even if the producer sends them successfully.
- Fix: Double-check that the TEST_TOPIC and CONSUMER_GROUP names are consistent across your producer, consumer, and test assertions. Kafka’s consumer groups are critical for how messages are distributed.

Summary

Phew! You’ve come a long way. In this chapter, you’ve mastered how to perform robust integration testing for a complex data pipeline in Python, leveraging the power of Testcontainers.

Here are the key takeaways:

Real-World Application: You built a simple data pipeline involving Kafka and PostgreSQL.
Multi-Container Orchestration: You learned how to use pytest fixtures with testcontainers-python to manage the lifecycle of multiple interdependent containers (Kafka and PostgreSQL).
End-to-End Verification: You implemented an end-to-end test that produced messages to a temporary Kafka instance, had your application process them, and then verified the results in a temporary PostgreSQL database.
Version Specificity: We used specific, up-to-date versions for testcontainers-python (4.14.1), Kafka (confluentinc/cp-kafka:7.6.1), and PostgreSQL (postgres:16) for reliable testing.
Confidence in Integration: You experienced firsthand how Testcontainers provides isolated, reproducible, and realistic testing environments, eliminating the flakiness and setup headaches often associated with integration tests.

This project demonstrates the immense value of Testcontainers in building reliable software, especially for microservices and data-intensive applications. By integrating real dependencies into your tests, you gain true confidence that your application behaves as expected in a production-like environment.

What’s next? In upcoming chapters, we’ll delve into more advanced Testcontainers features like performance optimization, integrating with CI/CD pipelines, and perhaps exploring other language-specific patterns or complex application stacks. Stay tuned!

References

Testcontainers Python Documentation: https://testcontainers.com/modules/kafka/
Testcontainers Python GitHub Releases (for versions): https://github.com/testcontainers/testcontainers-python/releases
Pytest Documentation: https://docs.pytest.org/
Confluent Kafka Python Client Documentation: https://docs.confluent.io/kafka-clients/python/current/overview.html
Psycopg2 Documentation: https://www.psycopg.org/docs/
PostgreSQL Official Documentation: https://www.postgresql.org/docs/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.