CDC Pipeline Architecture

Real-Time Data Streaming

Real-time streaming is the continuous flow of data as it is generated, enabling immediate analysis and processing to extract meaningful insights.

Common data sources:

  • Website activity

  • Banking transactions

  • E-commerce purchases

  • Player activity in gaming

  • IoT sensors

Change Data Capture (CDC)

CDC is an integration pattern that identifies and captures changes made to a database (INSERT, UPDATE, DELETE) and propagates them as events to other systems in real time.

Why CDC?

Traditional approach CDC with Debezium

Periodic polling against the DB

WAL-based capture (write-ahead log)

High latency (seconds/minutes)

Sub-second latency

Extra load on the DB

Minimal impact (reads the transaction log)

May miss changes between polls

Captures all changes, including deletes

Pipeline Architecture

Diagram

Detailed data flow

  1. A user inserts or updates a row in PostgreSQL

  2. PostgreSQL writes the change to the Write-Ahead Log (WAL) with wal_level=logical

  3. Debezium (running as a KafkaConnect plugin) reads the WAL via a replication slot

  4. Debezium serializes the event as JSON (optionally Avro via Apicurio Registry)

  5. The event is published to the corresponding Kafka topic (cdc.public.customers)

  6. Apache Camel consumes the event from the topic

  7. Camel transforms the event and sends an email notification via Mailpit

  8. The entire pipeline is monitored in Grafana and visible in Kiali (Service Mesh)

How it Works

The CDC pipeline operates as a reactive chain of five stages, where each component acts independently and asynchronously:

  1. Capture — PostgreSQL writes every transaction to the Write-Ahead Log (WAL) before committing it. Debezium opens a replication slot and reads the WAL using PostgreSQL’s native logical replication protocol, without additional queries against the tables.

  2. Serialize — Debezium turns each change into a structured JSON event that includes: the before value (previous row state), after (new state), op (operation type: c`reate, `u`pdate, `d`elete, `r`ead), and `source (metadata: table, schema, WAL LSN, WAL timestamp).

  3. Stream — The event is published to the corresponding Kafka topic (cdc.public.<table>). Kafka persists it to disk, replicates it to 3 brokers (RF=3), and acknowledges the write only when at least 2 replicas have persisted the data (min.insync.replicas=2).

  4. Process — Apache Camel consumes from the topic using a consumer group. Each partition is assigned to a single consumer (partition affinity), preserving order by primary key. Camel deserializes the JSON, applies content-based routing ($.payload.op), and transforms the payload.

  5. React — The Camel processor runs the appropriate action: HTTP POST to Mailpit for notifications, production to DLQ topics for errors, or Prometheus metric emission. Everything is non-blocking — if a destination fails, the message goes to the Dead Letter Queue.

Typical end-to-end pipeline latency is sub-second from the PostgreSQL commit to the notification. The bottleneck is not Kafka (which persists in microseconds) but the HTTP POST to external services.

Namespace and resources

All CDC pipeline components are deployed in the kafka-cdc namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: kafka-cdc
  labels:
    istio.io/dataplane-mode: ambient
    istio-discovery: enabled

The istio.io/dataplane-mode: ambient label automatically enrolls the namespace in the Service Mesh, making all traffic visible in Kiali without sidecars.

High Availability

The pipeline is configured with multiple replicas and persistent storage for fault tolerance.

Kafka Brokers — Persistent Storage

Brokers use persistent-claim instead of ephemeral to retain data across restarts:

spec:
  kafka:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      deleteClaim: false
  • size: 10Gi — capacity per broker (adjust based on data volume)

  • deleteClaim: false — PVCs are preserved when scaling or deleting the cluster

KafkaConnect — 2 replicas

KafkaConnect is deployed with 2 replicas for load balancing and connector failover:

spec:
  replicas: 2

Kafka Connect automatically distributes tasks across available replicas. If one replica fails, the other takes over the tasks.

Camel CDC Processor — 2 replicas

The Camel processor is scaled to 2 replicas. Kafka balances partitions across consumers in the same group:

spec:
  replicas: 2

With 3 partitions per topic and 2 consumers, each replica processes ~1.5 partitions on average.

Official Documentation