CDC Pipeline Architecture
Real-Time Data Streaming
Real-time streaming is the continuous flow of data as it is generated, enabling immediate analysis and processing to extract meaningful insights.
Common data sources:
-
Website activity
-
Banking transactions
-
E-commerce purchases
-
Player activity in gaming
-
IoT sensors
Change Data Capture (CDC)
CDC is an integration pattern that identifies and captures changes made to a database (INSERT, UPDATE, DELETE) and propagates them as events to other systems in real time.
Why CDC?
| Traditional approach | CDC with Debezium |
|---|---|
Periodic polling against the DB |
WAL-based capture (write-ahead log) |
High latency (seconds/minutes) |
Sub-second latency |
Extra load on the DB |
Minimal impact (reads the transaction log) |
May miss changes between polls |
Captures all changes, including deletes |
Detailed data flow
-
A user inserts or updates a row in PostgreSQL
-
PostgreSQL writes the change to the Write-Ahead Log (WAL) with
wal_level=logical -
Debezium (running as a KafkaConnect plugin) reads the WAL via a replication slot
-
Debezium serializes the event as JSON (optionally Avro via Apicurio Registry)
-
The event is published to the corresponding Kafka topic (
cdc.public.customers) -
Apache Camel consumes the event from the topic
-
Camel transforms the event and sends an email notification via Mailpit
-
The entire pipeline is monitored in Grafana and visible in Kiali (Service Mesh)
How it Works
The CDC pipeline operates as a reactive chain of five stages, where each component acts independently and asynchronously:
-
Capture — PostgreSQL writes every transaction to the Write-Ahead Log (WAL) before committing it. Debezium opens a replication slot and reads the WAL using PostgreSQL’s native logical replication protocol, without additional queries against the tables.
-
Serialize — Debezium turns each change into a structured JSON event that includes: the
beforevalue (previous row state),after(new state),op(operation type:c`reate, `u`pdate, `d`elete, `r`ead), and `source(metadata: table, schema, WAL LSN, WAL timestamp). -
Stream — The event is published to the corresponding Kafka topic (
cdc.public.<table>). Kafka persists it to disk, replicates it to 3 brokers (RF=3), and acknowledges the write only when at least 2 replicas have persisted the data (min.insync.replicas=2). -
Process — Apache Camel consumes from the topic using a consumer group. Each partition is assigned to a single consumer (partition affinity), preserving order by primary key. Camel deserializes the JSON, applies content-based routing (
$.payload.op), and transforms the payload. -
React — The Camel processor runs the appropriate action: HTTP POST to Mailpit for notifications, production to DLQ topics for errors, or Prometheus metric emission. Everything is non-blocking — if a destination fails, the message goes to the Dead Letter Queue.
|
Typical end-to-end pipeline latency is sub-second from the PostgreSQL commit to the notification. The bottleneck is not Kafka (which persists in microseconds) but the HTTP POST to external services. |
Namespace and resources
All CDC pipeline components are deployed in the kafka-cdc namespace:
apiVersion: v1
kind: Namespace
metadata:
name: kafka-cdc
labels:
istio.io/dataplane-mode: ambient
istio-discovery: enabled
The istio.io/dataplane-mode: ambient label automatically enrolls the namespace in the Service Mesh, making all traffic visible in Kiali without sidecars.
High Availability
The pipeline is configured with multiple replicas and persistent storage for fault tolerance.
Kafka Brokers — Persistent Storage
Brokers use persistent-claim instead of ephemeral to retain data across restarts:
spec:
kafka:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
deleteClaim: false
-
size: 10Gi— capacity per broker (adjust based on data volume) -
deleteClaim: false— PVCs are preserved when scaling or deleting the cluster
Official Documentation
-
Red Hat Streams for Apache Kafka — Deploying and configuring Kafka clusters on OpenShift
-
Red Hat build of Debezium — CDC connector configuration
-
Red Hat build of Apache Camel — Integration and processing routes
-
OpenShift Service Mesh — mTLS and traffic observability
-
Strimzi Documentation — Upstream project for Streams for Apache Kafka