Observability + Service Mesh

CDC Pipeline Observability

The full pipeline is instrumented with Prometheus metrics, Grafana dashboards, and tracing via the Service Mesh.

Observability stack

Component	Role
Prometheus (via Cluster Observability Operator)	Scrapes JMX metrics from Kafka, Debezium, and Camel
Grafana	"Kafka CDC Pipeline" dashboard with throughput, lag, and latency panels
Kiali	Visualization of traffic between services in the Service Mesh
Kafka Exporter	Exports consumer group lag metrics to Prometheus

Component

Role

Prometheus (via Cluster Observability Operator)

Scrapes JMX metrics from Kafka, Debezium, and Camel

Grafana

"Kafka CDC Pipeline" dashboard with throughput, lag, and latency panels

Kiali

Visualization of traffic between services in the Service Mesh

Kafka Exporter

Exports consumer group lag metrics to Prometheus

Grafana Dashboard — Kafka CDC Pipeline

Access: https://grafana-observability.

The dashboard is bound to the Grafana instance via instanceSelector. The label must match the Grafana CR instance label:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: kafka-cdc-pipeline
  namespace: openshift-cluster-observability-operator
spec:
  instanceSelector:
    matchLabels:
      dashboards: grafana-observability
  json: |
    { ... }

The label dashboards: grafana-observability must match the Grafana CR instance label exactly. A mismatch (for example connectivity-link) will prevent the dashboard from ever mounting.

The "Kafka CDC Pipeline" dashboard includes the following panels:

Panel Metric

Panel	Metric
Kafka Broker — Messages In/s	`sum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])) by (topic)`
Kafka Broker — Bytes In/Out	Byte throughput per second (in and out)
Consumer Group Lag	`kafka_consumergroup_lag > 0` — lag per partition and consumer group
Debezium — Streaming Duration	`debezium_postgres_MilliSecondsSinceLastEvent` — CDC latency
KafkaConnect — Task Status	Number of active connectors
KafkaConnect — Records Processed/s	`rate(kafka_connect_task_sink_record_send_total[5m])` — records processed per second

Kafka Broker — Messages In/s

sum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])) by (topic)

Kafka Broker — Bytes In/Out

Byte throughput per second (in and out)

Consumer Group Lag

kafka_consumergroup_lag > 0 — lag per partition and consumer group

Debezium — Streaming Duration

debezium_postgres_MilliSecondsSinceLastEvent — CDC latency

KafkaConnect — Task Status

Number of active connectors

KafkaConnect — Records Processed/s

rate(kafka_connect_task_sink_record_send_total[5m]) — records processed per second

PodMonitors for metrics

PodMonitors are configured for each pipeline component:

apiVersion: monitoring.rhobs/v1
kind: PodMonitor
metadata:
  name: kafka-cluster-metrics
  namespace: openshift-cluster-observability-operator
spec:
  namespaceSelector:
    matchNames:
    - kafka-cdc
  podMetricsEndpoints:
  - interval: 30s
    port: tcp-prometheus
  selector:
    matchLabels:
      strimzi.io/cluster: cdc-cluster
      strimzi.io/kind: Kafka

How it Works

Metrics pipeline: from JMX to Grafana

CDC pipeline metrics pass through four layers before visualization:

Exposure (JMX → Prometheus format) — Each Kafka broker and KafkaConnect exposes internal JMX metrics. A JMX Prometheus Exporter agent (configured via metricsConfig in the Strimzi CR) converts JMX metrics to Prometheus text format on port 9404. Camel uses Micrometer to expose its metrics natively at /q/metrics.
Scraping (Prometheus) — PodMonitor CRs tell Prometheus which pods to scrape, on which port, and how often (interval: 30s). Prometheus stores time series with labels (topic, partition, consumer group, connector) that enable granular queries.
Query (PromQL → panels) — Each Grafana dashboard panel runs a PromQL query. For example, sum(rate(kafka_server_brokertopicmetrics_messagesin_total[5m])) by (topic) computes message throughput per second grouped by topic, using a 5-minute window to smooth spikes.
Alerts (PrometheusRule → Alertmanager) — Alert rules are evaluated continuously by Prometheus. When an expression holds for the defined for period (e.g. kafka_consumergroup_lag > 1000 for 5 minutes), Prometheus fires an alert to Alertmanager, which can notify via email, Slack, PagerDuty, etc.

Kafka Exporter: consumer lag metrics

The kafkaExporter deployed by Strimzi is a dedicated process that:

Connects to the Kafka cluster and reads offsets for all consumer groups
Computes lag per partition: lag = highWaterMark - consumerOffset
Exposes the kafka_consumergroup_lag metric that Prometheus scrapes
This helps detect bottlenecks: if lag grows, consumers are not keeping up with producers

Service Mesh — Istio Ambient Mode

The kafka-cdc namespace is enrolled in the Service Mesh using Istio ambient mode (no sidecars):

metadata:
  labels:
    istio.io/dataplane-mode: ambient
    istio-discovery: enabled

Traffic visible in Kiali

Access: https://kiali-openshift-cluster-observability-operator.

In Kiali you can see the service graph for the kafka-cdc namespace:

PostgreSQL → KafkaConnect (Debezium)
KafkaConnect → Kafka brokers
Kafka → Camel CDC Processor
Camel → Mailpit (HTTP)

Ambient mode provides:

Automatic mTLS between all pods in the namespace
L4/L7 metrics without sidecars (via ztunnel)
Traffic visibility in Kiali without per-sidecar CPU/RAM overhead

Alerts — PrometheusRule

Alerts are deployed as a PrometheusRule resource that Prometheus evaluates automatically:

apiVersion: monitoring.rhobs/v1
kind: PrometheusRule
metadata:
  name: kafka-cdc-alerts
  namespace: openshift-cluster-observability-operator
  labels:
    openshift.io/user-monitoring: "true"
spec:
  groups:
    - name: kafka-cdc
      rules:
        - alert: KafkaConsumerLagHigh
          expr: kafka_consumergroup_lag > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High consumer lag on {{ $labels.consumergroup }}"
        - alert: DebeziumDisconnected
          expr: debezium_postgres_Connected == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Debezium disconnected from PostgreSQL"
        - alert: KafkaConnectTaskFailed
          expr: kafka_connect_worker_connector_failed_task_count > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "KafkaConnect connector in FAILED state"

To verify that alerts are active:

oc get prometheusrule kafka-cdc-alerts -n openshift-cluster-observability-operator -o yaml

Official Documentation

OpenShift Monitoring — Integrated monitoring with Prometheus and Alertmanager
Cluster Observability Operator — Multi-signal observability on OpenShift
Red Hat OpenShift Service Mesh — Service Mesh with Istio, including ambient mode
Grafana Documentation — Dashboards and metric visualization
Kiali Documentation — Service Mesh observability console
Prometheus Documentation — Monitoring and alerting
Kafka Metrics and Monitoring — Export Kafka metrics to Prometheus