Troubleshooting
Production lessons from fleet GitOps, ambient mesh, and centralized observability. See also ebook Ch.15 matrix (adapted below).
RHDP fleets: Start with the RHDP install playbook for install order, spoke token anti-patterns, console-link 503s during the first hour, and operator bootstrap blockers.
Symptom matrix
| Symptom | Likely cause | Fix |
|---|---|---|
| Hub console links 503 (Developer Hub, Gitea, ODS, Skupper) | Backends still syncing or missing deps (catalog CM, SCC, Site) | Wait 60–90 min; see install playbook sections per product |
| OpenShift AI link 403 in curl | OAuth-protected dashboard | Log in with oc login; script uses oc whoami -t bearer token |
| East/west namespaces Terminating / recreating | Spoke tokens in auto-syncing field-content while import fails, or Namespace pre-created with managedCluster label before ManagedCluster | Remove tokens from GitOps values; import via ACM UI or chart order (ManagedCluster first); fleet-values-sync = domains only |
acm-operator stuck, no MCH | CRD not ready before PostSync | helm template acm charts/all/acm-operator \| oc apply -f - |
| RHODS / COO CSV multiple operatorgroups | Duplicate OG from subscription + operatorGroup: true | Remove duplicate OG flag in hub values |
| ArgoCD apps show Unknown sync status | ACM 2.16 CRD schema bug | Add resource exclusion for clusterview.open-cluster-management.io; see below |
upstream connect error / 503 on mesh routes | HBONE port 15008 not configured (pod before ztunnel) | Restart pods in ambient namespaces; ensure ambient labels at sync-wave 2 after Istio/ZTunnel |
| ApplicationSet Degraded: both name and server | Stale destination.server from older template (SSA) | Delete/recreate ApplicationSet or set server: "" in template |
| ACM UI: no Argo applications created | ApplicationSet missing cluster.open-cluster-management.io/placement label | Label ApplicationSet + child Apps; verify with oc get applications -n openshift-gitops \| grep spoke |
Kiali: Unauthorized on east/west | Stale kiali-multi-cluster-secret or expired spoke token | Delete aggregate secret; run token-sync job; restart Kiali pod |
Kafka Console: /api/kafkas 404 | External route hits UI only; Next.js does not proxy /api | Enable apiRoute in charts/all/kafka-console; verify HTTP 200 on /api/kafkas |
| Strimzi entity-operator CrashLoop | mTLS on 9091 conflicts with ztunnel | Exclude operator namespace from ambient or use documented Strimzi tuning |
| Skupper listener not Ready | Site or token not synced | Check oc get site,listener -n service-interconnect on hub and spoke |
| GitOpsCluster: legacy secret not found | ACM hasn’t created cluster secret yet | Wait 5-10 min; check klusterlet on spoke; verify ManagedCluster is Joined |
Kuadrant /kuadrant: failed to fetch APIProducts | K8s RBAC or CRD group | Sync developer-hub; ClusterRole developer-hub-kuadrant needs devportal.kuadrant.io + gateway.networking.k8s.io gateways/httproutes |
AuthPolicy Not Accepted / MissingDependency | Wrong ISTIO_GATEWAY_CONTROLLER_NAMES or operator started before mesh | Sync rhcl-operator subscription config; restart kuadrant-operator-controller-manager; see Connectivity Link |
| API key works in console but not httpbin | AuthPolicy selector app ≠ APIProduct name | Match app label to APIProduct (e.g. workshop-mcp-gateway, workshop-llm-tokens) |
| API Overview: Expected object at root, got string | Incomplete OpenAPI in catalog entity | Ensure API entities have valid definition with paths; fix $text file refs in reading.allow |
| TechDocs tab 404 / builder not local | techdocs.builder: external or missing mkdocs | Set builder: local in app-config; scaffolded repos need mkdocs.yml + backstage.io/techdocs-ref: dir:. |
| Quay org-setup Job failing | /version redirect, CSRF, or duplicate robot | Use GitOps setup.py with /discovery + bearer token; see Quay |
| DevSpaces link on hub 404 | DevSpaces is spoke-only | Open https://devspaces.<east-or-west-domain> from template output |
MCP Gateway 503 / /mcp 404 | Argo Unknown — CRDs never applied | bash scripts/apply-mcp-gateway.sh |
| Developer Hub /lightspeed chat 401 | Missing MaaS key or wrong vLLM URL | bash scripts/apply-maas-secrets.sh; default model granite-3-2-8b-instruct via MaaS |
| NeuroFace /api/chat 401 | Secret neuroface-maas-api-key placeholder | RHDP litemaas.apiKey or apply-maas-secrets.sh; PostSync neuroface-maas-key-sync |
| Gitea assets 503/500 | Wrong ROOT_URL or service selector | PostSync gitea-fix-* jobs; bash scripts/apply-gitea-root-url.sh |
Orphan apps in default | helm template \| oc apply without -n | Delete orphan stack; always sync via Argo CD (namespace in Application spec) |
| workshop-apis 401 without key | Expected (Kuadrant AuthPolicy) | Request key at Developer Hub /kuadrant |
| Developer Hub Kuadrant tab missing / catalog parse error | Catalog ConfigMap truncated to hub domain only | Re-sync developer-hub chart ≥ v1.5.1; verify oc get cm developer-hub-catalog-workshop-kuadrant-apis -n developer-hub -o yaml \| grep 'kind: API' returns 4 lines |
| Vault console link 307 | href points to route root | Use /ui/ — see install playbook |
Camel mqtt-to-kafka Error, Kafka metadata timeout | Missing advertised EndpointSlice or ambient ztunnel on Kafka TCP | EndpointSlice + deployment trait istio.io/dataplane-mode: none; see below |
| Stormshift MirrorMaker2 CrashLoop | Empty clusterName → broker-0-. | Set clusterName: east|west in spoke app values |
ArgoCD Unknown sync status (ACM 2.16)
Symptom: All ArgoCD applications show “Unknown” sync status in the UI, even though they are healthy and syncing correctly.
Error message:
SchemaError(github.com/stolostron/cluster-lifecycle-api/clusterview/v1alpha1.UserPermission.status):
unknown model in reference
Cause: MCE ocm-proxyserver publishes aggregated clusterview OpenAPI with a broken UserPermission.status reference. Argo CD cannot load the hub OpenAPI cache, so apps show Unknown / ComparisonError. resourceExclusions alone does not fix this.
Verification: Applications still show Healthy health status and operationState.phase: Succeeded:
# Check actual operation state (should show "Succeeded")
oc get application <app-name> -n openshift-gitops \
-o jsonpath='{.status.operationState.phase}'
# All apps healthy?
oc get applications -n openshift-gitops -o jsonpath='{range .items[*]}{.metadata.name}: {.status.health.status}{"\n"}{end}' | grep -v Healthy
Automated fix (Git): charts/all/openshift-gitops — acmArgocdOpenapiFix (enabled by default): scales ocm-proxyserver to 0, deletes its APIServices, PostSync Job + CronJob, restarts the application controller.
Manual one-shot:
oc scale deployment/ocm-proxyserver -n multicluster-engine --replicas=0
for name in v1.clusterview.open-cluster-management.io \
v1alpha1.clusterview.open-cluster-management.io \
v1beta1.proxy.open-cluster-management.io; do
oc delete apiservice "$name" --ignore-not-found
done
oc rollout restart statefulset openshift-gitops-application-controller -n openshift-gitops
Note: Pair with acm-operator PostSync to disable cluster-proxy-addon. Success: 0 Unknown apps. Trade-off: clusterview UserPermission via proxy is unavailable; direct spoke APIs and ACM fleet inventory still work.
MCE cluster-proxy-addon (ACM 2.16+)
Symptom: Argo CD spoke apps use destination.server = cluster-proxy URL; proxy add-on conflicts with hub cluster-wide proxy or complicates GitOps debugging.
Default in ACM/MCE 2.16: cluster-proxy-addon component is enabled.
Automated fix (new installations): Chart charts/all/acm-operator runs PostSync Job + CronJob (acm-mce-disable-cluster-proxy) that sets MultiClusterEngine/spec.overrides.components[name=cluster-proxy-addon].enabled: false.
Verify:
oc get mce multiclusterengine -o jsonpath='{range .spec.overrides.components[*]}{.name}={.enabled}{"\n"}{end}' | grep cluster-proxy
# expect: cluster-proxy-addon=false
Disable automation: set mceDisableClusterProxyAddon: false in acm-operator values (hub clustergroup override if needed).
Limitation: Disabling the add-on does not always remove ocm-proxyserver in multicluster-engine — that deployment is a separate MCE component. Spoke ManagedClusterAddon/cluster-proxy on local-cluster may also need manual review if pod-log-via-proxy features are required.
Manual one-shot:
oc patch mce multiclusterengine --type=merge -p '{"spec":{"overrides":{"components":[{"name":"cluster-proxy-addon","enabled":false}]}}}'
For a full component list merge without dropping other overrides, use the Job script in charts/all/acm-operator/files/disable-cluster-proxy-addon.py.
HBONE port 15008 not configured
Symptom: Routes return upstream connect error or 503; ztunnel logs show missing HBONE listener for pod IP.
Cause: Workloads started before ambient enrollment or before ztunnel programmed iptables.
Fix:
- Ensure namespaces get
istio.io/dataplane-mode: ambientafter Istio + IstioCNI + ZTunnel (wave 2 inservicemeshoperator3, not wave 1namespaces). - Restart affected Deployments after mesh is Ready.
reconcileIptablesOnStartup: trueon IstioCNI helps new nodes but does not retrofix running pods.
# charts/all/servicemeshoperator3 — ambient labels (wave 2)
metadata:
labels:
istio.io/dataplane-mode: ambient
annotations:
argocd.argoproj.io/sync-wave: "2"
ApplicationSet: both name and server defined
Symptom:
application destination spec is invalid: application destination can't
have both name and server defined: west https://kubernetes.default.svc
Cause: Older ApplicationSet template set server; Server-Side Apply does not remove fields the new manifest omits.
Fix:
# charts/all/acm-hub-spoke/templates/applicationset.yaml
destination:
name: ''
namespace: openshift-gitops
server: "" # explicit blank clears stale SSA
Then delete and let Argo CD recreate the ApplicationSet, or patch live spec to remove server.
Kiali multi-cluster Unauthorized
Symptom: Hub Kiali logs: Error fetching Namespaces for cluster [east]: Unauthorized.
Cause:
- Expired token in spoke
kiali-hub-exportConfigMap. - Legacy
kiali-multi-cluster-secretstill labeledkiali.io/multiCluster=truealongsidekiali-remote-*secrets.
Fix:
# Hub
oc delete secret kiali-multi-cluster-secret -n openshift-cluster-observability-operator --ignore-not-found
oc create job kiali-token-refresh --from=cronjob/kiali-multicluster-token-sync \
-n openshift-cluster-observability-operator
oc delete pod -n openshift-cluster-observability-operator -l app=kiali
On spokes, confirm export ConfigMap exists:
oc get cm kiali-hub-export -n openshift-cluster-observability-operator -o jsonpath='{.data.updatedAt}'
Kafka Console 404 on /api/*
Symptom: Browser or curl to https://kafka-console.<hub-domain>/api/kafkas returns Next.js HTML 404; in-pod console-api returns 200.
Cause: Operator Service targets UI port 3000 only; external route does not split /api to port 8080.
Fix: Deploy supplemental Route (GitOps: charts/all/kafka-console/templates/api-route.yaml):
spec:
host: kafka-console.apps.hub.example.com
path: /api
to:
kind: Service
name: kafka-console-api-service
port:
targetPort: http # 8080 on console-api container
Do not set haproxy.router.openshift.io/rewrite-target — the API expects the /api prefix.
Blank UI / NextAuth 404 on /api/auth/*
Symptom: Kafka Console page loads partially or stays blank; browser network tab shows 404 on /api/auth/providers; console-api logs show GET /api/auth/providers ... 404.
Cause: The supplemental /api Route sends all /api/* traffic to Quarkus. NextAuth runs in the UI container (Next.js) on port 3000, not in console-api.
Fix: Add a more specific Route /api/auth → kafka-console-console-service with port.targetPort: **3000** (not 80 — the Service’s EndpointSlice exposes pod port 3000). GitOps: charts/all/kafka-console/templates/api-route.yaml (kafka-console-ui-auth).
curl -sk -o /dev/null -w '%{http_code}\n' \
https://kafka-console.<hub-domain>/api/auth/providers
# Expect 200
JSON 404 / code 4041 on cluster detail
Symptom: UI shows {"errors":[{"title":"Resource not found","status":"404","code":"4041"}]} when opening a Kafka cluster.
Cause: Valid API route, but the cluster id is unknown or the console-api cannot reach brokers (often west spoke offline → Skupper listener has no connector).
Checks:
# List works?
curl -sk https://kafka-console.<hub-domain>/api/kafkas
# Detail per cluster (replace id from list response)
curl -sk -o /dev/null -w '%{http_code}\n' https://kafka-console.<hub-domain>/api/kafkas/<id>
# West spoke up?
oc config use-context west
oc get applications spoke-interconnect-west -n openshift-gitops
oc get link -n service-interconnect
Fix: Restore west (or east) spoke apps and Skupper link; resync field-content-kafka-console for broker DNS EndpointSlices.
industrial-edge-tst Degraded (Camel / KServe)
Symptom: Argo CD app industrial-edge-tst-east (or -west) is Degraded with:
Integration/mqtt-to-kafka:dependency camel:mqtt not found in Camel catalogInferenceService/anomaly-detection: stuck Progressing; sync waits for healthy state
Causes:
- Camel K: Routes use
paho:URIs; the catalog dependency iscamel:paho, notcamel:mqtt. - KServe: Chart ships
InferenceServiceonly whenanomalyDetection.enabled: true. Default isfalsebecause spokes need ODH RawDeployment (no Serverless Operator), a MinIO model ats3://models/anomaly-detection/model, and a ReadyDataScienceCluster. Threshold alerts still work viaie-anomaly-alerterwithout KServe.
Fix (GitOps):
# charts/all/industrial-edge-tst/templates/camel-integrations.yaml
dependencies:
- camel:paho
- camel:kafka
# charts/all/industrial-edge-data-science-cluster — edge RawDeployment
kserve:
defaultDeploymentMode: RawDeployment
serving:
managementState: Removed
modelmeshserving:
managementState: Removed
Verify Camel integration:
oc get integration mqtt-to-kafka -n industrial-edge-tst-all \
-o jsonpath='{range .status.conditions[?(@.type=="Ready")]}{.status} {.message}{"\n"}{end}'
Enable ML inference later: upload model to MinIO, set anomalyDetection.enabled: true in spoke app values, sync industrial-edge-data-science-cluster then industrial-edge-tst.
Industrial Edge alerts not in Mailpit
Symptom: ie-anomaly-alerter logs show Failed to send mail: HTTP Error 503 or Mailpit UI is empty while MQTT anomalies appear in pod logs.
Causes:
- Wrong hub domain on spokes —
MAILPIT_URLmust behttps://mailpit.<hub-apps-domain>/api/v1/send, not the spoke’s own domain. Check:oc get deploy ie-anomaly-alerter -n industrial-edge-tst-all \ -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="MAILPIT_URL")].value}{"\n"}' ie-anomaly-alerternot deployed — Argo CD app Missing on east/west; apply with correcthubClusterDomain:helm template ie charts/all/ie-anomaly-alerter \ --set hubClusterDomain=apps.cluster-<hub-id>.dynamic2.redhatworkshops.io \ --set clusterName=east | oc apply -f -fleet-values-syncstale on ACM 2.16 — spoke domains were not derived when the job looked forkube-apiserverinstead ofapiserverurl.openshift.io. Re-run after chart fix:oc create job --from=cronjob/fleet-values-sync fleet-values-sync-manual -n openshift-gitops
Verify: Mailpit route returns 200 on POST /api/v1/send; alerter logs Mail sent [...] -> 200.
Camel K 401 Unauthorized / ImagePullBackOff on internal registry: The PostSync Job camel-k-registry-bootstrap creates camel-k-registry-docker from the builder SA token and patches IntegrationPlatform + pull-secret trait. If the integration kit is stuck in Error, delete the Integration and IntegrationKit, then re-sync the app.
Camel K + Istio ambient (MQTT → Kafka silent failure): With istio.io/dataplane-mode: ambient on industrial-edge-tst-all, ztunnel intercepts Kafka broker TCP and Camel cannot complete metadata fetch. Git fix: deployment trait (not pod) sets istio.io/dataplane-mode: none on the integration Deployment.
# charts/all/industrial-edge-tst/templates/camel-integrations.yaml
traits:
deployment:
configuration:
metadata:
labels:
istio.io/dataplane-mode: none
Kafka advertised DNS (EndpointSlice)
Symptom: Camel or MirrorMaker2 logs UnknownHostException for dev-cluster-broker-0-<clusterName>.<namespace>.svc or metadata request timeout.
Cause: Strimzi Kafka CR sets advertisedHost to a custom DNS name; clients resolve it via hub EndpointSlice objects that Skupper/kafka-console charts create. If clusterName is empty in spoke values, broker hostnames are invalid (broker-0-.).
Fix:
- Set
clusterName: east|westincharts/region/east|west/values.yamlfor IE tst, stormshift, datalake apps. - Verify EndpointSlices exist on hub for each broker advertised name.
- Re-sync
field-content-kafka-consoleif west/east broker lists are stale.
oc get endpointslices -A | grep kafka-brokers-advertised
oc get kafka -n industrial-edge-tst-all -o yaml | grep -A2 advertisedHost
MCP Gateway (Argo Unknown)
Symptom: https://mcp-gateway.<hub-domain>/mcp returns 503 or 404; Argo app mcp-gateway sync Unknown.
Cause: ACM 2.16 schema bug blocks Application sync; MCPServerRegistration CRDs and routes never land.
Fix:
bash scripts/apply-mcp-gateway.sh
curl -sk -o /dev/null -w '%{http_code}\n' https://mcp-gateway.<hub-domain>/mcp
# Expect 200
spoke-gateway Degraded (modelmesh-serving not found)
Symptom: Argo CD app spoke-gateway-east (on the east cluster) shows HTTPRoute ie-anomaly-detection Degraded.
Cause: Optional KServe/ModelMesh route points at a backend that is not Ready yet (or ML stack not installed).
Fix (GitOps): charts/all/spoke-gateway/values.yaml sets inferenceRoute.enabled: false by default. Enable only after InferenceService is Ready and set backend namespace to redhat-ods-applications when using cluster-scoped ModelMesh.
MaaS / Lightspeed / NeuroFace 401
Symptom: Developer Hub /lightspeed loads but chat fails with 401 or empty response; NeuroFace /api/chat returns 401.
Cause: MaaS API keys not injected — secrets contain CHANGEME-inject-via-RHDP or Lightspeed sync Job skipped.
Fix:
export MAAS_KEY_LLAMA='sk-...'
export MAAS_KEY_GRANITE='sk-...'
bash scripts/apply-maas-secrets.sh
oc rollout restart deployment/developer-hub -n developer-hub
oc rollout restart deployment/neuroface -n neuroface
Verify:
oc get secret kairos-ai-credentials -n kairos-system -o jsonpath='{.data.api-key}' | base64 -d | wc -c
curl -sk -X POST https://neuroface.<hub-domain>/api/chat \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"hi"}]}'
Lightspeed model: defaults to MaaS granite-3-2-8b-instruct (plugins.lightspeed.aiModel in developer-hub chart). Requires valid key in llama-stack-secrets / Kairos sync.
Camel Dashboard (spoke console plugin)
Symptom: No Camel tab in the OpenShift console on east/west, or Argo app camel-dashboard-openshift-all-{east,west} OutOfSync.
GitOps: Vendored wrapper charts/all/camel-dashboard-openshift (umbrella 4.20.2 in charts/*.tgz), namespace camel-dashboard, sync wave 3 (see charts/region/east/values.yaml, charts/region/west/values.yaml). Avoids Argo DeadlineExceeded when spokes cannot reach the public Helm repo in time.
Post-sync (cluster-admin, once per spoke): Administration → Cluster settings → Console → enable the Camel Dashboard console plugin. Argo ignores ConsolePlugin.spec.enablement so manual enablement does not fight GitOps.
Camel K vs CamelApp: Industrial Edge uses Camel K Integration resources (e.g. mqtt-to-kafka). The dashboard operator primarily manages CamelApp CRs. Integrations may not appear in the Camel tab until you register them as CamelApp or add a bridge; use Topology/Kamelet views for Camel K workloads in the meantime.
Symptom: Failed to get a valid plugin manifest from /api/plugins/camel-dashboard-console/
Cause: The camel-dashboard-console Service has no endpoints — usually app.kubernetes.io/instance on the Service selector does not match the Deployment pod labels (e.g. after helm template + oc apply with release name camel-dashboard instead of camel-dashboard-openshift-all-{east,west}).
Fix:
# Endpoints must be non-empty
oc get endpointslices -n camel-dashboard -l kubernetes.io/service-name=camel-dashboard-console -o yaml | grep -A3 addresses
# Align selector with running pods (or re-sync Argo with helm.releaseName set in spoke templates)
oc get svc camel-dashboard-console -n camel-dashboard -o jsonpath='selector={.spec.selector}{"\n"}'
oc get pod -n camel-dashboard -l app=camel-dashboard-console -o jsonpath='instance={.items[0].metadata.labels.app\.kubernetes\.io/instance}{"\n"}'
# Test manifest from inside the cluster
oc run curl-camel --rm -i --restart=Never -n camel-dashboard \
--image=registry.redhat.io/ubi9/ubi-minimal:latest -- \
curl -sk https://camel-dashboard-console.camel-dashboard.svc:9443/plugin-manifest.json
Prefer Argo CD sync (not manual helm apply) so releaseName: camel-dashboard-openshift-all-{cluster} matches Service and Deployment labels.
Checks:
oc get application camel-dashboard-openshift-all-east -n openshift-gitops -o jsonpath='{.status.sync.status}{" "}{.status.health.status}{"\n"}'
oc get deployment -n camel-dashboard
oc get consoleplugin | grep -i camel
Air-gapped spokes: mirror the Helm repo or chart tgz internally and point repoURL / targetRevision in spoke values.yaml.
Helm template error (Hawtio disabled): If Argo reports index of nil pointer on hawtio-online-console-plugin, ensure spoke valuesObject includes stub plugin.service.port and gateway.service.port (see east/templates/component-applications.yaml).
East spoke Unknown apps: If east-spoke-components was removed from the hub, re-sync field-content-acm-hub-spoke so ApplicationSet fleet-spoke-push recreates it (see GitOps deployment chain).
east-spoke-components stuck Progressing: Usually waiting on devspaces-east (CheCluster InstallOrUpdateFailed while chePhase: Active). Fixes: delete orphan east-devspaces on the spoke (duplicate of devspaces-east, often with deletionTimestamp); ensure only devspaces from charts/region/east/values.yaml exists. Git: ignoreDifferences on CheCluster status + argocd.argoproj.io/skip-health-check on the CheCluster CR. Then oc patch application east-spoke-components -n openshift-gitops --type json -p='[{"op":"remove","path":"/operation"}]' and re-sync.
Cannot find ApplicationSet in ACM UI: ACM Applications lists Application CRs only. Use oc get applicationset fleet-spoke-push -n openshift-gitops on the hub, or open OpenShift GitOps → ApplicationSets. Child apps like industrial-edge-tst on the east spoke come from charts/region/east/values.yaml (PULL), not from the ApplicationSet template directly.
Argo CD: where applications live
| Cluster | Namespace | Examples |
|---|---|---|
| Hub | openshift-gitops | field-content-*, east-spoke-components, west-spoke-components |
| East spoke | openshift-gitops | camel-dashboard-openshift-all-east, operators-east, spoke-gateway-east, spoke-interconnect-east |
| West spoke | openshift-gitops | camel-dashboard-openshift-all-west, operators-west, spoke-gateway-west, spoke-interconnect-west |
Parent apps use destination.server = cluster-proxy URL. Child apps on spokes use https://kubernetes.default.svc.
Symptom: entity-operator CrashLoopBackOff after enabling ambient on Kafka namespaces.
Cause: Double encryption or ztunnel intercept on internal replication port 9091.
Fix: Keep Kafka control-plane namespaces off ambient where documented, or follow Strimzi + OSSM ambient guidance for your version.
Related docs
- Validation Guide — quick health checks and component validation
- Bill of Materials — operator versions and compatibility
- Service Mesh sync waves
- Architecture sync-wave table
- Getting Started
- Support Policy — community support channels