RHDP install playbook
Lessons from fresh RHDP fleets (hub + east + west). This guide helps you reach fleet value faster — ACM inventory, fleet Grafana/Kafka Console, Developer Hub catalog, Industrial Edge via hub-gateway, and ACS Central — without fighting GitOps churn on day one.
What “done” looks like (product outcomes)
| Outcome | Why it matters | How you know |
|---|---|---|
| Fleet GitOps | One hub controls PUSH operators + spoke PULL apps | managedclusters east/west Available; fleet-spoke-push ApplicationSet present |
| Cross-cluster observability | Hub sees east/west metrics and Kafka | Grafana + Kiali console links HTTP 200; Kafka Console lists clusters |
| Secure fleet | Central policy + spoke enforcement | ACS Central link 200; SecuredClusters join after hub domain sync |
| Developer experience | Scaffolding, catalog, workshop entry | Developer Hub link 200; hybrid-mesh-shared-demos in catalog |
| Edge reachability | Factory telemetry through one ingress | industrial-edge.<hub-domain> 200 after spokes + Skupper |
| Private mesh | Hub reaches spoke services without VPN | Skupper sitesInNetwork: 3; network-observer link 200 |
Run the platform smoke test from the hub:
oc login --token=... --server=<hub-api-url>
MIN_OK_CODE=200 bash scripts/verify-console-links.sh
bash scripts/verify-workshop-http200.sh # console links + showroom/MCP/IE/spokes
bash scripts/verify-fleet.sh
Expect 60–90 minutes after hub sync for all 19 hub console links to return HTTP 200 (operators, CRs, and routes converge gradually). Some links may show 503 while pods sync — that usually means the route exists but the backend is still starting. Full checklist: Validation guide → Hub console links.
Recommended install order
Option A — Hub first (fastest path to fleet value)
- Hub —
charts/region/hub, wait formulticlusterhubphase Running (~10–15 min). - East / West —
charts/region/eastandcharts/region/west(parallel is fine). - Import spokes — ACM UI or
auto-import-secret(see Getting started → Phase 3). - Sync domains — trigger
fleet-values-synconce (domains only; see below). - Day-2 bootstrap — when spokes are Available, run
bash scripts/apply-post-install-day2.sh(mesh, showroom, MCP gateway, HTTP 200 gate). - Verify — console links + Skupper VAN (
sitesInNetwork: 3).
Option B — Three RHDP orders in parallel
Valid for catalog provisioning. Spokes may finish before the hub; cross-cluster features (Mailpit, ACS Central endpoint, hub-gateway Industrial Edge, console links to hub services) converge when ACM import and fleet-values-sync complete. Same 60–90 min total; more time in a “partially connected” state.
ACM bootstrap
Symptom: acm-operator retries MultiClusterHub because the CRD is not installed yet (Subscription still rolling out).
Impact: Delays ACM, ApplicationSet, and fleet PUSH — blocks the core product story.
Mitigation (if Argo is stuck after ~15 min):
helm template acm charts/all/acm-operator | oc apply -f -
Wait for MCH Running before importing spokes or enabling heavy acm-hub-spoke auto-sync with tokens.
Spoke tokens and field-content
fleet-values-sync patches domains only — not API tokens. Tokens belong in RHDP-injected secrets or a one-time hub patch.
Anti-pattern: Putting managedClusters.east.token / west.token in field-content while acm-hub-spoke auto-syncs can cause east/west namespace terminate/recreate loops if import fails.
Preferred flow:
- MCH Running
- Import east/west (ACM UI, or
ManagedClusterfirst thenauto-import-secret— see chartacm-hub-spoke; do not pre-create aNamespacewithcluster.open-cluster-management.io/managedClusterlabel) - Optional: patch domains via
fleet-values-syncmanual job - Re-enable automated sync on
acm-hub-spokeonce clusters are Available
oc create job --from=cronjob/fleet-values-sync fleet-values-sync-manual -n openshift-gitops
Argo CD + ACM 2.16
New installs include resourceExclusions for clusterview.open-cluster-management.io in the openshift-gitops chart — avoids blocking sync on fresh hubs.
If apps show ComparisonError after MCH install:
# Apply chart or patch ArgoCD — see Troubleshooting
oc rollout restart statefulset openshift-gitops-application-controller -n openshift-gitops
Note: Local helm install/upgrade against the cluster may fail with the same clusterview schema error after ACM installs. Use helm template | oc apply for emergency fixes.
Full detail: Troubleshooting → ArgoCD Unknown sync status.
OperatorGroups (RHODS, observability)
Symptom: CSV Failed — csv created in namespace with multiple operatorgroups.
Impact: No Grafana/Kiali (observability story) or OpenShift AI dashboard until fixed.
Cause: Duplicate OperatorGroups when both clustergroup subscriptions and operatorGroup: true on the same namespace create two OGs.
Fix in Git: Do not set operatorGroup: true on namespaces that already receive an OG from subscriptions (e.g. redhat-ods-operator). See charts/region/hub/values.yaml.
Gitea (SCM for Developer Hub)
The upstream Gitea chart’s PostgreSQL/Valkey pods need privileged SCC on OpenShift (not anyuid — upstream sets seccomp annotations that anyuid rejects).
Chart binding: charts/all/gitea/templates/clusterrolebinding-gitea-privileged.yaml.
Route must target Service gitea-http (not field-content-gitea-chart-http).
Without Gitea, Developer Hub scaffolding and local SCM integration are degraded — the Gitea console link stays 503.
PostgreSQL HA on small hubs: chart defaults to 1 replica + PostSync PVC ownership job (job-gitea-postgres-fix-perms.yaml). Day-2: bash scripts/apply-gitea-postgres-fix.sh (scales STS to 0, chown 1000910000, restarts pgpool).
ACS Central (constrained hub CPU)
Central + Central DB default requests (~5.5 CPU) exceed small RHDP hub capacity when MaaS predictors, ODS notebooks, and NooBaa pods are scheduled.
Chart: charts/all/acs-operator — reduced Central/DB requests and scanner maxReplicas: 1.
Day-2 (before expecting ACS console 200):
bash scripts/apply-hub-resource-relief.sh
# wait 60–120s; central-db must be Running 2/2
curl -skI "https://central-stackrox.$(oc get ingresses.config/cluster -o jsonpath='{.spec.domain}')/"
Relief scales MaaS predictors and neuroface-ml-lab notebooks to 0, trims duplicate ACS scanners, and pauses heavy NooBaa backing-store pods until Central DB schedules.
Skupper network observer
Console link hostname: skupper-network-observer-service-interconnect.<domain>.
Use wrapper chart charts/all/skupper-network-observer (OCI subchart + OpenShift Route with TLS passthrough to port https). The subchart alone may land resources in default; the wrapper keeps everything in service-interconnect.
Observer pods need Skupper Site on the hub and TLS secrets from certificates.skupper.io.
OpenShift AI (RHODS)
OperatorGroup in redhat-ods-operator must be AllNamespaces (spec: {}) — not targetNamespaces: [redhat-ods-operator] (CSV fails with UnsupportedOperatorGroup).
Chart includes DSCInitialization + DataScienceCluster with defaultDeploymentMode: RawDeployment.
Dashboard returns 403 without auth — verify-console-links.sh uses oc whoami -t when logged in.
Kubecost
OperatorGroup name must match clustergroup convention: kubecost-operator-group (avoid duplicate OGs from chart + clustergroup).
Route uses global.localClusterDomain — not apps.cluster.example.com.
Developer Hub
Backstage requires catalog ConfigMaps before the hub pod starts:
developer-hub-catalog-demos(fromworkshop-demoschart)- TechDocs ConfigMap keys must not contain
/(OpenShift rejectsdocs/index.mdas a key)
TechDocs mount path is separate from the IE catalog entity mount (.../ie vs .../ie/techdocs) so Backstage can load both.
Product check: https://developer-hub.<hub-domain> returns 200 and catalog shows Industrial Edge and hybrid-mesh-shared-demos.
Vault console link
Upstream Vault chart creates vault-link pointing at the route root (HTTP 307). For strict HTTP 200 checks, use /ui/ in the href or patch post-install.
Console links verification
oc login --token=... # required for OAuth-protected links (OpenShift AI)
MIN_OK_CODE=200 bash scripts/verify-console-links.sh
| HTTP | Meaning |
|---|---|
| 200–399 | Route + backend OK — product surface reachable |
| 503 | Route exists; pods/operators still syncing (common in first hour) |
| 404 / 000 | Wrong hostname or no Route — check Troubleshooting |
Script uses cluster bearer token when logged in; excludes operator-created duplicate rhodslink ConsoleLinks.
Success criteria: Summary: 19 OK (200-399), 0 503 (route exists / pods down), 0 other with exit code 0.
Industrial Edge dashboard (hub-gateway)
Hub hub-gateway uses Istio Gateway API (Gateway + HTTPRoute) → Skupper ie-gateway-{east,west} listeners. Requires:
- OSSM 3.2 installed on hub and spokes (
servicemeshoperator3Subscription +Istio/ZTunnelCRs — sync wave 4, beforehub-gatewaywave 5) fleet-values-syncpopulatedclusters.east.domain/clusters.west.domainon hub- Skupper
sitesInNetwork: 3(hub + east + west) - Spoke apps
spoke-interconnect+spoke-gateway+industrial-edge-tstHealthy on each spoke - Listeners
ie-gateway-east|westReady (not No matching connectors)
bash scripts/verify-industrial-edge.sh
# optional: EAST_DOMAIN=apps.<east> WEST_DOMAIN=apps.<west> ...
Per-spoke direct check: https://line-dashboard-industrial-edge-tst-all.<spoke-domain>/
If Argo sync is Unknown, bootstrap mesh + Skupper manually:
bash scripts/apply-fleet-mesh.sh
Grafana fleet metrics (East/West)
Hub dashboard platform-overview uses datasources Prometheus-East / Prometheus-West → Skupper listeners prometheus-east|west.
No data on East/West panels usually means:
- Skupper links incomplete (
sitesInNetwork≠ 3) - Spoke
spoke-interconnectmissingprometheus-{east,west}Connector +prometheus-auth-proxy - Hub listeners
prometheus-east|westPending (No matching connectors) - ztunnel not Ready on ambient namespaces
oc get grafanadatasource -n openshift-cluster-observability-operator
oc get listener -n service-interconnect | grep prometheus
oc get ds -n ztunnel
Fix Skupper VAN first — same prerequisite as Industrial Edge.
Skupper Network Observer (demo, sin auth)
Console link: https://skupper-network-observer-service-interconnect.<hub-domain>/
Chart charts/all/skupper-network-observer usa auth.strategy: none (sin basic auth en /api/). Solo para demos / RHDP — no usar en producción expuesta a Internet.
Si la consola pide usuario/contraseña, re-sincronizar la app o aplicar:
helm template skupper-network-observer charts/all/skupper-network-observer \
--set global.localClusterDomain=apps.<hub-domain> \
| oc apply -n service-interconnect -f -
Service Mesh + Skupper bootstrap (spokes)
Symptom: istiod Not found, spoke Gateway/spoke-gateway stuck Waiting for controller, Skupper sitesInNetwork=1, IE 502/503.
Automated GitOps (default):
| Layer | Mechanism |
|---|---|
| Operator | servicemeshoperator3 Subscription (stable-3.2) in clustergroup subscriptions + operators-edge / operators-platform (PUSH) |
| Mesh CRs | App servicemeshoperator3 (sync wave 4) → Istio, IstioCNI, ZTunnel, ambient labels |
| Spoke ingress | App spoke-gateway (wave 6) after istiod Running |
| Hub ingress | App hub-gateway (wave 5) — Istio Gateway + HTTPRoute (no nginx fallback) |
Cause chain when broken:
servicemeshoperator3Subscription missing or mesh app synced before CSV installsspoke-interconnectrendered with emptyclusterName→ invalidSite/andConnector/ie-gateway-- Skupper tokens never create
Link/hub-linkuntil spokeSiteexists
Mitigation (after ManagedClusters Available):
bash scripts/apply-fleet-mesh.sh
Installs OSSM 3.2 (stable-3.2), applies spoke-interconnect + spoke-gateway + industrial-edge-tst on east/west, re-runs Skupper token sync, and validates IE + Prometheus paths.
Verify:
oc get pods -n istio-system -l app=istiod # east/west
oc get site hub -n service-interconnect -o jsonpath='sitesInNetwork={.status.sitesInNetwork}{"\n"}'
bash scripts/verify-industrial-edge.sh
Expect sitesInNetwork=3, istiod Running on spokes, hub IE 200, prometheus-east|west listeners Ready.
Workshop users (workshop-users IdP)
Chart platform-users creates htpasswd users user1..userN, admin, platformadmin (default password Welcome123!).
- OAuth IdP name:
workshop-users(secretworkshop-usersinopenshift-config) grantClusterReader: true— Platform Hub-Spoke menu (ACM, Kiali, Skupper, Grafana links)- Namespace
viewRoleBindings for middleware namespaces
Re-apply after rename:
oc apply -f scripts/fix-htpasswd-users-secret-job.yaml
oc wait --for=condition=complete job/htpasswd-users-secret-fix -n openshift-gitops --timeout=5m
Login: OpenShift console → workshop-users → user1 / Welcome123!
Workshop Showroom (Antora lab)
Hub apps workshop-registration (wave 4) and showroom (wave 5) live in namespace showroom. Entry from the console link Hybrid Mesh AI Workshop → registration → redirect to Antora with USER_NAME=userN.
| URL | Purpose |
|---|---|
https://workshop-registration.<hub-domain>/ | Assign userN, progress tracking |
https://showroom-showroom.<hub-domain>/ | Lab guide + embedded terminal (/terminal/) |
Symptom: showroom-showroom returns 503 — Route missing or Argo never applied the showroom chart (common when sync status is Unknown on ACM 2.16 hubs).
Mitigation (after fleet-values-sync has east/west domains, or once ManagedClusters are Available):
bash scripts/apply-workshop-showroom.sh
The script reads hub/east/west domains from the cluster, runs helm template | oc apply, and waits for the pod (Antora build init can take ~3 min).
Verify:
curl -sk -o /dev/null -w '%{http_code}\n' https://workshop-registration.apps.<hub-domain>/api/health
curl -sk -o /dev/null -w '%{http_code}\n' https://showroom-showroom.apps.<hub-domain>/
Expect 200 on both. Facilitator flow: register → Showroom with ?USER_NAME=user1 → Developer Hub catalog hybrid-mesh-shared-demos.
After pushing new Antora content to showroom-hybrid-mesh-ai:
bash scripts/sync-showroom-content.sh
# or: oc rollout restart deployment/showroom -n showroom
Post-install day-2 (automated)
When ACM import and fleet-values-sync are done but Argo CD shows Unknown sync (ACM 2.16), run one script from the hub:
export KUBECONFIG=/tmp/hub-kubeconfig # or oc login
bash scripts/apply-post-install-day2.sh
| Step | Script | What it fixes |
|---|---|---|
| Fleet mesh + Skupper | apply-fleet-mesh.sh | OSSM 3.2, spoke-interconnect, IE listeners, sitesInNetwork=3 |
| Workshop showroom | apply-workshop-showroom.sh | Registration + Antora pod when showroom app never synced |
| MCP Gateway | apply-mcp-gateway.sh | CRDs + MCPServerRegistration when /mcp returns 503 |
| Istio/Kafka monitoring | apply-istio-monitoring.sh | PodMonitors + UWM on hub/spokes (Grafana panels) |
| Kuadrant public APIs | apply-workshop-kuadrant-apis.sh | workshop-apis gateway + APIProducts |
| Gitea route/assets | apply-gitea-root-url.sh | ROOT_URL, service selector, route host (idempotent; PostSync jobs cover fresh install) |
| MaaS secrets | apply-maas-secrets.sh | Lightspeed / NeuroFace / ODS keys (env vars, optional) |
| HTTP 200 gate | verify-workshop-http200.sh | 19 console links + workshop/AI URLs |
Skip mesh if already healthy: SKIP_MESH=1 bash scripts/apply-post-install-day2.sh.
HashiCorp Vault (hub)
Console: Platform Hub-Spoke menu → Vault → https://vault-vault.<hub-domain>/ui/ (use /ui/ — route root returns HTTP 307).
Workshop users: userN has view on namespace vault (see platform-users chart).
Facilitator init: Vault chart is external VP hashicorp-vault. Create local values-secret.yaml (gitignored) for init/unseal — never commit tokens. External Secrets Operator (openshift-external-secrets) connects ESO to Vault when configured.
Demo login (userpass): chart vault-demo-auth creates workshop users for the UI. Same password as OpenShift workshop users (Welcome123! by default).
| Vault user | Password (demo) | Policy |
|---|---|---|
admin | Welcome123! | read/write secret/workshop/*, read secret/global/* |
user1 | Welcome123! | read secret/workshop/* |
Vault UI → Sign in → method Username (userpass) → admin / Welcome123!.
Re-apply: bash scripts/apply-vault-demo-auth.sh
oc get route -n vault
oc get pods -n vault
oc get cm vault-demo-login -n vault -o yaml
Target: model keys via Vault paths + ExternalSecret (see Vault & External Secrets product page and Showroom module Vault & External Secrets). Today: scripts/apply-maas-secrets.sh for RHDP day-2 injection.
MaaS API keys (Lightspeed, NeuroFace, OpenShift AI)
Never commit sk-* keys. Inject after hub sync:
export MAAS_KEY_LLAMA='sk-...'
export MAAS_KEY_GRANITE='sk-...' # optional — Lightspeed default model
export MAAS_KEY_DEEPSEEK='sk-...' # optional
bash scripts/apply-maas-secrets.sh
| Secret | Namespace | Consumers |
|---|---|---|
kairos-ai-credentials | kairos-system | Kairos, Lightspeed sync |
openshift-ai-maas-credentials | maas-workshop | ODS playground, MaaS proxies |
neuroface-maas-api-key | neuroface | NeuroFace /api/chat |
maas-granite-credentials | maas-workshop | ODS connection (optional) |
Symptom: Developer Hub /lightspeed or NeuroFace chat 401 — run apply-maas-secrets.sh and restart developer-hub / neuroface.
RHDP can inject litemaas.apiKey into clustergroup values (wired to workshop-kuadrant-apis, openshift-ai-hub, neuroface charts).
Workshop APIs (Kuadrant / Connectivity Link)
Public APIs via ExternalName + Istio ServiceEntry → hub Gateway API → Kuadrant APIProduct (auto-approval).
| URL | Purpose |
|---|---|
https://workshop-apis.<hub-domain>/httpbin/* | httpbin (PlanPolicy bronze/silver/gold) |
https://workshop-apis.<hub-domain>/countries/* | REST Countries |
https://workshop-apis.<hub-domain>/mcp | MCP Gateway (Kuadrant API key) |
https://ai-gateway.<hub-domain>/v1/chat/completions | MaaS LLM (TokenRateLimit free/gold) |
Developer Hub: /kuadrant → API Products → click product name → Request API key (auto-approved) → My API Keys → Authorization: APIKEY …. Or Catalog → API entity → Kuadrant tab.
Fresh install: Catalog ConfigMaps use piped Helm replace (not $var = replace, which truncates file content). After sync, confirm developer-hub-catalog-workshop-kuadrant-apis contains API entities before opening the UI.
Kuadrant catalog sync: APIProduct CRs need backstage.io/owner: group:default/platform-engineering (chart workshop-kuadrant-apis) so the Kuadrant provider registers entities. Static catalog entities live under System workshop-kuadrant-apis. Login via Keycloak (user1 / Welcome123! or platformadmin to edit products).
OpenShift Console: RHCL console plugin — AuthPolicy / PlanPolicy in workshop-kuadrant-apis and ai-gateway-system; APIProducts in same namespaces.
RHCL + mesh order: rhcl-operator syncWave 1, hub-gateway / mesh syncWave 5, workshop-kuadrant-apis syncWave 6. If policies stay Not Accepted, verify ISTIO_GATEWAY_CONTROLLER_NAMES on the Kuadrant operator deployment and restart the pod (see Connectivity Link).
bash scripts/apply-workshop-kuadrant-apis.sh
curl -sk -o /dev/null -w '%{http_code}\n' https://workshop-apis.<hub-domain>/httpbin/get
# Expect 401 without key
Catalog: System workshop-kuadrant-apis — Components + API entities for userN onboarding.
MCP Gateway
Console link: https://mcp-gateway.<hub-domain>/mcp (expect HTTP 200 on /mcp).
When Argo app mcp-gateway stays Unknown and the route returns 503:
bash scripts/apply-mcp-gateway.sh
Requires hub OpenShift AI / Lightspeed stack for MCPServerRegistration backends.
Kafka + Camel MQTT (spokes)
Symptom: mqtt-to-kafka Integration Error; Kafka metadata timeout; no messages on topic temperature.
Fix chain (Git + day-2):
clusterNamein spoke values —charts/region/east|west/values.yamloverrides forindustrial-edge-tst,stormshift,industrial-edge-data-lake(avoids empty broker DNSbroker-0-.).- Kafka advertised host — Strimzi
advertisedHostneeds matching EndpointSlice on hub (*-kafka-brokers-advertised); see Troubleshooting → Kafka advertised DNS. - Istio ambient + Camel K — trait
deploymentwithistio.io/dataplane-mode: noneonmqtt-to-kafka(traitpodis ignored in Camel K 2.10). - Registry 401 — refresh
camel-k-registry-docker, delete Integration + IntegrationKit, re-sync app.
oc get integration mqtt-to-kafka -n industrial-edge-tst-all -o jsonpath='{.status.phase}{"\n"}'
# Expect Running
OpenShift Virtualization (CNV)
Hub-only. Chart cnv-example installs Subscription + HyperConverged + demo VM in cnv-workshop.
Requires nested virtualization on workers (may be unavailable on some RHDP cloud flavors).
oc get csv -n openshift-cnv | grep kubevirt
oc get vm -n cnv-workshop
Related docs
- Getting started — ACM-first install phases
- RHDP field content — catalog parameters
- Validation guide — component matrix
- Troubleshooting — symptom matrix