diff --git a/content/patterns/openshift-aiops-platform/_index.adoc b/content/patterns/openshift-aiops-platform/_index.adoc new file mode 100644 index 000000000..7d89ffaad --- /dev/null +++ b/content/patterns/openshift-aiops-platform/_index.adoc @@ -0,0 +1,42 @@ +--- +title: OpenShift AIOps Self-Healing Platform +date: 2026-02-26 +tier: community +summary: This pattern provides an AI-powered self-healing platform for OpenShift clusters, combining deterministic automation with machine learning for intelligent incident response. +rh_products: +- Red Hat OpenShift Container Platform +- Red Hat OpenShift AI +- Red Hat OpenShift GitOps +- Red Hat OpenShift Pipelines +- Red Hat OpenShift Data Foundation +- Red Hat Advanced Cluster Management for Kubernetes +industries: +- General +pattern_logo: openshift-aiops-platform.png +links: + github: https://github.com/KubeHeal/openshift-aiops-platform + install: getting-started + arch: https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/002-hybrid-self-healing-approach.md + bugs: https://github.com/KubeHeal/openshift-aiops-platform/issues + feedback: https://github.com/KubeHeal/openshift-aiops-platform/discussions +ci: openshift-aiops-platform +--- +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +include::modules/oaiops-about.adoc[leveloffset=+1] + +include::modules/oaiops-solution-elements.adoc[leveloffset=+2] + +include::modules/oaiops-architecture.adoc[leveloffset=+1] + +[id="next-steps_openshift-aiops-platform-index"] +== Next steps + +* link:getting-started[Deploy the OpenShift AIOps Self-Healing Platform] +* Review the link:cluster-sizing[cluster sizing requirements] +* Explore link:ideas-for-customization[customization options] to adapt the pattern to your use case +* Read the link:{github-url}/blob/main/docs/adrs/002-hybrid-self-healing-approach.md[hybrid self-healing architecture decision record] +* Join the discussion at link:{feedback-url}[GitHub Discussions] diff --git a/content/patterns/openshift-aiops-platform/cluster-sizing.adoc b/content/patterns/openshift-aiops-platform/cluster-sizing.adoc new file mode 100644 index 000000000..7c590a477 --- /dev/null +++ b/content/patterns/openshift-aiops-platform/cluster-sizing.adoc @@ -0,0 +1,162 @@ +--- +title: Cluster sizing +weight: 50 +aliases: /openshift-aiops-platform/openshift-aiops-platform-cluster-sizing/ +--- + +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY + +include::modules/comm-attributes.adoc[] +include::modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc[] + +include::modules/cluster-sizing-template.adoc[] + +[id="additional-sizing-considerations-openshift-aiops"] +== Additional sizing considerations for AIOps workloads + +The OpenShift AIOps Self-Healing Platform has specific resource requirements beyond the baseline cluster sizing due to its observability, machine learning, and data storage needs. + +=== Hub cluster sizing recommendations + +The hub cluster hosts the majority of the AIOps platform components including observability aggregation, ML training and inference, and the self-healing decision engine. + +The pattern supports two deployment topologies, which are automatically detected during deployment using `make show-cluster-info`: + +*Standard HighlyAvailable Topology*:: ++ +Recommended for production multi-cluster deployments: ++ +*Control Plane Nodes*:: +* Minimum: 3 nodes +* vCPUs per node: 8 +* Memory per node: 32 GB +* Sufficient for ACM, GitOps, and platform operators + +*Compute Nodes*:: +* Minimum: 6 nodes +* vCPUs per node: 16 +* Memory per node: 64 GB +* Required for OpenShift AI workloads, observability stack, and data storage + +*Total Hub Cluster Resources*:: +* Control plane: 24 vCPUs, 96 GB memory +* Compute: 96 vCPUs, 384 GB memory +* Combined: 120 vCPUs, 480 GB memory + +*Single Node OpenShift (SNO) Topology*:: ++ +Suitable for edge deployments, development, or single-cluster self-healing scenarios: ++ +*Single Node Requirements*:: +* Minimum: 1 node +* vCPUs: 8 minimum, 16+ recommended +* Memory: 32 GB minimum, 64 GB recommended +* Storage: 120 GB minimum, 250 GB recommended +* Combined control plane and compute workloads on one node ++ +[NOTE] +==== +SNO deployments have reduced high availability but are suitable for edge locations, development environments, or scenarios where a single cluster is being managed. The pattern automatically detects SNO topology and adjusts resource allocation and storage configuration accordingly. + +To verify cluster topology before deployment: +[source,terminal] +---- +make show-cluster-info +---- + +During deployment, OpenShift Data Foundation (ODF) installation is automated via `make configure-cluster`, which adjusts for SNO topology when detected. +==== + +=== Spoke cluster requirements + +Spoke clusters have minimal overhead from the AIOps platform since most processing occurs on the hub: + +* Standard OpenShift cluster sizing for your workloads +* Add 2 vCPUs and 4 GB memory per node for observability agents (Prometheus, Fluentd, OpenTelemetry) +* No additional nodes required specifically for AIOps + +=== Storage considerations + +The pattern requires persistent storage for several components: + +*Metrics Storage (Thanos)*:: +* 500 GB minimum for 30 days of retention +* 1 TB recommended for 60 days +* Scale based on number of clusters and metric cardinality +* Storage class: Block storage with good IOPS (gp3, Premium SSD) + +*Log Storage (Loki)*:: +* 200 GB minimum for 15 days of retention +* 500 GB recommended for 30 days +* Scale based on log volume from applications +* Storage class: Block or object storage + +*Model Storage (S3-compatible)*:: +* 50 GB minimum for model artifacts and registry +* 100 GB recommended for multiple model versions and A/B testing +* Storage class: Object storage (S3, MinIO, ODF) + +*Incident History Database*:: +* 50 GB minimum for incident data and ML training datasets +* 100 GB recommended for extended history +* Storage class: Block storage with good IOPS + +*Total Storage Requirements*:: +* Minimum: 800 GB +* Recommended: 1.75 TB +* Consider using OpenShift Data Foundation for unified storage + +=== Scaling recommendations by cluster count + +Resource requirements scale with the number of managed clusters: + +*1-5 Spoke Clusters*:: +* Use baseline hub sizing (5 compute nodes) +* 1 TB total storage +* Suitable for development and small production deployments + +*6-20 Spoke Clusters*:: +* Scale to 7-9 compute nodes +* 2 TB total storage +* Consider dedicated nodes for observability workloads +* May require metrics downsampling for cost optimization + +*21-50 Spoke Clusters*:: +* Scale to 10-15 compute nodes +* 4 TB total storage +* Use separate node pools for ML, observability, and data storage +* Implement metric federation and sampling strategies +* Consider dedicated Kafka or similar for event streaming + +*50+ Spoke Clusters*:: +* Enterprise deployment requiring detailed capacity planning +* Consider horizontal scaling of observability components +* Implement tiered storage with hot/warm/cold data lifecycle +* May require multiple hub clusters for geographic distribution +* Consult Red Hat for sizing recommendations + +=== Network requirements + +*Bandwidth*:: +* Each spoke cluster generates approximately 1-5 Mbps of metrics and logs +* Hub cluster needs sufficient ingress bandwidth: 50 Mbps for 10 spokes, 250 Mbps for 50 spokes +* Model inference is low bandwidth (<1 Mbps) + +*Latency*:: +* Observability can tolerate latency up to 1 second +* Real-time self-healing performs best with <200ms latency to spoke clusters +* Consider regional hub clusters for global deployments + +=== GPU requirements (Optional) + +GPU acceleration is optional but recommended for ML training: + +*ML Model Training*:: +* Not required for inference (CPU-based inference is sufficient) +* Recommended for faster model training: 1-2 NVIDIA GPUs (T4, V100, or A100) +* Reduces training time from hours to minutes for large datasets +* Use GPU node pools with taints to reserve for ML workloads + +The baseline cluster sizing includes sufficient CPU resources for inference. Add GPUs only if training time is a concern or if experimenting with larger neural network models. diff --git a/content/patterns/openshift-aiops-platform/getting-started.adoc b/content/patterns/openshift-aiops-platform/getting-started.adoc new file mode 100644 index 000000000..7e8bd69d2 --- /dev/null +++ b/content/patterns/openshift-aiops-platform/getting-started.adoc @@ -0,0 +1,20 @@ +--- +title: Getting started +weight: 10 +aliases: /openshift-aiops-platform/getting-started/ +--- +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +include::modules/oaiops-deploying.adoc[leveloffset=+1] + +[id="next-steps_openshift-aiops-platform-getting-started"] += Next steps + +* Review link:../cluster-sizing[cluster sizing requirements] to ensure your infrastructure meets the pattern's needs +* Explore link:../ideas-for-customization[customization options] to adapt the self-healing platform to your environment +* Check the Grafana dashboards to monitor self-healing activity and model performance +* Review and extend the runbook library for your specific use cases +* Configure integrations with external systems like ServiceNow, PagerDuty, or Slack diff --git a/content/patterns/openshift-aiops-platform/ideas-for-customization.adoc b/content/patterns/openshift-aiops-platform/ideas-for-customization.adoc new file mode 100644 index 000000000..b869c8688 --- /dev/null +++ b/content/patterns/openshift-aiops-platform/ideas-for-customization.adoc @@ -0,0 +1,1600 @@ +--- +title: Ideas for customization +weight: 60 +aliases: /openshift-aiops-platform/ideas-for-customization/ +--- +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +[id="about-customizing-openshift-aiops-pattern"] += About customizing the OpenShift AIOps Self-Healing Platform + +One of the major goals of the Validated Patterns development process is to create modular and customizable solutions. The OpenShift AIOps Self-Healing Platform provides a framework for intelligent, automated incident response that can be adapted to your specific infrastructure, applications, and operational requirements. + +The pattern's hybrid approach combining deterministic runbooks with machine learning predictions makes it highly customizable. You can start with runbooks for well-understood problems in your environment, then gradually introduce ML-based automation as you collect incident data and train models specific to your workload patterns. + +[id="real-world-deployment-examples"] +== Real-world deployment examples + +This section provides examples from actual production deployments to help you understand how the pattern components work together in practice. + +=== Multi-model inference serving + +A typical deployment runs multiple ML models concurrently for different prediction tasks: + +[source,terminal] +---- +oc get inferenceservice -n self-healing-platform -o wide +---- + +Example output from a running cluster: + +---- +NAME URL READY +anomaly-detector http://anomaly-detector-predictor.self-healing-platform.svc.cluster.local True +predictive-analytics http://predictive-analytics-predictor.self-healing-platform.svc.cluster.local True +---- + +The `anomaly-detector` model uses Isolation Forest for detecting unusual patterns in metrics, while `predictive-analytics` uses Random Forest Regressor for forecasting resource usage 1 hour ahead. Both models are trained weekly via automated notebooks and deployed via KServe, accessible to the Go-based coordination engine for decision-making. + +=== Topology-aware storage configuration + +Storage configuration adapts automatically based on cluster topology (SNO vs HA): + +[source,terminal] +---- +oc get pvc -n self-healing-platform +---- + +*Example: HA cluster storage pattern (ocs-storagecluster-cephfs):* + +---- +NAME CAPACITY ACCESS MODES STORAGECLASS +model-storage-pvc 10Gi RWX ocs-storagecluster-cephfs +model-storage-gpu-pvc 10Gi RWO gp3-csi +workbench-data-development 20Gi RWO ocs-storagecluster-cephfs +---- + +*Example: SNO cluster storage pattern (gp3-csi):* + +---- +NAME CAPACITY ACCESS MODES STORAGECLASS +model-storage-pvc 10Gi RWO gp3-csi +workbench-data-development 20Gi RWO gp3-csi +---- + +*Storage class selection by topology:* + +* *HA clusters:* Use `ocs-storagecluster-cephfs` (RWX) for shared model storage accessed by multiple pods, plus `gp3-csi` (RWO) for GPU training nodes (CephFS unavailable on GPU nodes) +* *SNO clusters:* Use `gp3-csi` (RWO) exclusively - simpler, more reliable, and works everywhere +* *S3 storage:* NooBaa S3 available on both topologies (full ODF on HA, MCG-only on SNO) + +=== Platform configuration tuning + +Customize operational parameters via the platform ConfigMap: + +[source,yaml] +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: platform-config + namespace: self-healing-platform +data: + # Cluster identification + CLUSTER_NAME: hub + + # Service ports + COORDINATION_ENGINE_PORT: "8080" + METRICS_PORT: "8090" + + # Environment settings + ENVIRONMENT: development # or production, staging + LOG_LEVEL: INFO # DEBUG for troubleshooting, WARN for production + + # Timeouts and intervals + MODEL_SERVING_TIMEOUT: 30s # Increase for complex models + PROMETHEUS_SCRAPE_INTERVAL: 30s # Balance between freshness and load + + # Optional: Custom settings + MAX_CONCURRENT_REMEDIATIONS: "5" + ENABLE_DRY_RUN_MODE: "false" +---- + +*Tuning recommendations:* + +* *Development:* `LOG_LEVEL: DEBUG`, shorter timeouts for faster iteration +* *Production:* `LOG_LEVEL: WARN`, longer timeouts for stability, enable rate limiting + +=== Core platform components + +The pattern deploys these core services: + +*Coordination Engine (Go-based):* +- Deployment: `quay.io/takinosh/openshift-coordination-engine:ocp-4.18-latest` +- Ports: 8080 (API), 9090 (Prometheus metrics) +- Resources: 200m CPU / 256Mi memory (requests), 500m CPU / 512Mi memory (limits) +- Function: Orchestrates remediation actions, integrates with KServe models + +*MCP Server (Go-based standalone):* +- Repository: `openshift-cluster-health-mcp` +- Protocol: Model Context Protocol (MCP) over HTTP +- Port: 8080 (ClusterIP service) +- Tools: 12+ MCP tools for cluster health, pod management, Prometheus queries +- Function: Provides MCP-compliant interface for AI assistants (OpenShift Lightspeed) + +Both components are implemented in Go for better performance and native Kubernetes client support. + +=== Jupyter notebook catalog + +The pattern includes 33 production-ready Jupyter notebooks organized in 9 directories, covering the complete self-healing workflow: + +*00-setup/* (3 notebooks) - Platform prerequisites +- Platform readiness validation +- KServe model onboarding +- Python environment setup + +*01-data-collection/* (5 notebooks) - Metrics and event collection +- Prometheus metrics collection +- OpenShift events analysis +- Log parsing analysis +- Feature store demo +- Synthetic anomaly generation + +*02-anomaly-detection/* (5 notebooks) - ML model training +- Isolation Forest implementation (primary anomaly detector) +- Time-series anomaly detection (ARIMA, Prophet) +- LSTM-based prediction (GPU-accelerated) +- Ensemble anomaly methods +- Predictive analytics with KServe (Random Forest Regressor) + +*03-self-healing-logic/* (3 notebooks) - Remediation workflows +- Rule-based remediation +- AI-driven decision making +- Hybrid healing workflows + +*04-model-serving/* (3 notebooks) - KServe deployment +- KServe model deployment +- Inference pipeline setup +- Model versioning and MLOps + +*05-end-to-end-scenarios/* (4 notebooks) - Complete demos +- Complete platform demo +- Pod crash loop healing +- Resource exhaustion detection +- Network anomaly response + +*06-mcp-lightspeed-integration/* (4 notebooks) - AI assistant integration +- MCP server integration +- OpenShift Lightspeed integration +- LlamaStack integration +- End-to-end troubleshooting workflow + +*07-monitoring-operations/* (3 notebooks) - Platform observability +- Prometheus metrics monitoring +- Model performance monitoring +- Healing success tracking + +*08-advanced-scenarios/* (3 notebooks) - Production use cases +- Security incident response automation +- Predictive scaling and capacity planning +- Cost optimization and resource efficiency + +These notebooks are validated as part of the deployment process and serve as customization templates for your environment. + +[id="understanding-customization-approaches"] +== Understanding customization approaches + +There are several levels at which you can customize this pattern: + +*Runbook Customization*:: +Add, modify, or remove runbooks to match the specific incidents common in your environment. Runbooks are defined as YAML ConfigMaps and can remediate infrastructure issues, application problems, or integrate with external systems. + +*AI Agent Customization*:: +Develop custom AI agents for specialized incident response scenarios. The pattern includes a comprehensive link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/AGENTS.md[AI Agent Development Guide] covering agent architecture, state management, tool integration, and best practices. Agents can be developed and tested using the included Jupyter workbench environment with interactive notebooks. + +*ML Model Customization*:: +Train custom models on your historical incident data to improve prediction accuracy for your specific workload characteristics. The pattern includes a complete MLOps pipeline for this purpose, with Jupyter notebooks for feature engineering, model training, and evaluation workflows. + +*Observability Customization*:: +Extend the observability stack to collect custom metrics, traces, or logs specific to your applications. Add custom Prometheus exporters, OpenTelemetry instrumentation, or log parsing rules. + +*Integration Customization*:: +Integrate with your existing tooling such as ITSM systems (ServiceNow, Jira), communication platforms (Slack, Microsoft Teams), or monitoring tools (Datadog, Splunk). + +*Policy Customization*:: +Adjust confidence thresholds, auto-remediation policies, change windows, and approval workflows to match your operational risk tolerance and compliance requirements. + +*Deployment Topology*:: +Customize for your deployment topology - the pattern supports both standard HighlyAvailable multi-node clusters and Single Node OpenShift (SNO) deployments with automatic topology detection. + +[id="industry-specific-extensions"] +== Industry-specific extensions + +The pattern can be adapted for different industries with specific operational challenges: + +=== Financial services + +Financial institutions could customize the pattern to: + +* *Detect and remediate trading platform issues*: Automatically restart hung trading engines, clear message queues, or fail over to backup systems when latency thresholds are exceeded +* *Maintain compliance during remediation*: Ensure all automated actions are logged with full audit trails, include compliance metadata in Git commits, and enforce approval workflows for production changes +* *Handle high-frequency transaction workloads*: Train models to recognize degradation patterns in transaction processing rates and automatically scale resources or reroute traffic before SLA violations occur +* *Integrate with fraud detection*: Coordinate with fraud detection systems to automatically quarantine suspicious services or scale up security monitoring when anomalies are detected + +=== Telecommunications + +Telco operators could extend the pattern for: + +* *Network function virtualization (NFV) self-healing*: Detect and remediate issues in virtualized network functions (vNF) such as routers, firewalls, or load balancers running on OpenShift +* *5G core network automation*: Automatically handle scaling, failure recovery, and configuration drift in 5G core components +* *Edge site management*: Manage self-healing across hundreds or thousands of edge OpenShift clusters with intermittent connectivity +* *Service Level Agreement (SLA) protection*: Proactively remediate issues before they cause SLA violations by predicting degradation based on metric trends + +=== Healthcare + +Healthcare organizations could customize for: + +* *HIPAA-compliant remediation*: Ensure all automated actions maintain data privacy, include encryption for sensitive logs, and restrict access to patient data during troubleshooting +* *Critical system prioritization*: Assign priority levels to different workloads (e.g., life-critical systems vs. administrative apps) and route high-priority incidents to immediate automated remediation while lower-priority issues follow approval workflows +* *Integration with medical device management*: Coordinate with IoT device management platforms to detect and remediate connectivity or data flow issues from medical devices +* *Disaster recovery automation*: Automatically fail over critical healthcare applications to disaster recovery sites when primary data center issues are detected + +=== Retail and e-commerce + +Retail organizations could adapt the pattern for: + +* *Seasonal traffic handling*: Automatically scale infrastructure during peak shopping periods (Black Friday, holiday season) and detect when auto-scaling is not keeping up with demand +* *Payment processing reliability*: Immediately detect and remediate payment gateway issues to prevent revenue loss, with escalation to humans for financial system changes +* *Inventory system resilience*: Automatically remediate database connection issues, cache failures, or API timeouts in inventory management systems +* *Point-of-sale (POS) edge management*: Manage self-healing across distributed POS systems in retail stores, handling connectivity issues and local system failures + +[id="extending-runbook-library"] +== Extending the runbook library + +[NOTE] +==== +**Status:** Coming Soon + +The declarative runbook library feature (ConfigMap-based runbooks) is planned for a future release. Currently, remediation actions are implemented through notebooks that integrate with the Coordination Engine REST API. +==== + +The pattern's hybrid self-healing approach combines deterministic automation with AI-driven analysis. While a declarative runbook format is planned, you can currently implement deterministic remediation logic using the Coordination Engine API. + +=== Current approach: Coordination Engine API integration + +Notebooks can submit anomalies and trigger remediation actions programmatically: + +[source,python] +---- +import requests + +# Submit anomaly to coordination engine +response = requests.post( + 'http://coordination-engine.self-healing-platform.svc.cluster.local:8080/api/v1/anomalies', + json={ + 'timestamp': '2026-03-06T12:00:00Z', + 'type': 'resource_exhaustion', + 'severity': 'critical', + 'namespace': 'production-apps', + 'confidence_score': 0.92, + 'recommended_action': 'scale_down_pods', + 'metadata': { + 'deployment': 'web-frontend', + 'memory_usage_percent': 95 + } + } +) + +# Check remediation status +anomaly_id = response.json()['id'] +status = requests.get( + f'http://coordination-engine:8080/api/v1/anomalies/{anomaly_id}' +) +print(f"Remediation status: {status.json()['status']}") +---- + +*Available Coordination Engine endpoints:* + +|=== +|Endpoint |Method |Purpose + +|`/health` +|GET +|Health check and readiness probe + +|`/api/v1/anomalies` +|POST +|Submit anomaly for processing + +|`/api/v1/anomalies/` +|GET +|Get anomaly status and resolution + +|`/api/v1/remediate` +|POST +|Trigger remediation action directly + +|`/api/v1/status` +|GET +|Engine status and metrics + +|`/metrics` +|GET +|Prometheus metrics endpoint +|=== + +See the `03-self-healing-logic/` notebooks for complete examples of integrating with the coordination engine. + +=== Planned: Declarative runbook format + +A future release will support ConfigMap-based runbook definitions: + +[source,yaml] +---- +# Future feature - not yet implemented +apiVersion: v1 +kind: ConfigMap +metadata: + name: runbook-high-memory-usage + namespace: self-healing-platform + labels: + aiops.kubeheal.io/runbook: "true" +data: + runbook.yaml: | + name: restart-high-memory-pod + description: Restart pods with memory usage above 90% + + triggers: + - type: prometheus + query: | + (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9 + duration: 5m + + remediation: + type: coordination_engine_api + action: restart_deployment +---- + +This declarative format will allow platform administrators to define remediation procedures without writing Python code. + +[id="integrating-ai-assistants"] +== Integrating with AI assistants + +The pattern includes a Model Context Protocol (MCP) server that provides AI assistants with access to cluster health and operations data. + +=== MCP Server integration + +The Go-based MCP server is pre-deployed and provides 12+ tools for cluster operations: + +*Deployed MCP Server:* +- Service: `http://mcp-server.self-healing-platform.svc.cluster.local:8080` +- Protocol: Model Context Protocol (MCP) over HTTP +- Integration: OpenShift Lightspeed uses this server automatically + +*Available MCP tools:* + +- `get-cluster-health` - Node, pod, and deployment health status +- `list-pods` - Pod listing with label selectors +- `get-prometheus-metrics` - Query Prometheus for cluster metrics +- `analyze-logs` - Container log analysis +- Additional cluster management tools + +=== OpenShift Lightspeed integration + +The pattern integrates with OpenShift Lightspeed for AI-powered cluster management: + +[source,terminal] +---- +# Access OpenShift Lightspeed in the OpenShift Console +# Navigate to: Help menu → OpenShift Lightspeed + +# Example queries: +# - "Show me pods with high memory usage in self-healing-platform namespace" +# - "What anomalies has the coordination engine detected?" +# - "Check the status of InferenceServices" +---- + +OpenShift Lightspeed automatically uses the MCP server to access real-time cluster data when answering questions. + +=== Custom remediation logic + +For custom remediation workflows, develop Python notebooks that integrate with the Coordination Engine API: + +[source,python] +---- +# Example: Custom remediation notebook +import requests +from datetime import datetime + +def detect_custom_anomaly(): + """Implement custom anomaly detection logic.""" + # Your custom detection code + return { + 'type': 'custom_performance_degradation', + 'severity': 'warning', + 'confidence': 0.85 + } + +def submit_to_coordination_engine(anomaly): + """Submit detected anomaly for remediation.""" + response = requests.post( + 'http://coordination-engine:8080/api/v1/anomalies', + json={ + 'timestamp': datetime.utcnow().isoformat(), + **anomaly + } + ) + return response.json() + +# Detect and submit +anomaly = detect_custom_anomaly() +result = submit_to_coordination_engine(anomaly) +print(f"Submitted anomaly {result['id']}, status: {result['status']}") +---- + +See the `06-mcp-lightspeed-integration/` notebooks for complete integration examples. + +[id="development-workflow-best-practices"] +== Development workflow best practices + +The Jupyter workbench environment provides a complete development platform for customizing the self-healing platform. + +=== Accessing the development environment + +The Jupyter workbench is accessed through the OpenShift AI dashboard with OAuth authentication. + +.Procedure + +. Get the OpenShift AI dashboard URL: ++ +[source,terminal] +---- +oc get route data-science-gateway -n openshift-ingress -o jsonpath='{.spec.host}' +---- ++ +Example output: ++ +---- +data-science-gateway.apps.cluster-7r4mf.7r4mf.sandbox458.opentlc.com +---- + +. Open the dashboard URL in your browser (use `https://`): ++ +---- +https://data-science-gateway.apps. +---- + +. Authenticate with your OpenShift credentials (OAuth2 authentication) + +. Navigate to **Data Science Projects** → **self-healing-platform** + +. In the **Workbenches** section, click **Open** next to `self-healing-workbench` + +. The JupyterLab environment opens with: ++ +* Pre-loaded notebooks from the repository +* Persistent storage mounted at `/opt/app-root/src/data` +* Model storage mounted at `/mnt/models` +* Access to coordination engine and KServe models + +. Verify the workbench configuration: ++ +[source,terminal] +---- +# From a terminal within JupyterLab +df -h | grep -E "data|models" +---- ++ +Example output: ++ +---- +/dev/mapper/vg0-workbench 20G 2.1G 18G 11% /opt/app-root/src/data +/dev/mapper/vg0-models 10G 1.2G 8.8G 12% /mnt/models +---- + +=== Notebook organization structure + +Organize your custom notebooks following the pattern's structure: + +---- +notebooks/ +├── 01-environment-setup/ +│ ├── 01-validate-cluster-access.ipynb +│ └── 02-configure-credentials.ipynb +├── 02-anomaly-detection/ +│ ├── 01-isolation-forest-implementation.ipynb +│ ├── 02-ensemble-anomaly-methods.ipynb +│ └── 05-predictive-analytics-kserve.ipynb +├── 03-incident-response/ +│ ├── 01-rule-based-remediation.ipynb +│ ├── 02-ai-driven-decision-making.ipynb +│ └── 03-hybrid-healing-workflows.ipynb +├── 04-integration/ +│ ├── 01-mcp-server-integration.ipynb +│ ├── 02-openshift-lightspeed-integration.ipynb +│ └── 03-llamastack-integration.ipynb +└── 05-custom/ + ├── my-custom-model.ipynb + └── my-custom-runbook.ipynb +---- + +=== Validation notebook pattern + +All notebooks should include validation cells to ensure they execute correctly: + +[source,python] +---- +# Validation cell (at end of notebook) +import os +import sys + +def validate_notebook_execution(): + """Validate this notebook executed successfully.""" + required_variables = ['model', 'accuracy_score', 'inference_service_name'] + + for var in required_variables: + if var not in globals(): + print(f"❌ Validation FAILED: {var} not defined") + sys.exit(1) + + if accuracy_score < 0.70: + print(f"❌ Validation FAILED: accuracy {accuracy_score} below threshold") + sys.exit(1) + + print("✓ Notebook validation PASSED") + return True + +# Run validation +validate_notebook_execution() +---- + +This pattern enables automated testing via NotebookValidationJobs. + +=== Testing changes before deployment + +Before deploying custom models or remediation logic to production: + +. Test in the Jupyter environment interactively +. Create a NotebookValidationJob custom resource for automated testing: ++ +[source,yaml] +---- +apiVersion: mlops.mlops.dev/v1alpha1 +kind: NotebookValidationJob +metadata: + name: my-custom-notebook-validation + namespace: self-healing-platform + labels: + app.kubernetes.io/name: my-custom-notebook-validation + app.kubernetes.io/component: notebook-validation +spec: + notebook: + git: + url: https://github.com/KubeHeal/openshift-aiops-platform.git + ref: main + path: notebooks/05-custom/my-custom-model.ipynb + podConfig: + containerImage: image-registry.openshift-image-registry.svc:5000/self-healing-platform/notebook-validator:latest + serviceAccountName: self-healing-workbench + resources: + requests: + cpu: 2000m + memory: 4Gi + limits: + cpu: 4000m + memory: 8Gi + volumeMounts: + - name: model-storage + mountPath: /mnt/models + volumes: + - name: model-storage + persistentVolumeClaim: + claimName: model-storage-pvc + envFrom: + - secretRef: + name: model-storage-config + timeout: 45m +---- + +. Apply the NotebookValidationJob: ++ +[source,terminal] +---- +oc apply -f my-custom-notebook-validation.yaml +---- + +. Monitor validation job status: ++ +[source,terminal] +---- +oc get notebookvalidationjob my-custom-notebook-validation -n self-healing-platform -w +---- + +. Review validation results: ++ +[source,terminal] +---- +# Get the validation job pod name +POD=$(oc get pods -n self-healing-platform -l app.kubernetes.io/name=my-custom-notebook-validation --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}') + +# View logs +oc logs -n self-healing-platform $POD +---- ++ +Successful validation shows: ++ +---- +✓ Notebook validation PASSED +All cells executed successfully +---- + +[id="training-custom-ml-models"] +== Training custom ML models + +The pattern includes a MLOps pipeline for training incident prediction models on your data. + +The Jupyter workbench environment includes pre-configured notebooks for model training workflows: + +.Procedure + +. Access Jupyter notebooks for model training: ++ +[source,terminal] +---- +oc port-forward self-healing-workbench-0 8888:8888 -n self-healing-platform +---- ++ +Open http://localhost:8888 and navigate to the `notebooks/ml-training/` directory. + +. Review the available training notebooks: ++ +* `data-preparation.ipynb` - Data collection and preprocessing +* `feature-engineering.ipynb` - Creating features from incident data +* `model-training.ipynb` - Training and hyperparameter tuning +* `model-evaluation.ipynb` - Evaluating model performance +* `model-deployment.ipynb` - Deploying models via KServe + +. Collect historical incident data from your environment. The pattern stores incidents in a database with features including: +* Alert name and labels +* Metric values at incident time +* Error messages from logs +* Remediation action taken +* Outcome (success/failure) +* Time to resolution + +. Prepare a training dataset by exporting incidents from the database: ++ +[source,terminal] +---- +oc exec -n aiops-mlops deployment/incident-collector -- \ + python export_training_data.py --start-date 2025-01-01 --output-file /data/incidents.csv +---- + +. Review the feature engineering pipeline in the Kubeflow Pipeline definition: ++ +[source,terminal] +---- +oc get pipeline incident-prediction-training -n aiops-mlops -o yaml +---- + +. Customize feature engineering for your environment by editing the pipeline ConfigMap: ++ +[source,yaml] +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: feature-engineering-config + namespace: aiops-mlops +data: + features.yaml: | + # Add custom metrics as features + custom_metrics: + - name: api_latency_p95 + query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) + - name: error_rate + query: rate(http_requests_total{status=~"5.."}[5m]) + + # Configure temporal features + temporal_features: + - hour_of_day + - day_of_week + - is_business_hours + - is_change_window + + # Feature transformations + transformations: + - type: log_transform + columns: [api_latency_p95, memory_usage_bytes] + - type: one_hot_encoding + columns: [alert_name, namespace] +---- + +. Trigger a model training run: ++ +[source,terminal] +---- +oc create -f - < | <{{ grafana_url }}|View Dashboard> + + # Notify on escalation + notify_on_escalation: + enabled: true + channel: "#platform-ops-urgent" + mention_users: ["@oncall"] + message_template: | + :warning: *Manual Intervention Required* + *Cluster:* {{ cluster_name }} + *Issue:* {{ alert_name }} + *AI Suggestion:* {{ predicted_remediation }} + *Confidence:* {{ confidence_score | round(2) }}% (below threshold) + <{{ incident_url }}|View Incident Details> +---- + +=== PagerDuty integration + +Integrate with PagerDuty for on-call escalations: + +[source,yaml] +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: pagerduty-integration + namespace: aiops-engine +data: + config.yaml: | + pagerduty: + integration_key_secret: pagerduty-key + + # Automatically resolve PD incidents when self-healing succeeds + auto_resolve_on_success: true + + # Only create PD incident if auto-remediation fails + create_incident_on: + - remediation_status: "failed" + - severity: "critical" + + # De-duplicate with existing alerts + dedup_key_template: "aiops-{{ cluster_name }}-{{ namespace }}-{{ alert_name }}" +---- + +[id="confidence-threshold-tuning"] +== Confidence threshold tuning + +The confidence threshold determines when the platform executes remediation automatically versus escalating to humans. You can tune this based on your risk tolerance: + +.Procedure + +. Review current confidence score distribution in Grafana: ++ +Navigate to the "ML Model Performance" dashboard and check the "Confidence Score Distribution" panel. + +. Analyze the trade-off between automation coverage and false positive rate: ++ +[source,terminal] +---- +oc exec -n aiops-mlops deployment/model-analyzer -- \ + python analyze_confidence.py --threshold 0.80 --show-coverage +---- + +. Adjust the threshold in `values-global.yaml`: ++ +[source,yaml] +---- +global: + aiops: + # Default: 0.80 (80% confidence required for auto-remediation) + confidence_threshold: 0.90 # More conservative + + # Optionally set different thresholds by severity + confidence_thresholds_by_severity: + critical: 0.95 # Very high confidence for critical + warning: 0.80 # Lower threshold for warnings + info: 0.70 # Even lower for informational +---- + +. Commit and push changes to trigger GitOps sync: ++ +[source,terminal] +---- +git add values-global.yaml +git commit -m "Adjust confidence thresholds for production" +git push origin main +---- + +. Monitor the effect on automation rate and escalation volume. + +[id="multi-tenancy-customization"] +== Multi-tenancy customization + +For environments with multiple teams or tenants, you can customize self-healing policies per namespace or team: + +[source,yaml] +---- +apiVersion: v1 +kind: ConfigMap +metadata: + name: tenant-policies + namespace: aiops-engine +data: + policies.yaml: | + # Default policy for all namespaces + default: + enable_auto_remediation: true + confidence_threshold: 0.80 + max_actions_per_hour: 10 + require_approval: false + + # Policy for production namespaces + tenant_policies: + - namespace_pattern: "prod-*" + enable_auto_remediation: true + confidence_threshold: 0.90 # Higher threshold + max_actions_per_hour: 5 # Rate limiting + require_approval: true # Always need approval + approval_team: "production-team" + + - namespace_pattern: "dev-*" + enable_auto_remediation: true + confidence_threshold: 0.70 # Lower threshold OK + max_actions_per_hour: 20 # Higher limit + require_approval: false + + - namespace_pattern: "sandbox-*" + enable_auto_remediation: false # No automation + require_approval: true +---- + +[id="advanced-observability"] +== Advanced observability customization + +Extend the observability stack for your specific applications: + +=== Custom Prometheus exporters + +Deploy custom exporters for application-specific metrics: + +[source,yaml] +---- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: custom-app-exporter + namespace: monitoring +spec: + template: + spec: + containers: + - name: exporter + image: myorg/custom-exporter:latest + ports: + - containerPort: 9090 + name: metrics +--- +apiVersion: v1 +kind: Service +metadata: + name: custom-app-exporter + namespace: monitoring + labels: + app: custom-app-exporter +spec: + ports: + - port: 9090 + name: metrics + selector: + app: custom-app-exporter +--- +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: custom-app-exporter + namespace: monitoring +spec: + selector: + matchLabels: + app: custom-app-exporter + endpoints: + - port: metrics + interval: 30s +---- + +=== Custom Grafana dashboards + +Create dashboards specific to your workloads: + +[source,terminal] +---- +oc create configmap custom-dashboard \ + -n aiops-observability \ + --from-file=my-app-dashboard.json \ + --dry-run=client -o yaml | oc apply -f - + +oc label configmap custom-dashboard \ + -n aiops-observability \ + grafana_dashboard=1 +---- + +[id="storage-configuration-patterns"] +== Storage configuration patterns + +Storage class selection significantly impacts performance, cost, and scalability of the self-healing platform. + +=== Storage class selection guide + +Different components have different storage requirements: + +[cols="3,2,2,4",options="header"] +|=== +|Component |Access Mode |Storage Class |Rationale + +|Model artifacts (shared) +|RWX +|CephFS or NFS +|Multiple pods read models, training data, notebooks + +|Jupyter workbench data +|RWO +|CephFS or EBS +|Single workbench pod, moderate I/O + +|GPU model training scratch +|RWO +|EBS gp3-csi or local SSD +|High IOPS required for training pipelines + +|Long-term model registry +|S3-compatible +|NooBaa or AWS S3 +|Versioned model storage, lifecycle policies + +|Prometheus metrics storage +|RWO +|EBS gp3-csi +|Time-series database with high write throughput + +|Log aggregation +|RWO +|EBS gp3-csi or CephFS +|High write volume, retention policies +|=== + +=== Example: Optimizing for HA clusters with ODF + +For HighlyAvailable clusters with OpenShift Data Foundation: + +[source,yaml] +---- +# values-hub.yaml excerpt +clusterGroup: + storage: + # Default storage class for most workloads + storageClass: "ocs-storagecluster-cephfs" + + # Per-component overrides + componentStorage: + # GPU training needs high IOPS + gpu-training: + storageClass: "gp3-csi" + size: "100Gi" + + # Shared model artifacts on CephFS + model-artifacts: + storageClass: "ocs-storagecluster-cephfs" + accessMode: "ReadWriteMany" + size: "50Gi" + + # Prometheus on fast block storage + prometheus: + storageClass: "ocs-storagecluster-ceph-rbd" + size: "50Gi" +---- + +=== Example: Optimizing for SNO with cloud storage + +For Single Node OpenShift with cloud provider storage: + +[source,yaml] +---- +# values-hub.yaml excerpt for AWS SNO +clusterGroup: + cluster: + topology: "sno" + + storage: + # Use cloud provider storage for all workloads + storageClass: "gp3-csi" + + # External S3 for model artifacts (no ODF on SNO) + externalS3: + enabled: true + endpoint: "s3.amazonaws.com" + bucket: "my-org-aiops-models" + region: "us-east-1" +---- + +=== Storage capacity planning + +Based on real-world deployments, recommended storage allocations: + +*Minimum (Development/POC):* +- Workbench data: 10-20 GB +- Model storage: 10 GB +- Model artifacts: 20 GB +- Total: ~50 GB + +*Production (HA cluster):* +- Workbench data: 50 GB (multiple data scientists) +- Model storage: 50-100 GB (model versions, A/B testing) +- Model artifacts: 100-200 GB (training datasets, experiment results) +- Prometheus metrics: 50-100 GB (30-day retention) +- Log storage: 50-100 GB (application logs, audit trails) +- Total: ~350-500 GB + +*Enterprise (Multi-cluster hub):* +- Scale linearly based on number of managed clusters +- Consider object storage (S3) for long-term artifact retention +- Implement lifecycle policies to archive old models/datasets + +[id="testing-customizations-safely"] +== Testing customizations safely + +Before deploying customizations to production: + +. Test in a non-production cluster first +. Use GitOps preview environments or ArgoCD sync waves +. Enable dry-run mode to see what actions would be taken without executing: ++ +[source,yaml] +---- +global: + aiops: + dry_run_mode: true # Log actions but don't execute +---- + +. Start with low-risk namespaces or development environments +. Monitor for false positives and adjust runbooks or thresholds +. Gradually roll out to production with confidence thresholds tuned conservatively + +[id="making-changes-to-see-pattern-in-action"] +== Making changes to see the pattern in action + +You can experiment with the pattern to observe self-healing in action: + +=== Simulate a failing pod + +. Deploy a test application that will fail: ++ +[source,terminal] +---- +oc create namespace aiops-test +oc run failing-pod --image=busybox --namespace=aiops-test -- sh -c 'exit 1' +---- + +. Watch the self-healing engine detect the CrashLoopBackOff: ++ +[source,terminal] +---- +oc logs -n aiops-engine deployment/decision-engine -f | grep failing-pod +---- + +. Check if a runbook triggered or ML model predicted an action: ++ +[source,terminal] +---- +oc get events -n aiops-test --field-selector involvedObject.name=failing-pod +---- + +=== Simulate high memory usage + +. Deploy a memory stress test: ++ +[source,terminal] +---- +oc run memory-hog --image=polinux/stress --namespace=aiops-test -- \ + stress --vm 1 --vm-bytes 512M --vm-hang 0 +---- + +. Watch Prometheus alerts fire: ++ +[source,terminal] +---- +oc get prometheusrule -n openshift-monitoring +---- + +. Observe if the self-healing platform triggers remediation (based on your runbooks). + +=== Adjust model serving and observe A/B testing + +. Check current model traffic distribution: ++ +[source,terminal] +---- +oc get inferenceservice incident-prediction -n aiops-inference -o yaml | grep -A 5 canary +---- + +. Shift more traffic to the canary model: ++ +[source,terminal] +---- +oc patch inferenceservice incident-prediction -n aiops-inference --type=merge -p '{ + "spec": { + "canaryTrafficPercent": 50 + } +}' +---- + +. Monitor model performance metrics in Grafana to compare the two models. + +[id="resource-planning"] +== Resource planning and sizing + +Plan cluster resources based on deployment topology and expected workload. + +=== HA cluster resource allocation + +Example from a production HA deployment with 7 nodes: + +[source,terminal] +---- +oc get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory +---- + +Example output: + +---- +NAME CPU MEMORY +ip-10-0-18-100.us-east-2.compute.internal 15500m 63295548Ki (~62GB) +ip-10-0-26-216.us-east-2.compute.internal 3500m 14584996Ki (~14GB) +ip-10-0-34-148.us-east-2.compute.internal 7500m 31004812Ki (~30GB) +ip-10-0-51-193.us-east-2.compute.internal 15500m 63295540Ki (~62GB) +ip-10-0-54-59.us-east-2.compute.internal 7500m 31004812Ki (~30GB) +ip-10-0-59-112.us-east-2.compute.internal 15500m 63295544Ki (~62GB) +ip-10-0-8-246.us-east-2.compute.internal 7500m 31004812Ki (~30GB) + +Total: ~57 CPU cores, ~185GB RAM +---- + +*Component resource usage (HA cluster):* + +|=== +|Component |CPU Request |Memory Request |Replicas + +|anomaly-detector-predictor +|500m +|1Gi +|1-2 + +|predictive-analytics-predictor +|2000m (GPU preferred) +|4Gi +|1-2 + +|coordination-engine +|1000m +|2Gi +|1-3 (HA) + +|mcp-server +|500m +|512Mi +|1-2 + +|self-healing-workbench +|2000m +|4Gi +|1 + +|Prometheus +|2000m +|8Gi +|1-2 (HA) + +|ODF (full stack) +|8000m +|16Gi +|3 nodes + +|*Total (estimated)* +|*~20-25 cores* +|*~50-60GB* +| +|=== + +*Recommendations:* + +* HA clusters: Minimum 3 worker nodes with 16 cores and 64GB RAM each +* Reserve 25-30% overhead for OpenShift platform services +* For GPU workloads: Add nodes with NVIDIA GPUs (T4, V100, A100) +* Enable auto-scaling for worker nodes in cloud environments + +=== SNO resource allocation + +For Single Node OpenShift deployments: + +*Minimum SNO configuration:* +- CPU: 8 cores (16 vCPUs recommended) +- Memory: 32 GB (64 GB recommended) +- Storage: 120 GB root disk + additional storage for PVCs + +*SNO component adjustments:* + +- Reduce replica counts to 1 for all components +- Use MCG-only (no Ceph) for object storage +- Disable GPU-based predictive analytics unless GPU available +- Consider external S3 instead of ODF to save resources + +[source,yaml] +---- +# values-hub.yaml for SNO +clusterGroup: + cluster: + topology: "sno" + + # Resource limits for SNO + resources: + coordination-engine: + replicas: 1 + cpu: "500m" + memory: "1Gi" + + anomaly-detector: + replicas: 1 + cpu: "500m" + memory: "1Gi" + + # Disable GPU model on SNO + predictive-analytics: + enabled: false +---- + +=== Scaling for multi-cluster management + +When managing multiple clusters as spokes: + +*Hub cluster scaling formula:* +- Base: Resources listed above for HA cluster +- Per spoke cluster: Add ~1 CPU core, 2GB RAM +- Example: Managing 10 spoke clusters = Base + 10 cores + 20GB RAM + +*Storage scaling:* +- Metrics retention: ~5-10 GB per spoke cluster per 30 days +- Model storage: Shared across clusters (no linear increase) + +[id="workshop-and-learning-resources"] +== Workshop and learning resources + +The OpenShift AIOps Platform includes a comprehensive workshop with 33 Jupyter notebooks covering every aspect of self-healing automation. + +=== Self-Healing Workshop + +The official workshop provides hands-on learning modules: + +* **Module 0**: Introduction & Architecture - How the platform works +* **Module 1**: ML Model Training with Tekton - Train anomaly detection models +* **Module 2**: Deploy MCP Server & Configure Lightspeed - Setup integration +* **Module 3**: End-to-End Self-Healing with Lightspeed - Interactive AI-powered management +* **Module 4**: Extra Credit - Advanced ML (LSTM, ensemble) and custom deployment +* **Module 5**: Notebook Catalog & Use Cases - Guide to all 33+ notebooks + +Access the workshop at: https://kubeheal.github.io/self-healing-workshop/ + +=== Notebook Catalog by Category + +The platform includes 33 Jupyter notebooks organized by use case. Access them via the workbench: + +[source,terminal] +---- +oc port-forward self-healing-workbench-0 8888:8888 -n self-healing-platform +# Open http://localhost:8888 +---- + +*Category 00: Setup & Validation (3 notebooks)* + +* `00-platform-readiness-validation.ipynb` - Validates cluster prerequisites +* `01-kserve-model-onboarding.ipynb` - Deploy models to KServe +* `environment-setup.ipynb` - Configure Python environment + +*Category 01: Data Collection (5 notebooks)* + +* `prometheus-metrics-collection.ipynb` - Query Prometheus for training data +* `openshift-events-analysis.ipynb` - Extract Kubernetes events +* `log-parsing-analysis.ipynb` - Parse container logs for errors +* `feature-store-demo.ipynb` - Feature engineering for ML +* `synthetic-anomaly-generation.ipynb` - Generate test anomalies + +*Category 02: Anomaly Detection (5 notebooks)* + +* `01-isolation-forest-implementation.ipynb` - Isolation Forest (fast, explainable) +* `02-time-series-anomaly-detection.ipynb` - Time series methods (ARIMA, Prophet) +* `03-lstm-based-prediction.ipynb` - LSTM neural networks (requires GPU) +* `04-ensemble-anomaly-methods.ipynb` - Combine algorithms via voting +* `05-predictive-analytics-kserve.ipynb` - Deploy to KServe for inference + +*Category 03: Self-Healing Logic (3 notebooks)* + +* `rule-based-remediation.ipynb` - Deterministic remediation rules +* `ai-driven-decision-making.ipynb` - ML-based action selection +* `hybrid-healing-workflows.ipynb` - Combine rules + AI + +*Category 04: Model Serving (3 notebooks)* + +* `kserve-model-deployment.ipynb` - Full KServe deployment workflow +* `inference-pipeline-setup.ipynb` - Pre/post processing pipelines +* `model-versioning-mlops.ipynb` - Versioning and A/B testing + +*Category 05: End-to-End Scenarios (4 notebooks)* + +* `complete-platform-demo.ipynb` - Full platform demonstration +* `pod-crash-loop-healing.ipynb` - Detect and remediate CrashLoopBackOff +* `resource-exhaustion-detection.ipynb` - CPU/memory pressure detection +* `network-anomaly-response.ipynb` - Network anomaly detection and response + +*Category 06: MCP & Lightspeed Integration (4 notebooks)* + +* `mcp-server-integration.ipynb` - Test MCP server functionality +* `openshift-lightspeed-integration.ipynb` - Lightspeed API usage +* `llamastack-integration.ipynb` - LlamaStack for local LLMs +* `end-to-end-troubleshooting-workflow.ipynb` - AI-guided debugging + +*Category 07: Monitoring & Operations (3 notebooks)* + +* `prometheus-metrics-monitoring.ipynb` - Custom Prometheus metrics +* `model-performance-monitoring.ipynb` - Track accuracy, drift detection +* `healing-success-tracking.ipynb` - Measure self-healing effectiveness + +*Category 08: Advanced Scenarios (3 notebooks)* + +* `security-incident-response-automation.ipynb` - Security automation +* `predictive-scaling-capacity-planning.ipynb` - Predict capacity needs +* `cost-optimization-resource-efficiency.ipynb` - Resource right-sizing + +=== Algorithm Selection Guide + +Choose the right anomaly detection approach based on your use case: + +[cols="2,2,3"] +|=== +|Scenario |Recommended Notebook |Rationale + +|Quick start, simple anomalies +|`01-isolation-forest-implementation.ipynb` +|Fast training, no GPU needed, explainable results + +|Time-based patterns (daily/weekly cycles) +|`02-time-series-anomaly-detection.ipynb` +|Captures seasonality and trends + +|Complex multi-variate patterns +|`03-lstm-based-prediction.ipynb` +|Deep learning captures complex relationships (requires GPU) + +|Production deployment +|`04-ensemble-anomaly-methods.ipynb` +|Combines models for robust, low-false-positive detection +|=== + +=== Python Integration Examples + +The workshop includes production-ready Python integration examples: + +*Automated Alert Response Pattern:* + +[source,python] +---- +from lightspeed_client import LightspeedClient + +def handle_prometheus_alert(alert, server_url): + """Respond to Prometheus alert with AI analysis.""" + + # Build context from alert + context = { + 'alert_name': alert['labels']['alertname'], + 'namespace': alert['labels'].get('namespace'), + 'severity': alert['labels'].get('severity'), + 'description': alert['annotations'].get('description'), + } + + # Query Lightspeed for analysis + client = LightspeedClient(server_url) + analysis = client.query( + f"Analyze this alert and suggest remediation: {context['description']}", + context=context + ) + + # Decide on action based on confidence + if analysis['confidence'] > 0.8: + return {'action': 'auto_remediate', 'confidence': analysis['confidence']} + elif analysis['confidence'] > 0.6: + return {'action': 'recommend', 'confidence': analysis['confidence']} + else: + return {'action': 'escalate', 'reason': 'low_confidence'} +---- + +*Available integration patterns:* + +* `lightspeed_client.py` - OpenShift Lightspeed client library +* `pattern_alert_response.py` - Automated alert response +* `pattern_batch_analysis.py` - Batch anomaly analysis +* `pattern_capacity_planning.py` - Capacity prediction +* `monitor_cluster.py` - Continuous cluster monitoring + +Download examples from: https://github.com/KubeHeal/self-healing-workshop/tree/main/examples/python + +=== Hybrid Healing Approach + +The platform combines deterministic rules with AI decision-making: + +[source] +---- +Incoming Anomaly + │ + ▼ +┌──────────────────┐ +│ Rule Matcher │───→ Known Issue? ───→ Apply Rule-Based Fix +│ (Deterministic) │ │ +└──────────────────┘ │ + │ No Match │ + ▼ │ +┌──────────────────┐ │ +│ AI Decision │───→ Novel Issue? ───→ AI-Recommended Fix +│ (ML-Based) │ │ +└──────────────────┘ │ + │ │ + └───────────────────────────────────────────┘ + │ + ▼ + Coordination Engine + (Conflict Resolution) +---- + +This hybrid approach provides: + +* **Speed**: Rules provide instant response for known issues +* **Intelligence**: AI handles novel problems requiring analysis +* **Safety**: Coordination engine prevents conflicting actions +* **Learning**: New patterns from AI decisions can become rules + +[id="contributing-improvements-back"] +== Contributing improvements back + +If you develop valuable runbooks, ML model improvements, or integrations, consider contributing them back to the community: + +. Fork the pattern repository: https://github.com/KubeHeal/openshift-aiops-platform +. Create a branch for your enhancement +. Add your customization with documentation +. Submit a pull request with a clear description of the use case and benefits +. Engage with the community in GitHub Discussions + +By contributing, you help other organizations facing similar challenges and benefit from community review and improvements to your contributions. + +For questions or support, visit the link:https://github.com/KubeHeal/openshift-aiops-platform/discussions[GitHub Discussions forum]. diff --git a/modules/oaiops-about.adoc b/modules/oaiops-about.adoc new file mode 100644 index 000000000..e86ab8153 --- /dev/null +++ b/modules/oaiops-about.adoc @@ -0,0 +1,119 @@ +:_content-type: CONCEPT +:imagesdir: ../../images + +[id="about-openshift-aiops-platform"] += About the OpenShift AIOps Self-Healing Platform + +Use case:: + +* Automatically detect and remediate common infrastructure and application issues in OpenShift clusters +* Combine coordination engine REST API with machine learning models for intelligent incident response +* Implement remediation workflows with audit trail via coordination engine +* Reduce Mean Time to Resolution (MTTR) through automated self-healing +* Enable continuous learning from incident history to improve prediction accuracy +* Manage multi-cluster environments with centralized observability and remediation +* Support both standard HighlyAvailable clusters and Single Node OpenShift (SNO) deployments + ++ +Background:: + +Organizations operating Kubernetes at scale face an increasing volume of operational incidents that require rapid response. Manual intervention becomes a bottleneck, leading to prolonged outages and increased operational costs. Traditional monitoring and alerting systems can identify problems but require human operators to diagnose and remediate issues. + +The OpenShift AIOps Self-Healing Platform addresses this challenge through a hybrid approach that combines coordination engine REST API with AI-driven machine learning predictions. When an incident is detected, notebooks or the MCP server analyze the issue using ML model inference (Isolation Forest for anomaly detection, Random Forest Regressor for predictive analytics) to determine remediation actions with confidence scores. The Go-based coordination engine orchestrates remediation workflows, integrates with deployment methods (ArgoCD, Helm, Operator-managed), and provides health monitoring endpoints. Current capabilities include pod lifecycle management, with deployment resource updates and declarative remediation patterns under active development. + +This architecture delivers intelligent remediation through ML-driven anomaly detection while maintaining deployment method awareness (ArgoCD sync vs direct pod management). The platform uses two trained models deployed via KServe: Isolation Forest for real-time anomaly detection and Random Forest Regressor for predictive resource usage forecasting. The continuous learning pipeline includes 33 Jupyter notebooks for model training, validation, and deployment, enabling progressive automation capability expansion. + +The pattern is deployed using the Validated Patterns framework with the Validated Patterns Operator, ensuring all components are version controlled and auditable. The platform uses a hybrid management model combining cluster-scoped resource management via Ansible with namespaced resource management via ArgoCD. The platform continuously learns from incident outcomes through automated model training workflows using Tekton Pipelines integrated with Red Hat OpenShift AI. Models are trained weekly via NotebookValidationJob custom resources and deployed to KServe for low-latency inference during incident response. + +The pattern supports both standard HighlyAvailable multi-node clusters and Single Node OpenShift (SNO) deployments, with automatic topology detection and configuration during deployment. + +For a hands-on walkthrough of the complete self-healing workflow, see the link:https://kubeheal.github.io/self-healing-workshop/modules/index.html[Self-Healing Workshop] which demonstrates end-to-end scenarios including anomaly detection, model training, MCP server integration, and AI-powered remediation with OpenShift Lightspeed. + +[id="about-solution"] +== About the solution + +The OpenShift AIOps Self-Healing Platform provides an end-to-end solution for automated incident response across multiple OpenShift clusters: + +* *Intelligent Self-Healing Architecture*: Combines three key components: +** *ML Analysis Layer*: Two KServe-deployed models (Isolation Forest for anomaly detection, Random Forest Regressor for predictive analytics) analyze cluster metrics and predict remediation actions +** *Jupyter Notebook Interface*: 33 production-ready notebooks organized in 9 categories provide the primary development and integration interface, with automated validation via NotebookValidationJob CRs +** *Go-based Coordination Engine*: Orchestrates remediation workflows via REST API endpoints (`/health`, `/api/v1/health`), integrates with deployment methods (ArgoCD, Helm, Operator-managed), and provides Prometheus metrics on port 9090. Current capabilities include pod lifecycle management, with deployment resource updates under active development + +* *Multi-Cluster Architecture*: Hub cluster manages observability and AI/ML operations, with support for spoke cluster integration via Red Hat Advanced Cluster Management (multi-cluster features under development) + +* *MCP Server Integration*: Go-based MCP (Model Context Protocol) server with 12+ tools provides AI assistants like OpenShift Lightspeed with cluster health data, enabling natural language interaction with the self-healing platform + +* *Continuous Learning*: Tekton Pipelines orchestrate weekly model retraining based on collected metrics and anomalies. Models are validated via NotebookValidationJob CRs and deployed to KServe with zero-downtime updates + +* *Topology-Aware Deployment*: Automatic detection and configuration for both HighlyAvailable (HA) multi-node clusters and Single Node OpenShift (SNO) deployments, with storage class selection (CephFS for HA, gp3-csi for SNO) and resource allocation adjusted per topology + +Benefits and capabilities of the self-healing platform: + +* *Anomaly Detection*: Isolation Forest model detects unusual patterns in cluster metrics with 100 estimators and 3% contamination rate, trained weekly on hybrid synthetic and real Prometheus data +* *Predictive Analytics*: Random Forest Regressor forecasts resource usage 1 hour ahead (12 steps at 5-minute intervals) to enable proactive scaling +* *Deployment Awareness*: Coordination engine detects how applications were deployed (ArgoCD, Helm, Operator, Manual) and applies appropriate remediation strategies +* *Production Validation*: 94%+ validation success rate across 33 notebooks on both SNO and HA cluster topologies +* *AI Assistant Integration*: OpenShift Lightspeed users can query cluster health and trigger remediation through natural language via the MCP server +* *Observable Operations*: Complete Prometheus metrics exposure, health check endpoints, and incident tracking enable monitoring and continuous improvement +* *Scalable Architecture*: Pattern supports Single Node OpenShift (edge/development) through HighlyAvailable multi-node clusters (production) with automatic topology detection + +[id="about-technology"] +== About the technology + +The following technologies are used in this solution: + +https://www.redhat.com/en/technologies/cloud-computing/openshift/try-it[Red Hat OpenShift Container Platform]:: +An enterprise-ready Kubernetes container platform built for an open hybrid cloud strategy. It provides a consistent application platform to manage hybrid cloud, public cloud, and edge deployments. OpenShift provides the foundation for running both the self-healing platform components and the workloads being monitored. + +https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai[Red Hat OpenShift AI]:: +A flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. In this pattern, OpenShift AI provides Jupyter workbench environments for model development (33 notebooks), KServe for model serving (Isolation Forest and Random Forest Regressor models), and integration with the data science gateway for OAuth-authenticated access. Models are stored in S3-compatible storage (NooBaa on HA clusters, external S3 on SNO) and deployed via KServe InferenceServices. + +https://www.redhat.com/en/technologies/management/advanced-cluster-management[Red Hat Advanced Cluster Management for Kubernetes]:: +Controls clusters and applications from a single console, with built-in security policies. Extends the value of Red Hat OpenShift by deploying apps, managing multiple clusters, and enforcing policies across multiple clusters at scale. ACM provides the multi-cluster observability capabilities that feed the self-healing engine with metrics, logs, and traces from spoke clusters. + +https://www.redhat.com/en/technologies/cloud-computing/openshift/try-it[Red Hat OpenShift GitOps]:: +A declarative application continuous delivery tool for Kubernetes based on the ArgoCD project. In this pattern, the coordination engine detects ArgoCD-managed applications and triggers ArgoCD sync operations for remediation when appropriate. The Validated Patterns framework uses GitOps for deploying all platform components with a hybrid management model (cluster-scoped via Ansible, namespaced via ArgoCD). + +https://access.redhat.com/documentation/en-us/red_hat_openshift_pipelines/1.14[Red Hat OpenShift Pipelines]:: +A cloud-native, continuous integration and continuous delivery (CI/CD) solution based on Kubernetes resources. It uses Tekton building blocks to automate deployments across multiple platforms. In this pattern, Tekton Pipelines orchestrate weekly model training workflows, triggered automatically via NotebookValidationJob custom resources. Trained models are validated and deployed to KServe upon successful completion. + +https://www.redhat.com/en/technologies/cloud-computing/openshift-data-foundation[Red Hat OpenShift Data Foundation]:: +Software-defined storage for containers. Supports file, block, and object storage for persistent data storage needs. In this pattern, ODF deployment varies by cluster topology: HighlyAvailable clusters use full ODF with Ceph (CephFS for shared model storage, Ceph RBD for block storage) and NooBaa for S3-compatible object storage; Single Node OpenShift uses MCG-only ODF (NooBaa S3 without Ceph) to minimize resource consumption. Storage classes are automatically selected based on topology detection during deployment. + +This solution also uses a variety of *observability tools* integrated with Red Hat Advanced Cluster Management: + +* *Thanos* for long-term metrics storage and multi-cluster query aggregation +* *Grafana* for visualization of metrics, incidents, and self-healing outcomes +* *Prometheus* for metrics collection and alerting +* *OpenTelemetry* for distributed tracing across microservices + +[id="getting-started"] +== Getting started + +To understand how these components work together in practice: + +*Self-Healing Workshop*:: +The comprehensive link:https://kubeheal.github.io/self-healing-workshop/modules/index.html[Self-Healing Workshop] provides hands-on labs demonstrating the complete platform: ++ +* *Module 0*: Introduction and Architecture - Understanding how the platform works +* *Module 1*: ML Model Training with Tekton - Train anomaly detection models using the notebooks +* *Module 2*: Deploy MCP Server & Configure Lightspeed - Setup AI assistant integration +* *Module 3*: End-to-End Self-Healing with Lightspeed - Interactive AI-powered cluster management +* *Module 4*: Extra Credit - Advanced ML techniques (LSTM, ensemble methods) and custom deployment +* *Module 5*: Notebook Catalog & Use Cases - Complete guide to all 33 notebooks + +*Pattern Documentation*:: +This documentation site provides: ++ +* link:/patterns/openshift-aiops-platform/architecture/[Architecture details] - Deep dive into coordination engine, ML pipeline, and deployment architecture +* link:/patterns/openshift-aiops-platform/solution-elements/[Solution elements] - Component-by-component explanation +* link:/patterns/openshift-aiops-platform/deploying/[Deployment guide] - Step-by-step installation instructions for both HA and SNO topologies +* link:/patterns/openshift-aiops-platform/ideas-for-customization/[Customization guide] - Extending the platform for your specific use cases + +*Source Code & ADRs*:: +For developers and architects: ++ +* link:https://github.com/KubeHeal/openshift-aiops-platform[Pattern Repository] - Main pattern with Helm charts, notebooks, and deployment manifests +* link:https://github.com/KubeHeal/openshift-aiops-platform/tree/main/docs/adrs[Architecture Decision Records] - Detailed ADRs documenting all design decisions +* link:https://github.com/KubeHeal/openshift-coordination-engine[Coordination Engine] - Go-based remediation orchestrator +* link:https://github.com/KubeHeal/openshift-cluster-health-mcp[MCP Server] - Model Context Protocol server for AI assistant integration diff --git a/modules/oaiops-architecture.adoc b/modules/oaiops-architecture.adoc new file mode 100644 index 000000000..02d21b3ed --- /dev/null +++ b/modules/oaiops-architecture.adoc @@ -0,0 +1,172 @@ +:_content-type: CONCEPT +:imagesdir: ../../images + +[id="architecture-openshift-aiops"] += Architecture + +The OpenShift AIOps Self-Healing Platform implements a hub-and-spoke architecture where a central hub cluster manages observability, AI/ML operations, and remediation orchestration across multiple spoke clusters. + +== Overview of the architecture + +The architecture consists of several layers: + +*Hub Cluster*:: +Runs Red Hat Advanced Cluster Management, Red Hat OpenShift AI, and the self-healing decision engine. Aggregates observability data from all spoke clusters, performs ML inference, and orchestrates remediation via GitOps. + +*Spoke Clusters*:: +Run production workloads and lightweight monitoring agents. Send metrics, logs, and traces to the hub cluster. Receive remediation actions via ArgoCD syncing from Git repositories. + +*Git Repositories*:: +Store cluster configurations and pattern deployment manifests. The Validated Patterns framework uses Git as the source of truth for infrastructure-as-code, with the Validated Patterns Operator synchronizing configurations via ArgoCD. + +*Model Storage & Serving*:: +Stores trained ML models in S3-compatible storage (NooBaa on HA clusters, external S3 on SNO) and serves them via two KServe InferenceServices: `anomaly-detector-predictor` (Isolation Forest model) and `predictive-analytics-predictor` (Random Forest Regressor model) for low-latency inference during incident analysis. + +.Logical architecture of OpenShift AIOps Self-Healing Platform +image::openshift-aiops-platform/aiops-logical-architecture.png[Logical Architecture] + +[NOTE] +==== +The diagram shows a hub cluster with OpenShift AI, ACM, and GitOps components, connected to multiple spoke clusters. Observability data flows from spokes to hub, while remediation actions flow from hub through Git to spokes. +==== + +== Self-Healing Data Flow + +The end-to-end self-healing workflow follows these stages: + +1. *Event Detection*: Prometheus monitors cluster metrics continuously; anomalies trigger notebook-based analysis or MCP server queries from AI assistants like OpenShift Lightspeed +2. *ML Analysis*: Notebooks or MCP server query KServe InferenceServices (`anomaly-detector-predictor` for real-time anomaly detection, `predictive-analytics-predictor` for 1-hour resource usage forecasting) via HTTP REST API +3. *Remediation Request*: Analysis results are submitted to the coordination engine REST API with incident details, resource identifiers, and severity levels +4. *Workflow Orchestration*: The Go-based coordination engine detects deployment method (ArgoCD, Helm, Operator, Manual) and selects appropriate remediation strategy +5. *Execution*: Current capabilities include pod lifecycle management and ArgoCD sync operations. Under development: Deployment resource updates (memory/CPU limits) +6. *Feedback*: Remediation outcomes feed weekly model retraining via Tekton Pipelines, improving detection accuracy over time + +.Self-healing data flow +image::openshift-aiops-platform/aiops-dataflow.png[Data Flow] + +This closed-loop architecture ensures the system learns from every incident, continuously improving its prediction accuracy and expanding its automation coverage. + +[NOTE] +==== +**Hands-On Workshop**: For a practical demonstration of this workflow in action, including step-by-step examples of anomaly detection, model training, and AI-powered remediation, see the link:https://kubeheal.github.io/self-healing-workshop/modules/index.html[Self-Healing Workshop]. Module 3 demonstrates end-to-end self-healing with OpenShift Lightspeed using the MCP server and coordination engine. +==== + +== Coordination Engine Architecture + +The Go-based coordination engine orchestrates intelligent remediation workflows: + +.Coordination Engine Components +image::openshift-aiops-platform/aiops-decision-flow.png[Decision Flow] + +*Deployment Detection*:: +The engine first detects how the affected application was deployed by examining resource labels, annotations, and ownership references. Detection strategies include checking for ArgoCD application annotations, Helm release labels, operator ownership references, or manual deployment patterns. + +*KServe Model Integration*:: +For anomaly detection, the engine can query KServe InferenceServices directly via HTTP REST API. Two models are available: `anomaly-detector-predictor` (Isolation Forest) for pattern detection and `predictive-analytics-predictor` (Random Forest) for resource usage forecasting. Currently, most integration occurs via notebooks calling both the models and coordination engine APIs. + +*Remediation Strategy Selection*:: +Based on deployment method, the engine routes to the appropriate remediator: +* *ArgoCD Remediator*: Triggers ArgoCD application sync for GitOps-managed applications +* *Helm Remediator*: Planned for Helm-deployed applications +* *Operator Remediator*: Planned for operator-managed resources +* *Manual Remediator*: **Currently implemented** for pod lifecycle management (delete/recreate). **Under development**: Deployment resource updates (memory/CPU limits per Issue #62) + +*Health Monitoring*:: +The engine exposes two health endpoints: `/health` (lightweight Kubernetes probe) and `/api/v1/health` (detailed dependency monitoring including Kubernetes API, KServe services, and RBAC permissions). Prometheus metrics on port 9090 track remediation attempts, workflow duration, and ArgoCD sync operations. + +== GitOps Integration + +The pattern leverages the proven multicloud-gitops framework: + +.GitOps architecture for self-healing +image::multicloud-gitops/spi-multi-cloud-gitops-sd-security.png[GitOps Architecture,link="/images/multicloud-gitops/spi-multi-cloud-gitops-sd-security.png"] + +All cluster configurations and remediation actions flow through Git, providing: + +* *Version control*: Every change is tracked with commit history +* *Audit trail*: Compliance and security teams can review all automated actions +* *Rollback*: Failed remediation can be reverted via Git +* *Policy enforcement*: Branch protection rules, required reviews, and signed commits + +The self-healing engine integrates with this GitOps workflow by creating commits that look identical to human-initiated changes, ensuring consistent tooling and processes. + +== Multi-Cluster Observability Architecture + +Red Hat Advanced Cluster Management provides the observability foundation: + +*Metrics Pipeline*:: +Each spoke cluster runs Prometheus to scrape local metrics. ACM's observability stack uses Thanos to aggregate metrics from all spokes into a centralized store on the hub cluster. This enables cross-cluster queries and long-term metric retention. + +*Distributed Tracing*:: +OpenTelemetry collectors on spoke clusters send traces to a central Jaeger instance on the hub. This enables correlation of incidents across microservices spanning multiple clusters. + +*Log Aggregation*:: +Fluentd or Vector collectors forward logs from spoke clusters to a central Loki instance on the hub cluster, enabling log-based incident detection and feature extraction for ML models. + +*Alerting*:: +Prometheus Alertmanager on the hub cluster receives alerts from all spoke clusters and routes them to the self-healing decision engine based on alert labels and severity. + +*Visualization*:: +Grafana dashboards on the hub cluster provide unified visibility into all clusters, including self-healing metrics such as incident volume, remediation success rate, and confidence score distributions. + +== ML Pipeline Architecture + +The MLOps pipeline implements automated model lifecycle management via Jupyter notebooks and Tekton: + +.ML pipeline architecture +image::openshift-aiops-platform/aiops-ml-pipeline.png[ML Pipeline] + +*Data Storage*:: +OpenShift Data Foundation provides topology-aware storage: +* *HA clusters*: Full ODF with CephFS (RWX for shared model storage), Ceph RBD (RWO for block storage), NooBaa S3 (object storage) +* *SNO clusters*: MCG-only ODF (NooBaa S3 without Ceph), gp3-csi (RWO for all persistent volumes) +* *Model artifacts*: `/mnt/models/` mounted from `model-storage-pvc` (10Gi CephFS on HA, 10Gi gp3-csi on SNO) +* *Notebook data*: `/opt/app-root/src/data` mounted from `workbench-data-development` (20Gi) + +*Training Workflow*:: +Tekton Pipelines orchestrate weekly model training via NotebookValidationJob custom resources: +1. **Data Collection** (notebooks/01-data-collection): Prometheus metrics query (7-day lookback for Isolation Forest, 30-day for Random Forest), synthetic anomaly generation +2. **Model Training** (notebooks/02-anomaly-detection): Isolation Forest (100 estimators, 3% contamination) and Random Forest Regressor (100 trees, max_depth=20) +3. **Validation** (papermill execution): NotebookValidationJob validates all cells execute successfully with timeout protection (45 minutes) +4. **Model Persistence**: Trained models saved to `/mnt/models//model.pkl` (scikit-learn pickle format) +5. **KServe Deployment**: InferenceServices restart on successful model update, zero-downtime deployment + +*Model Storage*:: +No centralized model registry (MLflow not used). Models stored directly in S3-compatible storage: +* Model files: `/mnt/models/anomaly-detector/model.pkl`, `/mnt/models/predictive-analytics/model.pkl` +* Credentials: `model-storage-config` Secret with AWS S3 endpoint, bucket, access keys +* Versioning: Managed via file timestamps and InferenceService rollout history + +*Model Serving*:: +KServe InferenceServices provide HTTP REST API: +* **Endpoints**: `http://anomaly-detector-predictor:8080/v1/models/model:predict`, `http://predictive-analytics-predictor:8080/v1/models/model:predict` +* **Base Image**: `quay.io/modh/odh-pytorch-notebook:latest` with scikit-learn +* **Autoscaling**: Based on CPU/memory utilization and request volume +* **Health Checks**: Readiness/liveness probes ensure models loaded before serving traffic + +*Continuous Monitoring*:: +NotebookValidationJob tracking provides production metrics: +* Validation success rate: 94%+ across 33 notebooks on both SNO and HA topologies +* Model deployment verification via KServe InferenceService ready status +* Prometheus metrics from coordination engine track model integration effectiveness + +== Deployment Architecture + +The pattern deploys using the Validated Patterns framework with the Validated Patterns Operator: + +* *Validated Patterns Operator*: Orchestrates pattern deployment and lifecycle management +* *Hybrid Management Model*: Combines cluster-scoped resource management via Ansible with namespaced resource management via ArgoCD (see link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/030-hybrid-management-model.md[ADR-030]) +* *GitOps*: All components are deployed via ArgoCD applications +* *Helm Charts*: Component configurations use Helm for parameterization +* *Values Hierarchy*: Global values, site-specific values, and secret values +* *Secret Management*: External Secrets Operator integrates with HashiCorp Vault +* *Operator Lifecycle*: OpenShift Operator Lifecycle Manager (OLM) manages operator installation and updates +* *Topology Support*: Automatic detection and configuration for both standard HighlyAvailable and Single Node OpenShift (SNO) topologies + +For detailed architectural decisions, see the pattern's link:https://github.com/KubeHeal/openshift-aiops-platform/tree/main/docs/adrs[Architecture Decision Records (ADRs)], including: + +* link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/002-hybrid-self-healing-approach.md[ADR-002: Hybrid Deterministic-AI Self-Healing Approach] +* link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/019-validated-patterns-framework-adoption.md[ADR-019: Validated Patterns Framework Adoption] +* link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/docs/adrs/030-hybrid-management-model.md[ADR-030: Hybrid Management Model] + +This ensures the pattern can be easily forked, customized, and deployed across different environments while maintaining consistency with other Validated Patterns. diff --git a/modules/oaiops-deploying.adoc b/modules/oaiops-deploying.adoc new file mode 100644 index 000000000..5b08b5520 --- /dev/null +++ b/modules/oaiops-deploying.adoc @@ -0,0 +1,1123 @@ +:_content-type: PROCEDURE +:imagesdir: ../../../images + +[id="deploying-openshift-aiops-platform"] += Deploying the OpenShift AIOps Self-Healing Platform + +This procedure walks through deploying the OpenShift AIOps Self-Healing Platform on an OpenShift cluster using the Validated Patterns Operator framework. + +== Prerequisites + +*Hub Cluster*:: +* OpenShift Container Platform 4.18 or later +* Cluster admin access +* Standard HighlyAvailable topology: At least 3 control plane nodes and 6+ compute nodes, OR +* Single Node OpenShift (SNO) topology: 1 node with 8+ vCPU, 32+ GB RAM, 120+ GB storage +* Internet connectivity for pulling container images + +*Optional Spoke Clusters*:: +* One or more OpenShift clusters to manage with the self-healing platform +* These can be added after initial hub deployment + +*Required Local Tools*:: +* `oc` CLI (matching your OpenShift version) +* `kubectl` CLI +* `helm` CLI version 3.12 or later +* `podman` for building/pulling Ansible execution environment +* `ansible-navigator` for running Ansible automation +* `yq` for YAML processing +* `git` CLI installed locally +* `make` for running deployment targets +* `jq` for JSON processing + +*Optional: RHEL Prerequisite Installer*:: +* For RHEL 9/10 systems, a prerequisite installation script is available +* Installs all required tools automatically + +*Required Access*:: +* Cluster admin credentials for the hub cluster +* GitHub account for forking the pattern repository (MANDATORY) +* Red Hat Ansible Automation Hub token (for building execution environment, if not using pre-built image) + +== Preparing for deployment + +Follow these steps to prepare your environment for deployment. + +.Procedure + +. Fork the pattern repository on GitHub: ++ +[IMPORTANT] +==== +Forking the repository is MANDATORY. The pattern will not work correctly if you clone the upstream repository directly, as GitOps requires your own repository URL for synchronization. +==== ++ +Navigate to https://github.com/KubeHeal/openshift-aiops-platform and click the *Fork* button. + +. Clone YOUR forked repository (not the upstream): ++ +[source,terminal] +---- +git clone https://github.com//openshift-aiops-platform.git +cd openshift-aiops-platform +---- + +. Set the upstream remote to track pattern updates: ++ +[source,terminal] +---- +git remote add upstream https://github.com/KubeHeal/openshift-aiops-platform.git +---- + +. (Optional) Install prerequisites using the RHEL installer script: ++ +[source,terminal] +---- +# RHEL 9 or RHEL 10 only +./scripts/install-prerequisites-rhel.sh + +# Start a new terminal session or source your shell configuration +source ~/.bashrc # or ~/.zshrc if using zsh +---- ++ +The script is *idempotent* (safe to run multiple times) and will prompt before reinstalling existing tools. ++ +*What the script installs:* ++ +* *System packages via dnf:* `podman`, `git`, `make`, `jq`, `python3-pip`, development headers +* *Python virtual environment:* Created at `~/.venv/aiops-platform` with `ansible-navigator` and `ansible-builder` +* *CLI tools installed to `/usr/local/bin/`:* +** `oc` - OpenShift CLI +** `kubectl` - Kubernetes CLI +** `helm` 3.12+ - Kubernetes package manager +** `yq` - YAML processor +** `tkn` - Tekton CLI for pipeline management ++ +[NOTE] +==== +*Ubuntu/macOS users:* The RHEL script is not compatible with other operating systems. Install the required tools manually: + +*Ubuntu:* +[source,terminal] +---- +apt install podman git make jq +pip3 install ansible-navigator ansible-builder +---- + +*macOS (using Homebrew):* +[source,terminal] +---- +brew install podman git make jq yq helm kubectl openshift-cli tektoncd-cli +pip3 install ansible-navigator ansible-builder +---- +==== ++ +[NOTE] +==== +*Fedora/CentOS Stream:* The script may work on Fedora and CentOS Stream 9+ but is officially tested on RHEL 9 and RHEL 10 only. Use at your own discretion. +==== + +. Log into your OpenShift cluster: ++ +[source,terminal] +---- +oc login --token= --server=https://api.cluster.example.com:6443 +---- + +. Verify you have cluster admin privileges: ++ +[source,terminal] +---- +oc auth can-i '*' '*' --all-namespaces +---- ++ +This should return `yes`. + +. Verify cluster topology and version: ++ +[source,terminal] +---- +make show-cluster-info +---- ++ +This command displays: +* OpenShift version +* Cluster topology (HighlyAvailable or SingleReplica for SNO) +* Node count and resources +* Kubernetes version ++ +[NOTE] +==== +The pattern supports both standard HighlyAvailable clusters and Single Node OpenShift (SNO). The topology is automatically detected. +==== + +. Configure cluster infrastructure: ++ +[source,terminal] +---- +make configure-cluster +---- ++ +This command: +* Installs OpenShift Data Foundation (ODF) if not present +* Scales compute nodes if needed +* Configures storage classes +* Validates cluster prerequisites ++ +This step takes approximately 10-15 minutes. + +. Verify available storage classes on your cluster: ++ +[source,terminal] +---- +oc get storageclass +---- ++ +Identify the appropriate storage class for your cluster topology and cloud provider: ++ +* *AWS SNO/HA:* Use `gp3-csi` or `gp2-csi` +* *Azure SNO/HA:* Use `managed-csi` +* *GCP SNO/HA:* Use `standard-rw` +* *Local/Edge SNO:* Use `local-path` (if available) ++ +Note the storage class name for use in `values-hub.yaml` configuration. + +[CRITICAL] +==== +The following steps MUST be completed before running ANY `make` targets. The Makefile reads `values-global.yaml` on startup, so all make commands will fail if this file doesn't exist or is misconfigured. +==== + +. Create values files from examples: ++ +[source,terminal] +---- +# Create global values file (REQUIRED) +cp values-global.yaml.example values-global.yaml + +# Create hub values file (REQUIRED - may already exist) +cp values-hub.yaml.example values-hub.yaml +---- ++ +[IMPORTANT] +==== +Both `values-global.yaml` and `values-hub.yaml` are REQUIRED before proceeding. All subsequent `make` commands depend on these files existing and being properly configured. +==== + +. Edit `values-global.yaml` and update the repository URL to YOUR fork: ++ +[source,yaml] +---- +global: + pattern: openshift-aiops-platform + + git: + # CRITICAL: Update this to YOUR forked repository URL + repoURL: https://github.com//openshift-aiops-platform.git + branch: main + + options: + useCSV: false + syncPolicy: Automatic + installPlanApproval: Automatic +---- ++ +[WARNING] +==== +Failing to update `git.repoURL` to your fork will cause deployment failures. The pattern requires your repository for GitOps synchronization. +==== + +. Edit `values-hub.yaml` and update the repository URL: ++ +[source,yaml] +---- +clusterGroup: + name: hub + + git: + # CRITICAL: Update this to YOUR forked repository URL + repoURL: https://github.com//openshift-aiops-platform.git + branch: main +---- ++ +For Single Node OpenShift deployments, also update the topology and storage class: ++ +[source,yaml] +---- +clusterGroup: + name: hub + + cluster: + # IMPORTANT: Set topology to "sno" for Single Node OpenShift + topology: "sno" # Or "ha" for standard HighlyAvailable (default) + + storage: + # SNO storage class depends on your cloud provider: + # - AWS SNO: "gp3-csi" (recommended) or "gp2-csi" + # - Azure SNO: "managed-csi" + # - GCP SNO: "standard-rw" + # - Local/edge SNO: "local-path" + storageClass: "gp3-csi" # Example for AWS SNO clusters +---- + +. Copy the secrets template file: ++ +[source,terminal] +---- +cp values-secret.yaml.template values-secret.yaml +---- + +. Edit `values-secret.yaml` and configure required secrets: ++ +[source,yaml] +---- +version: "2.0" +secrets: + - name: aws-s3-secret + fields: + - name: AWS_ACCESS_KEY_ID + value: + - name: AWS_SECRET_ACCESS_KEY + value: +---- ++ +[NOTE] +==== +The pattern uses S3-compatible storage for metrics, logs, and model artifacts. You can use AWS S3, MinIO, or OpenShift Data Foundation's S3 service. +==== + +. Commit your customized values files to your fork: ++ +[source,terminal] +---- +git add values-global.yaml values-hub.yaml values-secret.yaml +git commit -m "Configure AIOps platform for deployment" +git push origin main +---- + +== Understanding OpenShift Data Foundation (ODF) deployment + +The pattern requires S3-compatible object storage for metrics, logs, and model artifacts. The `make configure-cluster` step automatically installs OpenShift Data Foundation with different configurations based on cluster topology. + +*Standard HighlyAvailable (HA) Clusters*:: ++ +Full ODF deployment including: ++ +* Ceph storage cluster (requires 3+ worker nodes with local disks or cloud volumes) +* Multi-Cloud Gateway (MCG/NooBaa) for S3-compatible object storage +* CephFS and RBD storage classes for persistent volumes ++ +Resource requirements: ++ +* 3+ worker nodes with dedicated storage devices or cloud volumes +* 16 vCPU and 64 GB RAM per ODF node +* 1 TB+ storage per node for Ceph OSDs ++ +Installation time: 10-15 minutes + +*Single Node OpenShift (SNO)*:: ++ +MCG-only deployment (NooBaa standalone): ++ +* No Ceph storage cluster (SNO has insufficient nodes) +* Multi-Cloud Gateway for S3-compatible object storage only +* Uses cluster's existing storage class for MCG database and cache +* Suitable for edge deployments with single-node constraints ++ +Resource requirements: ++ +* 4 vCPU and 16 GB RAM for MCG pods +* 100 GB storage for MCG database (uses cluster's default storage class) ++ +Installation time: 5-10 minutes + +.Procedure + +. Run the cluster configuration script: ++ +[source,terminal] +---- +make configure-cluster +---- ++ +This command automatically: ++ +* Detects cluster topology (HA or SNO) via OpenShift API +* Installs appropriate ODF configuration (full ODF for HA, MCG-only for SNO) +* Configures storage classes and object storage endpoints +* Scales MachineSets if needed (HA clusters only) +* Waits for ODF components to be ready + +. (Optional) Skip ODF installation if you already have S3-compatible storage: ++ +[source,terminal] +---- +./scripts/configure-cluster-infrastructure.sh --skip-odf +---- ++ +[NOTE] +==== +If skipping ODF, you must manually configure S3 credentials in `values-secret.yaml` to point to your external S3 service (AWS S3, MinIO, etc.). +==== + +. Verify ODF installation completed successfully: ++ +[source,terminal] +---- +# Check ODF operator installation +oc get csv -n openshift-storage | grep odf + +# Check NooBaa (MCG) status - should show "Ready" +oc get noobaa -n openshift-storage + +# For HA clusters only, also check Ceph cluster status +oc get cephcluster -n openshift-storage +---- + +. Verify S3 endpoint is available: ++ +[source,terminal] +---- +# Get S3 endpoint route +oc get route s3 -n openshift-storage -o jsonpath='{.spec.host}' + +# Verify NooBaa admin console route +oc get route noobaa-mgmt -n openshift-storage -o jsonpath='{.spec.host}' +---- ++ +Expected output: Routes should be available and healthy. + +[NOTE] +==== +The pattern automatically configures access to the ODF S3 service using Kubernetes secrets. No manual S3 configuration is needed unless using external S3 storage. + +For MCG-only (SNO) deployments, object storage functionality is provided without the full Ceph cluster, making it suitable for resource-constrained edge environments. +==== + +== Get Ansible Execution Environment + +The pattern uses Ansible for cluster-scoped resource management and requires an execution environment container image. + +.Procedure + +Choose one of the following options: + +*Option A: Use pre-built execution environment (Recommended)*:: ++ +[source,terminal] +---- +podman pull quay.io/takinosh/openshift-aiops-platform-ee:latest +podman tag quay.io/takinosh/openshift-aiops-platform-ee:latest openshift-aiops-platform-ee:latest +---- + +*Option B: Build execution environment locally*:: ++ +Requires Red Hat Ansible Automation Hub token configured in `~/.ansible.cfg`: ++ +[source,terminal] +---- +make build-execution-environment +---- ++ +This takes approximately 5-10 minutes. + +== Pre-deployment validation + +Before deploying the pattern, validate your configuration files and cluster readiness to catch common errors early. + +.Procedure + +. Validate YAML syntax in values files: ++ +[source,terminal] +---- +# Validate values-global.yaml syntax +yq eval '.' values-global.yaml > /dev/null && echo "✓ values-global.yaml syntax valid" || echo "✗ values-global.yaml has syntax errors" + +# Validate values-hub.yaml syntax +yq eval '.' values-hub.yaml > /dev/null && echo "✓ values-hub.yaml syntax valid" || echo "✗ values-hub.yaml has syntax errors" +---- ++ +Both commands should output "✓" confirming valid YAML syntax. + +. Verify Git repository URLs are set correctly: ++ +[source,terminal] +---- +# Check global values repository URL +yq eval '.global.git.repoURL' values-global.yaml + +# Check hub values repository URL +yq eval '.clusterGroup.git.repoURL' values-hub.yaml + +# Both should output YOUR forked repository URL, not the upstream +---- ++ +[WARNING] +==== +If either value shows `https://github.com/KubeHeal/openshift-aiops-platform.git` (upstream), you MUST update it to your forked repository URL. Deployment will fail if this is not corrected. + +Expected format: `https://github.com//openshift-aiops-platform.git` +==== + +. Verify cluster topology configuration matches actual cluster: ++ +[source,terminal] +---- +# Check detected topology from cluster +make show-cluster-info + +# Check configured topology in values file +yq eval '.clusterGroup.cluster.topology' values-hub.yaml + +# Verify they match: +# - SNO clusters: both should show "sno" +# - HA clusters: values file should show "ha" or be empty (defaults to "ha") +---- + +. Test network connectivity from cluster to required container registries: ++ +[source,terminal] +---- +# Test connectivity to Quay.io +oc debug node/$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- \ + chroot /host curl -I https://quay.io + +# Test connectivity to Red Hat Container Registry +oc debug node/$(oc get nodes -o jsonpath='{.items[0].metadata.name}') -- \ + chroot /host curl -I https://registry.redhat.io + +# Both should return HTTP 200 or 301/302 redirects +---- + +. Verify sufficient cluster resources are available: ++ +[source,terminal] +---- +# Check total allocatable resources across all nodes +oc get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory + +# Verify resource requirements are met: +# - SNO: minimum 8 vCPU, 32 GB RAM on the single node +# - HA: minimum 120 vCPU, 480 GB RAM total across all nodes +---- ++ +If resources are insufficient, deployment may fail or components may not schedule properly. + +== Deploying the pattern + +The pattern deploys using the Validated Patterns Operator with Ansible prerequisites. + +.Procedure + +. Validate cluster prerequisites: ++ +[source,terminal] +---- +make check-prerequisites +---- ++ +This verifies: +* OpenShift version compatibility +* Required storage classes +* Cluster topology +* Network connectivity +* Available resources + +. Deploy Ansible prerequisites (RBAC, secrets, namespace): ++ +[source,terminal] +---- +make operator-deploy-prereqs +---- ++ +This creates: +* `self-healing-platform-hub` namespace +* Service accounts and RBAC for Ansible +* Secret management configuration + +. Deploy the pattern via Validated Patterns Operator: ++ +[source,terminal] +---- +make operator-deploy +---- ++ +This command: +* Deploys the Validated Patterns Operator +* Creates the pattern's root ArgoCD application +* Waits for the operator to initialize ++ +The deployment takes approximately 20-30 minutes depending on cluster size and network speed. + +. Wait for the root ArgoCD application to be created: ++ +[source,terminal] +---- +oc wait --for=jsonpath='{.kind}'=Application \ + application/self-healing-platform \ + -n self-healing-platform-hub \ + --timeout=120s +---- + +. Trigger ArgoCD synchronization: ++ +[source,terminal] +---- +oc annotate application self-healing-platform \ + -n self-healing-platform-hub \ + argocd.argoproj.io/refresh=hard \ + --overwrite +---- + +. Verify ArgoCD application health: ++ +[source,terminal] +---- +make argo-healthcheck +---- ++ +This checks that all ArgoCD applications are synced and healthy. + +. Run comprehensive deployment validation pipeline: ++ +[source,terminal] +---- +tkn pipeline start deployment-validation-pipeline --showlog +---- ++ +This automated validation pipeline runs *26 comprehensive checks* grouped into the following categories: ++ +*Prerequisite Validation (5 checks):* ++ +* Cluster version compatibility (OpenShift 4.18 or later) +* Required CLI tools availability (`oc`, `kubectl`, `helm`, `tkn`) +* Storage class availability and configuration +* Network connectivity to container registries +* Cluster resource capacity (CPU, memory, storage) ++ +*Operator Deployment Verification (6 checks):* ++ +* OpenShift GitOps operator installation and health +* OpenShift Pipelines operator installation and health +* OpenShift AI (RHODS) operator installation and health +* KServe operator installation and health +* GPU operator installation and health (if GPU nodes configured) +* OpenShift Data Foundation operator installation and health ++ +*Storage Provisioning Validation (3 checks):* ++ +* PersistentVolumeClaim binding status +* ODF/NooBaa readiness and S3 endpoint availability +* Storage class configuration for model storage ++ +*Model Serving Endpoint Validation (4 checks):* ++ +* InferenceService custom resource definitions installed +* InferenceService deployment status (anomaly-detector, predictive-analytics) +* Model serving endpoint HTTP connectivity +* Inference endpoint response validation (POST request test) ++ +*Coordination Engine Validation (3 checks):* ++ +* Coordination engine pod health and readiness +* REST API endpoint availability (`:8080/health`) +* Health check endpoint response validation ++ +*Monitoring Infrastructure Validation (5 checks):* ++ +* Prometheus availability and metric scraping +* Grafana dashboard deployment and accessibility +* Logging stack operational status (if deployed) +* Metrics collection validation for self-healing platform +* Custom ServiceMonitor resource creation + +. Verify model training pipeline status: ++ +After initial deployment, ArgoCD automatically triggers initial training for both ML models. Verify the PipelineRuns were created and completed successfully: ++ +[source,terminal] +---- +tkn pipelinerun list -n self-healing-platform +---- ++ +Expected output shows completed PipelineRuns for: ++ +* `anomaly-detector-training-` - Anomaly detection model +* `predictive-analytics-training-` - Predictive analytics model (GPU-accelerated) ++ +Both should show `Succeeded` status. If status shows `Failed` or `Running`, investigate logs. + +. If model training failed or was not automatically triggered, manually start the training pipelines: ++ +*Start anomaly detector model training:* ++ +[source,terminal] +---- +tkn pipeline start model-training-pipeline \ + -p model-name=anomaly-detector \ + -p notebook-path=notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb \ + -p data-source=prometheus \ + -p training-hours=168 \ + -p inference-service-name=anomaly-detector \ + -p health-check-enabled=true \ + -p git-url=https://github.com//openshift-aiops-platform.git \ + -p git-ref=main \ + -n self-healing-platform \ + --showlog +---- ++ +*Start predictive analytics model training (requires GPU):* ++ +[source,terminal] +---- +tkn pipeline start model-training-pipeline-gpu \ + -p model-name=predictive-analytics \ + -p notebook-path=notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb \ + -p data-source=prometheus \ + -p training-hours=720 \ + -p inference-service-name=predictive-analytics \ + -p health-check-enabled=true \ + -p git-url=https://github.com//openshift-aiops-platform.git \ + -p git-ref=main \ + -n self-healing-platform \ + --showlog +---- ++ +[NOTE] +==== +* `training-hours` parameter specifies how much historical metric data to use for training +* `anomaly-detector`: 168 hours (7 days) - runs on CPU nodes +* `predictive-analytics`: 720 hours (30 days) - requires GPU nodes +* Training time varies: 10-30 minutes for anomaly-detector, 1-3 hours for predictive-analytics +==== + +== Verification + +After installation completes, verify all components are running correctly. + +.Procedure + +. Check that all required operators are installed: ++ +[source,terminal] +---- +oc get csv -n openshift-operators +---- ++ +You should see ClusterServiceVersions for: +* OpenShift GitOps +* OpenShift Pipelines +* OpenShift AI (RHODS) +* KServe +* Advanced Cluster Management (if multi-cluster) +* OpenShift Data Foundation (if deployed) + +. Check all pods in the self-healing platform namespace: ++ +[source,terminal] +---- +oc get pods -n self-healing-platform +---- ++ +All pods should be in `Running` or `Completed` state. + +. Verify InferenceServices are ready: ++ +[source,terminal] +---- +oc get inferenceservice -n self-healing-platform +---- ++ +The incident prediction model should be in `Ready` state. + +. Access the ArgoCD UI: ++ +[source,terminal] +---- +oc get route openshift-gitops-server -n openshift-gitops +---- ++ +Open the URL in your browser and log in with your OpenShift credentials. + +. Verify all ArgoCD applications are synced and healthy: ++ +image::openshift-aiops-platform/oaiops-applications.png[ArgoCD Applications] ++ +You should see applications including: +* `self-healing-platform` +* `openshift-ai-operators` +* `aiops-observability` +* `aiops-mlops-pipeline` + +. Check ArgoCD application sync status via CLI: ++ +[source,terminal] +---- +oc get applications -n self-healing-platform-hub +---- ++ +All applications should show `Synced` and `Healthy` status. + +. Verify Tekton deployment validation pipeline status: ++ +[source,terminal] +---- +tkn pipelinerun list -n self-healing-platform +---- ++ +The deployment-validation-pipeline should show `Succeeded` status. + +== Post-deployment configuration + +Configure access to platform components and verify functionality. + +.Procedure + +. Access the Jupyter workbench for model development: ++ +[source,terminal] +---- +oc port-forward self-healing-workbench-0 8888:8888 -n self-healing-platform +---- ++ +Open http://localhost:8888 in your browser. The Jupyter environment includes notebooks for: +* Model training workflows +* Feature engineering +* Incident analysis +* A/B testing configuration + +. Verify model storage directory structure: ++ +[source,terminal] +---- +oc exec -it self-healing-workbench-0 -n self-healing-platform -- ls -la /mnt/models/ +---- ++ +Expected subdirectories created by the init-models-job: +* `predictive-analytics/` +* `arima-predictor/` +* `prophet-predictor/` +* `lstm-predictor/` +* `ensemble-predictor/` +* `anomaly-detector/` + +. Check coordination engine status and logs: ++ +[source,terminal] +---- +# View coordination engine pod status +oc get pods -n self-healing-platform -l app.kubernetes.io/component=coordination-engine + +# Check coordination engine health endpoint +oc port-forward svc/coordination-engine 8080:8080 -n self-healing-platform +# In another terminal: +curl http://localhost:8080/health + +# View logs for self-healing decisions +oc logs -n self-healing-platform -l app.kubernetes.io/component=coordination-engine --tail=100 -f +---- + +. Verify NotebookValidationJobs: ++ +[source,terminal] +---- +oc get notebookvalidationjob -n self-healing-platform +---- ++ +These jobs validate notebook execution and model serving functionality. + +. Test model inference endpoints: ++ +[source,terminal] +---- +# List all InferenceServices +oc get inferenceservices -n self-healing-platform + +# Verify model serving endpoints are ready +oc get inferenceservices -n self-healing-platform -o wide +---- ++ +The deployment validation pipeline tests inference endpoints automatically. + +. Access ArgoCD UI for GitOps management: ++ +[source,terminal] +---- +# Get ArgoCD route +oc get route openshift-gitops-server -n openshift-gitops -o jsonpath='{.spec.host}' + +# Retrieve admin password +oc extract secret/openshift-gitops-cluster -n openshift-gitops --to=- +---- ++ +Log in to the ArgoCD UI to view all applications and their sync status. + +. Check MCP server status: ++ +[source,terminal] +---- +# View cluster health MCP server logs +oc logs deployment/cluster-health-mcp-server -n self-healing-platform +---- ++ +The MCP server operates via stdio transport and provides cluster health monitoring capabilities. + +. Clean up auto-generated namespaces (optional): ++ +[source,terminal] +---- +# These namespaces are created by upstream Validated Patterns defaults +# They can be safely deleted if not needed +oc delete namespace self-healing-platform-example imperative --ignore-not-found=true +---- + +. Verify external secrets synchronization: ++ +[source,terminal] +---- +# Check ExternalSecrets status +oc get externalsecrets -n self-healing-platform + +# Verify synchronized secrets exist +oc get secrets -n self-healing-platform | grep -E "git-credentials|model-storage" +---- + +== Adding spoke clusters (Optional) + +To add managed spoke clusters for multi-cluster self-healing: + +.Procedure + +. In the ACM console on the hub cluster, navigate to *Infrastructure* -> *Clusters*. + +. Click *Import cluster*. + +. Follow the import wizard to add your spoke cluster: ++ +* Enter a cluster name +* Select cluster set (create one if needed) +* Copy and run the generated `oc` command on the spoke cluster + +. After import completes, verify observability is enabled: ++ +[source,terminal] +---- +oc get managedcluster -o yaml | grep observability +---- + +. The spoke cluster will automatically start sending metrics, logs, and alerts to the hub cluster. + +. Verify spoke metrics in Grafana on the hub cluster. + +For detailed multi-cluster configuration, see the link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes[Advanced Cluster Management documentation]. + +== Troubleshooting + +If you encounter issues during deployment: + +*ArgoCD applications not syncing*:: ++ +Check that your Git repository URL is correctly configured: ++ +[source,terminal] +---- +oc get application self-healing-platform -n self-healing-platform-hub -o yaml | grep repoURL +---- ++ +Verify it points to YOUR fork, not the upstream repository. If applications are OutOfSync, force synchronization: ++ +[source,terminal] +---- +argocd app sync self-healing-platform --force +---- + +*Pods in "Init:Error" state*:: ++ +This usually indicates missing ServiceAccounts or RBAC. Re-run the prerequisite deployment: ++ +[source,terminal] +---- +make deploy-prereqs-only +---- + +*NotebookValidationJobs fail*:: ++ +Verify the init-models-job successfully created model storage subdirectories: ++ +[source,terminal] +---- +oc get job init-models-job -n self-healing-platform +oc logs job/init-models-job -n self-healing-platform +---- + +*Operator deployment fails*:: ++ +Check operator logs: ++ +[source,terminal] +---- +oc logs -n openshift-operators deployment/validated-patterns-operator -f +---- + +*InferenceService not ready*:: ++ +Check KServe controller logs: ++ +[source,terminal] +---- +oc logs -n knative-serving deployment/controller -f +---- + +*Storage issues*:: ++ +Verify storage classes are configured: ++ +[source,terminal] +---- +oc get storageclass +---- ++ +For SNO, ensure the correct storage class is specified in `values-hub.yaml`. ++ +Check PVC binding status: ++ +[source,terminal] +---- +oc get pvc -n self-healing-platform +---- + +*ODF installation fails during configure-cluster*:: ++ +Check ODF operator status and logs: ++ +[source,terminal] +---- +# Check ODF operator installation +oc get csv -n openshift-storage + +# Check operator controller logs +oc logs -n openshift-storage deployment/odf-operator-controller-manager --tail=100 + +# For HA clusters, check Ceph cluster status +oc get cephcluster -n openshift-storage -o yaml + +# For all deployments, check NooBaa status +oc get noobaa -n openshift-storage -o yaml +---- ++ +Common ODF issues and solutions: ++ +* *Insufficient nodes for Ceph (HA):* HA clusters need 3+ worker nodes for Ceph. If fewer nodes, use `--skip-odf` and configure external S3. +* *Storage class not found:* Verify backing storage class exists with `oc get storageclass`. Ensure cloud provider storage drivers are installed. +* *OSD creation fails:* Check that nodes have available disks/volumes: `oc debug node/ -- lsblk` +* *NooBaa pods in CrashLoopBackOff:* Check PVC binding: `oc get pvc -n openshift-storage`. Ensure storage class can provision volumes. + +*Model training pipeline fails*:: ++ +Check pipeline logs and resource availability: ++ +[source,terminal] +---- +# List recent pipeline runs +tkn pipelinerun list -n self-healing-platform + +# Get detailed logs for failed pipeline run +tkn pipelinerun logs -n self-healing-platform + +# Check GPU availability (required for predictive-analytics model) +oc get nodes -l nvidia.com/gpu.present=true + +# Verify training data storage is available +oc exec -it self-healing-workbench-0 -n self-healing-platform -- \ + ls -lh /mnt/models/ +---- ++ +Common model training issues: ++ +* *Insufficient training data:* Pipeline requires historical Prometheus metrics. Wait 24-48 hours after deployment before training. +* *GPU not available:* `predictive-analytics` model requires GPU nodes. Use `anomaly-detector` model only for non-GPU clusters. +* *Notebook execution fails:* Check Jupyter workbench logs: `oc logs self-healing-workbench-0 -n self-healing-platform` +* *S3 storage access denied:* Verify `values-secret.yaml` has correct S3 credentials and secrets are synced: `oc get secrets -n self-healing-platform` + +*Execution environment image issues*:: ++ +Verify the Ansible execution environment image is available: ++ +[source,terminal] +---- +# List local podman images +podman images | grep openshift-aiops-platform-ee + +# Test the execution environment image +podman run --rm openshift-aiops-platform-ee:latest ansible --version + +# If building locally, check for errors in build logs +podman build --help # Verify podman build works +---- ++ +If image is missing, re-pull or rebuild: ++ +[source,terminal] +---- +# Re-pull pre-built image +podman pull quay.io/takinosh/openshift-aiops-platform-ee:latest +podman tag quay.io/takinosh/openshift-aiops-platform-ee:latest openshift-aiops-platform-ee:latest + +# Or rebuild locally (requires ANSIBLE_HUB_TOKEN) +export ANSIBLE_HUB_TOKEN='your-token-here' +make build-ee +---- + +*Values file repository URL still pointing to upstream*:: ++ +Verify and fix repository URL configuration: ++ +[source,terminal] +---- +# Check current repository URLs +yq eval '.global.git.repoURL' values-global.yaml +yq eval '.clusterGroup.git.repoURL' values-hub.yaml + +# Update if pointing to upstream (KubeHeal/openshift-aiops-platform) +yq eval '.global.git.repoURL = "https://github.com//openshift-aiops-platform.git"' -i values-global.yaml +yq eval '.clusterGroup.git.repoURL = "https://github.com//openshift-aiops-platform.git"' -i values-hub.yaml + +# Commit and push changes +git add values-global.yaml values-hub.yaml +git commit -m "Update repository URLs to forked repo" +git push origin main + +# Re-sync ArgoCD applications +oc annotate application self-healing-platform -n self-healing-platform-hub \ + argocd.argoproj.io/refresh=hard --overwrite +---- + +*Complete cleanup and redeployment*:: ++ +If you need to clean up the pattern while preserving infrastructure operators: ++ +[source,terminal] +---- +ansible-navigator run ansible/playbooks/cleanup_pattern.yml \ + --container-engine podman \ + --execution-environment-image openshift-aiops-platform-ee:latest \ + --mode stdout +---- ++ +This removes the pattern CR, ArgoCD applications, and application namespace while retaining GitOps, Gitea, and the Validated Patterns Operator. + +For additional troubleshooting, see: +* link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/README.md[Pattern README] +* link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/DEPLOYMENT.md[Detailed Deployment Guide] +* link:https://github.com/KubeHeal/openshift-aiops-platform/issues[GitHub Issues] + +== Next steps + +* Explore the self-healing dashboard in Grafana to see incident detection and remediation +* Review the link:https://github.com/KubeHeal/openshift-aiops-platform/blob/main/AGENTS.md[AI Agent Development Guide] to customize agent behavior +* Access Jupyter notebooks to explore model training workflows +* Configure confidence thresholds and auto-remediation settings in `values-global.yaml` +* Train custom ML models using your historical incident data +* Set up integrations with external systems like ServiceNow or Slack for notifications + +For customization options, see link:ideas-for-customization[Ideas for Customization]. diff --git a/modules/oaiops-solution-elements.adoc b/modules/oaiops-solution-elements.adoc new file mode 100644 index 000000000..4a8055f12 --- /dev/null +++ b/modules/oaiops-solution-elements.adoc @@ -0,0 +1,87 @@ +:_content-type: CONCEPT +:imagesdir: ../../images + +[id="solution-elements-openshift-aiops"] += Solution elements + +The OpenShift AIOps Self-Healing Platform is built on several key architectural components that work together to provide intelligent, automated incident response. + +[TIP] +==== +**Learn by Doing**: The link:https://kubeheal.github.io/self-healing-workshop/modules/index.html[Self-Healing Workshop] provides hands-on labs demonstrating each solution element in practice. Workshop modules cover ML model training (Module 1), MCP server deployment (Module 2), end-to-end self-healing with Lightspeed (Module 3), and advanced ML techniques (Module 4). +==== + +== Intelligent Self-Healing Architecture + +The platform uses ML-driven analysis combined with deployment-aware remediation: + +*ML-Based Anomaly Detection*:: +Two trained models deployed via KServe analyze cluster behavior: **Isolation Forest** (100 estimators, 3% contamination rate) detects unusual patterns in real-time metrics, while **Random Forest Regressor** (100 trees, max_depth=20) forecasts resource usage 1 hour ahead. Models are trained weekly on hybrid synthetic and real Prometheus data (7-day and 30-day lookback periods respectively) via automated Tekton Pipelines triggered by NotebookValidationJob custom resources. + +*Jupyter Notebook Development Interface*:: +33 production-ready notebooks organized in 9 directories provide the primary interface for model development, testing, and remediation integration. Categories include setup (3 notebooks), data collection (5), anomaly detection (5), self-healing logic (3), model serving (3), end-to-end scenarios (4), MCP/Lightspeed integration (4), monitoring (3), and advanced scenarios (3). Notebooks integrate with the coordination engine REST API to submit anomalies and trigger remediation workflows. + +*Go-Based Coordination Engine*:: +Orchestrates remediation workflows via REST API endpoints. **Currently implemented**: `/health` and `/api/v1/health` endpoints for health monitoring, deployment method detection (ArgoCD, Helm, Operator, Manual), and pod lifecycle management. **Under active development**: Deployment resource updates (memory/CPU limit adjustments), `/api/v1/remediation/trigger` endpoint for programmatic remediation. **Planned**: Declarative ConfigMap-based remediation pattern definitions. The engine integrates with KServe models for anomaly detection, monitors dependencies (Kubernetes API, KServe services), and exposes Prometheus metrics on port 9090. + +== Multi-Cluster Observability + +Red Hat Advanced Cluster Management provides the observability foundation: + +*Centralized Metrics Collection*:: +Prometheus metrics from all spoke clusters are aggregated in the hub cluster using Thanos. This provides a single pane of glass for querying metrics across the entire fleet and enables correlation of incidents across multiple clusters. + +*Distributed Tracing*:: +OpenTelemetry instrumentation captures traces across microservices, allowing the self-healing engine to understand the full context of distributed system failures and identify root causes beyond simple metric thresholds. + +*Log Aggregation*:: +Centralized log collection enables the platform to analyze error patterns in application logs, correlate log events with metrics, and extract features for ML model training. + +*Custom Metrics*:: +Pattern-specific metrics track self-healing effectiveness including remediation success rate, confidence scores, time to resolution, and false positive rates. These metrics feed dashboards and drive continuous improvement. + +== Deployment-Aware Remediation + +The coordination engine adapts remediation strategies based on how applications were deployed: + +*Deployment Method Detection*:: +The coordination engine automatically detects whether applications are managed by ArgoCD, deployed via Helm, managed by Operators, or manually deployed. This detection drives the remediation strategy selection. + +*ArgoCD Integration*:: +For ArgoCD-managed applications, the coordination engine triggers ArgoCD sync operations to reconcile application state with Git. This respects GitOps principles and maintains Git as the source of truth for configuration. + +*Direct Remediation*:: +For non-ArgoCD applications, the coordination engine performs direct Kubernetes API operations. Current capabilities include pod lifecycle management (delete/recreate). Under development: Deployment resource updates (memory limits, CPU requests, replica counts). + +*Remediation Tracking*:: +The coordination engine provides Prometheus metrics tracking remediation attempts, workflow duration, ArgoCD sync operations, and ML-enhanced detection events. Health check endpoints (`/health`, `/api/v1/health`) enable monitoring of the coordination engine and its dependencies (Kubernetes API, KServe services, RBAC permissions). + +== MLOps Pipeline + +The platform implements automated model lifecycle management: + +*Data Collection*:: +Prometheus metrics (via 7-day and 30-day lookback queries) combined with synthetic anomaly generation provide training data. Notebooks in the `01-data-collection/` directory extract features including metric time-series patterns, OpenShift event analysis, log parsing, and feature store integration. + +*Model Training*:: +Tekton Pipelines orchestrate weekly automated retraining triggered by NotebookValidationJob custom resources. Two models are maintained: **Isolation Forest** (notebooks/02-anomaly-detection/01-isolation-forest-implementation.ipynb) and **Random Forest Regressor** (notebooks/02-anomaly-detection/05-predictive-analytics-kserve.ipynb). Training notebooks execute in containerized environments with access to model storage PVCs and S3 credentials. + +*Model Storage*:: +Trained model artifacts are stored in S3-compatible storage: NooBaa (full ODF on HA clusters) or external S3 (SNO clusters). Models are persisted to `/mnt/models/` directories mounted from persistent volume claims (CephFS RWX on HA, gp3-csi RWO on SNO). Model files include serialized scikit-learn objects (`.pkl` format) and associated metadata. + +*Model Serving*:: +KServe InferenceServices deploy models with two endpoints: `anomaly-detector-predictor` and `predictive-analytics-predictor`. Each service provides HTTP REST API on port 8080 with `/v1/models/model:predict` endpoint for inference. The coordination engine calls these services directly via cluster-internal DNS (no external ML service wrapper). Models support autoscaling based on request volume and provide readiness/liveness probes. + +*Validation and Deployment*:: +NotebookValidationJob CRs validate notebook execution before model deployment. Successful validation triggers model updates in KServe InferenceServices. Validation results are tracked (94%+ success rate across 33 notebooks on both SNO and HA topologies). The `notebook-validator:latest` image (built from workbench base) executes notebooks via papermill with timeout protection (45 minutes default). + +== Technology Highlights + +The pattern demonstrates several modern operational practices: + +* *Event-Driven Architecture*: Incident detection triggers automated workflows +* *Hybrid AI*: Combines symbolic AI (rules) with machine learning for robust decision-making +* *MLOps*: Full lifecycle management for ML models in production +* *GitOps*: Declarative, auditable infrastructure and application management +* *Multi-Cluster*: Scalable architecture for managing large OpenShift fleets +* *Observability*: Comprehensive monitoring, logging, and tracing diff --git a/modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc b/modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc new file mode 100644 index 000000000..26f3cc127 --- /dev/null +++ b/modules/openshift-aiops-platform/metadata-openshift-aiops-platform.adoc @@ -0,0 +1,28 @@ +// This file has been generated automatically from the pattern-metadata.yaml file +// Do not edit manually! +:metadata_version: 1.0 +:name: openshift-aiops-platform +:pattern_version: 1.0 +:display_name: OpenShift AIOps Self-Healing Platform +:repo_url: https://github.com/KubeHeal/openshift-aiops-platform +:docs_repo_url: https://github.com/validatedpatterns/docs +:issues_url: https://github.com/KubeHeal/openshift-aiops-platform/issues +:docs_url: https://validatedpatterns.io/patterns/openshift-aiops-platform/ +:ci_url: https://validatedpatterns.io/ci/?pattern=openshift-aiops-platform +:tier: community +:owners: KubeHeal +:requirements_hub_compute_platform_gcp_replicas: 5 +:requirements_hub_compute_platform_gcp_type: n2-standard-16 +:requirements_hub_compute_platform_azure_replicas: 5 +:requirements_hub_compute_platform_azure_type: Standard_D16as_v4 +:requirements_hub_compute_platform_aws_replicas: 5 +:requirements_hub_compute_platform_aws_type: m5.4xlarge +:requirements_hub_controlPlane_platform_gcp_replicas: 3 +:requirements_hub_controlPlane_platform_gcp_type: n2-standard-8 +:requirements_hub_controlPlane_platform_azure_replicas: 3 +:requirements_hub_controlPlane_platform_azure_type: Standard_D8s_v3 +:requirements_hub_controlPlane_platform_aws_replicas: 3 +:requirements_hub_controlPlane_platform_aws_type: m5.2xlarge +:extra_features_hypershift_support: false +:extra_features_spoke_support: true +:external_requirements: diff --git a/static/images/logos/README-openshift-aiops-platform.md b/static/images/logos/README-openshift-aiops-platform.md new file mode 100644 index 000000000..9d9cab66b --- /dev/null +++ b/static/images/logos/README-openshift-aiops-platform.md @@ -0,0 +1,27 @@ +# OpenShift AIOps Platform Logo + +## Status: Placeholder Logo + +The current logo (`openshift-aiops-platform.svg`) is a placeholder SVG file. + +## Required Action + +A proper pattern logo needs to be created and added as `openshift-aiops-platform.png` to match the front matter specification in `content/patterns/openshift-aiops-platform/_index.adoc`. + +## Logo Requirements + +- **Format**: PNG +- **Dimensions**: Recommended 400x400 pixels (square aspect ratio) +- **Style**: Should align with Validated Patterns branding +- **Content**: Should visually represent AI/ML-powered self-healing for OpenShift clusters + +## Design Suggestions + +- Include OpenShift or Kubernetes iconography +- Incorporate AI/ML visual elements (brain, neural network, gears for automation) +- Use Red Hat brand colors if appropriate +- Keep it simple and recognizable at small sizes + +## Contact + +For logo design, contact the KubeHeal team or the Validated Patterns community. diff --git a/static/images/logos/openshift-aiops-platform.svg b/static/images/logos/openshift-aiops-platform.svg new file mode 100644 index 000000000..7f83c0412 --- /dev/null +++ b/static/images/logos/openshift-aiops-platform.svg @@ -0,0 +1,28 @@ + + + + + + + + + + + + + + + + + + + + + + + AIOps Platform + + + Self-Healing OpenShift + + diff --git a/static/images/openshift-aiops-platform/README.md b/static/images/openshift-aiops-platform/README.md new file mode 100644 index 000000000..1668c042c --- /dev/null +++ b/static/images/openshift-aiops-platform/README.md @@ -0,0 +1,134 @@ +# OpenShift AIOps Platform Images + +This directory contains diagrams and images for the OpenShift AIOps Self-Healing Platform pattern documentation. + +## Required Architecture Diagrams + +The following diagrams are referenced in the documentation but need to be created: + +### 1. `aiops-logical-architecture.png` +**Purpose**: Overview of hub-and-spoke architecture + +**Content**: +- Hub cluster with components: OpenShift AI, ACM, GitOps, Observability +- Multiple spoke clusters connected to hub +- Git repositories for configuration +- Model registry and serving infrastructure +- Data flow arrows showing observability data (spoke → hub) and remediation (hub → spoke via Git) + +**Suggested Tools**: Draw.io, Lucidchart, or diagrams.net + +--- + +### 2. `aiops-dataflow.png` +**Purpose**: End-to-end self-healing workflow + +**Content**: +Flow diagram showing: +1. Event Detection (Prometheus alerts on spoke) +2. Aggregation (ACM observability to hub) +3. Analysis (Decision engine feature extraction) +4. Decision (Hybrid engine: runbooks + ML) +5. Execution (Git commit → ArgoCD sync) +6. Feedback (Outcome recorded for ML training) + +**Style**: Sequential flow diagram with boxes and arrows + +--- + +### 3. `aiops-decision-flow.png` +**Purpose**: Hybrid decision engine logic + +**Content**: +Decision tree showing: +- Incident arrives +- Pattern Matching stage (check runbooks) + - Match found → Execute (confidence 100%) + - No match → ML Inference +- ML Inference stage (model prediction) + - Confidence ≥85% → Execute + Notify + - Confidence 60-84% → Create PR for review + - Confidence <60% → Escalate to ops team + +**Style**: Flowchart/decision tree + +--- + +### 4. `aiops-ml-pipeline.png` +**Purpose**: MLOps workflow for model lifecycle + +**Content**: +Pipeline showing: +1. Data Storage (incident history, metrics, logs) +2. Kubeflow Pipeline + - Data extraction + - Feature engineering + - Training + - Validation +3. Model Registry (MLflow) +4. KServe Serving (inference API) +5. Monitoring (Prometheus metrics, drift detection) +6. Feedback loop back to data storage + +**Style**: Horizontal pipeline diagram + +--- + +## Optional Screenshots (can be added post-deployment) + +### `oaiops-operators.png` +Screenshot of OpenShift Console showing installed operators: +- OpenShift GitOps +- OpenShift AI +- Advanced Cluster Management +- OpenShift Pipelines +- OpenShift Data Foundation + +### `oaiops-applications.png` +Screenshot of ArgoCD UI showing synced applications: +- openshift-aiops-hub +- openshift-ai-operators +- advanced-cluster-management +- aiops-observability +- aiops-mlops-pipeline + +--- + +## Diagram Style Guidelines + +- Use Red Hat/OpenShift brand colors where appropriate +- Keep diagrams clean and readable +- Include legends for icons and colors +- Use consistent iconography (OpenShift logo, Kubernetes icons, etc.) +- Ensure text is legible at standard documentation sizes +- Save as PNG with transparent background if possible + +## Image Specifications + +- **Format**: PNG +- **Resolution**: Minimum 1200px width for architecture diagrams +- **DPI**: 150+ for crisp rendering +- **File size**: Optimize to <500KB per image + +## Temporary Workarounds + +Until diagrams are created, the documentation will: +- Display broken image placeholders in browsers +- Show alt text describing the diagram +- Still build successfully in Hugo + +The pattern documentation is functional without these images but will be significantly enhanced once they are added. + +## Contributing + +If you create these diagrams, please: +1. Follow the content guidelines above +2. Add them to this directory +3. Submit a pull request to the validatedpatterns/docs repository +4. Include source files (`.drawio`, `.svg`, etc.) for future edits + +## Contact + +For questions about diagram creation, contact: +- Pattern maintainers: https://github.com/KubeHeal/openshift-aiops-platform +- Validated Patterns community: https://validatedpatterns.io