CephNodeDown
2. Overview
A storage node went down. Please check the node immediately. The alert should contain the node name.
A node running Ceph pods is down. While storage operations will continue to function as Ceph is designed to deal with a node failure, it is recommended to resolve the issue to minimise risk of another node going down and affecting storage functions.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform
ocm backplane tunnel <cluster_id>
ocm backplane login <cluster_id>
3.2. Check Alerts
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. Check OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.
3.4. Further info
3.4.1. OpenShift Data Foundation Dedicated Architecture
Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.
1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#common-ceph-osd-error-messages-in-the-ceph-logs_diag 2) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#inconsistent-placement-groups_diag
4. Alert
4.1. Make changes to solve alert
oc -n openshift-storage get pods
The OCS resource requirements must be met in order for the osd pods to be scheduled on the new node. This may take a few minutes as the ceph cluster recovers data for the failing but now recovering osd.
To watch this recovery in action ensure the osd pods were actually placed on the new worker node.
oc -n openshift-storage get pods
If the previously failing osd pods have not been scheduled, use describe and check events for reasons the pods were not rescheduled.
oc -n openshift-storage get pods | grep osd
Find a failing osd pod(s):
oc -n openshift-storage describe pods/<osd podname from previous step>
In the event section look for failure reasons, such as resources not being met.
In addition, you may use the rook-ceph-toolbox to watch the recovery. This step is optional but can be helpful for large Ceph clusters.
Determine failed OCS Node
oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
oc describe node <node_name>
Access the Toolbox
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6