CephClusterErrorState
2. Overview
Storage cluster is in error state for more than 10m.
Detailed Description: This alert reflects that the storage cluster is in *ERROR* state for an unacceptable amount of time and this impacts the storage availability. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform
ocm backplane tunnel <cluster_id>
ocm backplane login <cluster_id>
3.2. Check Alerts
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. Check OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.
3.4. Further info
3.4.1. OpenShift Data Foundation Dedicated Architecture
Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.
1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#common-ceph-osd-error-messages-in-the-ceph-logs_diag 2) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#inconsistent-placement-groups_diag
4. Alert
4.1. Make changes to solve alert
General troubleshooting will be required in order to determine the cause of this alert. This alert will trigger along with other (usually many other) alerts. Please view and troubleshoot the other alerts first.
oc project openshift-storage
oc get pod | grep rook-ceph
# Examine the output for a rook-ceph that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>
Look for resource limitations or pending pvcs. Otherwise, check for node assignment.
oc get pod/${MYPOD} -o wide
oc describe pod/${MYPOD}
oc logs pod/${MYPOD}
If a node was assigned, check kubelet on the node.
If the basic health of the running pods, node affinity and resource availability on the nodes have been verified, run Ceph tools for status of the storage components.
4.1.1. Troubleshooting Ceph
The first two status commands provide the overall cluster health. The normal state for cluster operations is HEALTH_OK, but will still function when the state is in a HEALTH_WARN state. If you are in a WARN state, then the cluster is in a condition that it may enter the HEALTH_ERROR state at which point all disk I/O operations are halted. If a HEALTH_WARN state is observed, then one should take action to prevent the cluster from halting when it enters the HEALTH_ERROR state.
- Problem 1
-
Ceph status shows that the OSD is full .Example Ceph OSD-FULL error
ceph status cluster: id: 62661e0d-417c-485e-b01f-562e9493f121 health: HEALTH_ERR 3 full osd(s) 3 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 3h) mgr: a(active, since 3h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 3h), 3 in (since 3h) data: pools: 3 pools, 192 pgs objects: 223.01k objects, 870 GiB usage: 2.6 TiB used, 460 GiB / 3 TiB avail pgs: 192 active+clean io: client: 853 B/s rd, 1 op/s rd, 0 op/s wr
1) Check the alert manager for readonly alert
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
2) If CephClusterReadOnly alert is listed from the above curl command, then see :
https://red-hat-storage.github.io/ocs-sop/sop/OSD/CephClusterReadOnly.html
- Problem 2
-
Ceph status shows an issue with osd, as see in example below
cluster: id: 263935ae-deb3-47e0-9355-d4a5c935aaf5 health: HEALTH_ERR 1 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 1 nearfull osd(s) 3 pool(s) nearfull 11/2142 objects unfound (0.514%) Reduced data availability: 237 pgs inactive, 237 pgs down Possible data damage: 8 pgs recovery_unfound Degraded data redundancy: 833/6426 objects degraded (12.963%), 24 pgs degraded, 63 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 115m) mgr: a(active, since 112m) mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay osd: 3 osds: 1 up (since 2m), 3 in (since 113m)
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#most-common-ceph-osd-errors
- Problem 3
-
Issues seen with PG, example .Example Ceph PG error --- cluster: id: 0a1a6dcb-2146-42f7-9e6f-8b933614c45f health: HEALTH_ERR Degraded data redundancy: 126/78009 objects degraded (0.162%), 7 pgs degraded Degraded data redundancy (low space): 1 pg backfill_toofull
data: pools: 10 pools, 80 pgs objects: 26.00k objects, 100 GiB usage: 306 GiB used, 5.7 TiB / 6.0 TiB avail pgs: 126/78009 objects degraded (0.162%) 35510/78009 objects misplaced (45.520%) 55 active+clean 12 active+remapped+backfill_wait 4 active+recovery_wait+undersized+degraded+remapped 3 active+recovery_wait+degraded 2 active+recovery_wait 2 active+recovering+undersized+remapped 1 active+recovering 1 active+remapped+backfill_toofull ---
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#most-common-ceph-placement-group-errors
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6