CephClusterErrorState

Table of Contents

1. Check Description
2. Overview
3. Prerequisites
4. Alert
- 4.1. Make changes to solve alert
5. Troubleshooting

1. Check Description

Severity: Error

Potential Customer Impact: Critical

2. Overview

Storage cluster is in error state for more than 10m.

Detailed Description: This alert reflects that the storage cluster is in *ERROR* state for an unacceptable amount of time and this impacts the storage availability. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:

ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform

Create a tunnel through backplane by provdiing SSH key passphrase:

ocm backplane tunnel <cluster_id>

In a new tab, login to target cluster using backplane by provdiing 2FA:

ocm backplane login <cluster_id>

3.2. Check Alerts

Set port-forwarding for alertmanager:

oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage

Check all alerts

curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. Check OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

Step 1: Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph osd status
exit

If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.

3.4. Further info

3.4.1. OpenShift Data Foundation Dedicated Architecture

Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.

4. Alert

4.1. Make changes to solve alert

General troubleshooting will be required in order to determine the cause of this alert. This alert will trigger along with other (usually many other) alerts. Please view and troubleshoot the other alerts first.

pod status: pending → Check for resource issues, pending pvcs, node assignment, kubelet problems.

oc project openshift-storage
oc get pod | grep rook-ceph

Set MYPOD for convenience:

# Examine the output for a rook-ceph that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>

Look for resource limitations or pending pvcs. Otherwise, check for node assignment.

oc get pod/${MYPOD} -o wide

pod status: NOT pending, running, but NOT ready → Check readiness probe.

oc describe pod/${MYPOD}

pod status: NOT pending, but NOT running → Check for app or image issues.

oc logs pod/${MYPOD}

If a node was assigned, check kubelet on the node.

If the basic health of the running pods, node affinity and resource availability on the nodes have been verified, run Ceph tools for status of the storage components.

4.1.1. Troubleshooting Ceph

Ceph commands

Some common commands to troubleshoot a Ceph cluster:

ceph status
ceph osd status
cepd osd df
ceph osd utilization
ceph osd pool stats
ceph osd tree
ceph pg stat

The first two status commands provide the overall cluster health. The normal state for cluster operations is HEALTH_OK, but will still function when the state is in a HEALTH_WARN state. If you are in a WARN state, then the cluster is in a condition that it may enter the HEALTH_ERROR state at which point all disk I/O operations are halted. If a HEALTH_WARN state is observed, then one should take action to prevent the cluster from halting when it enters the HEALTH_ERROR state.

Problem 1: Ceph status shows that the OSD is full .Example Ceph OSD-FULL error

 ceph status
  cluster:
    id:     62661e0d-417c-485e-b01f-562e9493f121
    health: HEALTH_ERR
            3 full osd(s)
            3 pool(s) full

  services:
    mon: 3 daemons, quorum a,b,c (age 3h)
    mgr: a(active, since 3h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 3h), 3 in (since 3h)

  data:
    pools:   3 pools, 192 pgs
    objects: 223.01k objects, 870 GiB
    usage:   2.6 TiB used, 460 GiB / 3 TiB avail
    pgs:     192 active+clean

  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr

1) Check the alert manager for readonly alert

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

2) If CephClusterReadOnly alert is listed from the above curl command, then see :

https://red-hat-storage.github.io/ocs-sop/sop/OSD/CephClusterReadOnly.html

Problem 2: Ceph status shows an issue with osd, as see in example below

Example Ceph OSD error

cluster:
    id:     263935ae-deb3-47e0-9355-d4a5c935aaf5
    health: HEALTH_ERR
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            1 nearfull osd(s)
            3 pool(s) nearfull
            11/2142 objects unfound (0.514%)
            Reduced data availability: 237 pgs inactive, 237 pgs down
            Possible data damage: 8 pgs recovery_unfound
            Degraded data redundancy: 833/6426 objects degraded (12.963%), 24 pgs degraded, 63 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 115m)
    mgr: a(active, since 112m)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 3 osds: 1 up (since 2m), 3 in (since 113m)

Solving common osd errors:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#most-common-ceph-osd-errors

Problem 3

Issues seen with PG, example .Example Ceph PG error --- cluster: id: 0a1a6dcb-2146-42f7-9e6f-8b933614c45f health: HEALTH_ERR Degraded data redundancy: 126/78009 objects degraded (0.162%), 7 pgs degraded Degraded data redundancy (low space): 1 pg backfill_toofull

    data:
    pools:   10 pools, 80 pgs
    objects: 26.00k objects, 100 GiB
    usage:   306 GiB used, 5.7 TiB / 6.0 TiB avail
    pgs:     126/78009 objects degraded (0.162%)
             35510/78009 objects misplaced (45.520%)
             55 active+clean
             12 active+remapped+backfill_wait
             4  active+recovery_wait+undersized+degraded+remapped
             3  active+recovery_wait+degraded
             2  active+recovery_wait
             2  active+recovering+undersized+remapped
             1  active+recovering
             1  active+remapped+backfill_toofull
---

Solving pg error:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#most-common-ceph-placement-group-errors

(Optional log gathering)

Document Ceph Cluster health check:

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

Issue encountered while following the SOP

Any issues while following the SOP should be documented here.