CephClusterCriticallyFull

Table of Contents

1. Check Description
2. Overview
3. Prerequisites
4. Alert
- 4.1. Notify Customer: Make changes to solve the alert
5. Troubleshooting

1. Check Description

Severity: Critical

Potential Customer Impact: High

2. Overview

Storage cluster utilization has crossed 80%.

Detailed Description: Storage cluster utilization has crossed 80% and will become read-only at 85%. Your Ceph cluster will become read-only once utilization crosses 85%. Free up some space or expand the storage cluster immediately. It is common to see alerts related to OSD devices full or near full prior to this alert.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:

ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform

Create a tunnel through backplane by provdiing SSH key passphrase:

ocm backplane tunnel <cluster_id>

In a new tab, login to target cluster using backplane by provdiing 2FA:

ocm backplane login <cluster_id>

3.2. Check Alerts

Set port-forwarding for alertmanager:

oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage

Check all alerts

curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. Check OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

Step 1: Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph osd status
exit

If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.

3.4. Further info

3.4.1. OpenShift Data Foundation Dedicated Architecture

Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.

4. Alert

4.1. Notify Customer: Make changes to solve the alert

Contact customer to free up space or expand the storage

4.1.1. Delete Data

OCS CEPH CLUSTER IS NOT IN READONLY MODE. The following instructions only apply to OCS clusters that are near or full but NOT in read-only mode. Read-only mode would prevent any changes including deleting data (i.e. PVC/PV deletions)

The customer may delete data and the cluster will resolve the alert through self healing processes.

4.1.2. Current size < 1 TB, Expand to 4 TB

Assess ability to expand. For every 1TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8GiB memory

The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, the customer should be advised to add 3 additional worker nodes to the cluster.

4.1.3. Current size = 4TB

Add-on currently supports clusters of size upto 4TB. If customer needs additional storage, log a support case to follow exception process.

Options for support

Update storage cluster to size above 4TB (log support exception and contact Engineering as per escalation flow)

To update the storage cluster, use the below command and make sure to replace the "storage_size" placeholder with desired size value (int).

oc patch storagecluster ocs-storagecluster -p '{"spec":{"storageDeviceSets":{"count":"storage_size"}}}'

If cluster is in read-only and customer wants to reclaim space by deleting PVs, follow the KCS to temporarily allow delete

(Optional log gathering)

Document Ceph Cluster health check:

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

Issue encountered while following the SOP

Any issues while following the SOP should be documented here.