CephClusterCriticallyFull
2. Overview
Storage cluster utilization has crossed 80%.
Detailed Description: Storage cluster utilization has crossed 80% and will become read-only at 85%. Your Ceph cluster will become read-only once utilization crosses 85%. Free up some space or expand the storage cluster immediately. It is common to see alerts related to OSD devices full or near full prior to this alert.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform
ocm backplane tunnel <cluster_id>
ocm backplane login <cluster_id>
3.2. Check Alerts
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. Check OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.
3.4. Further info
3.4.1. OpenShift Data Foundation Dedicated Architecture
Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.
1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#common-ceph-osd-error-messages-in-the-ceph-logs_diag 2) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#inconsistent-placement-groups_diag
4. Alert
4.1. Notify Customer: Make changes to solve the alert
Contact customer to free up space or expand the storage
4.1.1. Delete Data
OCS CEPH CLUSTER IS NOT IN READONLY MODE. The following instructions only apply to OCS clusters that are near or full but NOT in read-only mode. Read-only mode would prevent any changes including deleting data (i.e. PVC/PV deletions) |
The customer may delete data and the cluster will resolve the alert through self healing processes.
4.1.2. Current size < 1 TB, Expand to 4 TB
Assess ability to expand. For every 1TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8GiB memory
The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, the customer should be advised to add 3 additional worker nodes to the cluster.
4.1.3. Current size = 4TB
Add-on currently supports clusters of size upto 4TB. If customer needs additional storage, log a support case to follow exception process.
-
Update storage cluster to size above 4TB (log support exception and contact Engineering as per escalation flow)
To update the storage cluster, use the below command and make sure to replace the "storage_size" placeholder with desired size value (int).
oc patch storagecluster ocs-storagecluster -p '{"spec":{"storageDeviceSets":{"count":"storage_size"}}}'
-
If cluster is in read-only and customer wants to reclaim space by deleting PVs, follow the KCS to temporarily allow delete
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6