CephClusterReadOnly
2. Overview
Storage cluster utilization has crossed 85%.
Detailed Description: Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately as all storage operations are impacted. If deleting data to free up space, please refer to section below on procedure when cluster is ready only. It is common to see alerts related to OSD devices full or near full prior to this alert.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform
ocm backplane tunnel <cluster_id>
ocm backplane login <cluster_id>
3.2. Check Alerts
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. Check OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.
3.4. Further info
3.4.1. OpenShift Data Foundation Dedicated Architecture
Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.
1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#common-ceph-osd-error-messages-in-the-ceph-logs_diag 2) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#inconsistent-placement-groups_diag
4. Alert
4.1. Notify Customer: Make changes to solve the alert
Contact customer to free up space or expand the storage
4.1.1. Delete Data
The customer may NOT delete data while in read-only mode.
If the customer wants to delete data after the cluster is in readonly mode, the resolution procedure is:
-
Raise the threshold for readonly
-
Allow the cluster to drop of out readonly
-
Instruct the customer to delete data
-
Restore the original thresholds
This is essentially the same procedure stated in Scaling up (adding new disks) OCS 4 fails due to cluster state being unhealthy because of lack of capacity.
To Extend the cluster threshold ratio , follow the below mentioned steps
Extend the Ceph cluster full threshold
Execute the following command to extend the ceph cluster full threshold
oc process -n openshift-storage ocs-extend-cluster | oc create -f -
Now the ceph cluster threshold is extended from 0.85 to 0.87, this will make the cluster to get out of HEALTH_ERR state and Readonly mode. Notify the customer once it is out of HEALTH_ERR state to delete some of the existing data. When the customer is done deleting data , you need to revert the threshold to the previous value.
Before reverting the threshold, the existing job resource should be deleted:
oc delete job ocs-extend-cluster-job -n openshift-project
Once the old job is removed, re-run the extend cluster job to revert it to the previous threshold value
oc process -n openshift-storage ocs-extend-cluster | oc create -f -
4.1.2. Current size < 1 TB, Expand to 4 TB
Assess ability to expand. For every 1TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8GiB memory
The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, the customer should be advised to add 3 additional worker nodes to the cluster.
4.1.3. Current size = 4TB
Add-on currently supports clusters of size upto 4TB. If customer needs additional storage, log a support case to follow exception process.
-
Update storage cluster to size above 4TB (log support exception and contact Engineering as per escalation flow)
To update the storage cluster, use the below command and make sure to replace the "storage_size" placeholder with desired size value (int).
oc patch storagecluster ocs-storagecluster -p '{"spec":{"storageDeviceSets":{"count":"storage_size"}}}'
-
If cluster is in read-only and customer wants to reclaim space by deleting PVs, follow the KCS to temporarily allow delete
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6