CephClusterReadOnly

Table of Contents

1. Check Description
2. Overview
3. Prerequisites
4. Alert
- 4.1. Notify Customer: Make changes to solve the alert
5. Troubleshooting

1. Check Description

Severity: Error

Potential Customer Impact: High

2. Overview

Storage cluster utilization has crossed 85%.

Detailed Description: Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately as all storage operations are impacted. If deleting data to free up space, please refer to section below on procedure when cluster is ready only. It is common to see alerts related to OSD devices full or near full prior to this alert.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:

ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform

Create a tunnel through backplane by provdiing SSH key passphrase:

ocm backplane tunnel <cluster_id>

In a new tab, login to target cluster using backplane by provdiing 2FA:

ocm backplane login <cluster_id>

3.2. Check Alerts

Set port-forwarding for alertmanager:

oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage

Check all alerts

curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. Check OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

Step 1: Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph osd status
exit

If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.

3.4. Further info

3.4.1. OpenShift Data Foundation Dedicated Architecture

Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.

4. Alert

4.1. Notify Customer: Make changes to solve the alert

Contact customer to free up space or expand the storage

4.1.1. Delete Data

The customer may NOT delete data while in read-only mode.

If the customer wants to delete data after the cluster is in readonly mode, the resolution procedure is:

Raise the threshold for readonly
Allow the cluster to drop of out readonly
Instruct the customer to delete data
Restore the original thresholds

This is essentially the same procedure stated in Scaling up (adding new disks) OCS 4 fails due to cluster state being unhealthy because of lack of capacity.

To Extend the cluster threshold ratio , follow the below mentioned steps

Extend the Ceph cluster full threshold

Execute the following command to extend the ceph cluster full threshold

oc process -n openshift-storage ocs-extend-cluster | oc create -f -

Now the ceph cluster threshold is extended from 0.85 to 0.87, this will make the cluster to get out of HEALTH_ERR state and Readonly mode. Notify the customer once it is out of HEALTH_ERR state to delete some of the existing data. When the customer is done deleting data , you need to revert the threshold to the previous value.

Before reverting the threshold, the existing job resource should be deleted:

oc delete job ocs-extend-cluster-job -n openshift-project

Once the old job is removed, re-run the extend cluster job to revert it to the previous threshold value

oc process -n openshift-storage ocs-extend-cluster | oc create -f -

4.1.2. Current size < 1 TB, Expand to 4 TB

Assess ability to expand. For every 1TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8GiB memory

The customer may increase capacity via the addon and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, the customer should be advised to add 3 additional worker nodes to the cluster.

4.1.3. Current size = 4TB

Add-on currently supports clusters of size upto 4TB. If customer needs additional storage, log a support case to follow exception process.

Options for support

Update storage cluster to size above 4TB (log support exception and contact Engineering as per escalation flow)

To update the storage cluster, use the below command and make sure to replace the "storage_size" placeholder with desired size value (int).

oc patch storagecluster ocs-storagecluster -p '{"spec":{"storageDeviceSets":{"count":"storage_size"}}}'

If cluster is in read-only and customer wants to reclaim space by deleting PVs, follow the KCS to temporarily allow delete

(Optional log gathering)

Document Ceph Cluster health check:

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

Issue encountered while following the SOP

Any issues while following the SOP should be documented here.