CephClusterCriticallyFull

1. Message

Storage cluster is critically full and needs immediate expansion.

2. Description

Storage cluster utilization has crossed 80%.

Detailed Description: Storage cluster utilization has crossed 80% and will become read-only at 85%. Your Ceph cluster will become read only once utilization crosses 85%. Expand the storage cluster immediately. It is common to see alerts related to OSD devices full or near full prior to this alert.

3. Severity

Critical

4. Prerequisites

To proceed with the prerequisites and resolution, you will need basic cli tools including:

oc (Openshift CLI)
jq
curl

4.1. Verify cluster access

Verify you are admin and cluster server details:

oc whoami

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

4.2. Check Alerts

Get the route to this cluster’s alertmanager:

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Quickly view all alerts to check if your alert is still active.

Check all alerts

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

Continue ONLY if you want to view your specific alert or need more details

Get your alertname from the alert, set for use in jq:

export MYALERTNAME="<alertname from alert>"

Check if the alert is still active:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'

No entries means the alert is no longer active.

Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.

If you need more details run:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'

More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts

4.3. Check/Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

The rook-ceph toolbox is not supported by Red Hat and is used here only to provide a quick health assessment. Do not use the toolbox to modify your Ceph cluster. Use the toolbox for querying health only.

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph health
ceph osd status
ceph osd df
exit

Do not forget to exit.

5. Procedure for Resolution

5.1. Resolution Overview

Determine if you can proceed with scaling *UP* or *OUT* for expansion. Choose the appropriate tab for your deployment below.

5.2. Preparation for Scaling UP

The procedure for scaling UP storage requires adding more storage capacity to existing nodes. In general, this process requires 3 steps:

Check Ceph Cluster Status Before Recovery - Check ceph status, ceph osd status, check current alerts
Add Storage Capacity - Determine if LSO is in use or not, add capacity accordingly. This will be deployment specific and you will need to have some idea of how your OCS cluster was deployed (i.e. AWS, Azure, bare metal) and how to add storage for your specific deployment.
Check Ceph Cluster Status During Recovery - Essentially same as first item.

5.3. Assess Your Cluster State

Check the current status of the ceph cluster and OSDs using the rook-ceph toolbox:

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Once inside the toolbox, run the following Ceph commands:

ceph status

In the output of ceph status look for cluster: health, and data: usage.

Typically, OSDs running out of space to replicate are causing the problem. View current OSD status by running:

ceph osd status

In the output of ceph osd status look for the numbers in the used and avail columns. You will track these numbers after adding scaling storage capacity.

Do not forget to exit the pod to return back to your command prompt:

exit

This in this situation, it is common to have MANY alerts firing or warning all at once. If you have not already checked firing alerts, you may do so now by running:

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Quickly view all alerts to check if your alert is still active.

Check all alerts

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

5.4. Determine LSO or No LSO

To determine is LSO is installed and in use, run the following commands:

oc get csv -A | grep local-storage

Example output:

openshift-local-storage                local-storage-operator.4.6.0-202103130248.p0   Local Storage                 4.6.0-202103130248.p0              Succeeded

If you see output like above ending with Succeeded, LSO is installed.

Determine if LSO is in use by looking for PVs that begin with the "local-pv" prefix, and PVC names that contain "ocs-deviceset-" prefix then the name of the storageClass used in the storage cluster definition, by running the following commands:

Look for PV names with local-pv prefixes:

oc -n openshift-storage get pvc | grep local-pv

Example output:

ocs-deviceset-localblksc-demo-0-data-0-jlp6g   Bound    local-pv-f4d1075c                          50Gi       RWO            localblkscdemo               140m
ocs-deviceset-localblksc-demo-1-data-0-c8pmq   Bound    local-pv-bf5ca51e                          50Gi       RWO            localblkscdemo               140m
ocs-deviceset-localblksc-demo-2-data-0-nzzn5   Bound    local-pv-fb64d972                          50Gi       RWO            localblkscdemo               140m

Notice the storage class name embedded in the name of the PVC, right after "ocs-deviceset-". In our example demo case, this is "localblksc-demo" and can be verified to be in use by the storage cluster by running:

Check the storageClassName in the storage cluster matches:

oc -n openshift-storage get storagecluster -o yaml | grep storageClassName

Example output:

storageClassName: localblksc-demo

If PVs named local-pv* are present and are bound to PVCs that contain the name of the storageClass in use by storagecluster definition, LSO is in use.

LSO: SCALE UP
NON LSO: SCALE UP

=== Adding Storage Capacity when Local Storage Operator is in USE

Follow the cli instructions below or the webui instructions OCS - Scaling up storage by adding capacity to your OpenShift Container Storage nodes using local storage devices.

To determine which nodes will need additional capacity, view the localvolumeset and look for values: with key: kubernetes.io/hostname by running the following command:

oc -n openshift-local-storage get localvolumeset -o yaml

Example output:

[...]
    nodeSelector:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ip-10-0-149-49
          - ip-10-0-179-37
          - ip-10-0-209-228
    storageClassName: localblksc-demo
[...]

Add storage capacity (i.e. in AWS attach EBS volumes) to each hostname listed.

Details of attaching additional storage capacity will depend on your particular environment and are beyond the scope of this document.

Once you have added capacity to each node, verify LSO has discovered the new storage by viewing the localvolumediscoveryresults details. To view the localvolumediscoveryresults details, first view the localvolumediscovery then choose a node for a spot check:

oc -n openshift-local-storage get localvolumediscoveryresults

If you do not see your new storage please rerun the command a few times to allow discovery to complete.

Choose one of the localvolumediscoveryresults from above to inspect details. When new storage has been "discovered" it will be reflected under status: discoveredDevices.

oc -n openshift-local-storage get localvolumediscoveryresults/discovery-result-ip-10-0-149-49.us-east-2.compute.internal -o yaml

Example output

[...]
  - deviceID: /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0b86476e1d962d061
    fstype: ""
    model: 'Amazon Elastic Block Store              '
    path: /dev/nvme2n1
    property: NonRotational
    serial: vol0b86476e1d962d061
    size: 53687091200
    status:
      state: Available
    type: disk
    vendor: ""
[...]

Expand the OCS cluster by directly editing the ocs-storagecluster definition, finding spec: count: and increasing the count by 1:

oc -n openshift-storage edit storagecluster/ocs-storagecluster

Example output

spec:
  encryption: {}
  externalStorage: {}
  [...]
  monDataDirHostPath: /var/lib/rook
  storageDeviceSets:
  - config: {}
    count: 1

In the example above "count: 1" becomes "count: 2".

A successful edit of storagecluster/ocs-storagecluster will show the following:

storagecluster.ocs.openshift.io/ocs-storagecluster edited

=== Assess Your Cluster During Recovery Watch the expansion progress by viewing ceph status and ceph osd status. This step is a repeat of Assess Your Cluster however, this time you will be specifically looking for evidence of:

New osds for the new storage
Status of recovery in io:
Progress of OSD rebalancing across all (existing and new) OSDs
Finally ceph health: and all alerts return to normal

In case you did not do this from the previous Assess Your Cluster section:

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Once inside the toolbox, run the following Ceph commands:

ceph status

Example output:

  cluster:
    id:     4e949bf7-7e24-4c0e-898e-f5985b514a99
    health: HEALTH_ERR
            3 backfillfull osd(s)
            3 pool(s) backfillfull
            Degraded data redundancy: 18634/33627 objects degraded (55.414%), 122 pgs degraded
            Full OSDs blocking recovery: 93 pgs recovery_toofull

  services:
    mon: 3 daemons, quorum a,b,c (age 59m)
    mgr: a(active, since 59m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 43s), 6 in (since 43s)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   3 pools, 288 pgs
    objects: 11.21k objects, 43 GiB
    usage:   137 GiB used, 163 GiB / 300 GiB avail
    pgs:     18634/33627 objects degraded (55.414%)
             45/33627 objects misplaced (0.134%)
             165 active+clean
             93  active+recovery_toofull+degraded
             29  active+recovery_wait+degraded
             1   active+recovering

  io:
    client:   3.0 KiB/s rd, 75 MiB/s wr, 3 op/s rd, 38 op/s wr
    recovery: 71 MiB/s, 0 keys/s, 20 objects/s

Items to notice:

The cluster: health: is still HEALTH_ERR which is expected until the cluster full recovers.
service: osds reflect the total number of desired OSDs, and how many are currently up. This should eventually change to have desired match up.
io: recovery: being present means a recovery is taking place. This line will disappear when the recovery is complete.

To actively watch the OSD recovery, run the following:

ceph osd status

Watch the output of ceph osd status for detail on how the OSDs are rebalancing. You will see changes in the used and avail columns as ceph moves data to achieve a health state.

Example output while data is being rebalanced:

+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |                    host                    |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | ip-10-0-179-37.us-east-2.compute.internal  | 33.9G | 16.0G |    0   |   830k  |    0   |     0   | exists,up |
| 1  | ip-10-0-209-228.us-east-2.compute.internal | 35.0G | 14.9G |    3   |  5652k  |    0   |     0   | exists,up |
| 2  | ip-10-0-149-49.us-east-2.compute.internal  | 32.4G | 17.5G |    0   |  1638k  |    0   |     0   | exists,up |
| 3  | ip-10-0-179-37.us-east-2.compute.internal  | 13.5G | 86.4G |    3   |  8532k  |    2   |   106   | exists,up |
| 4  | ip-10-0-209-228.us-east-2.compute.internal | 12.2G | 87.7G |    8   |  8183k  |    0   |     0   | exists,up |
| 5  | ip-10-0-149-49.us-east-2.compute.internal  | 15.2G | 84.7G |    3   |  4118k  |    0   |     0   | exists,up |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+

Once the recovery process has completed, ceph status will show:

cluster: health: HEALTH_OK
All OSDs desired and present in service: osds:
io: recovery: will not be present since recovery has completed

In addition, ceph osd status will show:

Fairly even distribution of used/avail across all OSDs

Example out of ceph osd status after recovery:

+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |                    host                    |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | ip-10-0-209-93.us-east-2.compute.internal  | 1104M | 48.9G |    0   |     0   |    0   |     0   | exists,up |
| 1  | ip-10-0-140-27.us-east-2.compute.internal  | 1116M | 48.9G |    0   |     0   |    0   |     0   | exists,up |
| 2  | ip-10-0-183-136.us-east-2.compute.internal | 1092M | 48.9G |    0   |  3276   |    0   |     0   | exists,up |
| 3  | ip-10-0-140-27.us-east-2.compute.internal  | 1110M | 48.9G |    0   |  7372   |    0   |     0   | exists,up |
| 4  | ip-10-0-183-136.us-east-2.compute.internal | 1134M | 48.8G |    0   |  2457   |    0   |     0   | exists,up |
| 5  | ip-10-0-209-93.us-east-2.compute.internal  | 1121M | 48.9G |    0   |     0   |    2   |   106   | exists,up |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+

Do not forget to exit the pod to return back to your command prompt:

exit

Alerts will resolve themselves as the cluster recovers.

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Check all alerts. You may have to run this several times to watch the alerts resolve:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

This procedure is complete.

=== Adding Storage Capacity - No Local Storage Operator (LSO)

Follow the cli instructions below or the webui instructions OCS - Scaling up storage by adding capacity to your OpenShift Container Storage.

Expand the OCS cluster by directly editing the ocs-storagecluster definition, finding spec: count: and increasing the count by 1:

oc -n openshift-storage edit storagecluster/ocs-storagecluster

Example output

spec:
  encryption: {}
  externalStorage: {}
  [...]
  monDataDirHostPath: /var/lib/rook
  storageDeviceSets:
  - config: {}
    count: 1

In the example above "count: 1" becomes "count: 2".

A successful edit of storagecluster/ocs-storagecluster will show the following:

storagecluster.ocs.openshift.io/ocs-storagecluster edited

New osds for the new storage
Status of recovery in io:
Progress of OSD rebalancing across all (existing and new) OSDs
Finally ceph health: and all alerts return to normal

In case you did not do this from the previous Assess Your Cluster section:

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Once inside the toolbox, run the following Ceph commands:

ceph status

Example output:

  cluster:
    id:     4e949bf7-7e24-4c0e-898e-f5985b514a99
    health: HEALTH_ERR
            3 backfillfull osd(s)
            3 pool(s) backfillfull
            Degraded data redundancy: 18634/33627 objects degraded (55.414%), 122 pgs degraded
            Full OSDs blocking recovery: 93 pgs recovery_toofull

  services:
    mon: 3 daemons, quorum a,b,c (age 59m)
    mgr: a(active, since 59m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 43s), 6 in (since 43s)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   3 pools, 288 pgs
    objects: 11.21k objects, 43 GiB
    usage:   137 GiB used, 163 GiB / 300 GiB avail
    pgs:     18634/33627 objects degraded (55.414%)
             45/33627 objects misplaced (0.134%)
             165 active+clean
             93  active+recovery_toofull+degraded
             29  active+recovery_wait+degraded
             1   active+recovering

  io:
    client:   3.0 KiB/s rd, 75 MiB/s wr, 3 op/s rd, 38 op/s wr
    recovery: 71 MiB/s, 0 keys/s, 20 objects/s

Items to notice:

The cluster: health: is still HEALTH_ERR which is expected until the cluster full recovers.
service: osds reflect the total number of desired OSDs, and how many are currently up. This should eventually change to have desired match up.
io: recovery: being present means a recovery is taking place. This line will disappear when the recovery is complete.

To actively watch the OSD recovery, run the following:

ceph osd status

Watch the output of ceph osd status for detail on how the OSDs are rebalancing. You will see changes in the used and avail columns as ceph moves data to achieve a health state.

Example output while data is being rebalanced:

+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |                    host                    |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | ip-10-0-179-37.us-east-2.compute.internal  | 33.9G | 16.0G |    0   |   830k  |    0   |     0   | exists,up |
| 1  | ip-10-0-209-228.us-east-2.compute.internal | 35.0G | 14.9G |    3   |  5652k  |    0   |     0   | exists,up |
| 2  | ip-10-0-149-49.us-east-2.compute.internal  | 32.4G | 17.5G |    0   |  1638k  |    0   |     0   | exists,up |
| 3  | ip-10-0-179-37.us-east-2.compute.internal  | 13.5G | 86.4G |    3   |  8532k  |    2   |   106   | exists,up |
| 4  | ip-10-0-209-228.us-east-2.compute.internal | 12.2G | 87.7G |    8   |  8183k  |    0   |     0   | exists,up |
| 5  | ip-10-0-149-49.us-east-2.compute.internal  | 15.2G | 84.7G |    3   |  4118k  |    0   |     0   | exists,up |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+

Once the recovery process has completed, ceph status will show:

cluster: health: HEALTH_OK
All OSDs desired and present in service: osds:
io: recovery: will not be present since recovery has completed

In addition, ceph osd status will show:

Fairly even distribution of used/avail across all OSDs

Example out of ceph osd status after recovery:

+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| id |                    host                    |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+
| 0  | ip-10-0-209-93.us-east-2.compute.internal  | 1104M | 48.9G |    0   |     0   |    0   |     0   | exists,up |
| 1  | ip-10-0-140-27.us-east-2.compute.internal  | 1116M | 48.9G |    0   |     0   |    0   |     0   | exists,up |
| 2  | ip-10-0-183-136.us-east-2.compute.internal | 1092M | 48.9G |    0   |  3276   |    0   |     0   | exists,up |
| 3  | ip-10-0-140-27.us-east-2.compute.internal  | 1110M | 48.9G |    0   |  7372   |    0   |     0   | exists,up |
| 4  | ip-10-0-183-136.us-east-2.compute.internal | 1134M | 48.8G |    0   |  2457   |    0   |     0   | exists,up |
| 5  | ip-10-0-209-93.us-east-2.compute.internal  | 1121M | 48.9G |    0   |     0   |    2   |   106   | exists,up |
+----+--------------------------------------------+-------+-------+--------+---------+--------+---------+-----------+

Do not forget to exit the pod to return back to your command prompt:

exit

Alerts will resolve themselves as the cluster recovers.

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Check all alerts. You may have to run this several times to watch the alerts resolve:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

This procedure is complete.

Pending scale out

6. Gathering Logs

Document Ceph Cluster health check:

For ODF specific results run:
oc adm must-gather --image=registry.redhat.io/odf4/ocs-must-gather-rhel8:v4.10