CephNodeDown

1. Message

A storage node went down. Please check the node immediately. The alert should contain the node name.

2. Description

A storage node went down. Please check the node immediately. The alert should contain the node name.

3. Severity

Error

4. Prerequisites

To proceed with the prerequisites and resolution, you will need basic cli tools including:

  • oc (Openshift CLI)

  • jq

  • curl

4.1. Verify cluster access

Verify you are admin and cluster server details:
oc whoami

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

4.2. Check Alerts

Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Quickly view all alerts to check if your alert is still active.

Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

Continue ONLY if you want to view your specific alert or need more details

Get your alertname from the alert, set for use in jq:
export MYALERTNAME="<alertname from alert>"
Check if the alert is still active:
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'

No entries means the alert is no longer active.

Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.

If you need more details run:
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'

More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts

4.3. Check/Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

The rook-ceph toolbox is not supported by Red Hat and is used here only to provide a quick health assessment. Do not use the toolbox to modify your Ceph cluster. Use the toolbox for querying health only.
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph health
ceph osd status
ceph osd df
exit

Do not forget to exit.

5. Procedure for Resolution

5.1. Resolution Overview

5.2. Check Alerts

Get the route to this cluster’s alertmanager:
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Quickly view all alerts to check if your alert is still active.

Check all alerts
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

Continue ONLY if you want to view your specific alert or need more details

Get your alertname from the alert, set for use in jq:
export MYALERTNAME="<alertname from alert>"
Check if the alert is still active:
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'

No entries means the alert is no longer active.

Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.

If you need more details run:
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'

More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts

pod status: NOT pending, running, but NOT ready → Check readiness probe.
oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running → Check for app or image issues.
oc logs pod/${MYPOD}
oc describe pod/${MYPOD}

If you are at this step, then the pod is ok. Proceed to check the service.

pod status: NOT pending, running, ready, no access to app → Start debug workflow for service.

pod status: pending → Check for resource issues, pending pvcs, node assignment, kubelet problems.
oc get pod | grep rook-ceph-mgr
# Examine the output for a rook-ceph-mgr that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>
oc describe pod/${MYPOD}

Look for resource limitations or pending pvcs. Otherwise, check for node assignment.

oc get pod/${MYPOD} -o wide

If a node was assigned, check kubelet on the node.

Get pod status:
oc project openshift-storage
oc get pod | grep rook-ceph-mgr

5.3. OCS Operator Pod Health

Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.

Find and view the status of the OCS operator: WIP Is Upgrade a status or only pending?
% oc get pod -n openshift-storage | grep ocs-operator
% OCSOP=$(oc get pod -n openshift-storage  -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator)
% echo $OCSOP
% oc get pod/${OCSOP} -n openshift-storage
% oc describe pod/${OCSOP} -n openshift-storage

If you determine the OCS operator is in progress, please be patient, wait 5 minutes and this alert should resolve itself.

If you have waited or see a different error status condition, please continue troubleshooting.

5.4. Check for OCS Operator Update

WIP CLI does the status of the operator reflect update in progress? https://docs.openshift.com/container-platform/4.6/operators/admin/olm-status.html

WIP UI The cause of the missing replicas must be determined and fixed. Basic troubleshooting starting with operator status all the way to OCP health may be required.

5.5. OCS Operator Subscription Health

Check the ocs-operator subscription status
oc get sub ocs-operator -n openshift-storage  -o json | jq .status.conditions

Like all operators, the status conditions types are:

CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, InstallPlanFailed

The status for each type should be False. For example:

[
  {
    "lastTransitionTime": "2021-01-26T19:21:37Z",
    "message": "all available catalogsources are healthy",
    "reason": "AllCatalogSourcesHealthy",
    "status": "False",
    "type": "CatalogSourcesUnhealthy"
  }
]

The output above shows a false status for type CatalogSourcesUnHealthly, meaning the catalog sources are healthy. WIP insert basic cluster health/resources check?

6. Gathering Logs

Document Ceph Cluster health check:
For ODF specific results run:
oc adm must-gather --image=registry.redhat.io/odf4/ocs-must-gather-rhel8:v4.10