CephMgrIsAbsent

1. Check Description

Severity: Warning

Potential Customer Impact: High

2. Overview

Ceph Manager has disappeared from Prometheus target discovery.

Not having a Ceph manager running impacts the monitoring of the cluster, PVC creation and deletion requests and should be resolved as soon as possible.

3. Prerequisites

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:
ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform

Create a tunnel through backplane by provdiing SSH key passphrase:
ocm backplane tunnel <cluster_id>
In a new tab, login to target cluster using backplane by provdiing 2FA:
ocm backplane login <cluster_id>

3.2. Check Alerts

Set port-forwarding for alertmanager:
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
Check all alerts
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. Check OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

Step 1: Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit

If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.

3.4. Further info

3.4.1. OpenShift Data Foundation Dedicated Architecture

Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.

4. Alert

4.1. Make change to solve the alert

Verify the rook-ceph-mgr pod is failing and restart if necessary. If the ceph mgr pod restart fails, use general basic pod troubleshooting to resolve.

Verify the ceph mgr pod is failing:
oc get pods | grep mgr
Describe the ceph mgr pod for more detail:
oc describe pods/<rook-ceph-mgr pod name from previous step>

Analyze errors (i.e. resource issues?)

Try deleting the pod and watch for a successful restart:

oc get pods | grep mgr

If above fails, follow general pod troubleshooting procedures.

pod status: pending → Check for resource issues, pending pvcs, node assignment, kubelet problems.
oc project openshift-storage
oc get pod | grep rook-ceph-mgr
Set MYPOD for convenience:
# Examine the output for a rook-ceph-mgr that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>

Look for resource limitations or pending pvcs. Otherwise, check for node assignment.

oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready → Check readiness probe.
oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running → Check for app or image issues.
oc logs pod/${MYPOD}

If a node was assigned, check kubelet on the node.

(Optional log gathering)

Document Ceph Cluster health check:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

  • Issue encountered while following the SOP

Any issues while following the SOP should be documented here.