CephMonQuorumAtRisk
2. Overview
Storage cluster quorum is low.
Multiple mons work together to provide redundancy by each keeping a copy of the metadata. Cluster is deployed with 3 mons, and require 2 or more mons to be up and running for quorum and for the storage operations to run. If quorum is lost, access to data is at risk.
3. Prerequisites
3.1. Verify cluster access
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
ocm list clusters
From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform
ocm backplane tunnel <cluster_id>
ocm backplane login <cluster_id>
3.2. Check Alerts
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
3.3. Check OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph osd status
exit
If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.
3.4. Further info
3.4.1. OpenShift Data Foundation Dedicated Architecture
Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.
1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-osds#common-ceph-osd-error-messages-in-the-ceph-logs_diag 2) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/troubleshooting_guide/troubleshooting-ceph-placement-groups#inconsistent-placement-groups_diag
4. Alert
Follow the link to follow procedures for quorum recovery
Follow the general pod debug workflow outlined below.
oc project openshift-storage
oc get pod | grep rook-ceph-mon
# Examine the output for a rook-ceph-mon that is in the pending state, not running or not ready
MYPOD=<pod identified as the problem pod>
Look for resource limitations or pending pvcs. Otherwise, check for node assignment.
oc get pod/${MYPOD} -o wide
oc describe pod/${MYPOD}
oc logs pod/${MYPOD}
If a node was assigned, check kubelet on the node.
(Optional log gathering)
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6