CephMdsMissingReplicas

1. Message

Insufficient replicas for storage metadata service.

2. Description

Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.

Detailed Description: Minimum required replicas for the storage metadata service (MDS) are not available. MDS is responsible for file metadata. Degradation of the MDS service can affect the working of the storage cluster and should be fixed as soon as possible.

3. Severity

Warning

4. Prerequisites

To proceed with the prerequisites and resolution, you will need basic cli tools including:

oc (Openshift CLI)
jq
curl

4.1. Verify cluster access

Verify you are admin and cluster server details:

oc whoami

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

4.2. Check Alerts

Get the route to this cluster’s alertmanager:

MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')

Quickly view all alerts to check if your alert is still active.

Check all alerts

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

Continue ONLY if you want to view your specific alert or need more details

Get your alertname from the alert, set for use in jq:

export MYALERTNAME="<alertname from alert>"

Check if the alert is still active:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'

No entries means the alert is no longer active.

Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.

If you need more details run:

curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)"  https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'

More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts

4.3. Check/Document OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

The rook-ceph toolbox is not supported by Red Hat and is used here only to provide a quick health assessment. Do not use the toolbox to modify your Ceph cluster. Use the toolbox for querying health only.

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

After the rook-ceph-tools Pod is Running, access the toolbox like this:

Check and document ceph cluster health:

TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD

Step 2: From the rsh command prompt, run the following and capture the output.

ceph status
ceph health
ceph osd status
ceph osd df
exit

Do not forget to exit.

5. Procedure for Resolution

5.1. Resolution Overview

6. Gathering Logs

Document Ceph Cluster health check:

For ODF specific results run:
oc adm must-gather --image=registry.redhat.io/odf4/ocs-must-gather-rhel8:v4.10