CephMdsMissingReplicas
2. Description
Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.
Detailed Description: Minimum required replicas for the storage metadata service (MDS) are not available. MDS is responsible for file metadata. Degradation of the MDS service can affect the working of the storage cluster and should be fixed as soon as possible.
4. Prerequisites
To proceed with the prerequisites and resolution, you will need basic cli tools including:
-
oc (Openshift CLI)
-
jq
-
curl
4.1. Verify cluster access
oc whoami
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
4.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
Quickly view all alerts to check if your alert is still active.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
Continue ONLY if you want to view your specific alert or need more details
export MYALERTNAME="<alertname from alert>"
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'
No entries means the alert is no longer active.
Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'
More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts
4.3. Check/Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
The rook-ceph toolbox is not supported by Red Hat and is used here only to provide a quick health assessment. Do not use the toolbox to modify your Ceph cluster. Use the toolbox for querying health only. |
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
After the rook-ceph-tools
Pod is Running
, access the toolbox like this:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph health
ceph osd status
ceph osd df
exit
Do not forget to exit.