CephClusterWarningState
2. Description
Storage cluster is in warning state for more than 10m.
The rook-ceph-mgr job has been in a warning state for an unacceptable amount of time. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.
4. Prerequisites
To proceed with the prerequisites and resolution, you will need basic cli tools including:
-
oc (Openshift CLI)
-
jq
-
curl
4.1. Verify cluster access
oc whoami
Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.
4.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
Quickly view all alerts to check if your alert is still active.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
Continue ONLY if you want to view your specific alert or need more details
export MYALERTNAME="<alertname from alert>"
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'
No entries means the alert is no longer active.
Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'
More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts
4.3. Check/Document OCS Ceph Cluster Health
You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.
The rook-ceph toolbox is not supported by Red Hat and is used here only to provide a quick health assessment. Do not use the toolbox to modify your Ceph cluster. Use the toolbox for querying health only. |
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
After the rook-ceph-tools
Pod is Running
, access the toolbox like this:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
ceph status
ceph health
ceph osd status
ceph osd df
exit
Do not forget to exit.
5. Procedure for Resolution
5.2. Check Alerts
MYALERTMANAGER=$(oc -n openshift-monitoring get routes/alertmanager-main --no-headers | awk '{print $2}')
Quickly view all alerts to check if your alert is still active.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'
Continue ONLY if you want to view your specific alert or need more details
export MYALERTNAME="<alertname from alert>"
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERT: .labels.alertname, STATE: .status.state}'
No entries means the alert is no longer active.
Some alerts, such as mismatch versions, can occur during upgrades and resolve themselves. If this alert is not a mismatch version alert then there should be an investigation into what triggered the alert even though the alert resolved. Look for other active alerts or alerts with similiar timing.
curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://${MYALERTMANAGER}/api/v1/alerts | jq '.data[] | select( .labels.alertname | test(env.MYALERTNAME)) | { ALERTDETAILS: .}'
More about the Prometheus Alert endpoint can be found here: https://prometheus.io/docs/prometheus/latest/querying/api/#alerts
oc describe pod/${MYPOD}
oc logs pod/${MYPOD} oc describe pod/${MYPOD}
If you are at this step, then the pod is ok. Proceed to check the service.
oc get pod | grep rook-ceph-mgr # Examine the output for a rook-ceph-mgr that is in the pending state, not running or not ready MYPOD=<pod identified as the problem pod> oc describe pod/${MYPOD}
Look for resource limitations or pending pvcs. Otherwise, check for node assignment.
oc get pod/${MYPOD} -o wide
If a node was assigned, check kubelet on the node.
oc project openshift-storage oc get pod | grep rook-ceph-mgr
5.3. OCS Operator Pod Health
Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.
% oc get pod -n openshift-storage | grep ocs-operator % OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) % echo $OCSOP % oc get pod/${OCSOP} -n openshift-storage % oc describe pod/${OCSOP} -n openshift-storage
If you determine the OCS operator is in progress, please be patient, wait 5 minutes and this alert should resolve itself.
If you have waited or see a different error status condition, please continue troubleshooting.
5.4. Check for OCS Operator Update
WIP CLI does the status of the operator reflect update in progress? https://docs.openshift.com/container-platform/4.6/operators/admin/olm-status.html
WIP UI The cause of the missing replicas must be determined and fixed. Basic troubleshooting starting with operator status all the way to OCP health may be required.
5.5. OCS Operator Subscription Health
oc get sub ocs-operator -n openshift-storage -o json | jq .status.conditions
Like all operators, the status conditions types are:
CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, InstallPlanFailed
The status for each type should be False. For example:
[
{
"lastTransitionTime": "2021-01-26T19:21:37Z",
"message": "all available catalogsources are healthy",
"reason": "AllCatalogSourcesHealthy",
"status": "False",
"type": "CatalogSourcesUnhealthy"
}
]
The output above shows a false status for type CatalogSourcesUnHealthly, meaning the catalog sources are healthy. WIP insert basic cluster health/resources check?