CephOSDVersionMisMatch

1. Check Description

Severity: Warning

Potential Customer Impact: Medium

2. Overview

There are different versions of Ceph OSD components running.

Detailed Description: Typically this alert triggers during an upgrade that is taking a long time.

3. Prerequsities

3.1. Verify cluster access

Check the output to ensure you are in the correct context for the cluster mentioned in the alert. If not, please change context and proceed.

List clusters you have permission to access:
ocm list clusters

From the list above, find the cluster id of the cluster named in the alert. If you do not see the alerting cluster in the list above please refer Effective communication with SRE Platform

Create a tunnel through backplane by provdiing SSH key passphrase:
ocm backplane tunnel <cluster_id>
In a new tab, login to target cluster using backplane by provdiing 2FA:
ocm backplane login <cluster_id>

3.2. Check Alerts

Set port-forwarding for alertmanager:
oc port-forward alertmanager-managed-ocs-alertmanager-0 9093 -n openshift-storage
Check all alerts
curl http://localhost:9093/api/v1/alerts | jq '.data[] | select( .labels.alertname) | { ALERT: .labels.alertname, STATE: .status.state}'

3.3. Check OCS Ceph Cluster Health

You may directly check OCS Ceph Cluster health by using the rook-ceph toolbox.

Step 1: Check and document ceph cluster health:
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
oc rsh -n openshift-storage $TOOLS_POD
Step 2: From the rsh command prompt, run the following and capture the output.
ceph status
ceph osd status
exit

If 'ceph status' is not in HEALTH_OK, please look at the Troubleshooting section to resolve issue.

3.4. Further info

3.4.1. OpenShift Data Foundation Dedicated Architecture

Red Hat OpenShift Data Foundation Dedicated (ODF Dedicated) is deployed in converged mode on OpenShift Dedicated Clusters by the OpenShift Cluster Manager add-on infrastructure.

4. Alert

4.1. Make changes to solve alert

Check if an operator upgrade is in progress.

Checking on the OCS operator status involves checking the operator subscription status and the operator pod health.

4.1.1. OCS Operator Subscription Health

Check the ocs-operator subscription status
oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage  -o json | jq .status.conditions

Like all operators, the status conditions types are:

CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, InstallPlanFailed

The status for each type should be False. For example:

[
  {
    "lastTransitionTime": "2021-01-26T19:21:37Z",
    "message": "all available catalogsources are healthy",
    "reason": "AllCatalogSourcesHealthy",
    "status": "False",
    "type": "CatalogSourcesUnhealthy"
  }
]

The output above shows a false status for type CatalogSourcesUnHealthly, meaning the catalog sources are healthy.

4.1.2. OCS Operator Pod Health

Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.

WIP: Find specific status for upgrade (pending?)

To find and view the status of the OCS operator:
 oc get pod -n openshift-storage | grep ocs-operator
 OCSOP=$(oc get pod -n openshift-storage  -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator)
 echo $OCSOP
 oc get pod/${OCSOP} -n openshift-storage
 oc describe pod/${OCSOP} -n openshift-storage

If you determine the OCS operator is in progress, please be patient, wait 5 minutes and this alert should resolve itself.

If you have waited or see a different error status condition, please continue troubleshooting.

(Optional log gathering)

Document Ceph Cluster health check:
oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.6

5. Troubleshooting

  • Issue encountered while following the SOP

Any issues while following the SOP should be documented here.