Troubleshooting OpenEBS - cStor
General guidelines for troubleshooting#
- Contact OpenEBS Community for support.
- Search for similar issues added in this troubleshooting section.
- Search for any reported issues on StackOverflow under OpenEBS tag
cStor volume become read only state
cStor pools, volumes are offline and pool manager pods are stuck in pending state
Pool Operation Hung Due to Bad Disk
Volume Migration when the underlying cStor pool is lost
One of the cStorVolumeReplica(CVR) will have its status as Invalid after corresponding pool pod gets recreated#
When User delete a cStor pool pod, there are high chances for that corresponding pool-related CVR's can goes into Invalid state.
Following is a sample output of kubectl get cvr -n openebs
Troubleshooting
Sample logs of cstor-pool-mgmt when issue happens:
From the above highlighted logs, we can confirm cstor-pool-mgmt in new pod is communicating with cstor-pool in old pod as first highlighted log says cstor pool found then next highlighted one says pool is really imported.
Possible Reason:
When a cstor pool pod is deleted there are high chances that two cstor pool pods of same pool can present i.e old pool pod will be in Terminating state(which means not all the containers completely terminated) and new pool pod will be in Running state(might be few containers are in running state but not all). In this scenario cstor-pool-mgmt container in new pool pod is communicating with cstor-pool in old pool pod. This can cause CVR resource to set to Invalid.
Note: This issue has observed in all OpenEBS versions up to 1.2.
Resolution:
Edit the Phase of cStorVolumeReplica (cvr) from Invalid to Offline. After few seconds CVR will be Healthy or Degraded state depends on rebuilding progress.
cStor volume become read only state#
Application mount point running on cStor volume went into read only state.
Possible Reason:
If cStorVolume is Offline or corresponding target pod is unavailable for more than 120 seconds(iSCSI timeout) then the PV will be mounted as read-only filesystem. For understanding different states of cStor volume, more details can be found here.
Troubleshooting
Check the status of corresponding cStor volume using the following command:
If cStor volume exists in Healthy or Degraded state then restarting of the application pod alone will bring back cStor volume to RW mode. If cStor volume exists in Offline, reach out to OpenEBS Community for assistance.
cStor pools, volumes are offline and pool manager pods are stuck in pending state#
The cStor pools and volumes are offline, the pool manager pods are stuck in a pending state, as shown below:
Sample Output:
One such scenario that can lead to such a situation is, when the nodes have been scaled down and then scaled up. This results in nodes coming up with a different hostName and node name, i.e, the nodes that have come up are new nodes and not the same as previous nodes that existed earlier. Due to this, the disks that were attached to the older nodes now get attached to the newer nodes.
Troubleshooting To bring cStor pool back to online state carry out the below mentioned steps,
Update validatingwebhookconfiguration resource's failurePolicy: Update the
validatingwebhookconfigurationresource's failure policy toIgnore. It would be previously set toFail. This informs the kube-APIServer to ignore the error in case cStor admission server is not reachable. To edit, execute:Sample Output with updated
failurePolicyScale down the admission:
The openEBS admission server needs to be scaled down as this would skip the validations performed by cStor admission server when CSPC spec is updated with new node details.
Sample Output:
Update the CSPC spec nodeSelector: The
CStorPoolClusterneeds to be updated with the newnodeSelectorvalues. The updated CSPC now points to the new nodes instead of the old nodeSelectors.Update
kubernetes.io/hostnamewith the new values.Sample Output:
To apply the above configuration, execute:
Update nodeSelectors, labels and NodeName:
Next, the CSPI needs to be updated with the correct node details. Get the node details on which the previous blockdevice was attached and after fetching node details update hostName, nodeSelector values and
kubernetes.io/hostnamevalues in labels of CSPI with new details. To update, execute:NOTE: The same process needs to be repeated for all other CSPIs which are in pending state and belongs to the updated CSPC.
Verification: On successful implementation of the above steps, the updated CSPI generates an event, pool is successfully imported which verifies the above steps have been completed successfully.
Sample Output:
Scale-up the cStor admission server and update validatingwebhookconfiguration: This brings back the cStor admission server to running state. As well as admission server is required to validate the modifications made to CSPC API in future.
$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=1Sample Output:
Now, update the
failurePolicyback toFailunder validatingwebhookconfiguration. To edit, execute:Sample Output:
Pool Operation hung due to Bad Disk#
cStor scans all the devices on the node while it tries to import the pool in case there is a pool manager pod restart. Pool(s) are always imported before creation. On pool creation all of the devices are scanned and as there are no existing pool(s), a new pool is created. Now, when the pool is created the participating devices are cached for faster import of the pool (in case of pool manager pod restart). If the import utilises cache then this issue won't be hit but there is a chance of import without cache (when the pool is being created for the first time)
In such cases where pool import happens without cache file and if any of the devices(even the devices that are not part of the cStor pool) is bad and is not responding the command issued by cStor keeps on waiting and is stuck. As a result of this, pool manager pod is not able to issue any more command in order to reconcile the state of cStor pools or even perform the IO for the volumes that are placed on that particular pool.
Troubleshooting This might be encountered because of one of the following situations:
- The device that has gone bad is actually a part of the cStor pool on the node. In such cases, Block device replacement needs to be done, the detailed steps to it can be found here.
Note: Block device replacement is not supported for stripe raid configuration. Please visit this link for some use cases and solutions.
- The device that has gone bad is not part of the cStor pool on the node. In this case, removing the bad disk from the node and restarting the pool manager pod with fix the problem.
Volume Migration when the underlying cStor pool is lost#
Scenarios that can result in losing of cStor pool(s):#
- If the node is lost.
- If one or more disks participating in the cStor pool are lost. This occurs when the pool configuration is set to stripe.
- If all the disks participating in any raid group are lost. This occurs when the pool configuration is set to mirror.
- If the cStor pool configuration is raidz and more than 1 disk in any raid group is lost.
- If the cStor pool configuration is raidz2 and more than 2 disks in any raid group are lost.
This situation is often encountered in Kubernetes clusters that have autoscale feature enabled and nodes scale down and scale-up.
If the volume replica that resided on the lost pool was configured in high availability mode then the volume replica can be migrated to a new cStor pool.
NOTE:The CStorVolume associated to the volume replicas have to be migrated should be in Healthy state.
STEP 1:
Remove the cStorVolumeReplicas from the lost pool:
To remove the pool the CStorVolumeConfig needs to updated. The poolName for the corresponding pool needs to be removed from replicaPoolInfo. This ensures that the admission server accepts the scale down request.
NOTE: Ensure that the cstorvolume and target pods are in running state.
A sample CVC resource(corresponding to the volume) that has 3 pools.
Now edit the CVC and remove the desired poolName.
From the above spec, cstor-cspc-4tr5 CSPI entry is removed. This needs to be repeated for all the volumes which have cStor volume replicas on the lost pool. To get the list of volume replicas in lost pool, execute:
STEP 2:
Remove the finalizer from cStor volume replicas
The CVRs need to be deleted from the etcd, this requires the finalizer under cstorvolumereplica.openebs.io/finalizer to be removed from the CVRs which were present on the lost cStor pool.
Usually, the finalizer is removed by pool-manager pod but as in this case the pod is not in running state hence manual intervention is required.
To get the list of CVRs, execute:
Sample Output:
After this step, CStorVolume will scale down. To verify, execute:
Sample Output:
STEP 3:
Remove the pool spec from CSPC belongs to lost node
Next, the corresponding CSPC needs to be edited and the pool spec that belongs to the nodes, which no longer exists, needs to be removed. To edit the cspc, execute:
This updates the number of desired instances.
To verify, execute:
Sample Output:
Since CSPI has pool protection finalizer i.e openebs.io/pool-protection the CSPC operator was unable to delete the CSPI. Due to this reason the count for provisioned instances still remains 3.
To fix this openebs.io/pool-protection finalizer must be removed from the CSPI that was present on the lost node.
To edit, execute:
After the finalizer is removed the CSPI count goes to the desired number.
STEP 4:
Scale the cStorVolumeReplicas back to the original number
Scale the CStorVolumeReplicas back to the desired number on new or existing cStor pool where a volume replica of the same volume doesn't exist.
NOTE: A CStorVolume is a collection of 1 or more volume replicas and no two replicas of a CStorVolume should reside on the same CStorPoolInstance. CStorVolume is a custom resource and a logical aggregated representation of all the underlying cStor volume replicas for this particular volume.
To get the list of cspi execute:
Sample Output:
Next, add the newly created CStorPoolInstance under CVC.Spec
In this example we are adding, cstor-cspc-bf9h
To edit, execute:
Sample YAML:
The same needs to be repeated for all the scaled down cStor volumes. Next, verify the status of the new CStorVolumeReplica(CVR) that are provisioned.
To get the list of CVR, execute:
Sample Output:
To get the list of cspi, execute:
Sample Output: