Troubleshooting OpenEBS - cStor
#
General guidelines for troubleshooting- Contact OpenEBS Community for support.
- Search for similar issues added in this troubleshooting section.
- Search for any reported issues on StackOverflow under OpenEBS tag
cStor volume become read only state
cStor pools, volumes are offline and pool manager pods are stuck in pending state
Pool Operation Hung Due to Bad Disk
Volume Migration when the underlying cStor pool is lost
Invalid
after corresponding pool pod gets recreated#
One of the cStorVolumeReplica(CVR) will have its status as When User delete a cStor pool pod, there are high chances for that corresponding pool-related CVR's can goes into Invalid
state.
Following is a sample output of kubectl get cvr -n openebs
Troubleshooting
Sample logs of cstor-pool-mgmt
when issue happens:
From the above highlighted logs, we can confirm cstor-pool-mgmt
in new pod is communicating with cstor-pool
in old pod as first highlighted log says cstor pool found
then next highlighted one says pool is really imported
.
Possible Reason:
When a cstor pool pod is deleted there are high chances that two cstor pool pods of same pool can present i.e old pool pod will be in Terminating
state(which means not all the containers completely terminated) and new pool pod will be in Running
state(might be few containers are in running state but not all). In this scenario cstor-pool-mgmt
container in new pool pod is communicating with cstor-pool
in old pool pod. This can cause CVR resource to set to Invalid
.
Note: This issue has observed in all OpenEBS versions up to 1.2.
Resolution:
Edit the Phase
of cStorVolumeReplica (cvr) from Invalid
to Offline
. After few seconds CVR will be Healthy
or Degraded
state depends on rebuilding progress.
#
cStor volume become read only stateApplication mount point running on cStor volume went into read only state.
Possible Reason:
If cStorVolume
is Offline
or corresponding target pod is unavailable for more than 120 seconds(iSCSI timeout) then the PV will be mounted as read-only
filesystem. For understanding different states of cStor volume, more details can be found here.
Troubleshooting
Check the status of corresponding cStor volume using the following command:
If cStor volume exists in Healthy
or Degraded
state then restarting of the application pod alone will bring back cStor volume to RW
mode. If cStor volume exists in Offline
, reach out to OpenEBS Community for assistance.
#
cStor pools, volumes are offline and pool manager pods are stuck in pending stateThe cStor pools and volumes are offline, the pool manager pods are stuck in a pending
state, as shown below:
Sample Output:
One such scenario that can lead to such a situation is, when the nodes have been scaled down and then scaled up. This results in nodes coming up with a different hostName and node name, i.e, the nodes that have come up are new nodes and not the same as previous nodes that existed earlier. Due to this, the disks that were attached to the older nodes now get attached to the newer nodes.
Troubleshooting To bring cStor pool back to online state carry out the below mentioned steps,
Update validatingwebhookconfiguration resource's failurePolicy: Update the
validatingwebhookconfiguration
resource's failure policy toIgnore
. It would be previously set toFail
. This informs the kube-APIServer to ignore the error in case cStor admission server is not reachable. To edit, execute:Sample Output with updated
failurePolicy
Scale down the admission:
The openEBS admission server needs to be scaled down as this would skip the validations performed by cStor admission server when CSPC spec is updated with new node details.
Sample Output:
Update the CSPC spec nodeSelector: The
CStorPoolCluster
needs to be updated with the newnodeSelector
values. The updated CSPC now points to the new nodes instead of the old nodeSelectors.Update
kubernetes.io/hostname
with the new values.Sample Output:
To apply the above configuration, execute:
Update nodeSelectors, labels and NodeName:
Next, the CSPI needs to be updated with the correct node details. Get the node details on which the previous blockdevice was attached and after fetching node details update hostName, nodeSelector values and
kubernetes.io/hostname
values in labels of CSPI with new details. To update, execute:NOTE: The same process needs to be repeated for all other CSPIs which are in pending state and belongs to the updated CSPC.
Verification: On successful implementation of the above steps, the updated CSPI generates an event, pool is successfully imported which verifies the above steps have been completed successfully.
Sample Output:
Scale-up the cStor admission server and update validatingwebhookconfiguration: This brings back the cStor admission server to running state. As well as admission server is required to validate the modifications made to CSPC API in future.
$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=1
Sample Output:
Now, update the
failurePolicy
back toFail
under validatingwebhookconfiguration. To edit, execute:Sample Output:
#
Pool Operation hung due to Bad DiskcStor scans all the devices on the node while it tries to import the pool in case there is a pool manager pod restart. Pool(s) are always imported before creation. On pool creation all of the devices are scanned and as there are no existing pool(s), a new pool is created. Now, when the pool is created the participating devices are cached for faster import of the pool (in case of pool manager pod restart). If the import utilises cache then this issue won't be hit but there is a chance of import without cache (when the pool is being created for the first time)
In such cases where pool import happens without cache file and if any of the devices(even the devices that are not part of the cStor pool) is bad and is not responding the command issued by cStor keeps on waiting and is stuck. As a result of this, pool manager pod is not able to issue any more command in order to reconcile the state of cStor pools or even perform the IO for the volumes that are placed on that particular pool.
Troubleshooting This might be encountered because of one of the following situations:
- The device that has gone bad is actually a part of the cStor pool on the node. In such cases, Block device replacement needs to be done, the detailed steps to it can be found here.
Note: Block device replacement is not supported for stripe raid configuration. Please visit this link for some use cases and solutions.
- The device that has gone bad is not part of the cStor pool on the node. In this case, removing the bad disk from the node and restarting the pool manager pod with fix the problem.
#
Volume Migration when the underlying cStor pool is lost#
Scenarios that can result in losing of cStor pool(s):- If the node is lost.
- If one or more disks participating in the cStor pool are lost. This occurs when the pool configuration is set to stripe.
- If all the disks participating in any raid group are lost. This occurs when the pool configuration is set to mirror.
- If the cStor pool configuration is raidz and more than 1 disk in any raid group is lost.
- If the cStor pool configuration is raidz2 and more than 2 disks in any raid group are lost.
This situation is often encountered in Kubernetes clusters that have autoscale feature enabled and nodes scale down and scale-up.
If the volume replica that resided on the lost pool was configured in high availability mode then the volume replica can be migrated to a new cStor pool.
NOTE:The CStorVolume associated to the volume replicas have to be migrated should be in Healthy state.
STEP 1:
Remove the cStorVolumeReplicas from the lost pool:
To remove the pool the CStorVolumeConfig
needs to updated. The poolName
for the corresponding pool needs to be removed from replicaPoolInfo
. This ensures that the admission server accepts the scale down request.
NOTE: Ensure that the cstorvolume and target pods are in running state.
A sample CVC resource(corresponding to the volume) that has 3 pools.
Now edit the CVC and remove the desired poolName.
From the above spec, cstor-cspc-4tr5
CSPI entry is removed. This needs to be repeated for all the volumes which have cStor volume replicas on the lost pool. To get the list of volume replicas in lost pool, execute:
STEP 2:
Remove the finalizer from cStor volume replicas
The CVRs need to be deleted from the etcd, this requires the finalizer
under cstorvolumereplica.openebs.io/finalizer
to be removed from the CVRs which were present on the lost cStor pool.
Usually, the finalizer is removed by pool-manager pod but as in this case the pod is not in running state hence manual intervention is required.
To get the list of CVRs, execute:
Sample Output:
After this step, CStorVolume will scale down. To verify, execute:
Sample Output:
STEP 3:
Remove the pool spec from CSPC belongs to lost node
Next, the corresponding CSPC needs to be edited and the pool spec that belongs to the nodes, which no longer exists, needs to be removed. To edit the cspc, execute:
This updates the number of desired instances.
To verify, execute:
Sample Output:
Since CSPI has pool protection finalizer i.e openebs.io/pool-protection
the CSPC operator was unable to delete the CSPI. Due to this reason the count for provisioned instances still remains 3.
To fix this openebs.io/pool-protection
finalizer must be removed from the CSPI that was present on the lost node.
To edit, execute:
After the finalizer is removed the CSPI count goes to the desired number.
STEP 4:
Scale the cStorVolumeReplicas back to the original number
Scale the CStorVolumeReplicas back to the desired number on new or existing cStor pool where a volume replica of the same volume doesn't exist.
NOTE: A CStorVolume is a collection of 1 or more volume replicas and no two replicas of a CStorVolume should reside on the same CStorPoolInstance. CStorVolume is a custom resource and a logical aggregated representation of all the underlying cStor volume replicas for this particular volume.
To get the list of cspi execute:
Sample Output:
Next, add the newly created CStorPoolInstance under CVC.Spec
In this example we are adding, cstor-cspc-bf9h
To edit, execute:
Sample YAML:
The same needs to be repeated for all the scaled down cStor volumes. Next, verify the status of the new CStorVolumeReplica(CVR) that are provisioned.
To get the list of CVR, execute:
Sample Output:
To get the list of cspi, execute:
Sample Output: