- Contact OpenEBS Community for support.
- Search for similar issues added in this troubleshooting section.
- Search for any reported issues on StackOverflow under OpenEBS tag
One of the cStorVolumeReplica(CVR) will have its status as
Invalid after corresponding pool pod gets recreated#
When User delete a cStor pool pod, there are high chances for that corresponding pool-related CVR's can goes into
Following is a sample output of
kubectl get cvr -n openebs
Sample logs of
cstor-pool-mgmt when issue happens:
From the above highlighted logs, we can confirm
cstor-pool-mgmt in new pod is communicating with
cstor-pool in old pod as first highlighted log says
cstor pool found then next highlighted one says pool is really
When a cstor pool pod is deleted there are high chances that two cstor pool pods of same pool can present i.e old pool pod will be in
Terminating state(which means not all the containers completely terminated) and new pool pod will be in
Running state(might be few containers are in running state but not all). In this scenario
cstor-pool-mgmt container in new pool pod is communicating with
cstor-pool in old pool pod. This can cause CVR resource to set to
Note: This issue has observed in all OpenEBS versions up to 1.2.
Phase of cStorVolumeReplica (cvr) from
Offline. After few seconds CVR will be
Degraded state depends on rebuilding progress.
Application mount point running on cStor volume went into read only state.
Offline or corresponding target pod is unavailable for more than 120 seconds(iSCSI timeout) then the PV will be mounted as
read-only filesystem. For understanding different states of cStor volume, more details can be found here.
Check the status of corresponding cStor volume using the following command:
If cStor volume exists in
Degraded state then restarting of the application pod alone will bring back cStor volume to
RW mode. If cStor volume exists in
Offline, reach out to OpenEBS Community for assistance.
The cStor pools and volumes are offline, the pool manager pods are stuck in a
pending state, as shown below:
One such scenario that can lead to such a situation is, when the nodes have been scaled down and then scaled up. This results in nodes coming up with a different hostName and node name, i.e, the nodes that have come up are new nodes and not the same as previous nodes that existed earlier. Due to this, the disks that were attached to the older nodes now get attached to the newer nodes.
Troubleshooting To bring cStor pool back to online state carry out the below mentioned steps,
Update validatingwebhookconfiguration resource's failurePolicy: Update the
validatingwebhookconfigurationresource's failure policy to
Ignore. It would be previously set to
Fail. This informs the kube-APIServer to ignore the error in case cStor admission server is not reachable. To edit, execute:$ kubectl edit validatingwebhookconfiguration openebs-cstor-validation-webhook
Sample Output with updated
failurePolicykind: ValidatingWebhookConfigurationmetadata:name: openebs-cstor-validation-webhook......webhooks:- admissionReviewVersions:- v1beta1failurePolicy: Failname: admission-webhook.cstor.openebs.io......
Scale down the admission:
The openEBS admission server needs to be scaled down as this would skip the validations performed by cStor admission server when CSPC spec is updated with new node details.$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=0
Sample Output:deployment.extensions/openebs-cstor-admission-server scaled
Update the CSPC spec nodeSelector: The
CStorPoolClusterneeds to be updated with the new
nodeSelectorvalues. The updated CSPC now points to the new nodes instead of the old nodeSelectors.
kubernetes.io/hostnamewith the new values.
Sample Output:apiVersion: cstor.openebs.io/v1kind: CStorPoolClustermetadata:name: cstor-cspcnamespace: openebsspec:pools:- nodeSelector:kubernetes.io/hostname: "ip-192-168-25-235"dataRaidGroups:- blockDevices:- blockDeviceName: "blockdevice-798dbaf214f355ada15d097d87da248c"poolConfig:dataRaidGroupType: "stripe"- nodeSelector:kubernetes.io/hostname: "ip-192-168-33-15"dataRaidGroups:- blockDevices:- blockDeviceName: "blockdevice-4505d9d5f045b05995a5654b5493f8e0"poolConfig:dataRaidGroupType: "stripe"- nodeSelector:kubernetes.io/hostname: "ip-192-168-75-156"dataRaidGroups:- blockDevices:- blockDeviceName: "blockdevice-c783e51a80bc51065402e5473c52d185"poolConfig:dataRaidGroupType: "stripe"
To apply the above configuration, execute:$ kubectl apply -f cspc.yaml
Update nodeSelectors, labels and NodeName:
Next, the CSPI needs to be updated with the correct node details. Get the node details on which the previous blockdevice was attached and after fetching node details update hostName, nodeSelector values and
kubernetes.io/hostnamevalues in labels of CSPI with new details. To update, execute:kubectl edit cspi <cspi_name> -n openebs
NOTE: The same process needs to be repeated for all other CSPIs which are in pending state and belongs to the updated CSPC.
Verification: On successful implementation of the above steps, the updated CSPI generates an event, pool is successfully imported which verifies the above steps have been completed successfully.kubectl describe cspi cstor-cspc-xs4b -n openebs
Sample Output:......Events:Type Reason Age From Message---- ------ ---- ---- -------Normal Pool Imported 2m48s CStorPoolInstance Pool Import successful: cstor-07c4bfd1-aa1a-4346-8c38-f81d33070ab7
Scale-up the cStor admission server and update validatingwebhookconfiguration: This brings back the cStor admission server to running state. As well as admission server is required to validate the modifications made to CSPC API in future.
$ kubectl scale deploy openebs-cstor-admission-server -n openebs --replicas=1
Sample Output:deployment.extensions/openebs-cstor-admission-server scaled
Now, update the
Failunder validatingwebhookconfiguration. To edit, execute:$ kubectl edit validatingwebhookconfiguration openebs-cstor-validation-webhook
Sample Output:validatingwebhookconfiguration.admissionregistration.k8s.io/openebs-cstor-validation-webhook edited
cStor scans all the devices on the node while it tries to import the pool in case there is a pool manager pod restart. Pool(s) are always imported before creation. On pool creation all of the devices are scanned and as there are no existing pool(s), a new pool is created. Now, when the pool is created the participating devices are cached for faster import of the pool (in case of pool manager pod restart). If the import utilises cache then this issue won't be hit but there is a chance of import without cache (when the pool is being created for the first time)
In such cases where pool import happens without cache file and if any of the devices(even the devices that are not part of the cStor pool) is bad and is not responding the command issued by cStor keeps on waiting and is stuck. As a result of this, pool manager pod is not able to issue any more command in order to reconcile the state of cStor pools or even perform the IO for the volumes that are placed on that particular pool.
Troubleshooting This might be encountered because of one of the following situations:
- The device that has gone bad is actually a part of the cStor pool on the node. In such cases, Block device replacement needs to be done, the detailed steps to it can be found here.
Note: Block device replacement is not supported for stripe raid configuration. Please visit this link for some use cases and solutions.
- The device that has gone bad is not part of the cStor pool on the node. In this case, removing the bad disk from the node and restarting the pool manager pod with fix the problem.
- If the node is lost.
- If one or more disks participating in the cStor pool are lost. This occurs when the pool configuration is set to stripe.
- If all the disks participating in any raid group are lost. This occurs when the pool configuration is set to mirror.
- If the cStor pool configuration is raidz and more than 1 disk in any raid group is lost.
- If the cStor pool configuration is raidz2 and more than 2 disks in any raid group are lost.
This situation is often encountered in Kubernetes clusters that have autoscale feature enabled and nodes scale down and scale-up.
If the volume replica that resided on the lost pool was configured in high availability mode then the volume replica can be migrated to a new cStor pool.
NOTE:The CStorVolume associated to the volume replicas have to be migrated should be in Healthy state.
Remove the cStorVolumeReplicas from the lost pool:
To remove the pool the
CStorVolumeConfig needs to updated. The
poolName for the corresponding pool needs to be removed from
replicaPoolInfo. This ensures that the admission server accepts the scale down request.
NOTE: Ensure that the cstorvolume and target pods are in running state.
A sample CVC resource(corresponding to the volume) that has 3 pools.
Now edit the CVC and remove the desired poolName.
From the above spec,
cstor-cspc-4tr5 CSPI entry is removed. This needs to be repeated for all the volumes which have cStor volume replicas on the lost pool. To get the list of volume replicas in lost pool, execute:
Remove the finalizer from cStor volume replicas
The CVRs need to be deleted from the etcd, this requires the
cstorvolumereplica.openebs.io/finalizer to be removed from the CVRs which were present on the lost cStor pool.
Usually, the finalizer is removed by pool-manager pod but as in this case the pod is not in running state hence manual intervention is required.
To get the list of CVRs, execute:
After this step, CStorVolume will scale down. To verify, execute:
Remove the pool spec from CSPC belongs to lost node
Next, the corresponding CSPC needs to be edited and the pool spec that belongs to the nodes, which no longer exists, needs to be removed. To edit the cspc, execute:
This updates the number of desired instances.
To verify, execute:
Since CSPI has pool protection finalizer i.e
openebs.io/pool-protection the CSPC operator was unable to delete the CSPI. Due to this reason the count for provisioned instances still remains 3.
To fix this
openebs.io/pool-protection finalizer must be removed from the CSPI that was present on the lost node.
To edit, execute:
After the finalizer is removed the CSPI count goes to the desired number.
Scale the cStorVolumeReplicas back to the original number
Scale the CStorVolumeReplicas back to the desired number on new or existing cStor pool where a volume replica of the same volume doesn't exist.
NOTE: A CStorVolume is a collection of 1 or more volume replicas and no two replicas of a CStorVolume should reside on the same CStorPoolInstance. CStorVolume is a custom resource and a logical aggregated representation of all the underlying cStor volume replicas for this particular volume.
To get the list of cspi execute:
Next, add the newly created CStorPoolInstance under CVC.Spec
In this example we are adding,
To edit, execute:
The same needs to be repeated for all the scaled down cStor volumes. Next, verify the status of the new CStorVolumeReplica(CVR) that are provisioned.
To get the list of CVR, execute:
To get the list of cspi, execute: