Troubleshooting OpenEBS - Provisioning
#
General guidelines for troubleshooting- Contact OpenEBS Community for support.
- Search for similar issues added in this troubleshooting section.
- Search for any reported issues on StackOverflow under OpenEBS tag
Application complaining ReadOnly filesystem
Unable to create persistentVolumeClaim due to certificate verification error
Application pods are not running when OpenEBS volumes are provisioned on Rancher
Application pod is stuck in ContainerCreating state after deployment
Creating cStor pool fails on CentOS when there are partitions on the disk
Application pod enters CrashLoopBackOff state
cStor pool pods are not running
OpenEBS Jiva PVC is not provisioning in 0.8.0
Recovery procedure for Read-only volume where kubelet is running in a container
Recovery procedure for Read-only volume for XFS formatted volumes
Unable to clone OpenEBS volume from snapshot
Unable to mount XFS formatted volumes into Pod
Unable to create or delete a PVC
Unable to provision cStor on DigitalOcean
Persistent volumes indefinitely remain in pending state
#
Application complaining ReadOnly filesystemApplication sometimes complain about the underlying filesystem has become ReadOnly.
Troubleshooting
This can happen for many reasons.
- The cStor target pod is evicted because of resource constraints and is not scheduled within time
- Node is rebooted in adhoc manner (or unscheduled reboot) and Kubernetes is waiting for Kubelet to respond to know if the node is rebooted and the pods on that node need to be rescheduled. Kubernetes can take up to 30 minutes as timeout before deciding the node is going to stay offline and pods need to be rescheduled. During this time, the iSCSI initiator at the application pod has timeout and marked the underlying filesystem as ReadOnly
- cStor target has lost quorum because of underlying node losses and target has marked the lun as ReadOnly
Go through the Kubelet logs and application pod logs to know the reason for marking the ReadOnly and take appropriate action. Maintaining volume quorum is necessary during Kubernetes node reboots.
#
Unable to create persistentVolumeClaim due to certificate verification errorAn issue can appear when creating a PersistentVolumeClaim:
Troubleshooting
By default OpenEBS chart generates TLS certificates used by the openebs-admission-controller
, while this is handy, it requires the admission controller to restart on each helm upgrade
command. For most of the use cases, the admission controller would have restarted to update the certificate configurations, if not , then user will get the above mentioned error.
Workaround
This can be fixed by restarting the admission controller:
#
Application pods are not running when OpenEBS volumes are provisioned on RancherThe setup environment where the issue occurs is rancher/rke with bare metal hosts running CentOS. After installing OpenEBS, OpenEBS pods are running, but application pod is in ContainerCreating state. It consume Jiva volume. The output of kubectl get pods
is displayed as follows.
Troubleshooting
Make sure the following prerequisites are done.
Verify iSCSI initiator is installed on nodes and services are running.
Added extra_binds under kubelet service in cluster YAML
More details are mentioned here.
#
Application pod is stuck in ContainerCreating state after deploymentTroubleshooting
Obtain the output of the
kubectl describe pod <application_pod>
and check the events.If the error message executable not found in $PATH is found, check whether the iSCSI initiator utils are installed on the node/kubelet container (rancherOS, coreOS). If not, install the same and retry deployment.
If the warning message
FailedMount: Unable to mount volumes for pod <>: timeout expired waiting for volumes to attach/mount
is persisting use the following procedure.Check whether the Persistent Volume Claim/Persistent Volume (PVC/PV) are created successfully and the OpenEBS controller and replica pods are running. These can be verified using the
kubectl get pvc,pv
andkubectl get pods
command.If the OpenEBS volume pods are not created, and the PVC is in pending state, check whether the storageclass referenced by the application PVC is available/installed. This can be confirmed using the
kubectl get sc
command. If this storageclass is not created, or improperly created without the appropriate attributes, recreate the same and re-deploy the application.Note: Ensure that the older PVC objects are deleted before re-deployment.
If the PV is created (in bound state), but replicas are not running or are in pending state, perform a
kubectl describe <replica_pod>
and check the events. If the events indicate FailedScheduling due to Insufficient cpu, NodeUnschedulable or MatchInterPodAffinity and PodToleratesNodeTaints, check the following:- replica count is equal to or lesser than available schedulable nodes
- there are enough resources on the nodes to run the replica pods
- whether nodes are tainted and if so, whether they are tolerated by the OpenEBS replica pods
Ensure that the above conditions are met and the replica rollout is successful. This will ensure application enters running state.
If the PV is created and OpenEBS pods are running, use the
iscsiadm -m session
command on the node (where the pod is scheduled) to identify whether the OpenEBS iSCSI volume has been attached/logged-into. If not, verify network connectivity between the nodes.If the session is present, identify the SCSI device associated with the session using the command
iscsiadm -m session -P 3
. Once it is confirmed that the iSCSI device is available (check the output offdisk -l
for the mapped SCSI device), check the kubelet and system logs including the iscsid and kernel (syslog) for information on the state of this iSCSI device. If inconsistencies are observed, execute the filesyscheck on the devicefsck -y /dev/sd<>
. This will mount the volume to the node.
In OpenShift deployments, you may face this issue with the OpenEBS replica pods continuously restarting, that is, they are in crashLoopBackOff state. This is due to the default "restricted" security context settings. Edit the following settings using
oc edit scc restricted
to get the application pod running.- allowHostDirVolumePlugin: true
- runAsUser: runAsAny
#
Creating cStor pool fails on CentOS when there are partitions on the disk.Creating cStor pool fails with the following error message:
sdb and sdc are used for cStor pool creation.
Troubleshooting
Clear the partitions on the portioned disk.
Run the following command on the host machine to check any LVM handler on the device.
Output of the above command will be similar to the following.
If the output is similar to the above, you must remove the handler on the device.
#
Application pod enters CrashLoopBackOff statesApplication pod enters CrashLoopBackOff state
This issue is due to failed application operations in the container. Typically this is caused due to failed writes on the mounted PV. To confirm this, check the status of the PV mount inside the application pod.
Troubleshooting
- Perform a
kubectl exec -it <app>
bash (or any available shell) on the application pod and attempt writes on the volume mount. The volume mount can be obtained either from the application specification ("volumeMounts" in container spec) or by performing adf -h
command in the controller shell (the OpenEBS iSCSI device will be mapped to the volume mount). - The writes can be attempted using a simple command like
echo abc > t.out
on the mount. If the writes fail with Read-only file system errors, it means the iSCSI connections to the OpenEBS volumes are lost. You can confirm by checking the node's system logs including iscsid, kernel (syslog) and the kubectl logs (journalctl -xe, kubelet.log
). - iSCSI connections usually fail due to the following.
- flaky networks (can be confirmed by ping RTTs, packet loss etc.) or failed networks between -
- OpenEBS PV controller and replica pods
- Application and controller pods
- Node failures
- OpenEBS volume replica crashes or restarts due to software bugs
- flaky networks (can be confirmed by ping RTTs, packet loss etc.) or failed networks between -
- In all the above cases, loss of the device for a period greater than the node iSCSI initiator timeout causes the volumes to be re-mounted as RO.
- In certain cases, the node/replica loss can lead to the replica quorum not being met (i.e., less than 51% of replicas available) for an extended period of time, causing the OpenEBS volume to be presented as a RO device.
Workaround/Recovery
The procedure to ensure application recovery in the above cases is as follows:
Resolve the system issues which caused the iSCSI disruption/RO device condition. Depending on the cause, the resolution steps may include recovering the failed nodes, ensuring replicas are brought back on the same nodes as earlier, fixing the network problems and so on.
Ensure that the OpenEBS volume controller and replica pods are running successfully with all replicas in RW mode. Use the command
curl GET http://<ctrl ip>:9501/v1/replicas | grep createTypes
to confirm.If anyone of the replicas are still in RO mode, wait for the synchronization to complete. If all the replicas are in RO mode (this may occur when all replicas re-register into the controller within short intervals), you must restart the OpenEBS volume controller using the
kubectl delete pod <pvc-ctrl>
command . Since it is a Kubernetes deployment, the controller pod is restarted successfully. Once done, verify that all replicas transition into RW mode.Un-mount the stale iscsi device mounts on the application node. Typically, these devices are mounted in the
/var/lib/kubelet/plugins/kubernetes.io/iscsi/iface-default/<target-portal:iqn>-lun-0
path.Example:
Identify whether the iSCSI session is re-established after failure. This can be verified using
iscsiadm -m session
, with the device mapping established usingiscsiadm -m session -P 3
andfdisk -l
. Note: Sometimes, it is observed that there are stale device nodes (scsi device names) present on the Kubernetes node. Unless the logs confirm that a re-login has occurred after the system issues were resolved, it is recommended to perform the following step after doing a purge/logout of the existing session usingiscsiadm -m node -T <iqn> -u
.If the device is not logged in again, ensure that the network issues/failed nodes/failed replicas are resolved, the device is discovered, and the session is re-established. This can be achieved using the commands
iscsiadm -m discovery -t st -p <ctrl svc IP>:3260
andiscsiadm -m node -T <iqn> -l
respectively.Identify the new SCSI device name corresponding to the iSCSI session (the device name may or may not be the same as before).
Re-mount the new disk into the mountpoint mentioned earlier using the
mount -o rw,relatime,data=ordered /dev/sd<> <mountpoint>
command. If the re-mount fails due to inconsistencies on the device (unclean filesystem), perform a filesyscheckfsck -y /dev/sd<>
.Ensure that the application uses the newly mounted disk by forcing it to restart on the same node. Use the command
docker stop <id>
of the application container on the node. Kubernetes will automatically restart the pod to ensure the "desirable" state.While this step may not be necessary most times (as the application is already undergoing periodic restarts as part of the CrashLoop cycle), it can be performed if the application pod's next restart is scheduled with an exponential back-off delay.
Notes:
- The above procedure works for applications that are either pods or deployments/statefulsets. In case of the latter, the application pod can be restarted (i.e., deleted) after step-4 (iscsi logout) as the deployment/statefulset controller will take care of rescheduling the application on a same/different node with the volume.
#
cStor pool pods are not runningThe cStor disk pods are not coming up after it deploy with the YAML. On checking the pool pod logs, it says /dev/xvdg is in use and contains a xfs filesystem.
Workaround:
cStor can consume disks that are attached (are visible to OS as SCSI devices) to the Nodes and no need of format these disks. This means disks should not have any filesystem and it should be unmounted on the Node. It is also recommended to wipe out the disks if you are using an used disk for cStor pool creation. The following steps will clear the file system from the disk.
The following is an example output of lsblk
on node.
From the above output, it shows that /dev/xvdf
is mounted on /home/openebs-ebs
. The following commands will unmount disk first and then remove the file system.
After performing the above commands, verify the disk status using lsblk
command:
Example output:
#
OpenEBS Jiva PVC is not provisioning in 0.8.0Even all OpenEBS pods are in running state, unable to provision Jiva volume if you install through helm.
Troubleshooting:
Check the latest logs showing in the OpenEBS provisioner logs. If the particular PVC creation entry logs are not coming on the OpenEBS provisioner pod, then restart the OpenEBS provisioner pod. From 0.8.1 version, liveness probe feature will check the OpenEBS provisioner pod status periodically and ensure its availability for OpenEBS PVC creation.
#
Recovery procedure for Read-only volume where kubelet is running in a container.In environments where the kubelet runs in a container, perform the following steps as part of the recovery procedure for a Volume-Read only issue.
- Confirm that the OpenEBS target does not exist as a Read Only device by the OpenEBS controller and that all replicas are in Read/Write mode.
- Un-mount the iSCSI volume from the node in which the application pod is scheduled.
- Perform the following iSCSI operations from inside the kubelet container.
- Logout
- Rediscover
- Login
- Perform the following iSCSI operations from inside the kubelet container.
- Re-mount the iSCSI device (may appear with a new SCSI device name) on the node.
- Verify if the application pod is able to start using/writing into the newly mounted device.
- Once the application is back in "Running" state post recovery by following steps 1-9, if existing/older data is not visible (i.e., it comes up as a fresh instance), it is possible that the application pod is using the docker container filesystem instead of the actual PV (observed sometimes due to the reconciliation attempts by Kubernetes to get the pod to a desired state in the absence of the mounted iSCSI disk). This can be checked by performing a
df -h
ormount
command inside the application pods. These commands should show the scsi device/dev/sd*
mounted on the specified mount point. If not, the application pod can be forced to use the PV by restarting it (deployment/statefulset) or performing a docker stop of the application container on the node (pod).
#
Recovery procedure for Read-only volume for XFS formatted volumesIn case of XFS
formatted volumes, perform the following steps once the iSCSI target is available in RW state & logged in:
- Un-mount the iSCSI volume from the node in which the application pod is scheduled. This may cause the application to enter running state by using the local mount point.
- Mount to volume to a new (temp) directory to replay the metadata changes in the log
- Unmount the volume again
- Perform
xfs_repair /dev/<device>
. This fixes if any file system related errors on the device - Perform application pod deletion to facilitate fresh mount of the volume. At this point, the app pod may be stuck on
terminating
ORcontainerCreating
state. This can be resolved by deleting the volume folder (w/ app content) on the local directory.
#
Unable to clone OpenEBS volume from snapshotTaken a snapshot of a PVC successfully. But unable to clone the volume from the snapshot.
Troubleshooting:
Logs from snapshot-controller pods are follows.
Resolution:
This can be happen due to the stale entries of snapshot and snapshot data. By deleting those entries will resolve this issue.
#
Unable to mount XFS formatted volumes into PodI created PVC with FSType as xfs
. OpenEBS PV is successfully created and I have verified that iSCSI initiator is available on the Application node. But application pod is unable to mount the volume.
Troubleshooting:
Describing application pod is showing following error:
kubelet had following errors during mount process:
And dmesg was showing errors like:
Resolution:
This can happen due to xfs_repair
failure on the application node. Make sure that the application node has xfsprogs
package installed.
#
Unable to create or delete a PVCUser is unable to create a new PVC or delete an existing PVC. While doing any of these operation, the following error is coming on the PVC.
Workaround:
When a user creates or deletes a PVC, there are validation triggers and a request has been intercepted by the admission webhook controller after authentication/authorization from kube-apiserver. By default admission webhook service has been configured to 443 port and the error above suggests that either port 443 is not allowed to use in cluster or admission webhook service has to be allowed in k8s cluster Proxy settings.
User is unable to create a new PVC or delete an existing PVC. While doing any of these operation, the following error is coming on the PVC.
Workaround:
When a user creates or deletes a PVC, there are validation triggers and a request has been intercepted by the admission webhook controller after authentication/authorization from kube-apiserver. By default admission webhook service has been configured to 443 port and the error above suggests that either port 443 is not allowed to use in cluster or admission webhook service has to be allowed in k8s cluster Proxy settings.
#
Unable to provision OpenEBS volume on DigitalOceanUser is unable to provision cStor or jiva volume on DigitalOcean, encountering error thrown from iSCSI PVs:
Resolution :
To avoid this issue, the Kubelet Service needs to be updated to mount the required packages to establish iSCSI connection to the target. Kubelet Service on all the nodes in the cluster should be updated.
info
The exact mounts may vary depending on the OS. The following steps have been verified on:
- Digital Ocean Kubernetes Release: 1.15.3-do.2
- Nodes running OS Debian Release: 9.11
Add the below lines (volume mounts) to the file on each of the nodes:
Restart the kubelet service using the following commands:
To know more about provisioning cStor volume on DigitalOcean click here.
#
Persistent volumes indefinitely remain in pending stateIf users have a strict firewall setup on their Kubernetes nodes, the provisioning of a PV from a storageclass backed by a cStor storage pool may fail. The pool can be created without any issue and even the storage class is created, but the PVs may stay in pending state indefinitely.
The output from the openebs-provisioner
might look as follows:
Workaround:
This issue has currently only been observed, if the underlying node uses a network bridge and if the setting net.bridge.bridge-nf-call-iptables=1
in the /etc/sysctl.conf
is present. The aforementioned setting is required in some Kubernetes installations, such as the Rancher Kubernetes Engine (RKE).
To avoid this issue, open the port 5656/tcp
on the nodes that run the OpenEBS API pod. Alternatively, removing the network bridge might work.