I/O Path Description
#
Glossary of Terms#
Replicated PV Mayastor InstanceAn instance of the mayastor
binary running inside an IO engine container, which is encapsulated by an IO engine pod.
#
NexusReplicated PV Mayastor terminology. A data structure instantiated within a Replicated PV Mayastor instance which performs I/O operations for a single Replicated PV Mayastor volume. Each nexus acts as an NVMe controller for the volume it exports. Logically it is composed chiefly of a 'static' function table which determines its base I/O handling behaviour (held in common with all other nexus of the cluster), combined with configuration information specific to the Replicated PV Mayastor volume it exports, such as the identity of its children. The function of a nexus is to route I/O requests for its exported volume which are received on its host container's target to the underlying persistence layer, via any applied transformations ("data services"), and to return responses to the calling initiator back along that same I/O path.
#
Pool/Storage Pool/Replicated PV Mayastor Storage Pool (MSP)The Replicated PV Mayastor's volume management abstraction. Block devices contributing storage capacity to a Replicated PV Mayastor deployment do so by their inclusion within configured storage pools. Each Replicated PV Mayastor node can host zero or more pools and each pool can "contain" a single base block device as a member The total capacity of the pool is therefore determined by the size of that device. Pools can only be hosted on nodes running an instance of an IO engine pod.
Multiple volumes can share the capacity of one pool but thin provisioning is not supported. Volumes cannot span multiple pools for the purposes of creating a volume larger in size than could be accommodated by the free capacity in any one pool.
Internally a storage pool is an implementation of an SPDK Logical Volume Store
#
BdevA code abstraction of a block-level device to which I/O requests may be sent, presenting a consistent device-independent interface. The Replicated PV Mayastor's bdev abstraction layer is based upon that of Intel's Storage Performance Development Kit (SPDK).
- base bdev - Handles I/O directly, e.g. a representation of a physical SSD device
- logical volume - A bdev representing an SPDK Logical Volume ("lvol bdev")
#
ReplicaReplicated PV Mayastor terminology. An lvol bdev (a "logical volume", created within a pool and consuming pool capacity) which is being exported by a Replicated PV Mayastor instance, for consumption by a nexus (local or remote to the exporting instance) as a "child".
#
ChildReplicated PV Mayastor terminology. An NVMe controller created and owned by a given Nexus and which handles I/O downstream from the nexus' target, by routing it to a replica associated with that child.
A nexus has a minimum of one child, which must be local (local: exported as a replica from a pool hosted by the same Replicated PV Mayastor instance as hosts the nexus itself). If the Replicated PV Mayastor volume being exported by the nexus is derived from a StorageClass with a replication factor greater than 1 (i.e. synchronous N-way mirroring is enabled), then the nexus will have additional children, up to the desired number of data copies.
#
ExportTo allow the discovery of and acceptance of I/O for a volume by a client initiator, over a Replicated PV Mayastor target.
#
Basics of I/O Flow#
Non-Replicated Volume I/O PathFor volumes based on a StorageClass defined as having a replication factor of one, a single data copy is maintained by Replicated PV Mayastor. The I/O path is largely (entirely, if using malloc:/// pool devices) constrained to within the bounds of a single IO engine instance, which hosts both the volume's nexus and the storage pool in use as its persistence layer.
Each IO engine instance presents a user-space storage target over NVMe-oF TCP. Worker nodes mounting a Replicated PV Mayastor volume for a scheduled application pod to consume are directed by Replicated PV Mayastor's CSI driver implementation to connect to the appropriate transport target for that volume and perform discovery, after which they are able to send I/O to it, directed at the volume in question. Regardless of how many volumes, and by extension how many nexus a IO engine instance hosts, all share the same target instances.
Application I/O received on a target for a volume is passed to the virtual bdev at the front-end of the nexus hosting that volume. In the case of a non-replicated volume, the nexus is composed of a single child, to which the I/O is necessarily routed. As a virtual bdev itself, the child handles the I/O by routing it to the next device, in this case the replica that was created for this child. In non-replicated scenarios, both the volume's nexus and the pool which hosts its replica are co-located within the same IO engine instance, hence the I/O is passed from child to replica using SPDK bdev routines, rather than a network level transport. At the pool layer, a blobstore maps the lvol bdev exported as the replica concerned to the base bdev on which the pool was constructed. From there, other than for malloc:/// devices, the I/O passes to the host Kernel via either aio or io_uring, thence via the appropriate storage driver to the physical disk device.
The disk devices response to the I/O request is returns back along the same path to the caller's initiator.
#
Replicated Volume I/O PathIf the StorageClass on which a volume is based specifies a replication factor of greater than one, then a synchronous mirroring scheme is employed to maintain multiple redundant data copies. For a replicated volume creation and configuration of the volume's nexus requires additional orchestration steps. Prior to creating the nexus, not only must a local replica be created and exported as for the non-replicated case, but the requisite count of additional remote replicas required to meet the replication factor must be created and exported from Replicated PV Mayastor instances other than that hosting the nexus itself. The control plane core-agent component will select appropriate pool candidates, which includes ensuring sufficient available capacity and that no two replicas are sited on the same Replicated PV Mayastor instance (which would compromise availability during co-incident failures). Once suitable replicas have been successfully exported, the control plane completes the creation and configuration of the volume's nexus, with the replicas as its children. In contrast to their local counterparts, remote replicas are exported, and so connected to by the nexus, over NVMe-oF using a user-mode initiator and target implementation from the SPDK.
Write I/O requests to the nexus are handled synchronously; the I/O is dispatched to all (healthy) children and only when completion is acknowledged by all is the I/O acknowledged to the calling initiator via the nexus front-end. Reads are round-robin distributed across healthy children.