Monitoring
#
Pool Metrics ExporterThe IO engine pool metrics exporter runs as a sidecar container within every I/O-engine pod and exposes pool usage metrics in Prometheus format. These metrics are exposed on port 9502 using an HTTP endpoint/metrics and are refreshed every five minutes.
#
Supported Pool MetricsName | Type | Unit | Description |
---|---|---|---|
disk_pool_total_size_bytes | Gauge | Integer | Total size of the pool |
disk_pool_used_size_bytes | Gauge | Integer | Used size of the pool |
disk_pool_status | Gauge | Integer | Status of the pool (0, 1, 2, 3) = {"Unknown", "Online", "Degraded", "Faulted"} |
disk_pool_committed_size | Gauge | Integer | Committed size of the pool in bytes |
Example Metrics
#
Stats Exporter MetricsWhen eventing is activated, the stats exporter operates within the obs-callhome-stats container, located in the callhome pod. The statistics are made accessible through an HTTP endpoint at port 9090
, specifically using the /stats
route.
#
Supported Stats MetricsName | Type | Unit | Description |
---|---|---|---|
pools_created | Guage | Integer | Total successful pool creation attempts |
pools_deleted | Guage | Integer | Total successful pool deletion attempts |
volumes_created | Guage | Integer | Total successful volume creation attemtps |
volumes_deleted | Guage | Integer | Total successful volume deletion attempts |
#
Integrating Exporter with Prometheus Monitoring Stack- To install, add the Prometheus-stack helm chart and update the repo.
Command
Then, install the Prometheus monitoring stack and set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues to false. This enables Prometheus to discover custom ServiceMonitor for Replicated PV Mayastor.
Command
- Install the ServiceMonitor resource to select services and specify their underlying endpoint objects.
ServiceMonitor YAML
info
Upon successful integration of the exporter with the Prometheus stack, the metrics will be available on the port 9090 and HTTP endpoint /metrics.
#
CSI Metrics ExporterName | Type | Unit | Description |
---|---|---|---|
kubelet_volume_stats_available_bytes | Gauge | Integer | Size of the available/usable volume (in bytes) |
kubelet_volume_stats_capacity_bytes | Gauge | Integer | The total size of the volume (in bytes) |
kubelet_volume_stats_used_bytes | Gauge | Integer | Used size of the volume (in bytes) |
kubelet_volume_stats_inodes | Gauge | Integer | The total number of inodes |
kubelet_volume_stats_inodes_free | Gauge | Integer | The total number of usable inodes. |
kubelet_volume_stats_inodes_used | Gauge | Integer | The total number of inodes that have been utilized to store metadata. |
#
Performance Monitoring StackEarlier, the pool capacity/state stats were exported, and the exporter used to cache the metrics and return when Prometheus client queried. This was not ensuring the latest data retuns during the Prometheus poll cycle.
In addition to the capacity and state metrics, the metrics exporter also exports performance statistics for pools, volumes, and replicas as Prometheus counters. The exporter does not pre-fetch or cache the metrics, it polls the IO engine inline with the Prometheus client polling cycle.
important
Users are recommended to have Prometheus poll interval not less then 5 minutes.
The following sections describes the raw resource metrics counters.
#
DiskPool IoStat CountersMetric Name | Metric Type | Labels/Tags | Metric Unit | Description |
---|---|---|---|---|
diskpool_num_read_ops | Gauge | name =<pool_id>, node =<pool_node> | Integer | Number of read operations |
diskpool_bytes_read | Gauge | name =<pool_id>, node =<pool_node> | Integer | Total bytes read on the pool |
diskpool_num_write_ops | Gauge | name =<pool_id>, node =<pool_node> | Integer | Number of write operations on the pool |
diskpool_bytes_written | Gauge | name =<pool_id>, node =<pool_node> | Integer | Total bytes written on the pool |
diskpool_read_latency_us | Gauge | name =<pool_id>, node =<pool_node> | Integer | Total read latency for all IOs on Pool in usec. |
diskpool_write_latency_us | Gauge | name =<pool_id>, node =<pool_node> | Integer | Total write latency for all IOs on Pool in usec. |
#
Replica IoStat CountersMetric Name | Metric Type | Labels/Tags | Metric Unit | Description |
---|---|---|---|---|
replica_num_read_ops | Gauge | name =<replica_uuid>, pool_id =<pool_uuid> pv_name =<pv_name>, node =<replica_node> | Integer | Number of read operations on replica |
replica_bytes_read | Gauge | name =<replica_uuid>, pv_name =<pv_name>, node =<replica_node> | Integer | Total bytes read on the replica |
replica_num_write_ops | Gauge | name =<replica_uuid>, pv_name =<pv_name>, node =<replica_node> | Integer | Number of write operations on the replica |
replica_bytes_written | Gauge | name =<replica_uuid>, pv_name =<pv_name>, node =<replica_node> | Integer | Total bytes written on the Replica |
replica_read_latency_us | Gauge | name =<replica_uuid>, pv_name =<pv_name>, node =<replica_node> | Integer | Total read latency for all IOs on replica in usec. |
replica_write_latency_us | Gauge | name =<replica_uuid>, pv_name =<pv_name>, node =<replica_node> | Integer | Total write latency for all IOs on replica in usec. |
#
Target/Volume IoStat CountersMetric Name | Metric Type | Labels/Tags | Metric Unit | Description |
---|---|---|---|---|
volume_num_read_ops | Gauge | pv_name =<pv_name> | Integer | Number of read operations through vol target |
volume_bytes_read | Gauge | pv_name =<pv_name> | Integer | Total bytes read through vol target |
volume_num_write_ops | Gauge | pv_name =<pv_name> | Integer | Number of write operations through vol target |
volume_bytes_written | Gauge | pv_name =<pv_name> | Integer | Total bytes written through vol target |
volume_read_latency_us | Gauge | pv_name =<pv_name> | Integer | Total read latency for all IOs through vol target in usec. |
volume_write_latency_us | Gauge | pv_name =<pv_name> | Integer | Total write latency for all IOs through vol target in usec. |
note
If you require IOPS, Latency, and Throughput in the dashboard, use the following consideration while creating dashboard json config.
#
R/W IOPS Calculationnum_read_ops
and num_write_ops
for all resources in stats response are available.
#
R/W Latency Calculationwrite_latency
(sum of all IO's read_latency) and read_latency
(sum of all IO and read_latency) are available.
#
R/W Throughput Calculationbytes_read/written
(total bytes read/written for a bdev) are available.
#
Handling Counter ResetThe performance stats are not persistent across IO engine restart, this means the counters will be reset upon IO engine restart. Users will receive lesser values for all the resources residing on that particular IO engine due to reset. So using the above logic would yield negative values. Hence, the counter current poll is less than the counter previous poll. In this case, do the following: