Monitoring
Pool Metrics Exporter#
The IO engine pool metrics exporter runs as a sidecar container within every I/O-engine pod and exposes pool usage metrics in Prometheus format. These metrics are exposed on port 9502 using an HTTP endpoint/metrics and are refreshed every five minutes.
Supported Pool Metrics#
| Name | Type | Unit | Description |
|---|---|---|---|
| disk_pool_total_size_bytes | Gauge | Integer | Total size of the pool |
| disk_pool_used_size_bytes | Gauge | Integer | Used size of the pool |
| disk_pool_status | Gauge | Integer | Status of the pool (0, 1, 2, 3) = {"Unknown", "Online", "Degraded", "Faulted"} |
| disk_pool_committed_size | Gauge | Integer | Committed size of the pool in bytes |
Example Metrics
Stats Exporter Metrics#
When eventing is activated, the stats exporter operates within the obs-callhome-stats container, located in the callhome pod. The statistics are made accessible through an HTTP endpoint at port 9090, specifically using the /stats route.
Supported Stats Metrics#
| Name | Type | Unit | Description |
|---|---|---|---|
| pools_created | Guage | Integer | Total successful pool creation attempts |
| pools_deleted | Guage | Integer | Total successful pool deletion attempts |
| volumes_created | Guage | Integer | Total successful volume creation attemtps |
| volumes_deleted | Guage | Integer | Total successful volume deletion attempts |
Integrating Exporter with Prometheus Monitoring Stack#
- To install, add the Prometheus-stack helm chart and update the repo.
Command
Then, install the Prometheus monitoring stack and set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues to false. This enables Prometheus to discover custom ServiceMonitor for Replicated PV Mayastor.
Command
- Install the ServiceMonitor resource to select services and specify their underlying endpoint objects.
ServiceMonitor YAML
info
Upon successful integration of the exporter with the Prometheus stack, the metrics will be available on the port 9090 and HTTP endpoint /metrics.
CSI Metrics Exporter#
| Name | Type | Unit | Description |
|---|---|---|---|
| kubelet_volume_stats_available_bytes | Gauge | Integer | Size of the available/usable volume (in bytes) |
| kubelet_volume_stats_capacity_bytes | Gauge | Integer | The total size of the volume (in bytes) |
| kubelet_volume_stats_used_bytes | Gauge | Integer | Used size of the volume (in bytes) |
| kubelet_volume_stats_inodes | Gauge | Integer | The total number of inodes |
| kubelet_volume_stats_inodes_free | Gauge | Integer | The total number of usable inodes. |
| kubelet_volume_stats_inodes_used | Gauge | Integer | The total number of inodes that have been utilized to store metadata. |
Performance Monitoring Stack#
Earlier, the pool capacity/state stats were exported, and the exporter used to cache the metrics and return when Prometheus client queried. This was not ensuring the latest data retuns during the Prometheus poll cycle.
In addition to the capacity and state metrics, the metrics exporter also exports performance statistics for pools, volumes, and replicas as Prometheus counters. The exporter does not pre-fetch or cache the metrics, it polls the IO engine inline with the Prometheus client polling cycle.
important
Users are recommended to have Prometheus poll interval not less then 5 minutes.
The following sections describes the raw resource metrics counters.
DiskPool IoStat Counters#
| Metric Name | Metric Type | Labels/Tags | Metric Unit | Description |
|---|---|---|---|---|
| diskpool_num_read_ops | Gauge | name=<pool_id>, node=<pool_node> | Integer | Number of read operations |
| diskpool_bytes_read | Gauge | name=<pool_id>, node=<pool_node> | Integer | Total bytes read on the pool |
| diskpool_num_write_ops | Gauge | name=<pool_id>, node=<pool_node> | Integer | Number of write operations on the pool |
| diskpool_bytes_written | Gauge | name=<pool_id>, node=<pool_node> | Integer | Total bytes written on the pool |
| diskpool_read_latency_us | Gauge | name=<pool_id>, node=<pool_node> | Integer | Total read latency for all IOs on Pool in usec. |
| diskpool_write_latency_us | Gauge | name=<pool_id>, node=<pool_node> | Integer | Total write latency for all IOs on Pool in usec. |
Replica IoStat Counters#
| Metric Name | Metric Type | Labels/Tags | Metric Unit | Description |
|---|---|---|---|---|
| replica_num_read_ops | Gauge | name=<replica_uuid>, pool_id=<pool_uuid> pv_name=<pv_name>, node=<replica_node> | Integer | Number of read operations on replica |
| replica_bytes_read | Gauge | name=<replica_uuid>, pv_name=<pv_name>, node=<replica_node> | Integer | Total bytes read on the replica |
| replica_num_write_ops | Gauge | name=<replica_uuid>, pv_name=<pv_name>, node=<replica_node> | Integer | Number of write operations on the replica |
| replica_bytes_written | Gauge | name=<replica_uuid>, pv_name=<pv_name>, node=<replica_node> | Integer | Total bytes written on the Replica |
| replica_read_latency_us | Gauge | name=<replica_uuid>, pv_name=<pv_name>, node=<replica_node> | Integer | Total read latency for all IOs on replica in usec. |
| replica_write_latency_us | Gauge | name=<replica_uuid>, pv_name=<pv_name>, node=<replica_node> | Integer | Total write latency for all IOs on replica in usec. |
Target/Volume IoStat Counters#
| Metric Name | Metric Type | Labels/Tags | Metric Unit | Description |
|---|---|---|---|---|
| volume_num_read_ops | Gauge | pv_name=<pv_name> | Integer | Number of read operations through vol target |
| volume_bytes_read | Gauge | pv_name=<pv_name> | Integer | Total bytes read through vol target |
| volume_num_write_ops | Gauge | pv_name=<pv_name> | Integer | Number of write operations through vol target |
| volume_bytes_written | Gauge | pv_name=<pv_name> | Integer | Total bytes written through vol target |
| volume_read_latency_us | Gauge | pv_name=<pv_name> | Integer | Total read latency for all IOs through vol target in usec. |
| volume_write_latency_us | Gauge | pv_name=<pv_name> | Integer | Total write latency for all IOs through vol target in usec. |
note
If you require IOPS, Latency, and Throughput in the dashboard, use the following consideration while creating dashboard json config.
R/W IOPS Calculation#
num_read_ops and num_write_ops for all resources in stats response are available.
R/W Latency Calculation#
write_latency (sum of all IO's read_latency) and read_latency (sum of all IO and read_latency) are available.
R/W Throughput Calculation#
bytes_read/written (total bytes read/written for a bdev) are available.
Handling Counter Reset#
The performance stats are not persistent across IO engine restart, this means the counters will be reset upon IO engine restart. Users will receive lesser values for all the resources residing on that particular IO engine due to reset. So using the above logic would yield negative values. Hence, the counter current poll is less than the counter previous poll. In this case, do the following: