Skip to main content

Pool Metrics Exporter#

The IO engine pool metrics exporter runs as a sidecar container within every I/O-engine pod and exposes pool usage metrics in Prometheus format. These metrics are exposed on port 9502 using an HTTP endpoint/metrics and are refreshed every five minutes.

Supported Pool Metrics#

NameTypeUnitDescription
disk_pool_total_size_bytesGaugeIntegerTotal size of the pool
disk_pool_used_size_bytesGaugeIntegerUsed size of the pool
disk_pool_statusGaugeIntegerStatus of the pool (0, 1, 2, 3) = {"Unknown", "Online", "Degraded", "Faulted"}
disk_pool_committed_sizeGaugeIntegerCommitted size of the pool in bytes

Example Metrics

# HELP disk_pool_status disk-pool status
# TYPE disk_pool_status gauge
disk_pool_status{node="worker-0",name="mayastor-disk-pool"} 1
# HELP disk_pool_total_size_bytes total size of the disk-pool in bytes
# TYPE disk_pool_total_size_bytes gauge
disk_pool_total_size_bytes{node="worker-0",name="mayastor-disk-pool"} 5.360320512e+09
# HELP disk_pool_used_size_bytes used disk-pool size in bytes
# TYPE disk_pool_used_size_bytes gauge
disk_pool_used_size_bytes{node="worker-0",name="mayastor-disk-pool"} 2.147483648e+09
# HELP disk_pool_committed_size_bytes Committed size of the pool in bytes
# TYPE disk_pool_committed_size_bytes gauge
disk_pool_committed_size_bytes{node="worker-0", name="mayastor-disk-pool"} 9663676416

Stats Exporter Metrics#

When eventing is activated, the stats exporter operates within the obs-callhome-stats container, located in the callhome pod. The statistics are made accessible through an HTTP endpoint at port 9090, specifically using the /stats route.

Supported Stats Metrics#

NameTypeUnitDescription
pools_createdGuageIntegerTotal successful pool creation attempts
pools_deletedGuageIntegerTotal successful pool deletion attempts
volumes_createdGuageIntegerTotal successful volume creation attemtps
volumes_deletedGuageIntegerTotal successful volume deletion attempts

Integrating Exporter with Prometheus Monitoring Stack#

  1. To install, add the Prometheus-stack helm chart and update the repo.

Command

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Then, install the Prometheus monitoring stack and set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues to false. This enables Prometheus to discover custom ServiceMonitor for Replicated PV Mayastor.

Command

helm install mayastor prometheus-community/kube-prometheus-stack -n openebs --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
  1. Install the ServiceMonitor resource to select services and specify their underlying endpoint objects.

ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mayastor-monitoring
labels:
app: mayastor
spec:
selector:
matchLabels:
app: mayastor
endpoints:
- port: metrics
info

Upon successful integration of the exporter with the Prometheus stack, the metrics will be available on the port 9090 and HTTP endpoint /metrics.

CSI Metrics Exporter#

NameTypeUnitDescription
kubelet_volume_stats_available_bytesGaugeIntegerSize of the available/usable volume (in bytes)
kubelet_volume_stats_capacity_bytesGaugeIntegerThe total size of the volume (in bytes)
kubelet_volume_stats_used_bytesGaugeIntegerUsed size of the volume (in bytes)
kubelet_volume_stats_inodesGaugeIntegerThe total number of inodes
kubelet_volume_stats_inodes_freeGaugeIntegerThe total number of usable inodes.
kubelet_volume_stats_inodes_usedGaugeIntegerThe total number of inodes that have been utilized to store metadata.

Performance Monitoring Stack#

Earlier, the pool capacity/state stats were exported, and the exporter used to cache the metrics and return when Prometheus client queried. This was not ensuring the latest data retuns during the Prometheus poll cycle.

In addition to the capacity and state metrics, the metrics exporter also exports performance statistics for pools, volumes, and replicas as Prometheus counters. The exporter does not pre-fetch or cache the metrics, it polls the IO engine inline with the Prometheus client polling cycle.

important

Users are recommended to have Prometheus poll interval not less then 5 minutes.

The following sections describes the raw resource metrics counters.

DiskPool IoStat Counters#

Metric NameMetric TypeLabels/TagsMetric UnitDescription
diskpool_num_read_opsGaugename=<pool_id>, node=<pool_node>IntegerNumber of read operations
diskpool_bytes_readGaugename=<pool_id>, node=<pool_node>IntegerTotal bytes read on the pool
diskpool_num_write_opsGaugename=<pool_id>, node=<pool_node>IntegerNumber of write operations on the pool
diskpool_bytes_writtenGaugename=<pool_id>, node=<pool_node>IntegerTotal bytes written on the pool
diskpool_read_latency_usGaugename=<pool_id>, node=<pool_node>IntegerTotal read latency for all IOs on Pool in usec.
diskpool_write_latency_usGaugename=<pool_id>, node=<pool_node>IntegerTotal write latency for all IOs on Pool in usec.

Replica IoStat Counters#

Metric NameMetric TypeLabels/TagsMetric UnitDescription
replica_num_read_opsGaugename=<replica_uuid>, pool_id=<pool_uuid> pv_name=<pv_name>, node=<replica_node>IntegerNumber of read operations on replica
replica_bytes_readGaugename=<replica_uuid>, pv_name=<pv_name>, node=<replica_node>IntegerTotal bytes read on the replica
replica_num_write_opsGaugename=<replica_uuid>, pv_name=<pv_name>, node=<replica_node>IntegerNumber of write operations on the replica
replica_bytes_writtenGaugename=<replica_uuid>, pv_name=<pv_name>, node=<replica_node>IntegerTotal bytes written on the Replica
replica_read_latency_usGaugename=<replica_uuid>, pv_name=<pv_name>, node=<replica_node>IntegerTotal read latency for all IOs on replica in usec.
replica_write_latency_usGaugename=<replica_uuid>, pv_name=<pv_name>, node=<replica_node>IntegerTotal write latency for all IOs on replica in usec.

Target/Volume IoStat Counters#

Metric NameMetric TypeLabels/TagsMetric UnitDescription
volume_num_read_opsGaugepv_name=<pv_name>IntegerNumber of read operations through vol target
volume_bytes_readGaugepv_name=<pv_name>IntegerTotal bytes read through vol target
volume_num_write_opsGaugepv_name=<pv_name>IntegerNumber of write operations through vol target
volume_bytes_writtenGaugepv_name=<pv_name>IntegerTotal bytes written through vol target
volume_read_latency_usGaugepv_name=<pv_name>IntegerTotal read latency for all IOs through vol target in usec.
volume_write_latency_usGaugepv_name=<pv_name>IntegerTotal write latency for all IOs through vol target in usec.
note

If you require IOPS, Latency, and Throughput in the dashboard, use the following consideration while creating dashboard json config.

R/W IOPS Calculation#

num_read_ops and num_write_ops for all resources in stats response are available.

write_iops = num_write_ops (current poll) - num_write_ops (previous_poll) / poll period (in sec)
read_iops = num_read_ops (current poll) - num_read_ops (previous_poll) / poll period (in sec)

R/W Latency Calculation#

write_latency (sum of all IO's read_latency) and read_latency (sum of all IO and read_latency) are available.

read_latency_avg = read_latency (current poll) - read_latency (previous poll) / num_read_ops (current poll) - num_read_ops (previous_poll)
write_latency_avg = write_latency (current poll) - write_latency (previous poll) / num_write_ops (current poll) - num_write_ops (previous_poll)

R/W Throughput Calculation#

bytes_read/written (total bytes read/written for a bdev) are available.

read_throughput = bytes_read (current poll) - bytes_read (previous_poll) / poll period (in sec)
write_throughput = bytes_written (current poll) - bytes_written (previous_poll) / poll period (in sec)

Handling Counter Reset#

The performance stats are not persistent across IO engine restart, this means the counters will be reset upon IO engine restart. Users will receive lesser values for all the resources residing on that particular IO engine due to reset. So using the above logic would yield negative values. Hence, the counter current poll is less than the counter previous poll. In this case, do the following:

iops (r/w) = num_ops (r/w) / poll cycle
latency_avg(r/w) = latency (r/w) / num_ops
throughput (r/w) = bytes_read/written / poll_cycle (in secs)

Learn more

Was this page helpful? We appreciate your feedback