Monitoring

Monitoring

This chapter explains the functions for monitoring, problem analysis, and options for incidence response.

Working with the CLI

ℹ️
In general, all functionality is available via both, the CLI and the API. However, in this documentation we describe the administration of the system via the CLI. The CLI itself is available on all system nodes (management nodes and storage nodes), but some restrictions apply to storage nodes. Therefore, it is recommended to use the CLI from one of the management nodes.

The CLI is available via the sbcli command. sbcli is called for a particular object type: sbcli {cluster, sn, mgmt, lvol, pool, cn, snapshot}, where each object has a set of commands. Each command has a list of mandatory (positional) and optional parameters, which can be called in a short version (e.g. -f) or a long version (e.g. --force). Usage of pool and lvol commands are described in Pools and Logical Volumes. The usage of pool and lvol commands is relevant for host (server) storage management outside the Kubernetes ecosystem.

Retrieve Capacity Information

Individual and aggregated current and historic capacity information is available per device, sn (storage-node) and cluster via CLI, API and Grafana Dashboards. It shows physical capacity, as well as absolute and relative utilization of physical capacity.

For pools and lvols, only thinly provisioned (reserved) capacity information is available, not actual used one. There are multiple capacity thresholds, which can be set on cluster create command. These thresholds control when warnings and critical messages are issued to the cluster log and when alerts fire based on these messages. Also, when the critical limit of capacity utilization in percent is reached for a cluster, all devices are switched into read-only state. At the same time, if the critical (over-)provisioning limit for logical volumes and pools is reached, no additional logical volumes can be provisioned. In such a case it is required to delete or resize existing lvols or to provision additional storage capacity. At the moment, limits cannot be changed during active cluster operations.

OptionTypeDescription
cap-warnPercentageif this total cluster utilization threshold is exceeded, a warning will be created in the event log.
cap-critPercentageif this total cluster utilization threshold is exceeded, a critical message will be created in the event log. This message will also lead to an alert trigger in grafana. The cluster will switch into read-only mode until util. capacity drops below threshold again.
prov-cap-warnPercentageif this total cluster provisioning threshold is exceeded, a warning will be created in the event log and warnings will be shown when creating logical volumes.
prov-cap-critPercentageif this total cluster provisioning threshold is exceeded, a critical message will be created in the event log and prov. will be turned off. An alert will be fired.

The following commands are used to retrieve physical and provisioned capacity information from the different objects and aggregation levels:

sbcli cluster get-capacity UUID
sbcli sn get-capacity UUID
sbcli sn get-capacity-device UUID
sbcli lvol get-capacity UUID
sbcli pool get-capacity UUID

It is possible to retrieve historical capacity records as well:

sbcli cluster get-capacity --history HISTORY XXdYYh UUID
sbcli sn get-capacity --history HISTORY XXdYYh UUID
sbcli sn get-capacity-device --history HISTORY XXdYYh UUID
sbcli lvol get-capacity --history HISTORY XXdYYh UUID
sbcli pool get-capacity --history HISTORY XXdYYh UUID

Retrieve IO Statistics

IO statistics contain a number of relevant metrics of historic and current IO activity per device, storage node, lvol and cluster. This includes read and write throughput (in MB/s), IO operations per second aka IOPS (read, write, unmap), the total amount of bytes read and written, as well as IO operations (since the (re-)start of a node), latency ticks and actual average read / write, and unmap latency.

sbcli cluster get-io-stats UUID
sbcli sn get-io-stats UUID
sbcli sn get-io-stats-device UUID
sbcli lvol get-io-stats UUID
sbcli pool get-io-stats UUID

The history option is also available with the get-io-stats commands.

Retrieve the Cluster Log

The cluster log provides important information on health and status of the cluster and its objects such as storage nodes, devices, logical volumes and pools.

Important event types include:

  • Any status change of an object is recorded with the timestamp, object type and object’s UUID. Status changes can be caused by administrative actions (e.g. removing a device, adding a storage node, shutting down or restarting a storage node) or a failure situation (e.g. network outage, container stop or exit, instance unavailability). Important Examples: A storage node goes offline or becomes unreachable, a lvol goes offline, a device becomes unavailable or read-only. A device is removed. A storage node, device or lvol become online again.
  • Any device-level or general IO errors posted by the storage plane together with the device ID (storage ID)
  • Any tasks created or task status changes in the cluster
  • Exceeding provisioning or physical capacity utilization thresholds
  • The health state of any objects including nodes, devices, lvols and the cluster - health checks are performed regularly on all cluster objects and any deviation between the expected state of an object and its factual state causes a health check error reported in the log.

To retrieve the cluster log, use:

sbcli cluster get-logs UUID

In Grafana, the cluster log is stored, but not visible on the dashboards. It can be used to define customized alerts.

Cluster Incidence Responses

Transient failure situations of a device (device unavailable or read-only) or a node (network outage, container exit, host instance down, etc.) are automatically detected within a few seconds. The object status is then changed accordingly and the object is excluded from cluster operations by updating the cluster map and informing all lvols. The system automates certain retries after failures using exponential time-out and can change the status of the object back into online state, if the problem resolves (e.g. network problem is gone). If retries are unsuccessful, automated restarts of devices or nodes are scheduled and also retried multiple times with increasing time-outs. While these restarts can resolve certain situations (e.g. a stopped or exited container due to an out-of-memory situation, transient device IO error, which resolves after resetting the device), they cannot mitigate other situations such as a hanging host or network, a host instance stop (host shutdown) or a hardware failure of a node, a network interface or a device.

If the underlying problem is not resolved within less than 60 minutes, the object remains in its unreachable, read-only or unavailable state. In such a case, it is necessary for the administrator to restart the device or node manually after fixing the underlying issue.

Administrators can also reset a device manually, if they suspect a transient hardware issue:

sbcli sn reset-device UUID

They can also shut down a storage node, perform necessary maintenance and repair tasks and restart the node again:

sbcli sn shutdown UUID
sbcli sn restart UUID 

The restart command can be used with different parameters such as max-lvol, max-snap and max-prov, if changing these parameters is expected to resolve the issue. If a non-working device requires physical inspection, it can be removed and inspected:

sbcli sn remove-device UUID

After inspection, it can be inserted again and restarted:

sbcli sn restart-device UUID

Volume Migration

The simplyblock scheduler distributes logical volumes across available storage nodes using a probabilistic approach under consideration of the recent history of IO activities (load) on the available nodes. The scheduler is not aware of the IO demand of the logical volume to be provisioned, neither does it consider future fluctuations in IO demand of already provisioned lvols on the node. The scheduler does not re-schedule volumes after initial provisioning. This can lead to a situation in which a node becomes overloaded, while other nodes are under-utilized.

It is possible to perform manual volume migration between nodes. This happens almost instantly, as no data is migrated. Rather, the in-memory data structures required for the connection of the volume to a node are re-created. For volumes without HA, the re-scheduling will nevertheless require a re-connect of the volume and thus disrupt IO:

sbcli lvol move [--force] id node_id

Working with Grafana

Grafana retrieves metric data from prometheus, including capacity and io statistics as well as the cluster event log, and dashboards their dynamics. Grafana is also used for alerting via Slack or Email. It is available from all management nodes. Open a browser with one of the management nodes IPs on port 3000 and login in using the credentials (admin and the cluster-secret). The cluster secret can be retrieved via sbcli cluster get-secret. The standard retention period for metrics is 7 days, but this value can be changed when creating a cluster.

Dashboards

All dashboards are stored in per-cluster folders. Each cluster contains the following dashboards: cluster, storage node, device, lvol, pool. Each of those dashboards per default contain data for all objects (e.g. all devices) in a cluster, but can be filtered for particular objects (e.g. one device).

Dashboard widgets are self-explanatory. It is possible to filter them by particular objects (e.g. devices, storage nodes or logical volumes) and to change the timescale and window. Dashboards include physical and logical capacity utilization dynamics, IOPS, IO throughput, and latency dynamics (all separate for read, write and unmap). While all data of the event log is currently stored in Prometheus, we do not use it.

Alerting

Out-of-the-box alerting is currently supported to Slack channels. Grafana also allows alerting via Email notifications, but this requires the use of an authorized SMTP server to send message. An SMTP server is currently not part of the management stack and must be deployed separately. Alerts can be triggered based on on-time or interval-based thresholds of statistical data collected (IO statistics, capacity information) or based on events from the cluster event log.

Pre-Defined Alerts

The following pre-defined alerts are available:

AlertTrigger
device-unavailableDevice Status changed from online to unavailable
device-read-onlyDevice Status changed from online to read-only
sn-offlineStorage node status changed from online to offline
crit-cap-reachedCritical absolute capacity utilization in cluster was reached
crit-prov-reachedCritical absolute capacity utilization in cluster was reached

It is possible to specify the slack webhook for alerting during cluster create and to modify it manually later on within the cluster.

sbcli cluster create --alert ...
Adding Alerts

Working with Graylog

Graylog is available from all Management Nodes. Open a browser to one of the management nodes IPs on port 9000 using the credentials (admin and the cluster-secret). The cluster secret can be retrieved via sbcli cluster get-secret. Logs from all storage nodes and management nodes are consolidated. The standard retention period for logs is 7 days, but this value can be changed when creating a cluster.

To get a basic understanding of Graylog search syntax, have a look at the Graylog documentation.

To analyze issues in clusters, most important sources are the JSON-style logs from the spdkProxy container and the plain text logs from the spdk container. This chapter by no means can describe the structure and semantics of all log records, but here are a few points for navigation: