Monitoring
This chapter explains the functions for monitoring, problem analysis, and options for incidence response.
Working with the CLI
The CLI is available via the sbcli
command. sbcli
is called for a particular object type: sbcli {cluster, sn, mgmt, lvol, pool, cn, snapshot}
, where each object has a set of commands. Each command has a list of mandatory (positional) and optional parameters, which can be called in a short version (e.g. -f
) or a long version (e.g. --force
). Usage of pool
and lvol
commands are described in Pools and Logical Volumes. The usage of pool
and lvol
commands is relevant for host (server) storage management outside the Kubernetes ecosystem.
Retrieve Capacity Information
Individual and aggregated current and historic capacity information is available per device, sn (storage-node) and cluster via CLI, API and Grafana Dashboards. It shows physical capacity, as well as absolute and relative utilization of physical capacity.
For pools and lvols, only thinly provisioned (reserved) capacity information is available, not actual used one. There are multiple capacity thresholds, which can be set on cluster create
command. These thresholds control when warnings and critical messages are issued to the cluster log and when alerts fire based on these messages. Also, when the critical limit of capacity utilization in percent is reached for a cluster, all devices are switched into read-only state. At the same time, if the critical (over-)provisioning limit for logical volumes and pools is reached, no additional logical volumes can be provisioned. In such a case it is required to delete or resize existing lvols or to provision additional storage capacity. At the moment, limits cannot be changed during active cluster operations.
Option | Type | Description |
---|---|---|
cap-warn | Percentage | if this total cluster utilization threshold is exceeded, a warning will be created in the event log. |
cap-crit | Percentage | if this total cluster utilization threshold is exceeded, a critical message will be created in the event log. This message will also lead to an alert trigger in grafana. The cluster will switch into read-only mode until util. capacity drops below threshold again. |
prov-cap-warn | Percentage | if this total cluster provisioning threshold is exceeded, a warning will be created in the event log and warnings will be shown when creating logical volumes. |
prov-cap-crit | Percentage | if this total cluster provisioning threshold is exceeded, a critical message will be created in the event log and prov. will be turned off. An alert will be fired. |
The following commands are used to retrieve physical and provisioned capacity information from the different objects and aggregation levels:
sbcli cluster get-capacity UUID
sbcli sn get-capacity UUID
sbcli sn get-capacity-device UUID
sbcli lvol get-capacity UUID
sbcli pool get-capacity UUID
It is possible to retrieve historical capacity records as well:
sbcli cluster get-capacity --history HISTORY XXdYYh UUID
sbcli sn get-capacity --history HISTORY XXdYYh UUID
sbcli sn get-capacity-device --history HISTORY XXdYYh UUID
sbcli lvol get-capacity --history HISTORY XXdYYh UUID
sbcli pool get-capacity --history HISTORY XXdYYh UUID
Retrieve IO Statistics
IO statistics contain a number of relevant metrics of historic and current IO activity per device, storage node, lvol and cluster. This includes read and write throughput (in MB/s), IO operations per second aka IOPS (read, write, unmap), the total amount of bytes read and written, as well as IO operations (since the (re-)start of a node), latency ticks and actual average read / write, and unmap latency.
sbcli cluster get-io-stats UUID
sbcli sn get-io-stats UUID
sbcli sn get-io-stats-device UUID
sbcli lvol get-io-stats UUID
sbcli pool get-io-stats UUID
The history option is also available with the get-io-stats
commands.
Retrieve the Cluster Log
The cluster log provides important information on health and status of the cluster and its objects such as storage nodes, devices, logical volumes and pools.
Important event types include:
- Any status change of an object is recorded with the timestamp, object type and object’s UUID. Status changes can be caused by administrative actions (e.g. removing a device, adding a storage node, shutting down or restarting a storage node) or a failure situation (e.g. network outage, container stop or exit, instance unavailability). Important Examples: A storage node goes
offline
or becomesunreachable
, a lvol goesoffline
, a device becomesunavailable
orread-only
. A device is removed. A storage node, device or lvol becomeonline
again. - Any device-level or general IO errors posted by the storage plane together with the device ID (storage ID)
- Any tasks created or task status changes in the cluster
- Exceeding provisioning or physical capacity utilization thresholds
- The health state of any objects including nodes, devices, lvols and the cluster - health checks are performed regularly on all cluster objects and any deviation between the expected state of an object and its factual state causes a health check error reported in the log.
To retrieve the cluster log, use:
sbcli cluster get-logs UUID
In Grafana, the cluster log is stored, but not visible on the dashboards. It can be used to define customized alerts.
Cluster Incidence Responses
Transient failure situations of a device (device unavailable
or read-only
) or a node (network outage, container exit, host instance down, etc.) are automatically detected within a few seconds. The object status is then changed accordingly and the object is excluded from cluster operations by updating the cluster map and informing all lvols. The system automates certain retries after failures using exponential time-out and can change the status of the object back into online
state, if the problem resolves (e.g. network problem is gone). If retries are unsuccessful, automated restarts of devices or nodes are scheduled and also retried multiple times with increasing time-outs. While these restarts can resolve certain situations (e.g. a stopped or exited container due to an out-of-memory situation, transient device IO error, which resolves after resetting the device), they cannot mitigate other situations such as a hanging host or network, a host instance stop (host shutdown) or a hardware failure of a node, a network interface or a device.
If the underlying problem is not resolved within less than 60 minutes, the object remains in its unreachable
, read-only
or unavailable
state. In such a case, it is necessary for the administrator to restart the device or node manually after fixing the underlying issue.
Administrators can also reset a device manually, if they suspect a transient hardware issue:
sbcli sn reset-device UUID
They can also shut down a storage node, perform necessary maintenance and repair tasks and restart the node again:
sbcli sn shutdown UUID
sbcli sn restart UUID
The restart
command can be used with different parameters such as max-lvol
, max-snap
and max-prov
, if changing these parameters is expected to resolve the issue. If a non-working device requires physical inspection, it can be removed and inspected:
sbcli sn remove-device UUID
After inspection, it can be inserted again and restarted:
sbcli sn restart-device UUID
Volume Migration
The simplyblock scheduler distributes logical volumes across available storage nodes using a probabilistic approach under consideration of the recent history of IO activities (load) on the available nodes. The scheduler is not aware of the IO demand of the logical volume to be provisioned, neither does it consider future fluctuations in IO demand of already provisioned lvols on the node. The scheduler does not re-schedule volumes after initial provisioning. This can lead to a situation in which a node becomes overloaded, while other nodes are under-utilized.
It is possible to perform manual volume migration between nodes. This happens almost instantly, as no data is migrated. Rather, the in-memory data structures required for the connection of the volume to a node are re-created. For volumes without HA, the re-scheduling will nevertheless require a re-connect of the volume and thus disrupt IO:
sbcli lvol move [--force] id node_id
Working with Grafana
Grafana retrieves metric data from prometheus, including capacity and io statistics as well as the cluster event log, and dashboards their dynamics. Grafana is also used for alerting via Slack or Email. It is available from all management nodes. Open a browser with one of the management nodes IPs on port 3000 and login in using the credentials (admin and the cluster-secret). The cluster secret can be retrieved via sbcli cluster get-secret
. The standard retention period for metrics is 7 days, but this value can be changed when creating a cluster.
Dashboards
All dashboards are stored in per-cluster folders. Each cluster contains the following dashboards: cluster, storage node, device, lvol, pool. Each of those dashboards per default contain data for all objects (e.g. all devices) in a cluster, but can be filtered for particular objects (e.g. one device).
Dashboard widgets are self-explanatory. It is possible to filter them by particular objects (e.g. devices, storage nodes or logical volumes) and to change the timescale and window. Dashboards include physical and logical capacity utilization dynamics, IOPS, IO throughput, and latency dynamics (all separate for read, write and unmap). While all data of the event log is currently stored in Prometheus, we do not use it.
Alerting
Out-of-the-box alerting is currently supported to Slack channels. Grafana also allows alerting via Email notifications, but this requires the use of an authorized SMTP server to send message. An SMTP server is currently not part of the management stack and must be deployed separately. Alerts can be triggered based on on-time or interval-based thresholds of statistical data collected (IO statistics, capacity information) or based on events from the cluster event log.
Pre-Defined Alerts
The following pre-defined alerts are available:
Alert | Trigger |
---|---|
device-unavailable | Device Status changed from online to unavailable |
device-read-only | Device Status changed from online to read-only |
sn-offline | Storage node status changed from online to offline |
crit-cap-reached | Critical absolute capacity utilization in cluster was reached |
crit-prov-reached | Critical absolute capacity utilization in cluster was reached |
It is possible to specify the slack webhook for alerting during cluster create
and to modify it manually later on within the cluster.
sbcli cluster create --alert ...
Adding Alerts
Working with Graylog
Graylog is available from all Management Nodes. Open a browser to one of the management nodes IPs on port 9000 using the credentials (admin and the cluster-secret). The cluster secret can be retrieved via sbcli cluster get-secret
. Logs from all storage nodes and management nodes are consolidated. The standard retention period for logs is 7 days, but this value can be changed when creating a cluster.
To get a basic understanding of Graylog search syntax, have a look at the Graylog documentation.
To analyze issues in clusters, most important sources are the JSON-style logs from the spdkProxy container and the plain text logs from the spdk container. This chapter by no means can describe the structure and semantics of all log records, but here are a few points for navigation: