Important Concepts

Important Concepts

Cluster Topology and Monitoring

Each management plane maintains its cluster topologies (clusters, storage nodes, devices, logical volumes (lvols), pools, caching nodes, and their relationships) and the status of each of the topology objects in a clustered repository on management nodes. Each object is uniquely identified by an UUID and interaction with the cluster (e.g. sbcli cluster list, sbcli sn list, sbcli lvol list, sbcli pool list) allows to retrieve precise information on these objects, their respective parameters and state. Permanent monitoring and health checks detect deviation of actual from desired state of each of these objects, reports those deviations in the cluster log and performs mitigative actions (e.g. to restart a container, destroy and recreate NVMe-oF connections, reset devices) whenever possible.

Object State Transitions

The following important state machines are implemented:

  • A cluster can be in online or suspended state. The state change is only performed by administrative action and has no further impact.
  • A storage node can be in intermediate states, such as in shutdown and restarting, as well as online (operational), offline (turned off - after shutdown), unreachable (not reachable via management interface), and removed (final state - not part of the cluster anymore).
  • A device can be new (detected, but not part of cluster yet), online, unavailable (returns IO errors on read and write), removed (was gracefully soft removed or forcefully physically removed from cluster), read-only (returns IO errors on write or almost full) and failed (permanently removed from cluster). Some state changes are automatic, others follow administrative action.
  • A lvol can be online and offline. The lvol is set online after initial connection and set to offline if it is in single mode and storage node is turned off or based on unrecoverable IO error. It is then switched back into online state if new IO activity on the lvol is detected. It will not automatically return into online state even if functional, if there is no IO activity.
  • A pool can be enabled (open for lvol provisioning) or disabled (closed for lvol provisioning). The state is changed via CLI or API.

Cluster Balancing and Rebalancing

A simplyblock storage cluster is always kept in or transitioned into a perfectly balanced state. This means that all devices in the cluster store the same relative amount of data depending on the size and performance of the device (utilization percentage).

For example, if a cluster is filled by 75% in total, it means that each individual device in the cluster, regardless of the storage node it resides on, can also be assumed to be filled by about 75%. As this data placement is based on statistical methods, there may only be slight deviations between utilization devices.

⚠️
As an implication, it is important that devices in a cluster provide similar relative performance characteristics (latency and IOPS per TB) and storage. Furthermore, storage nodes and network bandwidth should be designed in a way that they’re not the limiting factor to cumulative performance of the devices on a node.

Clusters are re-balanced in case of cluster expansion (adding devices or complete nodes with devices to the cluster), as well as with device or node failures or removals. In these cases, automated data migration rebalances the cluster across all nodes. Rebalancing can take from a few seconds to hours, depending on the relative amount of data in the cluster, which needs to be re-distributed, the size of the cluster and the performance density (per TB).

During rebalancing, the status of data migration can be viewed via sbcli cluster list-tasks UUID.

In case of temporary outages of devices or whole nodes, new write IO is redirected to alternative locations and once the outage is gone, will be migrated back to its planned locations for rebalancing.

⚠️

Keep planned node outages short, as outages impact cluster performance. Single node outages will impact performance by a factor of 1/n, where n is the number of nodes in the cluster.

If an outage is planned for a longer time, it is advised to remove a node, which will evacuate to entire data set from the to-be-removed node into the rest of the cluster. Once this evacuation is finished, the cluster will work at full performance again. The node can be added again later on.

Redundancy

Redundancy has two elements: (1) protection of data from loss and (2) availability of access to data.

Data Loss Protection

In simplyblock, data loss protection is implemented using erasure coding. Two schemas are available: n+k, with k=0, 1 or 2. n+1 allows for the permanent loss of a single cluster device or node, while n+2 allows for the concurrent loss of two devices or nodes. The schema (n, k=0, 1, or 2) can be selected for each logical volume individually.

If a device fails permanently it takes a some time for the cluster to rebuild in the background and reaching full redundancy again. This has to be separated from a temporary removal of a device from the cluster, as well as a temporary planned or unplanned outage of a node. In those cases the data may become unavailable for the time being, but is not lost. Anyway, the time it takes to rebuild the lost device depends on the relative size of the failed device compared to the total cluster size and the performance density (throughput) of the whole cluster as well as other load on the cluster. It typically ranges from a few minutes to two hours.

Remember that in case of n+1 any additional failure during rebuild will lead to permanent data loss as the cluster can not compensate for the permanent loss of another device.

In case of n+2, the cluster can compensate for the permanent loss of a further device during rebuild. Now, the failure rate of SSDs is very low (about 1 failure in 20 mio hours). Therefore, n+1 in a cluster with 100 devices will lead to an annual failure probability of around 4.4%. The probability of a secondary failure within a time range of a rebuild of 2 hours is only at about 0.000044%. From the perspective of data protection, n+1 can be considered safe enough for almost all but the critical data use cases as it delivers (1-0.000044%)=99.99996% data durability. This assumption takes into account a controlled environment in which the cluster is operated, meaning it is fully compliant with SSD requirements (temperature control, humidity). If SSDs are refurbished or SSDs are not exchanged within their healthy lifetime, MTTF (mean-time-to-failure) of individual SSDs could drop significantly. In such a scenario, n+2 is recommended.

Erasure coding generally relates to the devices in a cluster and not the nodes. The correlated physical damage or loss of multiple or all storage device of an entire node is not considered, as it can only be caused by a disaster such as terrorist attack, theft, fire, earthquake or similar. Protection against disaster happens through replication of storage across racks, cabinets or sites, a feature different from single-site data protection (important: this feature is part of simplyblock, but not released yet).

However, the choice of the erasure coding schema effects availability. Data is in fact distributed in a manner that compensates for single node loss as far as this is possible. A compensation for n+1 is possible, if the cluster has at least n+1 nodes. However, an n+2 erasure coding schema can compensate for a single node outage with as little as n/2+1 nodes. Double node outage can then be compensated in case of a minimum of n+2 nodes.

This leads to an estimation of the optimal data stripe size n. If a single node outage compensation is required, as in most cases, n should be chosen so that n+1 equals or is less than the number of nodes in the cluster for k=1. In case k=2 and a single outage compensation is required, n/2+1 nodes are sufficient. In case a dual node outage compensation is required (meaning the logical volumes stay online while two cluster nodes are not online), n must be chosen so that n+2 nodes are available.

Generally, a larger stripe size reduces erasure coding overhead (overhead in percent is calculated as $k/n*100$), but at the same time longer stripes perform less efficiently than stripes shorter than 4 (usually _n=2) for small, random IO (4 KiB). As a rule of thumb, the number of (initial) cluster nodes determines the stripe size, if HA is a requirement. For three node clusters, use 2+1 or 2+2 schemas, for larger clusters increase n.

For HA, a second requirement must be met: logical volumes must be created with ha-type=ha. This parameter ensures, that three instances of the same logical volume are created on three different storage nodes. The NVMe-oF subsystems will be connected to the logical volume via three different paths and NVMe multipath takes care of transparent fail-over in case of a node outage. Please note, these are not three replicas of the volume’s data, but refer to three independent in-memory access points to connect to a logical volumes to the same underlying, erasure coded storage.

Block and Chunk Sizes

Simplyblock can be operated with block sizes of 512 bytes and 4 KiB. The optimal choice depends on the system’s limiting factors. If the network can work efficiently with small packets of 512 bytes and the network is the limiting factor for overall IOPS, it can make sense to use 512 bytes, as network bandwidth will be optimized for 4 KiB blocks from the file systems. If the limiting factor for overall IOPS is the disk, 4 KiB blocks are the best choice, as disks will produce as many IOPS on 4 KiB blocks as on 512 bytes blocks.

As for the chunk size, it depends on the workloads served. For small, random IO workloads, the best choice is 4 KiB, while with workloads that have a majority of IO requests of larger block sizes, a larger chunk size will optimize for throughput. The selection criteria for chunk size in simplyblock are not different from other erasure coded system.

Performance Considerations

Due to erasure coding, journaling, and the network, maximum raw IOPS (the IOPS that can be generated by disks) cannot be delivered in full.

Read IOPS are also delivered in full (assuming the network bandwidth isn’t a limiting factor), whereas write operations are subject to write amplification due to additional operations for the erasure coding data protection. The exact amount of amplification depends on the exact erasure coding schema. For example, in case of n=2, k=1, the write amplification on 4 KiB random write IOPS is 2xWrite + 1xRead at a minimum (meaning that for each effective write, 2 raw writes and one read are required).

⚠️

There is a direct relationship between IO latency and IOPS throughput. As long as the system is not loaded with IO by more than about 90% of its maximal capacity, the latency remains stable and relatively close to the minimum latency. However, once the system starts to saturate (90%+), the latency starts to increase significantly. It is important to understand the total (max.) IOPS throughput of a system.

IOPS can be limited by disk performance, cpu and network. Most frequently, assuming the usage of fast NVMe devices and simplyblock, the limiting factor is the network bandwidth. This means that the weakest component determines the maximum capacity. Design your system and assign storage workloads to stay below 90% of maximum IOPS for nw, cpu and disks.

Due to the design of simplyblock, there are two IOPS limits: One for the overall cluster and another one for each individual lvol. Logical volume IOPS limits are currently bound by the power of single CPU cores and range at about 200.000 IOPS, while clusters can provide (tens of) millions of IOPS.

To understand IOPS saturation, it is best to test the system. Use fio and increase load until the latency starts to go up significantly. If this is not possible, it can also be calculated based on specific schemas. First, understand your throughput bottleneck (maximum amount MB/s) under worst case scenario for disks (which is 4 KiB random IOPS). For disks, it can be a bit more complicated as there are two ways to calculate, by specification and network. Using both ways, you can determine both maximum throughput and maximum raw IOPS possible (based on the weakest point). The final step is to calculate the write amplification, based on the selected erasure coding scheme, to get the effective IOPS amount.

You can also measure the current IOPS, throughput and latency, as well as create alerts in case thresholds are exceeded.