Architecture

Architecture

Cluster Topology

The cluster topology is stored in the control plane (management nodes).

Each object is identified and interacted with by its uuid both in cli and api. To retrieve the uuid of objects via cli, the sbcli list command is used (sbcli cluster list, sbcli sn list, sbcli sn list-devices, sbcli pool list, sbcli lvol list). The parent object’s uuid is used for the list command, for example the uuid of a storage node is required to list its devices.

Th basic object model is simple: clusters are comprised of storage nodes, which contain storage drives (nvme SSDs) and run on virtual or bare-metal instances; virtual volumes are provisioned against clusters and connected to a set of storage nodes; Snapshots relate to virtual volumes and virtual volumes may be created out of snapshots (cloning); volumes and snapshots can be managed within storage pools. Multiple clusters may be connected to a single control plane.

The data plane worker threads responsible for distributed data placement and erasure coding receive a map of the cluster together with an up-to-date state of each object. A new cluster map or a status update is sent by the control plane in case the topology or the state of any of the map’s object changes. While the workers are able to detect new problems directly on the data path, they generally operate based on the information contained in the map. It is possible to send and retrieve cluster maps manually via the control plane cli (sn send-cluster-map, sn get-cluster-map).

Cluster Monitor

The cluster monitor runs in the background and tracks changes in object states. For example, if a storage node becomes unreachable, the status is changed from online to unreachable or if a device is removed or returns io errors, its status is changed from online to removed or unavailable. In case a storage node or a device becomes unhealthy, automated restart actions may also be performed. Depending on the chosen deployment topology of a cluster and the state of an instance hosting a storage node, storage node restart actions may occur on the same instance or another instance.

In addition, the internal state, connectivity and configuration of each object in the data plane is observed and regular automated health checks compare its desired state with the actual state. In case of a deviation from the desired state and the actual state, this information is reported in the cluster log and when running the health check manually. In many cases it can be resolved by an (automated) restart.

For example, each logical volume is represented by a number of data plane objects on multiple storage nodes, such as a storage-node-internal logical volume, an nvme NQN, a namespace and a listener. These objects have to be present and configured in the correct manner for a logical volume to be healthy.

Cluster Persistence

The cluster topology is stored using a clustered persistence service based on foundation db. It consumes a moderate amount of storage, usually from the root partition of the management nodes.

The persistence service also stores io stats of each object; based on the defined retention periods, housekeeping is performed to limit the growth of data stored.

Object State Transitions

The following important state machines are implemented (Some state changes are automatic, others follow administrative action):

  • A cluster is in new, online or suspended state. After adding initial storage nodes, a cluster can be activated (cluster activate) to become operational. If an insufficient amount of storage nodes are online to remain operational (serve r/w io), the cluster moves into suspended state.
  • A storage node can be in intermediate states, such as in_shutdown and restarting, as well as online (operational), offline (turned off - after shutdown), unreachable (not reachable or responsive via management interface), and removed (node was permanently removed from cluster).
  • A device is in new (detected, but not added to cluster yet), online, unavailable (returns IO errors on read and write), removed (was temporarily removed from cluster - this can be a logical removal operation via cli/api or a hot-plug physical removal), read-only (returns IO errors on write, if full or worn out) or failed (permanently failed and removed from cluster).
  • A logical volume can be online and offline. The lvol is set online after initial connection and set to offline if it is in single mode and storage node is turned off or based on IO error. It is then switched back into online state if new IO activity on the lvol is detected. It will not automatically return into online state even if functional, if there is no IO activity.
  • A pool is either enabled (open for lvol provisioning) or disabled (closed for lvol provisioning). The state is changed via cli/api.

Cluster Rebalancing

A simplyblock storage cluster is always kept in or transitioned into a balanced state. This means that all devices in the cluster store the same relative amount of data depending on the size and performance of the device (utilization percentage).

For example, if a cluster is filled by 75% in total, it means that each individual device in the cluster, regardless of the storage node it resides on, can also be assumed to be filled by about 75%. As this data placement is based on statistical methods, there may only be slight deviations between utilization devices.

⚠️
As an implication, it is important that devices in a cluster provide similar relative performance characteristics (latency and IOPS per TB) and storage. Furthermore, storage nodes and network bandwidth should be designed in a way that they’re not the limiting factor to cumulative performance of the devices on a node.

Clusters are re-balanced in three cases:

  • Some data written during the offline phase of a particular node has to be migrated back to this node once it becomes online again
  • When an offline node or unavailable device is permanently failed (this is currently an administrator action), the data stored on this node or device must be rebuilt and re-distributed into the remaining cluster
  • When a node is added to the cluster, some of the data of the existing cluster is moved to the new node to re-balance capacity across all nodes, including the new one
⚠️
Keep planned node outages short, as outages impact cluster performance and redundancy. If an outage is planned for a longer time, it is advised to remove a node, which will evacuate to entire data set from the to-be-removed node into the rest of the cluster. Once this evacuation is finished, the cluster will work at full performance again. The node can be added again later on.

Data re-balancing is fully automated. Whenever a rebalancing event occurs, data migration tasks are created. Their status can be viewed using sbcli cluster list-tasks. While re-balancing tasks of the different types can overlap across nodes, they are always fully serialized per node: each node can only perform one data re-balancing task at a time and a new task will only be started once the current task is completed. Multiple tasks may be queued.

Data Loss Protection

In simplyblock, data loss protection is implemented using erasure coding. Two schemas are available: n+k, with k=0, 1 or 2. k=1 allows for the permanent loss of a single storage device or an entire storage node per cluster, while k=1 allows for the concurrent or overlapping loss of two independent storage drives of a node.

⚠️
The permanent loss of a storage node is a scenario, which can happen in public cloud environments in which entire compute instances can “die” together with their local instance storage. In private cloud environments, storage node hardware can be repaired or replaced or storage drives containing data can be removed and inserted into another host. In these environments, a permanent loss of data is likely linked to the full failure of a storage drive or a data-center-level disaster.

Simplyblock works with distributed erasure coding. Each data or parity chunk is placed not only on a separate storage drive, but also on a separate storage node as long as sufficient nodes are available.

To maintain data protection guarantees under the assumption of recoverability/replacability of node hardware, a single node is sufficient as long as long is it is equipped with at least n+k storage drives.

However, to maintain data protection gurantees under the assumption that also entire nodes can get lost permanantly, n+k nodes are required to be online in a cluster.

In case to also guarantee full HA - to continue all r/w io during the outage of a single node two nodes - one (k=1) or two (k=2) spare nodes are required on top of n+k. Therefore, a total of n+2*k are required.

⚠️
It is the responsibility of the administrator to add sufficient nodes to the cluster in combination with selecting the correct values for n and k when creating the cluster to achieve the required degree of data protection and availability.

Keep in mind that while the probability of a full storage drive failure is very low (about one event per 20.000.000 operating hours), the chances of an entire loss of a compute instance in the cloud may be significantly higher.

Stripe Size

A choice of a high n has a negative impact on local node affinity (a feature to allow reads from the local SSDs), but improves the raw to effective ratio of storage (the raw to effective overhead factor is described by n/k).

Longer stripes, on the other side, have more overhead in the case data needs to be recovered due to the temporary or permanent loss of a storage device and small random reads (4 KiB) are performed. The overhead to rebuild data on a degraded system for a 4 KiB read and at a chunk size of 4 KiB, is in fact n. In other words, it takes n 4 KiB reads instead of a single one to perform this operation.

Block and Chunk and Stripe Sizes

Simplyblock is operated with block sizes of 512 bytes and 4 KiB. 512 bytes blocks will only perform better than 4 KiB blocks under a few specific conditions.

As for the chunk size, it depends on the workloads served. For small, random IO workloads, the best choice is 4 KiB, while with workloads that have a majority of IO requests of larger block sizes, a larger chunk size will optimize for throughput.

Write Amplification

Due to erasure coding and meta-data journaling, Simplyblock has a write amplification, which depends on the erasure coding schema chosen. The maximum write overhead, which occurs on 4 KiB random write IO, is at 2xread plus 2xwrite for a single write with k=1 and 2xread plus 3xwrite for a single write with k=2. For larger write sizes, the overhead drops accordingly. The overhead (write amplification) for writing the meta-data journal is about 25 percent.

There is no amplification for read as long as the cluster is not degraded.

System Latency and Throughput

⚠️
There is a direct relationship between IO latency and IOPS throughput. As long as the system is not loaded with IO by more than about 90% of its maximal capacity, the latency remains stable and relatively close to the minimum latency. However, once the system starts to saturate (90%+), the latency starts to increase significantly. It is important to understand the total (max.) IOPS throughput of a system.

To limit spill-over effects and let certain workloads operate with minimum latency, while also leveraging max. iops and throughput of a cluster, a QoS feature has been introduced.

Logical volumes in the high-priority class will receive priority serving (low access latency), while logical volumes in the standard class have to wait a bit longer.

It is also possible to limit the IOPS and throughput capacity of logical volumes by using QoS settings on those volumes and on the pools to which those volumes belong.