Requirements

Requirements

This chapter explains the architecture and sizing requirements of your simplyblock deployment.

Simplyblock consists of three main components:

  • the control plane
  • the storage plane (one or multiple storage clusters)
  • k8s integrations (CSI driver and caching node services)

The control plane is deployed to plain linux instances (vm or bare-metal). The storage plane can be either deployed to plain linux instances or into existing kubernetes clusters.

ℹ️
Deploying into plain linux instances requires a rocky 9 or rhel 9 and auto-installs a number of pre-requisites, including python3, docker and docker-swarm as well as the simplyblock cli (sbcli). For this reason, it is recommended to use those instances exclusively for simplyblock, even if the simplyblock services themselves are fully containerized.

Control Plane

The control plane is the first component to be deployed. Storage clusters require a running control plane to connect to. A single control plane can serve multiple storage clusters and can be deployed within a availability zone or site different from those clusters.

Sharing a vlan (subnet) with storage clusters is the simplest deployment variant, but it is also possible to place the control plane into a separate subnet and use VPC peering or routing in between.

Control Plane infrastructure

The control plane is implemented as a stack of containers running on single or multiple management nodes (by default 3 nodes for HA and scalability) as a set of replicated, stateful services. The control plane internally uses a key-value store (foundation db), which by itself is replicated across all management nodes and works in cluster mode.

A single node is sufficient for testing and staging environments, but for production use, a three-node load-balanced HA cluster is recommended. A load balancer is auto-deployed when using the simplyblock auto-deployer on a public cloud (aws, gcp). Outside of a public cloud it has to be deployed separately (not in scope of simplyblock deployment).

Control Plane features

The control plane hosts the api (http/s, TLS-secured) and cli with identical features. The cli is equally available on all management nodes. a;i and cli provide services for:

  • lifecycle management of a cluster: deploy storage clusters, manages nodes and devices, resize and re-configure clusters
  • lifecycle management of logical volumes and pools (this part is entirely automated in kubernetes environments)
  • cluster operations (retrieve io and capacity statistics, alerts, logs, etc.)

The management plane also provides real-time collection and aggregation of io stats (performance, capacity, utilization), pro-active cluster monitoring and health checks, monitoring dashboards, alerting, a log file repository with a management interface, data migration and automated node and device restart services.

For monitoring dashboards and alerting, Grafana and Prometheus are used. A few standard alerts are available and they can be sent to Slack and via E-Mail. They can be combined with any custom alerts.

For log management, simplyblock uses Graylog. Collected are container logs from different storage plane and control plane services, including the RPC service (all communication between the management and the storage plane, as well as the data services container) and storage plane (data) services (SPDK log).

Control Plane to Cluster Communication

The storage nodes communicate with the management plane using two containers (sNodeApi and spdkProxy). Communication includes HTTP-based RPCs for:

  • shutting down, restarting, removing and adding storage nodes
  • creating, updating and deleting objects such as logical volumes, snapshots and clones on storage nodes
  • storage drive, storage node and logical volume status monitoring
  • statistics collection from storage drives and storage nodes
  • health checking of objects
  • automated node and drive restarts
  • execution and monitoring of data migration tasks

Logs, statistics and monitoring data are removed from the repositories (located on the root partitions of the management nodes) after their retention period and can be backed up into S3 buckets using automated cluster housekeeping.

Memory and vCPU Requirements

The amount of memory and vCPUs required per management node depends on the total number of storage nodes and logical volumes managed.

The general recommendation for a three-node management plane in production is to use at least 8 GiB of RAM per node and at least 4 vCPU (two cores).

In addition, another 1 GiB of RAM and 1 vCPU is recommended for each 5 storage nodes and for each 500 logical volumes.

Root Disk Requirements

The boot disk storage capacity required depends on the amount of objects (storage nodes, devices, logical volumes, snapshots) and the online retention time of logs, monitoring data and statistics as well as statistics resolution. The retention period and data granularity for a management plane can be set at creation time. Consider the following rules (based on standard retention time and statistics resolution):

DescriptionValue
Minimum25 GiB
Additional per storage node1 GiB
Additional per hundred objects1500 MiB

These numbers are based on the default retention period of logs, statistics and monitoring data and will deviate if other intervals are chosen. The retention period can be set when creating a cluster.

Network Security

For a list of the required ports and protocols, please see the Manual Management Plane Deployment section.

Storage Cluster

Storage clusters are implemented as a set of storage nodes. Storage nodes are containerized.

Storage Cluster Containers

The hot data path with all of its data services runs within the privileged container named SPDK. The nvme drives, which are physically or virtually attached to the node, are removed from the linux kernel and attached to the user space process running in this container at container start time.

sNodeAPI is a container used to monitor and control the other node services. It receives inbound HTTP requests from management nodes.

spdkProxy is a container, which receives inbound HTTP requests and translates them into local (unix-socket) RPC calls for the data services container (SPDK).

Further containers collect and aggregate different types of statistics.

⚠️

At startup time of the SPDK container, all local, unmounted nvme devices (found via the lspci list command or nvme list from the nvme-cli package) are detached from the Linux kernel and attached to simplyblock exclusively.

Any detected devices, which are marked busy by the operating system cannot be claimed. This usually happens because they are formatted with a file system and this file system is mounted. Those devices will not be attached to the simplyblock storage node. Devices must be idle and unmounted to be claimed and usable via simplyblock.

Devices which host a boot partition will be excluded from the available device by the device detection and cannot be used for simplyblock.

Networking

Simplyblock data service containers within a cluster receive inbound IO requests via NVMe/TCP and transfer IO between each other using the same protocol. All communication within a storage cluster is fully NVMe over TCP based and no other networking protocol is used. TLS can be activated for security between hosts and storage nodes.

For additional information on port security see the Manual Storage Cluster Deployment section.

Storage Requirements

The storage nodes require at least 5G of free space on the root partition for the local container stack. Logs are shipped continuously to the central log management, so there is no local space requirement.

Memory Requirements

A certain amount of memory is required to service IO and data services for logical volumes and snapshots per storage node.

Simplyblock works with two types of memory: huge-page memory, which has to be pre-allocated prior to starting the storage node services and is then exclusively assigned to simplyblock, as well as system memory, which is required on demand. The exact amount of huge page memory is calculated when adding or restarting a node based on two parameters: the maximum amount of storage available in the cluster and the maximum amount of logical volumes, which can be created on the node:

UnitRAM
Fixed (small node)4 (2) GiB
Per logical volume6 MiB
Per TB of max. cluster storage256 MiB

If not enough of huge page memory is available, the node will not start, in this case you may check /proc/meminfo for total, reserved and available huge page memory on a node; You may also use e.g. sysctl vm.nr_hugepages=4096 to set the huge page memory to e.g. 8G. Huge page memory can also be reserved using the grub bootloader to avoid loosing the setting at reboot time.

In addition, the following amount of system storage may be allocated dynamically:

UnitRAM
Per TiB of used local SSD storage256 MiB
Per TiB of used logical volume storage256 MiB
ℹ️
Used local SSD storage is the physically utilized capacity of the local nvme devices on the storage node at a point in time. Used logical volume storage is the physically utilized (not just provisioned!) capacity of all logical volumes on a specific storage node at a point in time.

vCPU requirements

Simplyblock storage nodes can be started with different core masks.

ℹ️

The minimum amount of cores required per storage node is 4, 3 of them are hosting service threads. With 4 cores, only one core will be used for worker threads, producing about 200.000 IOPS. Every additional core will be used for worker threads and allow to achieve about 200.000 IOPS additionally as a rule-of-thumb (real core, not hyper-thread!), but it depends on the exact workload patterns and other limiting factors such as the network.

Always leave core 0 for the operating system and do not assign it to simplyblock, even if the entire instance is dedicated to simplyblock. The minimum number of cores required for simplyblock is currently 4 and the maximum number of cores supported is 23 (per storage node / instance).

If possible, turn off hyper-threading and use real cores!

k8s requirements

To use simplyblock storage on k8s worker nodes co-located with compute workloads, a Linux kernel version supporting NVMe over TCP and its multipath feature is required. See here. It is required to load the respective kernel module before NVMe/TCP storage can be used on a node. This can typically be achieved using

sudo modprobe nvme-tcp

The following components are to be installed on k8s workers (all via one helm chart):

  • CSI driver (controller): on any one worker in the cluster
  • CSI driver (node-part): on any worker, which attaches simplyblock storage to containers
  • Caching-Node pod (optional): on any worker with at least one local nvme SSD; the SSD or parts of it may be used as a local write-through cache to accelerate read and reduce network bandwidth requirements. This setup requires about 2G + 0.25% of size of cache in huge-page memory (no other system memory) and at least 2 vCPU (preferred: cores). Example: A local nvme has a size of 1.9T. The minimum huge-page ram requirement is 2G+0.025%*1900G=6.75G. At startup, a caching node detaches selected NVMe devices from the Linux operating system and attached them as cache to the simplyblock cluster. While the cache is an entirely stateless write-through cache, it can significantly improve (reduce) read access latency and cluster IOPS and reduce load on the network.
  • Storage-Node pod (optional): on any worker with at least one local nvme SSD; This setup requires a minimum of 2G of huge-page memory and further RAM depending on the size of the cluster and the local SSDs (see above); it further requires at least 5 vcpu (preferred: cores).
ℹ️
For workers running a caching node or storage node, memory reservation is done within the pod deployment. However, it is recommended to reserve this memory as huge memory at instance boot time. Failing to do so may render the assignment of the desired amount of memory to the caching node unsuccessful. Even when enough free of system memory is available the assignment can fail due to memory fragmentation.

  1. Objects include logical volumes, snapshots and clones ↩︎