Requirements

Requirements

This chapter explains the architecture and sizing requirements necessary when planning your simplyblock deployment.

Where necessary, we differentiate between deployments in the AWS Cloud and all other bare metal or virtualized deployments.

Simplyblock consists of three main components:

  • the management plane
  • the storage plane (one or multiple storage clusters)
  • k8s integrations (CSI driver and caching node services)

The management plane and storage plane are deployed using Docker Swarm (under-the-hood). The use of Docker Swarm is entirely hidden from the user within the simplyblock CLI and API. No prior knowledge of Docker Swarm is required to operate a simplyblock storage cluster. A Kubernetes alternative to Docker Swarm is planned with the next release.

Management Plane

The management plane is the first component to be deployed. Storage clusters require a running management plane to connect to. A single management plane can serve multiple storage clusters and can be deployed within a different availability zone or site.

For single-AZ deployments, it can run in the same vlan and ip subnet as the storage cluster(s), but for deployments within multiple AZs a separate subnet is recommended. On AWS, a single VPC may be sufficient within a particular region. However, it is also possible to operate storage clusters and management plane in separate VPCs using VPC peering.

The management plane is implemented as a stack of containers running on single or multiple management nodes (by default 3 nodes for HA and scalability) as a set of replicated, stateful services. The management plane internally uses a key-value store (foundation-db), which by itself is replicated across all management nodes.

A single node is sufficient for testing and staging environments, but for production use, a three-node load-balanced HA cluster is recommended. A load balancer is auto-deployed when using the simplyblock auto-deployer on AWS. Outside of AWS it has to be deployed separately (not in scope of simplyblock deployment). Management nodes currently run on either Rocky 9 or RHEL 9. This restriction will be lifted in the near future.

The management plane hosts the API (http/s, TLS-secured) and CLI. API and CLI are on feature parity. The CLI is equally available on all management nodes. API and CLI provide services for:

  • lifecycle management of a cluster: deploy storage clusters, manages nodes and devices, resize and re-configure clusters
  • lifecycle management of logical volumes and pools (this part is entirely automated in kubernetes environments)
  • cluster operations (retrieve io and capacity statistics, alerts, logs, etc.)

The management plane also provides real-time collection and aggregation of capacity, utilization and IO statistics, health checking, monitoring (with dashboarding and alerting), a log database and management interface, data migration and automated restart services. Internal monitoring of storage clusters observes and updates the status of nodes and devices to achieve a consistent state and view across the entire cluster. The internal monitoring updates the status of each node every few seconds.

Furthermore, regular additional health checks compare and identify any deviations between reported and actual state of objects such as devices, nodes, logical volumes and the entire cluster. The automated restart function implements automated retries of node and device status updates, as well as repair and restarts to mitigate “stuck” devices and nodes after situations such as container exit, transient device errors or transient network outages.

Data migration services automatically re-balance the cluster after a device and node failure, or in case of cluster expansion.

For monitoring dashboards and alerting, Grafana is used in combination with Prometheus. A few standard alerts are available and can be sent to Slack. They can be combined with any custom alerts.

For log management, simplyblock uses Graylog. Collected are container logs from different storage plane and management plane services, including the RPC service (all communication between the management and the storage plane, as well as the data services container) and storage plane (data) services (SPDK log).

The storage nodes communicate with the management plane using two containers (sNodeApi and spdkProxy). Communication includes HTTP RPCs for:

  • the deployment of storage nodes and to implement API and CLI functionality
  • device, node and lvol status monitoring and consistency
  • statistics collection and aggregation
  • health checking of objects
  • automated restarts and
  • execution of data migration tasks

Logs, statistics and monitoring data are removed from the systems after their retention period and can be backed up into S3 buckets using automated cluster housekeeping.

Memory and vCPU Requirements

The amount of memory and vCPUs required per nodes depends on:

  • objects (clusters, storage nodes, devices, logical volumes) managed
  • amount of API operations per second
  • number of management nodes (as load is balanced across them)

The general recommendation for a three-node management plane in production is to use at least 8 GiB of RAM per node and at least 4 vCPU (two cores).

Boot Disk Requirements

The boot disk storage capacity required depends on the amount of objects (storage nodes, devices, logical volumes, snapshots) and the online retention time of logs, monitoring data and statistics as well as statistics resolution. The retention period and data granularity for a management plane can be set at creation time. Consider the following rules (based on standard retention time and statistics resolution):

DescriptionValue
Minimum25 GiB
Additional per storage node1 GiB
Additional per hundred objects1500 MiB

These numbers are based on the default retention period of logs, statistics and monitoring data and will deviate if other intervals are chosen. The retention period can be set when creating a cluster.

Network Security

For a list of the required ports and protocols, please see the Manual Management Plane Deployment section.

Storage Cluster

Storage clusters are implemented as a set of storage nodes. Storage nodes are containerized. Specific core data services run within the privileged SPDK container. sNodeAPI is a container used to monitor and control the other node services. It receives inbound HTTP requests from management nodes. spdkProxy is a container, which receives inbound HTTP requests and translates them into local (unix-socket) RPC calls for the data services container (SPDK ). In addition, statistics collection and aggregation containers are part of the storage node container stack.

⚠️

At startup time of the SPDK container, all local, unmounted nvme devices (found via the lspci list command or nvme list from the nvme-cli package) are detached from the Linux kernel and attached to simplyblock exclusively.

Any detected devices, which are marked busy by the operating system cannot be claimed. This usually happens because they are formatted with a file system and this file system is mounted. Those devices will not be attached to the simplyblock storage node. Devices must be idle and unmounted to be claimed and usable via simplyblock.

Devices which host a boot partition will be excluded from the available device by the device detection and cannot be used for simplyblock.

ℹ️
At the moment, Storage Nodes run on Rocky 9 and RHEL 9. These restrictions will be lifted in the near future.

Networking

Simplyblock data service containers within a cluster receive inbound IO requests via NVMe/TCP and transfer IO between each other using the same protocol. All communication within a storage cluster is fully NVMe over TCP based and no other networking protocol is used.

For additional information on port security see the Manual Storage Cluster Deployment section.

Memory Requirements

A certain amount of memory is required to service IO traffic for logical volumes and snapshots per storage node. Memory consumption depends on a number of parameters:

  • number of cores on node (used for simplyblock)
  • maximum amount of logical volumes provisioned on the storage node
  • maximum amount of active snapshots on the storage node
  • amount of available local storage on the storage node
  • maximum amount of provisioned and utilized logical volume capacity on the storage node
UnitRAM
Minimum (small node)4 (2) GiB
Per logical volume16 MiB-128MiB2
Per active Snapshot12 MiB
Per TiB of local SSD storage256 MiB3
Per TiB of used logical volume storage256 MiB3

Example: A node with 24 cores (20 assigned to simplyblock), 4 KiB block and chunk size, 32 TB of local storage, max. 100 logical volumes, 500 active snapshots (an arbitrary number of backed up snapshots), and max total prov. capacity of 48TB requires:

4 + 100 * (0.016 + 0.03) + 500 * 0.012 + 0.256*32 + 0.256*48 = 35.2 GB

Remark: If not enough of system memory is available, the data service refuses to start (error message when using sblci sn add or sbcli sn restart).

ℹ️

Used logical volume storage is the physically used capacity of a logical volume on a specific storage cluster node. Due to statistical distribution, with a small number of large logical volumes, this value may exceed the naively expected provisioned_capacity / number_of_storage_nodes amount. Therefore, the used logical volume storage value should be assumed to be at least 1.5x the above value, better the full provisioned capacity of a logical volume.

With an increasing number of logical volumes (often many small volumes), the value can be configured closer to the naive number calculated above.

⚠️

Simplyblock reserves and allocates part of the memory as huge pages. Huge page memory is a different category of memory from operating system perspective, providing larger memory blocks than the default page size.

If the commands sbcli sn add or sbcli sn restart (or their respective API calls) fail, despite enough free memory, a potential cause is memory fragmentation. The problem will be visible in the /proc/meminfo virtual file or the log of the SPDK container when adding or restarting a storage node.

In such a case, it is important to reserve the required amount of huge page memory early on. This can be done by including a necessary setting into GRUB line.

vCPU requirements

Simplyblock is designed to utilize the full resources of a given node. Therefore, when starting up, a CPU mask will automatically be applied by simplyblock, claiming all available CPU. However, 20% of the available CPU (cores) will be left to the operating system per default.

In terms of sizing, for full performance, we recommend one physical vCPU core (vCPU defined as single core-thread, >=2.1 GHz) per 150,000 IOPS (raw). Additionally, 0.2 (20%) vCPUs core should be added for the operating system.

Example

In a storage cluster with 3 nodes, 25 NVMe devices and 2x25 GBit/s network per node (6x25 GBit/s of total network capacity), the maximum theoretical IOPS that can be achieved in the cluster is limited by network bandwidth only:

((150 * 0.9) / (8*4)) = 4.2 (mln. IOPS at 4 KiB), with a ratio of 70% read and 30% write

If 4,200,000 IOPS (raw) are expected (the factual IOPS will depend on the erasure coding schema and the rw ratio, on a 2+1 schema and an 80%/20% mix, this would result in about 2,450,000 IOPS max. output), this would require at least 24 cores, split over 3 nodes, assigned to simplyblock + 20% (6 cores) assigned to the system. A 12 core cpu per node would be sufficient in such a case.

k8s Integrations

To use simplyblock storage, a Linux kernel version supporting NVMe over TCP and its multipath feature is required. This is the case for all major distributions and more or less recent releases (see tested compatibilities here). It is required to load the respective kernel module before NVMe/TCP storage can be used on a node. This can typically be achieved using

sudo modprobe nvme-tcp

To use simplyblock storage on k8s workers, it is required to install the simplyblock CSI driver. The CSI driver consists of two Pods, the CSI controller and the CSI node parts. The CSI controller can be installed anywhere in the cluster, while the node part must run locally on all nodes to which simplyblock storage will be attached. This can be achieved by tagging the respective nodes.

A second optional component, which can be installed on worker nodes with local NVMe storage attached, is the simplyblock caching node, which consists of two Pods, one of which contains a privileged container. To deploy the caching node to k8s workers, the worker nodes must be tagged as a caching node.

At startup, a caching node detaches selected NVMe devices from the Linux operating system and attached them as cache to the simplyblock cluster. While the cache is an entirely stateless write-through cache, it can significantly improve (reduce) read access latency and cluster IOPS and reduce load on the network.

When adding or restarting a caching node, a memory and CPU mask have to be assigned.

It is recommended to assign at least two vCPU (a single core with two threads) to a caching node Pod.

The memory assigned will determine the size of the cache. Assign at least 2.2% of the size of the NVMe cache to be used + 50 MiB. For example, if the node has a 1.92 TB NVMe drive attached and the full device is to be used as cache, it is required to assign around 43 GiB of RAM to the caching node Pod.

1.92 TB * 0.022 = 42.24 => ~43 GB

For workers running a caching node, it is recommended to reserve this memory as huge memory at instance boot time. Failing to do so may render the assignment of the desired amount of memory to the caching node unsuccessful. Even when enough free of system memory is available the assignment can fail due to memory fragmentation.


  1. Objects include logical volumes and clones, storage nodes and devices ↩︎

  2. Depends on number of cores on system assigned to simplyblock - Formula: 16 MiB + 1.5 MiB per core ↩︎

  3. Based on 4 KiB block and chunk size. Larger chunk size will reduce demand proportionally, e.g. 8 KiB chunks will reduce demand to 128 MiB / TiB. ↩︎ ↩︎