Deployment

Deployment

A simplyblock deployment consists of three dependent parts to be deployed in the following order:

  1. Deploy the control plane (usually deploy once per organization or region, multiple storage clusters can connect) - the management plane can also be hosted by simplyblock
  2. Deploy disaggregated storage clusters (multiple k8s clusters and many hosts can be connected to one control plane)
  3. Deploy the CSI driver and (optionally) Caching Nodes and co-located storage nodes per k8s cluster

Simplyblock provides cli functions to deploy the control plane and create one or more cluster objects. Once the control plane is deployed and a cluster object has been created, cli and api functions can then be used to deploy disaggregated storage nodes into the cluster and connect them to the control plane.

To deploy storage clusters straight into kubernetes environments, a helm chart is available. This helm chart internally interacts with the control plane api for deployment and deploys both the co-located storage nodes, optional caching nodes and the CSI driver.

On top of the deployment functions, an auto-deployer is available. It can scratch-deploy entire environments (compute instances, virtual networking, control plane nodes, disaggregated storage cluster nodes, a k3s test cluster) using terraform and is currently tested for aws and gcp. The auto-deployer can separately deploy either control plane or a storage cluster or both: simplyblock auto-deployer.

Manual Control Plane Deployment

The control plane can be deployed on one or three nodes. Currently, these nodes must run RHEL 9 or the Rocky 9 equivalent. The following services run on those nodes:

Control Plane Services:

fdb-server
StorageNodeMonitor
MgmtNodeMonitor
CachingNodeMonitor
LVolStatsCollector
CachedLVolStatsCollector
PortStatsCollector
HAProxy
CapacityAndStatsCollector
CapacityMonitor
HealthCheck
DeviceMonitor
LVolMonitor
CleanupFDB
TasksRunnerRestart
TasksRunnerMigration
TasksRunnerFailedMigration
TasksRunnerNewDeviceMigration
NewDeviceDiscovery
WebAppAPI
TasksNodeAddRunner

Monitoring Services:

mongodb
opensearch
graylog
promagent
pushgateway
grafana
thanos
node-exporter

Ensure the following network ports are open on the hosts and into or from the subnet (the subnet can be shared only, if the management plane resides on the same AZ as the storage clusters):

The load balancer:

DirectionSource or target nwportsprotocol
ingressmgmt80tcp
ingressmgmt3000tcp
ingressmgmt9000tcp
egressallallall
ℹ️
The load balancer is only deployed using the auto-deployer! Using the manual deployment, it is required to deploy an AWS load balancer separately.

For Management Nodes:

ServiceDirectionSource or target nwportsprotocol
API (http/s)ingressloadbalancer80tcp
SSHingressmgmt22tcp
Grafanaingressloadbalancer3000tcp
Graylogingressloadbalancer9000tcp
EFS (aws onlyingressinternal subnet2049tcp
Docker Swarmingressstorage clusters, internal s.2375,tcp
2377,tcp
7946,tcp
7946,udp
4789udp
Graylogstorage clusters,internal s.12201,tcp
12201udp
fdbstorage clusters, internal s.4800,tcp
4500tcp
All trafficegress[0.0.0.0/0]-1all

Important: On AWS, the API, Grafana and Graylog are accessed via the load-balanced API gateway.

To deploy the management plane, first install the first node:

sudo yum -y install python3-pip
pip install sbcli-release --upgrade
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sbcli cluster create 
sbcli cluster list

The following important parameters may be used on cluster create:

ParameterValue RangeDescription
–distr-ndcs1,2,4,8Data chunks per stripe. Min. amount of devices in cluster must be ndcs+2*npcs. Larger ndcs improves raw/effective storage ratio (ratio: ndcs/(ndcs+ndps)). But it will reduce performance on small (4K) random writes. If ndcs>1, local node affinity does not work.
–distr-npcs0,1,2npcs=0: only for ephimeral storage. data is lost, if single node or device is lost. npcs=1: tolerates one node failure at a time. To avoid data loss, cluster will stop writing, if more than one node is down. npcs=2: tolerates up to two concurrent node failures. To avoid data loss, cluster will stop writing, if more than two nodes are down. To achieve single HA at least ndcs+2npcs nodes are required in case of npcs=1, and at least ndcs/2+npcs nodes are required in case of npcs=2. To achieve dual HA in case of npcs=2, at least ndcs+2npcs are required.
–enable-node-affinityon/offIf enabled, first data strip (chunk) is always placed on the node to which the logical volume is local. This reduces nw traffic on write and avoids nw traffic on read.
–log-del-interval1d-355d7d per default, after retention period, container logs are emptied (AWS: backed up to S3). If increased, you must increase root disk size on management nodes.
–metrics_retention_period1d-355d7d per default, after retention period hot statistics are removed from internal db and statistics (affects api, cli and grafanandashboards). If increased, you must increase root disk size on management nodes.
–cap-warnpercentagedefault: 92%. Will create repeated entries in cluster log and monitoring if reached.
–cap-critpercentagedefault: 98%. Total cluster storage utilization. If reached, cluster stops writing. Data must be deleted or cluster must be expanded.
–prov-cap-warnpercentagedefault: 150%. Creates repeated warnings in cluster log, if reached: prov. capacity / total cluster size
–prov-cap-critpercentagedefault: 200%. New lvol provisioning stopped, if reached.
–ifnamelinux dev namedefault: eth0. Default interface of data traffic on storage nodes.

To install the second and third control plane node (for a HA cluster):

sudo yum -y install python3-pip
pip install sbcli-release --upgrade
sbcli mgmt add <FIRST-MGMT-NODE-IP> <CLUSTER-UUID> eth0

To verify the management domain is up and running, retrieve the cluster secret (sbcli cluster get-secret) and log into the following services using the external IP of one of the three nodes. To log in to Graylog, Grafana and Prometheus the username is admin and the password is the cluster secret. For the API, the cluster UUID and the secret are required:

ServicePortUserSecret
API (http/s)443uuid*cluster secret
grafana3000admincluster secret
grafana3000uuid**cluster secret
graylog9000admincluster secret
graylog9000uuid**cluster secret
HA-Proxy8404
*uuid of cluster created first
**uuid of any additional cluster created - a separate login / access rights per cluster can be used for multi-tenancy (different tenants use the control plane for their clusters)

All services are reachable via all (external) IPs of the management nodes.

Storage Plane (Cluster) Deployment

Storage nodes can be installed once the control plane is running. It is required to differentiate a co-located deployment (storage nodes running on k8s workers) or a disaggregated deployment. Combinations are possible (some nodes in the cluster on workers, others disaggregated).

⚠️
Storage Node Container Stacks (pods) run in privileged mode.
ℹ️
When you combine different node types, it is important to ensure that network access latency between all nodes is similar (and below 150us on average) and that available vcpu per TB of storage, SSD IOPS performance and throughput per TB of storage and network bandwidth per TB of storage are similar. It is therefore also not advisable to use different nvme sizes in the same cluster and tier, as larger disks provide less IOPS/TB, while different models and vendors are usually not a problem. Keep in mind, the cluster has a tendency to converge to the slowest of its components.

The following ports must be opened within a storage cluster subnet (*for disaggregated nodes only, **for k8s co-located nodes only):

ServiceDirectionSource or target nwportsprotocol
Lvol Connectingresscompute hosts, storage cluster4420tcp
sNode APIingressManagement Nodes5000tcp
SNode APIingressManagement Nodes, storage cluster-ICMP
SPDK ProxyingressManagement Nodes8080tcp
Docker APIingressManagement Nodes, storage cluster2375tcp
Docker Swarm*ingressManagement Nodes, storage cluster2377, 7946tcp
Docker Swarm*ingressManagement Nodes, storage cluster7946, 4789udp
k8s node communication**ingressstorage cluster10250tcp
DNS resolution from worker nodes**ingressstorage cluster53tcp
UDP traffic on ephemeral ports**ingressstorage cluster1065 –> 65535tcp
SSHingressmgmt22tcp
All trafficegress[0.0.0.0/0]-1udp

Storage Nodes can be provisioned automatically in AWS against an existing control plane within the same account using the provided API Service.

Storage nodes currently require RHEL 9 or Rocky 9. The storage node installation consists of two parts. First, storage nodes are prepared for installation:

sudo yum -y install python3-pip
pip install sbcli-release --upgrade
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sbcli sn deploy

Note the IP address and port on which the storage node (sn) is listening. Then from one of the management nodes, storage nodes must be added to the cluster via the cli:

sbcli sn add

Description of parameters:

ServiceDescriptionExample
cluster_idCluster-UUID8ce9b324-d3dc-488b-ad90-e88ec7e05ca3
node_ipStorage node IP and mgmt port (SNodeAPI listener)172:168.0.5:5000
ifnameNetwork interface to use for mgmt and dataeth0
–number-of-distribsdefault 4: Number of worker threads for the node, depends on node size (recommended: 4-6)4
–max-provMaximum amount of GB to be provisioned via that node5000
–max-lvolMaximum lvols per node (affects memory demand)100
–max-snapDefault: 500 - Maximum snapshots per node1000
–cpu-maskDefault: All cores but 0; Core mask used to start storage node0xFE
–partitionsDefault 1: Partitioning of devices for performance reasons. Change only for large disks.2
–data-nicsIf your storage nw is on separate virtual or physical ports from your mgmt nw, list them here.eth1, eth2
⚠️
Storage nodes currenly require an amount of pre-allocated huge-page memory to start. The amount depends on the parameters max-prov, max-lvol and max-snap and is calculated when the node is added. It is recommended to reserve a reasonable amount of system memory as huge page memory early on after starting/rebooting an instance hosting a storage node, because it is possible that at the time the storage node is added, not enough of huge page memory can be claimed even if enough of system memory is available. This is due to system-internal memory fragmentation issues. Use the following command to reserve 6G of system memory for huge pages, which is the recommended minimum: sysctl vm.nr_hugepages=3072 See here for memory requirements memory requirements.

cpu mask

Currently, the following amount of cores in the mask is supported per node: 4,5,7,9,11,13,15,17,19,21,23. If more cores are assigned in the core mask, the next lower number from this table will be used.

Remarks:

  • avoid assigning core 0 to simplyblock, it should stay with the operating system
  • if the host or vm is used exclusively as a storage node (disaggregated model), all but core 0 can be assigned to the simplyblock storage node
  • if you run on x86, it is recommended to turn off hyper-threading (pair hyper-threads into physical cores on the linux operating system) before starting a storage node

Example of core mask (assign 5 cores, leave out core 0): 0x3E (or 111110)

If it is not feasible to turn off hyper-threading, because simplyblock runs co-located with other pods on kubernetes workers, it is recommended to define the core mask so that hyper-threads, which belong to the same core, are all included into the simplyblock core mask. For example, if your system provides 24 hyper-threads (12 physical cores) and you want to assign 7 hyper-threads to simplyblock, use the following mask:

0111 1000 0000 1110 0000 0000 (0xC80E00)

Simplyblock on Linux

To connect storage volumes to Linux hosts, no specific drivers or admin software is required. NVMe over TCP is available on all current Linux distributions.

For exact pre-requisites and a compatibility matrix, see “Using simplyblock on Linux Hosts”.

Simplyblock on Windows Server and VMWare

NVMe over TCP will be natively supported on Windows Server 2025/vNext. For earlier versions of Windows Server, StarWind provides the necessary initiator implementation.

VMWare supports NVMe over TCP in vSphere 8.0 or later. For more information see the VMware documentation.

CSI Driver and Caching Node Deployment

The repository with the driver and documentation can be found here: https://github.com/simplyblock-io/spdk-csi/tree/master/docs and here: docs.