Skip to content

Manual installation

Overview

Due to the many options available to you for installing Kubernetes clusters, this document will not go into the specifics of setting up the cluster. Rather, it will provide you with guidance and requirements for your cluster.

Nodes

You will need to provision 2 groups of nodes:

  1. For training - These nodes will run the Composabl components, as well as the actual training processes
  2. For Simulators - These nodes will be used for Simulators exclusively.

1. Sizing

Autoscaling

If your cluster supports autoscaling, you can continue on to 2. Labels. If you need to set your cluster's sizing manually, please follow the below guide.

Manual scaling

Nodes of each group must be sized properly to be able to run at least a single workload.

For training nodes, this means you need to consider the amount of resources you will request in your training config - more specifically the resources_cpus and resources_memory settings of your agent. The training node must cover these in addition to the requirements for the Composabl components (which also run on the training nodes) and some small overhead for cluster components.

Nodes that will be running simulators are easier to size - they just need to cover at least a single simulator, the requirements of which are set inside the kubernetes configuration of your agent as sim_cpu and sim_memory.

Lastly, you must also consider the workers parameter, which indicates the amount of workers that will be spawned. You need to provision enough resources to cover resources_cpus and resources_memory for each worker. Each worker will also spawn its own simulator instance, so you must provide sufficient resources to cover these totals.

Examples

Let's consider the following minimal training config:

json
{
  "target": {
    "kubernetes": {
      "image": "my-sim/image:latest",
      "sim_cpu": "2",
      "sim_memory": "2Gi"
    }
  },
  "trainer": {
    "workers": 2,
    "resources_cpus": "2",
    "resources_memory": "4Gi"
  }
}

In this configuration, we have specified the requirements for both types of workloads. Here's a step-by-step guide to calculate the required node sizing for Kubernetes:

  1. Calculate Total Resources for Simulator Pods:
    • Number of required simulators = number of workers: 2
    • Each simulator pod requires:
      • 2 CPU
      • 2 Gi of Memory
    • The total is then
      • CPU: 2 workers * 2 CPU = 4 CPU
      • Memory: 2 workers * 2 Gi = 4 Gi
  2. Calculate Total Resources for Worker Pods:
    • Number of required workers = number of defined workers: 2
    • Each worker pod requires:
      • 2 CPU
      • 4 Gi of Memory
    • The total is then
      • CPU: 2 workers * 2 CPU = 4 CPU
      • Memory: 2 workers * 4 Gi = 8 Gi
  3. Add Kubernetes Overhead:
    • Kubernetes requires a static amount of overhead per node. For simplicity, let's assume:
      • Kubernetes CPU overhead per node: 0.5 cores, so 500m
      • Kubernetes Memory overhead per node: 1Gi
  4. For training nodes only: add Composabl system components overhead
    • To run Composabl on the cluster, you need some additional resources:
      • 2 CPU
      • 4 Gi of memory
    • This is in total, not per node
  5. Combine everything:
    • Simulators:
      • 4 CPU + 0.5 CPU overhead: 4.5 CPU
      • 4 Gi + 1 Gi: 5 Gi
    • Training:
      • 4 CPU + 0.5 CPU (kubernetes) + 2 (Composabl): 6.5 CPU
      • 8 Gi + 1 Gi (Kubernetes) + 4 Gi (Composabl): 13 Gi

This example assumes you want to use a single simulator and a single training node. In a real-world scenario, you likely want to split the workload over multiple nodes.

In that case, you need to make sure that:

  • Each node can accommodate at least one pod with its respective resources.
  • Each node must also have additional resources to cover the Kubernetes overhead.

2. Labels

Both groups of nodes must be labeled accordingly. Training nodes must be labeled as agentpool: train. Simulator nodes must be labeled as agentpool: sims. You may be able to define this during your cluster setup, but if not, use the following commands:

bash
kubectl label node <my-training-node> agentpool=train --overwrite
kubectl label node <my-simulator-node> agentpool=sims --overwrite

Replace the values in between <> with the name of the nodes you'd like to assign to a specific pool.

Storage

The components also need access to (semi)persistent, shared storage. This section will detail the types and amount of storage needed.

It needs the following PersistentVolumeClaims in the composabl-train namespace:

  1. pvc-controller-data with a size of ±1Gi and ReadWriteOnce (or better) accessMode When using Azure, you will need to set the nobrl mountOption for this PVC, as this is required for the Composabl controller to function.

  2. pvc-training-results with a suitable size - this is where your final agent data will be stored before it is uploaded to the No-code application. It needs accessmode to be ReadWriteMany (RWX). A good initial size is to match historian-tmp.

  3. historian-tmp is used as temporary storage for historian data. It needs to have an accessMode of ReadWriteOnce and the size will depend on the length of your training sessions. We recommend starting with 5Gi.

The size of pvc-training-results and historian-tmp is dependent on the amount and size of training jobs you want to run simultaneously on your cluster. If you plan on running long-lived training sessions with many cycles, you may want to increase the capacity for both,

Private image registry

If you want to use a private registry for simulator images, you will need to set up this private registry yourself, and make sure the cluster is able to pull images from this registry.

Next steps

Once your cluster is running, and you have verified your setup is working, you can continue to Installing Composabl