Compute Pools

A Compute Pool is a group of shared Compute resources used to create Kubernetes clusters where your AI/ML applications run. In the hub-spoke architecture, each Compute Pool becomes a spoke cluster on which applications and models are deployed.

Types of Compute Pools

There are two types of Compute Pools, allowing you to deploy your AI/ML applications and models on the infrastructure that best suits your needs.

Mode	Description	Palette Required	Use Case
Dedicated	A single cluster used to host a single App Deployment	Yes	- Isolated workloads - Specific hardware needs - Stringent compliance or security requirements
Shared	One or multiple clusters that share resources and are used to host multiple App Deployments	Yes	- Resource efficiency for workloads with similar needs - Maximize hardware utilization to reduce costs - Development or staging environments

Compute Pool Provisioning

You can create shared or dedicated Compute Pools before deploying AI/ML applications. Provisioning Kubernetes clusters in advance reduces the number of steps involved when deploying applications and models by allowing data scientists to select an existing Compute Pool for their workload rather than create a new one.

The following table illustrates the types of Compute Pools you can create in specific workflows.

Workflow	Dedicated	Shared
App Deployment	✅	❌
Compute Pool	✅	✅
Model Deployment	✅	❌

When you create a dedicated or shared Compute Pool, PaletteAI provisions the underlying Kubernetes infrastructure:

PaletteAI validates the configuration against available Compute resources.
PaletteAI requests cluster provisioning from Palette using the Profile Bundle's Cluster Profile.
Palette provisions the Kubernetes cluster.
PaletteAI retrieves the cluster's kubeconfig from Palette.
PaletteAI registers the cluster as an OCM spoke with the hub.
PaletteAI installs the required controllers (Flux, OCM work agent) on the spoke.

Once complete, the Compute Pool is ready to receive applications using Application or Fullstack Profile Bundles.

To learn how applications are deployed to Compute Pools, refer to our App Deployments guide. For an in-depth look at the complete deployment flow, from Compute Pools to workload provisioning, refer to our Hub and Spoke Model guide.

Worker Pool Names

Each worker pool in a Compute Pool’s nodePoolRequirements.workerPools must include a name field. If you do not provide one, PaletteAI uses a mutating webhook to automatically generate a unique identifier (UUID) during validation. Worker pool names must not exceed 63 characters to comply with Kubernetes label constraints.

Worker pool names provide a stable reference for matching allocated machine pools back to their requirements. This enables accurate pool evaluation when requirements change, such as during scaling operations.

Example worker pool with explicit name
workerPools:
  - name: gpu-workers # Automatically generated by the mutating webhook if not provided
    gpu:
      family: 'NVIDIA A100'
      gpuCount: 8
    minWorkerNodes: 2

Hardware Capacity and Allocation

PaletteAI tracks hardware resources and machine pool allocations in the Compute Pool CRD's status field. This information helps PaletteAI determine whether a Compute Pool can accept additional workloads and is used for GPU quota enforcement at the Project level.

status.hardwareCapacity - Total resources available across all control plane and worker nodes in the Compute Pool's clusters. This includes CPU count, architecture, memory, and GPU family and count.
status.hardwareAllocation - Resources currently allocated to App Deployments running on the Compute Pool. Allocation is summed across all applications deployed to the pool.
status.status - Overall health of the Compute Pool. Possible values include Running, Failed, Provisioning, Unhealthy, Updating, and Deleting.
status.cloudConfigUUID - The UUID of the cloud configuration used to provision the Compute Pool. This is used to update the machine pools of the Compute Pool.
status.allocatedMachinePools - Tracks which Edge Hosts are allocated to each machine pool within the Compute Pool. Each entry includes:
- name: The machine pool name (e.g., "control-plane-pool", "worker-pool-nvidia-amd64-0").
- nodeType: The type of nodes in this machine pool. Possible values:
  - ControlPlaneOnly: Only control plane nodes.
  - WorkerOnly: Only worker nodes.
  - ControlPlaneAndWorker: Nodes serve as both control plane and worker.
- hosts: Map of Edge Host UIDs to their details (architecture, CPU count, memory, GPU count and memory by family, status, and status timestamp). Each host includes:
  - status: Provisioning and health status of the host. Possible values: Initial, Provisioning, Healthy, Unhealthy, Failed, Deleting, Unknown.
  - statusUpdatedAt: Timestamp of the last status change for the host.
- labels: Kubernetes labels applied to nodes in this pool (e.g., ["control-plane"], ["worker", "gpu-family-nvidia"]).
- workerPoolRequirementsName: (Worker pools only) The name of the WorkerPool requirement that this machine pool was created from. Used to match allocated pools back to their requirements for scaling operations.

Machine Pool Lifecycle

PaletteAI automatically reconciles machine pools to match the Compute Pool's requirements. When you update pool requirements, PaletteAI classifies the changes and performs the appropriate operations:

Create - Adds a machine pool for a requirement that does not have an allocated pool.
Delete - Removes a machine pool that no longer matches any requirement. PaletteAI deletes individual machines first, marking their hosts as Deleting. Once all machines are confirmed removed, the pool itself is deleted.
Scale - Adds or removes hosts from a pool to match updated requirements. When removing hosts, PaletteAI waits for at least one replacement host to reach Healthy status before removing the old hosts, preventing downtime.
Replace - Rebuilds a pool when all hosts are invalid (for example, due to hardware requirement changes). PaletteAI adds replacement hosts while keeping one "bridge" host from the old set. Once a replacement host reaches Healthy status, the bridge host is removed.

For single-node clusters (where singleNodeCluster is set to true in the control plane configuration), PaletteAI only manages the control plane machine pool. Worker pool requirements defined in the specification are not used for machine pool operations.

View hardware and machine pool allocation information associated with your Compute Pool with the following command.

kubectl get computepool <pool-name> --namespace <project-namespace> --output yaml

Example Compute Pool status
status:
  status: Running
  hardwareCapacity:
    - architecture: AMD64
      totalCPU: 32
      totalMemory: '128Gi'
      gpu:
        - family: 'NVIDIA-A100'
          gpuCount: 8
  hardwareAllocation:
    - architecture: AMD64
      gpu:
        - family: 'NVIDIA-A100'
          gpuCount: 4
  cloudConfigUUID: cloud-config-uuid-1
  allocatedMachinePools:
    - name: control-plane-pool
      nodeType: ControlPlaneAndWorker
      hosts:
        host-uid-1:
          architecture: AMD64
          cpuCount: 16
          memoryGB: 64
        host-uid-2:
          architecture: AMD64
          cpuCount: 16
          memoryGB: 64
      labels:
        - control-plane
    - name: worker-pool-nvidia-amd64-0
      nodeType: WorkerOnly
      workerPoolRequirementsName: gpu-workers
      hosts:
        host-uid-3:
          architecture: AMD64
          cpuCount: 32
          memoryGB: 128
          gpuCountByFamily:
            NVIDIA-A100: 4
          gpuMemoryGBByFamily:
            NVIDIA-A100: 160
          status: Healthy
          statusUpdatedAt: '2024-01-15T10:30:00Z'
        host-uid-4:
          architecture: AMD64
          cpuCount: 32
          memoryGB: 128
          gpuCountByFamily:
            NVIDIA-A100: 4
          gpuMemoryGBByFamily:
            NVIDIA-A100: 160
          status: Healthy
          statusUpdatedAt: '2024-01-15T10:32:00Z'
      labels:
        - worker
        - gpu-family-nvidia
  aiWorkloadRefs:
    - name: training-job-1
      namespace: project-a

Autoscaling

Compute Pools support automatic scaling based on CPU and GPU utilization metrics. Autoscaling adjusts the number of machines in response to workload demand, minimizing idle resources and handling demand spikes.

To enable autoscaling, reference a ScalingPolicy resource in the Compute Pool's clusterVariant configuration. Scaling policies define utilization thresholds, scaling durations, resource bounds, and cooldown periods.

Example ComputePool with autoscaling
spec:
  clusterVariant:
    dedicated:
      scalingPolicyRef:
        name: my-scaling-policy
        namespace: default

Scaling Behavior

When autoscaling is enabled, PaletteAI continuously monitors resource utilization and makes scaling decisions based on sustained metric values:

Scale Up - Triggered when the minimum average utilization over the scale-up duration exceeds the scale-up threshold. PaletteAI adds machines to the pool and waits for them to reach Healthy status.
Scale Down - Triggered when the maximum average utilization over the scale-down duration falls below the scale-down threshold. PaletteAI removes machines from the pool.
Cooldown - After a successful scaling operation, the pool enters a cooldown period to allow metrics to stabilize before the next scaling decision.

Scaling is only triggered when utilization strictly crosses the configured thresholds (not when equal to the threshold). This prevents unnecessary scaling operations when utilization is at the boundary.

PaletteAI tracks the status of each scaling action in the ComputePoolEvaluation resource. Host status transitions through Provisioning, Healthy, Unhealthy, Failed, or Deleting states during scaling operations.

If a scale-up operation does not complete within the configured abort duration, PaletteAI aborts the operation by removing pending nodes that have not reached Healthy status. Successfully provisioned nodes are retained, and the pool transitions to a cooldown period. Scale-down operations are not aborted and continue until all node removals complete.

Scaling policies apply to both dedicated and shared Compute Pools. For more information about configuring scaling policies, refer to the ScalingPolicy CRD documentation.

Resource Groups

Resource groups let you restrict which machines a Compute Pool can use. This is useful when you need workloads to run on specific hardware, such as machines in a particular network zone or with high-performance storage.

To use resource groups, tag your machines in Palette with labels that begin with palette.ai.rg/. For example, palette.ai.rg/network-pool: '1' or palette.ai.rg/storage-tier: 'high-performance'. Then specify the same labels in the controlPlaneResourceGroups or workerResourceGroups fields in your ComputePool resource.

Example ComputePool manifest
spec:
  clusterVariant:
    controlPlaneResourceGroups:
      network-pool: '1'
    workerResourceGroups:
      storage-tier: 'high-performance'

The key-value pair in palette.ai.rg/<key>: "<value>" assigned to the machine must match the key-value pair defined in the Compute Pool in order for the machine to be added to the Compute Pool.

For single-node clusters (where singleNodeCluster is set to true in the control plane configuration), worker pool host selection uses controlPlaneResourceGroups instead of workerResourceGroups. This ensures that the single node selected for both control plane and worker roles matches the control plane resource group constraints.

Permissions

Compute Pool operations are controlled by role-based access control (RBAC) permissions. Depending on your role, a read-only view may be displayed or certain actions may be disabled.

Update Permissions

The spectrocloud.com/computepools:update permission controls your ability to:

Edit Compute Pool settings (name, description, annotations, labels, auto-scaling policy)
Modify the profile bundle associated with the Compute Pool

Users without this permission can view Compute Pool details and configuration but cannot modify settings or profile bundles. When viewing a Compute Pool without update permissions, a read-only view will be displayed. The Save and Discard buttons are not displayed when viewing profile bundles without the required permission.

Delete Permissions

The spectrocloud.com/computepools:delete permission controls your ability to delete Compute Pools. The Delete Compute Pool option is not displayed in the Settings menu for users without this permission.

If an action or button described in the documentation is not displayed, your role does not include the required permission. Contact your administrator to request access.

Resources

Refer to the following articles to learn more about the role Compute Pools play in PaletteAI:

Compute - Discover hardware used to create Compute Pools
Compute Config - Configure default cluster settings
App Deployments - Deploy applications to Compute Pools
Hub-Spoke Model - Learn how Compute Pools fit into PaletteAI's architecture

Types of Compute Pools​

Compute Pool Provisioning​

Worker Pool Names​

Hardware Capacity and Allocation​

Machine Pool Lifecycle​

Autoscaling​

Scaling Behavior​

Resource Groups​

Permissions​

Update Permissions​

Delete Permissions​

Resources​