Compute Profile and Compute

Compute Profile

A Compute Profile is a collection of default configurations for compute resources. You can think of it as a machine configuration or a blueprint for when PaletteAI deploys a cluster through Palette. Compute profiles are namespaced, which means you can have different compute profiles for different projects.

A project must reference a compute profile upon creation. The values configured in the compute profile are also used by both approaches of deploying an MLPlatform, either through the PaletteAI User Interface (UI) or through a Kubernetes YAML manifest.

The compute profile referenced by the project plays a critical role in the UI as it allows data scientists and business users to deploy an MLPlatform without having to configure infrastructure-specific details.

A compute profile can reference the following resources:

Infrastructure provider-specific configurations, such as Edge.
A Settings resource containing Palette integration settings. By default, if no settings are configured, PaletteAI will default to the project's configured settingsRef.
SSH keys for node access
ControlPlaneDefaults for control plane node configuration
WorkerPoolDefaults for worker node configuration
GPULimits for GPU limits per GPU family
MinWorkerNodes for minimum worker node requirements

Click to view an example compute profile configuration

apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
  name: edge-compute-profile
  namespace: project-a
spec:
  # SSH keys for node access
  sshKeys:
    - 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC7... test-user@test-machine'
    - 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC8... admin@admin-machine'

  # Edge configuration for edge deployments
  edge:
    vip: '10.10.189.24'
    ntpServers:
      - 'time.google.com'
      - 'time2.google.com'
    networkOverlayConfig:
      enabled: false
      staticIp: true
      cidr: '192.168.1.0/24'
      overlayNetworkType: 'VXLAN'
    isTwoNode: false

  # Control plane defaults - required field
  controlPlaneDefaults:
    nodeCount: 3 # Must be 1, 3, or 5 to maintain quorum
    workerNodeEligible: false
    labels:
      'palette.ai': 'true'
    annotations:
      'control-plane-annotation': 'control-plane-value'
      'kubernetes.io/description': 'Control plane node'
    taints:
      - key: 'node-role.kubernetes.io/control-plane'
        effect: 'NoSchedule'

  # Worker pool defaults
  workerPoolDefaults:
    labels:
      'palette.ai': 'true'
    annotations:
      'worker-annotation': 'worker-value'
      'kubernetes.io/description': 'Worker node'
    taints:
      - key: 'worker-taint'
        value: 'true'
        effect: 'NoExecute'

  # Minimum number of worker nodes that must be provisioned
  minWorkerNodes: 2

  # GPU limits per GPU family that may be requested by an MLPlatform
  gpuLimits:
    'NVIDIA-H100': 6
    'NVIDIA-A100': 6

Infrastructure Providers

PaletteAI supports the following infrastructure providers:

Edge

The configuration settings exposed for Edge enable you to configure the same settings that you would when deploying an Edge cluster through Palette. The key difference is that the configuration settings are stored in the compute profile and are referenced by the MLPlatform during deployment. This is by design to reduce the complexity for a data science, ML researcher, or business user to configure the infrastructure settings when deploying an MLPlatform.

GPU Limits

A GPU limit is the maximum number of GPUs that can be requested by an MLPlatform. GPU limits are specified per GPU family. If you do not specify a GPU limit for a GPU family, the PaletteAI controller will not limit the number of GPUs that an MLPlatform can request.

For example, if you have a compute profile that specifies a GPU limit of 4 for the NVIDIA H100 GPU family, and an MLPlatform requests 6 NVIDIA H100 GPUs.

apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
  name: edge-compute-profile
  namespace: project-a
spec:
  gpuLimits:
    'NVIDIA-H100': 4

The PaletteAI controller will reject the MLPlatform request due to a breach of the GPU limit. You can configure GPU limits for different GPU families in the compute profile. Because different projects have their own compute profiles, they can have different GPU limits for different GPU families.

Minimum Worker Nodes

The minimum worker nodes setting ensures that an MLPlatform deployment provisions at least the specified number of worker nodes. This is useful for workloads that require a minimum level of compute resources to function properly, such as distributed training jobs or high-availability applications.

Order of Precedence

PaletteAI resolves the minimum worker node requirement using the following order of precedence:

WorkloadProfile Label: The palette.ai/min-nodes: N label on the WorkloadProfile referenced by the MLPlatform takes highest precedence.
ComputeProfile Override: The minWorkerNodes field in the MLPlatform's computeProfileRef.override section.
ComputeProfile: The minWorkerNodes field in the ComputeProfile referenced by the MLPlatform.

If none of these sources specify a minimum worker node requirement, no minimum is enforced.

WorkloadProfile Label

You can specify the minimum number of worker nodes directly on a WorkloadProfile using the palette.ai/min-nodes label. This is useful when the minimum node requirement is tied to the specific workload being deployed rather than the infrastructure profile. Below is an example of a WorkloadProfile that specifies a minimum of three worker nodes.

apiVersion: mural.spectrocloud.com/v1alpha1
kind: WorkloadProfile
metadata:
  name: clearml-workload-profile
  namespace: project-a
  labels:
    palette.ai/platform: clearml
    palette.ai/platform-version: 7.14.5
    palette.ai/min-nodes: '3' # Requires minimum 3 worker nodes
spec:
  # WorkloadProfile specification...

ComputeProfile Configuration

For infrastructure-wide minimum requirements, configure the minWorkerNodes field directly in the ComputeProfile:

apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
  name: edge-compute-profile
  namespace: project-a
spec:
  minWorkerNodes: 2 # Requires minimum 2 worker nodes
  # Other ComputeProfile configuration...

Compute

A Compute is a resource that tracks and monitors the available machines in your infrastructure environment that have the required PaletteAI tags and outputs the healthy machines eligible for an MLPlatform deployment. A compute can reference a settings resource in the current namespace, or fall back to the project's configured settings if omitted. The Compute controller reconciles automatically every 30 seconds and updates its status with the available compute resources.

To better illustrate how a compute resource works, review the status output of the following compute resources.

Status:
  Available Control Plane Compute:
    Architecture:  AMD64
    Available:     true
    Cpu Count:     4
    Instances:     6
    Machines:
      edge-5db0384219cfa0fa4ef97d53bf291b2e
      edge-228638428bf0078309b65730b24101ee
      edge-a94a384220dc840fa080ae0eb47894e0
      edge-2d463842d409106c40b0db6aa2cec28f
      edge-99193842f541279ef93982b6023c32b4
      edge-afac3842736c2b11fd82b93eac60e742
  Available Worker Compute:
    Architecture:  AMD64
    Available:     true
    Family:        NVIDIA H100
    Gpu Count:     8
    Instances:     3
    Machines:
      edge-e078384256765be6e92fc1118aa9f283
      edge-71bb3842c85b1731c37665c3d2ed0d10
      edge-d9673842ab48f990d65c11f504f13183
    Architecture:  AMD64
    Available:     true
    Family:        NVIDIA A100
    Gpu Count:     8
    Instances:     3
    Machines:
      edge-9baf38425dacad857c70ccdbabb48028
      edge-16a53842616769ab26a99001b0124419
      edge-a48938427145a5a6118034c9a54deb8b
    Resource Groups:
      network-pool:  "3"
      storage-tier:  "high-performance"

In this example, the following information can be extracted:

There are 6 available control plane compute resources with 4 CPU cores and of the architecture AMD64.
There are 3 available worker compute resources with 8 NVIDIA H100 GPUs and of the architecture AMD64.
There are 3 available worker compute resources with 8 NVIDIA A100 GPUs and of the architecture AMD64, assigned to network pool "3" and categorized as "high-performance" storage tier.

PaletteAI uses this information to determine whether to create an MLPlatform deployment if the requested resources are available. The GPU limits are taken into consideration when determining if the requested resources are available.

Discovery

PaletteAI uses Palette to manage and track physical infrastructure devices. Once a device is added to Palette with the required tags and made available, PaletteAI will automatically discover the device and update the compute resource with the available information.

The Compute resource uses the provided Settings or the project's configured Settings to authenticate with Palette. In the Palette tenant and project where the Edge devices are registered, PaletteAI will automatically discover the devices and update the compute resource with the available information.

tip

To learn how to add a bare metal device to Palette, check out the EdgeForge Workflow documentation.

Device Tags

For an Edge device to be discovered by PaletteAI, it must have the tag palette.ai: true. This tag is required to prevent PaletteAI from finding and using Edge devices that are not intended for PaletteAI. If the Palette agent installed on the Edge device is unable to detect CPUs, GPUs, and GPU memory, platform engineers can manually add the following tags to the Edge device, and PaletteAI will fallback to using the tag values provided.

Tag	Required	Description	Example
`palette.ai: true`	Yes	Reserves the Edge device for PaletteAI usage.	`palette.ai: true`
`gpus: <gpu_count>`	No	Number of GPUs on the Edge device.	`gpus: 8`
`cpus: <cpu_count>`	No	Number of CPUs on the Edge device.	`cpus: 6`
`gpu-memory: <gpu_memory>`	No	GPU memory on the Edge device. Supported units: M, MB, MiB, G, GB, GiB. Units are case insensitive.	`gpu-memory: 80G`
`gpu-family: <gpu_family>`	No	GPU family on the Edge device.	`gpu-family: nvidia-a100`
`palette.ai/control-plane: bool`	No	Indicates the Edge device is eligible to be a control plane node.	`palette.ai/control-plane: true`
`palette.ai/worker: bool`	No	Indicates the Edge device is eligible to be a worker node.	`palette.ai/worker: true`
`palette.ai.rg/<group_name>: <value>`	No	Assigns the Edge device to a resource group.	`palette.ai.rg/network-pool: 1`

Eligible Control Plane

By default, PaletteAI excludes an Edge device from being a Kubernetes control plane node if has GPUs. If you want to allow an Edge device with GPUs to act as a control plane node, you can add the tag palette.ai/control-plane: true to the Edge device.

Eligible Worker

By default, PaletteAI excludes an Edge device from being a Kubernetes worker node if has no GPUs. If you want to allow an Edge device without GPUs to act as a worker node, you can add the tag palette.ai/worker: true to the Edge device.

warning

You should never tag an edge device with both palette.ai/control-plane: true and palette.ai/worker: true. However, if you do, it will be treated as a worker node, not a control plane node.

Resource Groups

Resource groups provide a way to organize and categorize edge devices for granular resource management and allocation. Edge devices can be assigned to one or more resource groups using labels with the palette.ai.rg/ prefix.

Resource groups are useful for:

Network segmentation: Group devices by network zones or VLANs
Storage tiers: Categorize devices by storage capabilities or performance tiers
Geographical location: Organize devices by data center, region, or availability zone
Hardware specifications: Group devices with similar capabilities or configurations
Workload isolation: Separate devices for different types of ML workloads

Resource Group Label Format

Resource group labels follow the pattern: palette.ai.rg/<group_name>: <value>

<group_name>: The name of the resource group (can contain slashes for hierarchical naming)
<value>: The value assigned to the device within that resource group

Examples

# Network pool assignment
palette.ai.rg/network-pool: "1"

# Storage tier classification
palette.ai.rg/storage-tier: "high-performance"

# Geographical location
palette.ai.rg/region/us-west: "zone-a"

# Multiple resource group assignments
palette.ai.rg/network-pool: "1"
palette.ai.rg/storage-pool: "2"
palette.ai.rg/workload-type: "training"

Resource group information is automatically discovered by PaletteAI and included in the Compute resource status, allowing platform administrators to manage resource allocation across arbitrary device groups.

Minimum Worker Nodes with GPU Optimization

Consider a scenario where you have the following infrastructure:

four available worker nodes, each equipped with 8 GPUs
A ComputeProfile configured with minWorkerNodes: 3
An MLPlatform request for 8 GPUs in dedicated mode

In this scenario, a single GPU-enabled node would be sufficient to fulfill the 8 GPU request for the MLPlatform. However, because the minimum worker node requirement is set to three, PaletteAI must provision additional worker nodes to meet this constraint.

To fulfill the three-node minimum while avoiding unnecessary GPU allocation, PaletteAI will:

Allocate one node with 8 GPUs to satisfy the GPU requirement
Provision two additional nodes that have the palette.ai/worker: true label, if available.
If no additional nodes with the palette.ai/worker: true label are available, PaletteAI will select additional nodes that have GPUs matching one of the requested GPU families, but with the lowest possible GPU count to satisfy minimum worker node requirement.
If unable to select additional nodes with GPUs matching one of the requested GPU families, PaletteAI will select additional worker nodes having GPUs of any family, but with the lowest possible GPU count to satisfy minimum worker node requirement.

The palette.ai/worker: true label is crucial in these scenarios because PaletteAI normally excludes GPU-enabled nodes from being general worker nodes. Nodes with palette.ai/worker: true are eligible to serve as worker nodes even if they don't have GPUs, which prevents allocating additional GPU-enabled nodes when only compute capacity is needed for resource optimization.

Example edge device configuration for worker-eligible nodes.

Node Type	GPUs	Node Labels	Role in MLPlatform
gpu-node-1	8	None	GPU workload node
cpu-node-1	0	`palette.ai/worker: true`	General worker node
cpu-node-2	0	`palette.ai/worker: true`	General worker node

This approach ensures high availability and distributed workload capacity while efficiently utilizing your GPU resources.

Compute Profile​

Infrastructure Providers​

GPU Limits​

Minimum Worker Nodes​

Order of Precedence​

WorkloadProfile Label​

ComputeProfile Configuration​

Compute​

Discovery​

Device Tags​

Eligible Control Plane​

Eligible Worker​

Resource Groups​

Minimum Worker Nodes with GPU Optimization​