Skip to main content

Compute Profile and Compute

Compute Profile

A Compute Profile is a collection of default configurations for compute resources. You can think of it as a machine configuration or a blueprint for when PaletteAI deploys a cluster through Palette. Compute profiles are namespaced, which means you can have different compute profiles for different projects.

A project must reference a compute profile upon creation. The values configured in the compute profile are also used by both approaches of deploying an MLPlatform, either through the PaletteAI User Interface (UI) or through a Kubernetes YAML manifest.

The compute profile referenced by the project plays a critical role in the UI as it allows data scientists and business users to deploy an MLPlatform without having to configure infrastructure-specific details.

A compute profile can reference the following resources:

  • Infrastructure provider-specific configurations, such as Edge.
  • A Settings resource containing Palette integration settings. By default, if no settings are configured, PaletteAI will default to the project's configured settingsRef.
  • SSH keys for node access
  • ControlPlaneDefaults for control plane node configuration
  • WorkerPoolDefaults for worker node configuration
  • GPULimits for GPU limits per GPU family
  • MinWorkerNodes for minimum worker node requirements
Click to view an example compute profile configuration
apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
name: edge-compute-profile
namespace: project-a
spec:
# SSH keys for node access
sshKeys:
- 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC7... test-user@test-machine'
- 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC8... admin@admin-machine'

# Edge configuration for edge deployments
edge:
vip: '10.10.189.24'
ntpServers:
- 'time.google.com'
- 'time2.google.com'
networkOverlayConfig:
enabled: false
staticIp: true
cidr: '192.168.1.0/24'
overlayNetworkType: 'VXLAN'
isTwoNode: false

# Control plane defaults - required field
controlPlaneDefaults:
nodeCount: 3 # Must be 1, 3, or 5 to maintain quorum
workerNodeEligible: false
labels:
'palette.ai': 'true'
annotations:
'control-plane-annotation': 'control-plane-value'
'kubernetes.io/description': 'Control plane node'
taints:
- key: 'node-role.kubernetes.io/control-plane'
effect: 'NoSchedule'

# Worker pool defaults
workerPoolDefaults:
labels:
'palette.ai': 'true'
annotations:
'worker-annotation': 'worker-value'
'kubernetes.io/description': 'Worker node'
taints:
- key: 'worker-taint'
value: 'true'
effect: 'NoExecute'

# Minimum number of worker nodes that must be provisioned
minWorkerNodes: 2

# GPU limits per GPU family that may be requested by an MLPlatform
gpuLimits:
'NVIDIA-H100': 6
'NVIDIA-A100': 6

Infrastructure Providers

PaletteAI supports the following infrastructure providers:

The configuration settings exposed for Edge enable you to configure the same settings that you would when deploying an Edge cluster through Palette. The key difference is that the configuration settings are stored in the compute profile and are referenced by the MLPlatform during deployment. This is by design to reduce the complexity for a data science, ML researcher, or business user to configure the infrastructure settings when deploying an MLPlatform.

GPU Limits

A GPU limit is the maximum number of GPUs that can be requested by an MLPlatform. GPU limits are specified per GPU family. If you do not specify a GPU limit for a GPU family, the PaletteAI controller will not limit the number of GPUs that an MLPlatform can request.

For example, if you have a compute profile that specifies a GPU limit of 4 for the NVIDIA H100 GPU family, and an MLPlatform requests 6 NVIDIA H100 GPUs.

apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
name: edge-compute-profile
namespace: project-a
spec:
gpuLimits:
'NVIDIA-H100': 4

The PaletteAI controller will reject the MLPlatform request due to a breach of the GPU limit. You can configure GPU limits for different GPU families in the compute profile. Because different projects have their own compute profiles, they can have different GPU limits for different GPU families.

Minimum Worker Nodes

The minimum worker nodes setting ensures that an MLPlatform deployment provisions at least the specified number of worker nodes. This is useful for workloads that require a minimum level of compute resources to function properly, such as distributed training jobs or high-availability applications.

Order of Precedence

PaletteAI resolves the minimum worker node requirement using the following order of precedence:

  1. WorkloadProfile Label: The palette.ai/min-nodes: N label on the WorkloadProfile referenced by the MLPlatform takes highest precedence.
  2. ComputeProfile Override: The minWorkerNodes field in the MLPlatform's computeProfileRef.override section.
  3. ComputeProfile: The minWorkerNodes field in the ComputeProfile referenced by the MLPlatform.

If none of these sources specify a minimum worker node requirement, no minimum is enforced.

WorkloadProfile Label

You can specify the minimum number of worker nodes directly on a WorkloadProfile using the palette.ai/min-nodes label. This is useful when the minimum node requirement is tied to the specific workload being deployed rather than the infrastructure profile. Below is an example of a WorkloadProfile that specifies a minimum of three worker nodes.

apiVersion: mural.spectrocloud.com/v1alpha1
kind: WorkloadProfile
metadata:
name: clearml-workload-profile
namespace: project-a
labels:
palette.ai/platform: clearml
palette.ai/platform-version: 7.14.5
palette.ai/min-nodes: '3' # Requires minimum 3 worker nodes
spec:
# WorkloadProfile specification...

ComputeProfile Configuration

For infrastructure-wide minimum requirements, configure the minWorkerNodes field directly in the ComputeProfile:

apiVersion: palette.ai/v1alpha1
kind: ComputeProfile
metadata:
name: edge-compute-profile
namespace: project-a
spec:
minWorkerNodes: 2 # Requires minimum 2 worker nodes
# Other ComputeProfile configuration...

Compute

A Compute is a resource that tracks and monitors the available machines in your infrastructure environment that have the required PaletteAI tags and outputs the healthy machines eligible for an MLPlatform deployment. A compute can reference a settings resource in the current namespace, or fall back to the project's configured settings if omitted. The Compute controller reconciles automatically every 30 seconds and updates its status with the available compute resources.

To better illustrate how a compute resource works, review the status output of the following compute resources.

Status:
Available Control Plane Compute:
Architecture: AMD64
Available: true
Cpu Count: 4
Instances: 6
Machines:
edge-5db0384219cfa0fa4ef97d53bf291b2e
edge-228638428bf0078309b65730b24101ee
edge-a94a384220dc840fa080ae0eb47894e0
edge-2d463842d409106c40b0db6aa2cec28f
edge-99193842f541279ef93982b6023c32b4
edge-afac3842736c2b11fd82b93eac60e742
Available Worker Compute:
Architecture: AMD64
Available: true
Family: NVIDIA H100
Gpu Count: 8
Instances: 3
Machines:
edge-e078384256765be6e92fc1118aa9f283
edge-71bb3842c85b1731c37665c3d2ed0d10
edge-d9673842ab48f990d65c11f504f13183
Architecture: AMD64
Available: true
Family: NVIDIA A100
Gpu Count: 8
Instances: 3
Machines:
edge-9baf38425dacad857c70ccdbabb48028
edge-16a53842616769ab26a99001b0124419
edge-a48938427145a5a6118034c9a54deb8b
Resource Groups:
network-pool: "3"
storage-tier: "high-performance"

In this example, the following information can be extracted:

  • There are 6 available control plane compute resources with 4 CPU cores and of the architecture AMD64.
  • There are 3 available worker compute resources with 8 NVIDIA H100 GPUs and of the architecture AMD64.
  • There are 3 available worker compute resources with 8 NVIDIA A100 GPUs and of the architecture AMD64, assigned to network pool "3" and categorized as "high-performance" storage tier.

PaletteAI uses this information to determine whether to create an MLPlatform deployment if the requested resources are available. The GPU limits are taken into consideration when determining if the requested resources are available.

Discovery

PaletteAI uses Palette to manage and track physical infrastructure devices. Once a device is added to Palette with the required tags and made available, PaletteAI will automatically discover the device and update the compute resource with the available information.

The Compute resource uses the provided Settings or the project's configured Settings to authenticate with Palette. In the Palette tenant and project where the Edge devices are registered, PaletteAI will automatically discover the devices and update the compute resource with the available information.

tip

To learn how to add a bare metal device to Palette, check out the EdgeForge Workflow documentation.

Device Tags

For an Edge device to be discovered by PaletteAI, it must have the tag palette.ai: true. This tag is required to prevent PaletteAI from finding and using Edge devices that are not intended for PaletteAI. If the Palette agent installed on the Edge device is unable to detect CPUs, GPUs, and GPU memory, platform engineers can manually add the following tags to the Edge device, and PaletteAI will fallback to using the tag values provided.

TagRequiredDescriptionExample
palette.ai: trueYesReserves the Edge device for PaletteAI usage.palette.ai: true
gpus: <gpu_count>NoNumber of GPUs on the Edge device.gpus: 8
cpus: <cpu_count>NoNumber of CPUs on the Edge device.cpus: 6
gpu-memory: <gpu_memory>NoGPU memory on the Edge device. Supported units: M, MB, MiB, G, GB, GiB. Units are case insensitive.gpu-memory: 80G
gpu-family: <gpu_family>NoGPU family on the Edge device.gpu-family: nvidia-a100
palette.ai/control-plane: boolNoIndicates the Edge device is eligible to be a control plane node.palette.ai/control-plane: true
palette.ai/worker: boolNoIndicates the Edge device is eligible to be a worker node.palette.ai/worker: true
palette.ai.rg/<group_name>: <value>NoAssigns the Edge device to a resource group.palette.ai.rg/network-pool: 1
Eligible Control Plane

By default, PaletteAI excludes an Edge device from being a Kubernetes control plane node if has GPUs. If you want to allow an Edge device with GPUs to act as a control plane node, you can add the tag palette.ai/control-plane: true to the Edge device.

Eligible Worker

By default, PaletteAI excludes an Edge device from being a Kubernetes worker node if has no GPUs. If you want to allow an Edge device without GPUs to act as a worker node, you can add the tag palette.ai/worker: true to the Edge device.

warning

You should never tag an edge device with both palette.ai/control-plane: true and palette.ai/worker: true. However, if you do, it will be treated as a worker node, not a control plane node.

Resource Groups

Resource groups provide a way to organize and categorize edge devices for granular resource management and allocation. Edge devices can be assigned to one or more resource groups using labels with the palette.ai.rg/ prefix.

Resource groups are useful for:

  • Network segmentation: Group devices by network zones or VLANs
  • Storage tiers: Categorize devices by storage capabilities or performance tiers
  • Geographical location: Organize devices by data center, region, or availability zone
  • Hardware specifications: Group devices with similar capabilities or configurations
  • Workload isolation: Separate devices for different types of ML workloads

Resource Group Label Format

Resource group labels follow the pattern: palette.ai.rg/<group_name>: <value>

  • <group_name>: The name of the resource group (can contain slashes for hierarchical naming)
  • <value>: The value assigned to the device within that resource group

Examples

# Network pool assignment
palette.ai.rg/network-pool: "1"

# Storage tier classification
palette.ai.rg/storage-tier: "high-performance"

# Geographical location
palette.ai.rg/region/us-west: "zone-a"

# Multiple resource group assignments
palette.ai.rg/network-pool: "1"
palette.ai.rg/storage-pool: "2"
palette.ai.rg/workload-type: "training"

Resource group information is automatically discovered by PaletteAI and included in the Compute resource status, allowing platform administrators to manage resource allocation across arbitrary device groups.

Minimum Worker Nodes with GPU Optimization

Consider a scenario where you have the following infrastructure:

  • four available worker nodes, each equipped with 8 GPUs
  • A ComputeProfile configured with minWorkerNodes: 3
  • An MLPlatform request for 8 GPUs in dedicated mode

In this scenario, a single GPU-enabled node would be sufficient to fulfill the 8 GPU request for the MLPlatform. However, because the minimum worker node requirement is set to three, PaletteAI must provision additional worker nodes to meet this constraint.

To fulfill the three-node minimum while avoiding unnecessary GPU allocation, PaletteAI will:

  1. Allocate one node with 8 GPUs to satisfy the GPU requirement
  2. Provision two additional nodes that have the palette.ai/worker: true label, if available.
  3. If no additional nodes with the palette.ai/worker: true label are available, PaletteAI will select additional nodes that have GPUs matching one of the requested GPU families, but with the lowest possible GPU count to satisfy minimum worker node requirement.
  4. If unable to select additional nodes with GPUs matching one of the requested GPU families, PaletteAI will select additional worker nodes having GPUs of any family, but with the lowest possible GPU count to satisfy minimum worker node requirement.

The palette.ai/worker: true label is crucial in these scenarios because PaletteAI normally excludes GPU-enabled nodes from being general worker nodes. Nodes with palette.ai/worker: true are eligible to serve as worker nodes even if they don't have GPUs, which prevents allocating additional GPU-enabled nodes when only compute capacity is needed for resource optimization.

Example edge device configuration for worker-eligible nodes.

Node TypeGPUsNode LabelsRole in MLPlatform
gpu-node-18NoneGPU workload node
cpu-node-10palette.ai/worker: trueGeneral worker node
cpu-node-20palette.ai/worker: trueGeneral worker node

This approach ensures high availability and distributed workload capacity while efficiently utilizing your GPU resources.