Skip to main content

Troubleshooting Compute Pools

This page provides troubleshooting guidance for common Compute Pool issues.

info

Troubleshooting steps on this page use kubectl commands to inspect Kubernetes resources on the hub cluster. You need kubectl access to the hub cluster and the Project namespace where your Compute Pool is deployed. If you do not have kubectl access, contact your platform administrator.

Compute Pool Stuck in Provisioning

Symptom: The Compute Pool status remains Provisioning.

Possible causes:

  1. No available edge hosts match the resource requirements.

  2. Palette cluster deployment fails.

  3. Network issues prevent cluster creation.

Resolution:

Start by checking the Compute Pool in the PaletteAI UI. Select the Compute Pool from the list and review the Events tab for error messages and status updates. The UI shows the current status, cluster health, and recent events that can help identify the issue without kubectl.

If the UI does not provide enough detail, use the following kubectl commands.

  1. Check Compute Pool conditions.

    kubectl describe computepool <computepool-name> --namespace <project-namespace> | grep -A 10 Conditions

    Expected output for healthy cluster:

    Conditions:
    Type: ClusterDeployed
    Status: True
    Reason: ClusterRunning
    Message: Palette cluster is running

    Type: Validated
    Status: True
    Reason: ValidationPassed

    Example failure output:

    Conditions:
    Type: Validated
    Status: False
    Reason: ValidationFailed
    Message: ComputePool validation failed

    Type: ClusterDeployed
    Status: False
    Reason: ClusterDeploymentFailed
    Message: Failed to deploy Palette cluster

    Common condition types and failure reasons:

    • Validated / ValidationFailed - ComputePool spec validation failed. Check that referenced resources (ProfileBundle, Settings) exist and are Ready.
    • ClusterDeployed / ClusterDeploymentFailed - Palette cluster deployment failed. Check Settings credentials and available edge hosts.
    • AdminKubeconfigAcquired / KubeconfigFailed - Failed to retrieve the admin kubeconfig from the Palette cluster.
    • SpokeCreated / SpokeCreateFailed - Failed to create the Spoke resource for the cluster.
    • ManagedClusterReady / ManagedClusterNotReady - The OCM ManagedCluster representing the Palette cluster is not ready.
    • EnvironmentCreated / EnvironmentFailed - The Mural Environment failed to be created or become ready.
  2. Verify Compute resources have available machines.

    kubectl get compute --namespace <project-namespace> --output yaml | grep -A 20 controlPlaneCompute

    Expected output:

    status:
    controlPlaneCompute:
    - architecture: amd64
    cpuCount: 8
    memory: '32768 MiB'
    instances: 1
    machines:
    abc123: available
    resourceGroups:
    palette.ai: 'true'
    workerCompute:
    - architecture: amd64
    cpuCount: 16
    memory: '65536 MiB'
    instances: 1
    machines:
    def456: available
    family: NVIDIA-A100
    gpuCount: 2
    gpuMemory: '40960 MiB'
    resourceGroups:
    palette.ai: 'true'

    If controlPlaneCompute or workerCompute is empty, no hosts are available with the required resources.

  3. Check Palette cluster status.

    kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.paletteClusterStatuses}' | jq

    Expected output for healthy cluster:

    [
    {
    "name": "ml-cluster",
    "uid": "abc123def456",
    "status": "Running",
    "conditions": [
    {
    "type": "ClusterDeployed",
    "status": "True",
    "reason": "ClusterRunning"
    }
    ]
    }
    ]

    Common status field meanings:

    • status: Cluster lifecycle state (Provisioning, Running, Deleting, Failed, Unhealthy)
    • uid: The Palette cluster UID
    • conditions: Array of condition objects tracking each reconciliation phase
    • kubeconfigSecretName: Name of the Secret containing the cluster kubeconfig

    Example failure output:

    [
    {
    "name": "ml-cluster",
    "uid": "abc123def456",
    "status": "Provisioning",
    "conditions": [
    {
    "type": "ClusterDeployed",
    "status": "False",
    "reason": "ClusterDeploymentFailed",
    "message": "Failed to deploy Palette cluster"
    }
    ]
    }
    ]
  4. Verify the ProfileBundle exists.

    kubectl get profilebundle <profilebundle-name> --namespace <project-namespace>

    Expected output:

    NAME                READY   AGE
    edge-profilebundle True 5d

    If READY is False, describe the ProfileBundle to identify the cause:

    kubectl describe profilebundle <profilebundle-name> --namespace <project-namespace>

VIP Already in Use

Symptom: Compute Pool creation fails due to a VIP conflict.

Cause: The VIP is assigned to another cluster or device.

Resolution:

warning

The VIP address cannot be changed after cluster creation. If you need to use a different VIP, you must delete the Compute Pool and create a new one.

  1. For a new Compute Pool (not yet created successfully):

    a. Choose a VIP that is not in use.

    b. Check VIP values in existing Compute Pools to avoid conflicts:

    kubectl get computepools --namespace <project-namespace> --output yaml | grep -A 1 "vip:"

    c. Verify the VIP is not in use on your network:

    # Option 1: Use arping (recommended, more reliable)
    # Requires root/sudo on most systems. No reply expected if VIP is free.
    sudo arping -c 3 <new-vip-address>

    # Expected output if VIP is free (no replies):
    # ARPING 10.10.162.130
    # ... (timeout, 0 packets received)

    # Option 2: Check if VIP responds to ping (less reliable)
    # Some networks may not respond to ping even if IP is in use
    ping -c 3 <new-vip-address>

    # Expected output if VIP is free:
    # ... 100% packet loss

    # Option 3: Check ARP table (only shows recently resolved IPs)
    arp -a | grep <new-vip-address>

    # Expected output if VIP is free:
    # (no output)

    d. Update your Compute Pool manifest with the new VIP and apply it:

    # Edit the manifest
    vi computepool.yaml

    # Update the VIP under spec.clusterVariant.dedicated.paletteClusterDeploymentConfig.edge.vip

    # Apply the updated manifest
    kubectl apply --filename computepool.yaml
  2. For an existing Compute Pool (already created with wrong VIP):

    The VIP is immutable. You must delete and recreate the Compute Pool:

    a. Back up any workloads or data from the cluster.

    b. Delete the Compute Pool:

    kubectl delete computepool <computepool-name> --namespace <project-namespace>

    c. Verify the Compute Pool is deleted:

    kubectl get computepool <computepool-name> --namespace <project-namespace>

    The cluster will be removed from Palette. Once deletion completes, the command returns a NotFound error.

    d. Update your manifest with the correct VIP.

    e. Create the Compute Pool again:

    kubectl apply --filename computepool.yaml

Insufficient GPU Resources

Symptom: Worker pool creation fails because GPU quota is exceeded.

Cause: GPU requirements exceed Project or Tenant GPU limits.

Resolution:

  1. Check Project GPU limits.

    kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.spec.gpuResources}'
  2. Check current GPU usage.

    kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.status.gpuUsage}'
  3. Reduce GPU requirements in the Compute Pool or request higher GPU limits from the Tenant admin.

Compute Pool Cannot Be Deleted

Symptom: Compute Pool deletion fails or does not complete.

Possible causes:

  1. AIWorkloads still run on the cluster.

  2. A finalizer prevents deletion.

  3. Palette cluster deletion fails.

Resolution:

Start by checking the Compute Pool detail page in the PaletteAI UI. The Overview tab shows deployed workloads and cluster status. If workloads are still running, delete them from the Workloads section before retrying the Compute Pool deletion.

If the UI does not provide enough detail, use the following kubectl commands.

  1. List AIWorkload references.

    kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.aiWorkloadRefs}'
  2. Delete AIWorkloads.

    kubectl delete aiworkload <aiworkload-name> --namespace <project-namespace>
  3. Check finalizers.

    kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.metadata.finalizers}'

    Example output:

    ["spectrocloud.com/computepool-finalizer"]
    warning

    Finalizers protect resources from premature deletion while cleanup occurs. Do not manually remove finalizers unless:

    • You have confirmed all AIWorkloads are deleted.

    • Controller logs show the finalizer is stuck due to a known issue.

    • You have consulted with your PaletteAI administrator or support team.

    Manually removing finalizers can leave orphaned resources in Palette or cause state inconsistencies.

    If you must remove a finalizer (after confirming with support), use:

    kubectl patch computepool <computepool-name> --namespace <project-namespace> \
    --type json -p '[{"op": "remove", "path": "/metadata/finalizers/0"}]'
  4. If deletion does not complete, review controller logs.

    First, find the Hue controller pods:

    kubectl get pods --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllers

    Then review the logs:

    kubectl logs --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllers --tail=100 --follow

    If the mural-system namespace or labels are not found, discover the controller deployment:

    kubectl get deployments --all-namespaces | grep -i hue

Resources