Troubleshooting Compute Pools

This page provides troubleshooting guidance for common Compute Pool issues.

info

Troubleshooting steps on this page use kubectl commands to inspect Kubernetes resources on the hub cluster. You need kubectl access to the hub cluster and the Project namespace where your Compute Pool is deployed. If you do not have kubectl access, contact your platform administrator.

Compute Pool Stuck in Provisioning

Symptom: The Compute Pool status remains Provisioning.

Possible causes:

No available edge hosts match the resource requirements.
Palette cluster deployment fails.
Network issues prevent cluster creation.

Resolution:

Start by checking the Compute Pool in the PaletteAI UI. Select the Compute Pool from the list and review the Events tab for error messages and status updates. The UI shows the current status, cluster health, and recent events that can help identify the issue without kubectl.

If the UI does not provide enough detail, use the following kubectl commands.

Check Compute Pool conditions.
```
kubectl describe computepool <computepool-name> --namespace <project-namespace> | grep -A 10 Conditions
```
Expected output for healthy cluster:
```
Conditions:
  Type: ClusterDeployed
  Status: True
  Reason: ClusterRunning
  Message: Palette cluster is running

  Type: Validated
  Status: True
  Reason: ValidationPassed
```
Example failure output:
```
Conditions:
  Type: Validated
  Status: False
  Reason: ValidationFailed
  Message: ComputePool validation failed

  Type: ClusterDeployed
  Status: False
  Reason: ClusterDeploymentFailed
  Message: Failed to deploy Palette cluster
```
Common condition types and failure reasons:
- Validated / ValidationFailed - ComputePool spec validation failed. Check that referenced resources (ProfileBundle, Settings) exist and are Ready.
- ClusterDeployed / ClusterDeploymentFailed - Palette cluster deployment failed. Check Settings credentials and available edge hosts.
- AdminKubeconfigAcquired / KubeconfigFailed - Failed to retrieve the admin kubeconfig from the Palette cluster.
- SpokeCreated / SpokeCreateFailed - Failed to create the Spoke resource for the cluster.
- ManagedClusterReady / ManagedClusterNotReady - The OCM ManagedCluster representing the Palette cluster is not ready.
- EnvironmentCreated / EnvironmentFailed - The Mural Environment failed to be created or become ready.

Verify Compute resources have available machines.

kubectl get compute --namespace <project-namespace> --output yaml | grep -A 20 controlPlaneCompute

Expected output:

status:
  controlPlaneCompute:
    - architecture: amd64
      cpuCount: 8
      memory: '32768 MiB'
      instances: 1
      machines:
        abc123: available
      resourceGroups:
        palette.ai: 'true'
  workerCompute:
    - architecture: amd64
      cpuCount: 16
      memory: '65536 MiB'
      instances: 1
      machines:
        def456: available
      family: NVIDIA-A100
      gpuCount: 2
      gpuMemory: '40960 MiB'
      resourceGroups:
        palette.ai: 'true'

If controlPlaneCompute or workerCompute is empty, no hosts are available with the required resources.

Check Palette cluster status.

kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.paletteClusterStatuses}' | jq

Expected output for healthy cluster:

[
  {
    "name": "ml-cluster",
    "uid": "abc123def456",
    "status": "Running",
    "conditions": [
      {
        "type": "ClusterDeployed",
        "status": "True",
        "reason": "ClusterRunning"
      }
    ]
  }
]

Common status field meanings:

status: Cluster lifecycle state (Provisioning, Running, Deleting, Failed, Unhealthy)
uid: The Palette cluster UID
conditions: Array of condition objects tracking each reconciliation phase
kubeconfigSecretName: Name of the Secret containing the cluster kubeconfig

Example failure output:

[
  {
    "name": "ml-cluster",
    "uid": "abc123def456",
    "status": "Provisioning",
    "conditions": [
      {
        "type": "ClusterDeployed",
        "status": "False",
        "reason": "ClusterDeploymentFailed",
        "message": "Failed to deploy Palette cluster"
      }
    ]
  }
]

Verify the ProfileBundle exists.

kubectl get profilebundle <profilebundle-name> --namespace <project-namespace>

Expected output:

NAME                READY   AGE
edge-profilebundle  True    5d

If READY is False, describe the ProfileBundle to identify the cause:

kubectl describe profilebundle <profilebundle-name> --namespace <project-namespace>

VIP Already in Use

Symptom: Compute Pool creation fails due to a VIP conflict.

Cause: The VIP is assigned to another cluster or device.

Resolution:

warning

The VIP address cannot be changed after cluster creation. If you need to use a different VIP, you must delete the Compute Pool and create a new one.

For a new Compute Pool (not yet created successfully):

a. Choose a VIP that is not in use.

b. Check VIP values in existing Compute Pools to avoid conflicts:

kubectl get computepools --namespace <project-namespace> --output yaml | grep -A 1 "vip:"

c. Verify the VIP is not in use on your network:

# Option 1: Use arping (recommended, more reliable)
# Requires root/sudo on most systems. No reply expected if VIP is free.
sudo arping -c 3 <new-vip-address>

# Expected output if VIP is free (no replies):
# ARPING 10.10.162.130
# ... (timeout, 0 packets received)

# Option 2: Check if VIP responds to ping (less reliable)
# Some networks may not respond to ping even if IP is in use
ping -c 3 <new-vip-address>

# Expected output if VIP is free:
# ... 100% packet loss

# Option 3: Check ARP table (only shows recently resolved IPs)
arp -a | grep <new-vip-address>

# Expected output if VIP is free:
# (no output)

d. Update your Compute Pool manifest with the new VIP and apply it:

# Edit the manifest
vi computepool.yaml

# Update the VIP under spec.clusterVariant.dedicated.paletteClusterDeploymentConfig.edge.vip

# Apply the updated manifest
kubectl apply --filename computepool.yaml

For an existing Compute Pool (already created with wrong VIP):

The VIP is immutable. You must delete and recreate the Compute Pool:

a. Back up any workloads or data from the cluster.

b. Delete the Compute Pool:
```
kubectl delete computepool <computepool-name> --namespace <project-namespace>
```
c. Verify the Compute Pool is deleted:
```
kubectl get computepool <computepool-name> --namespace <project-namespace>
```
The cluster will be removed from Palette. Once deletion completes, the command returns a NotFound error.

d. Update your manifest with the correct VIP.

e. Create the Compute Pool again:
```
kubectl apply --filename computepool.yaml
```

Insufficient GPU Resources

Symptom: Worker pool creation fails because GPU quota is exceeded.

Cause: GPU requirements exceed Project or Tenant GPU limits.

Resolution:

Check Project GPU limits.

kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.spec.gpuResources}'

Check current GPU usage.

kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.status.gpuUsage}'

Reduce GPU requirements in the Compute Pool or request higher GPU limits from the Tenant admin.

Compute Pool Cannot Be Deleted

Symptom: Compute Pool deletion fails or does not complete.

Possible causes:

AIWorkloads still run on the cluster.
A finalizer prevents deletion.
Palette cluster deletion fails.

Resolution:

Start by checking the Compute Pool detail page in the PaletteAI UI. The Overview tab shows deployed workloads and cluster status. If workloads are still running, delete them from the Workloads section before retrying the Compute Pool deletion.

If the UI does not provide enough detail, use the following kubectl commands.

List AIWorkload references.

kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.aiWorkloadRefs}'

Delete AIWorkloads.

kubectl delete aiworkload <aiworkload-name> --namespace <project-namespace>

Check finalizers.
```
kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.metadata.finalizers}'
```
Example output:
```
["spectrocloud.com/computepool-finalizer"]
```
warning
Finalizers protect resources from premature deletion while cleanup occurs. Do not manually remove finalizers unless:
- You have confirmed all AIWorkloads are deleted.
- Controller logs show the finalizer is stuck due to a known issue.
- You have consulted with your PaletteAI administrator or support team.
Manually removing finalizers can leave orphaned resources in Palette or cause state inconsistencies.
If you must remove a finalizer (after confirming with support), use:
kubectl patch computepool <computepool-name> --namespace <project-namespace> \ --type json -p '[{"op": "remove", "path": "/metadata/finalizers/0"}]'

If deletion does not complete, review controller logs.

First, find the Hue controller pods:

kubectl get pods --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllers

Then review the logs:

kubectl logs --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllers --tail=100 --follow

If the mural-system namespace or labels are not found, discover the controller deployment:

kubectl get deployments --all-namespaces | grep -i hue

Compute Pool Stuck in Provisioning​

VIP Already in Use​

Insufficient GPU Resources​

Compute Pool Cannot Be Deleted​

Resources​

Compute Pool Stuck in Provisioning

VIP Already in Use

Insufficient GPU Resources

Compute Pool Cannot Be Deleted

Resources