Troubleshooting Compute Pools
This page provides troubleshooting guidance for common Compute Pool issues.
Troubleshooting steps on this page use kubectl commands to inspect Kubernetes resources on the hub cluster. You need kubectl access to the hub cluster and the Project namespace where your Compute Pool is deployed. If you do not have kubectl access, contact your platform administrator.
Compute Pool Stuck in Provisioning
Symptom: The Compute Pool status remains Provisioning.
Possible causes:
-
No available edge hosts match the resource requirements.
-
Palette cluster deployment fails.
-
Network issues prevent cluster creation.
Resolution:
Start by checking the Compute Pool in the PaletteAI UI. Select the Compute Pool from the list and review the Events tab for error messages and status updates. The UI shows the current status, cluster health, and recent events that can help identify the issue without kubectl.
If the UI does not provide enough detail, use the following kubectl commands.
-
Check Compute Pool conditions.
kubectl describe computepool <computepool-name> --namespace <project-namespace> | grep -A 10 ConditionsExpected output for healthy cluster:
Conditions:
Type: ClusterDeployed
Status: True
Reason: ClusterRunning
Message: Palette cluster is running
Type: Validated
Status: True
Reason: ValidationPassedExample failure output:
Conditions:
Type: Validated
Status: False
Reason: ValidationFailed
Message: ComputePool validation failed
Type: ClusterDeployed
Status: False
Reason: ClusterDeploymentFailed
Message: Failed to deploy Palette clusterCommon condition types and failure reasons:
Validated/ValidationFailed-ComputePoolspec validation failed. Check that referenced resources (ProfileBundle,Settings) exist and are Ready.ClusterDeployed/ClusterDeploymentFailed- Palette cluster deployment failed. Check Settings credentials and available edge hosts.AdminKubeconfigAcquired/KubeconfigFailed- Failed to retrieve the admin kubeconfig from the Palette cluster.SpokeCreated/SpokeCreateFailed- Failed to create the Spoke resource for the cluster.ManagedClusterReady/ManagedClusterNotReady- The OCM ManagedCluster representing the Palette cluster is not ready.EnvironmentCreated/EnvironmentFailed- The Mural Environment failed to be created or become ready.
-
Verify Compute resources have available machines.
kubectl get compute --namespace <project-namespace> --output yaml | grep -A 20 controlPlaneComputeExpected output:
status:
controlPlaneCompute:
- architecture: amd64
cpuCount: 8
memory: '32768 MiB'
instances: 1
machines:
abc123: available
resourceGroups:
palette.ai: 'true'
workerCompute:
- architecture: amd64
cpuCount: 16
memory: '65536 MiB'
instances: 1
machines:
def456: available
family: NVIDIA-A100
gpuCount: 2
gpuMemory: '40960 MiB'
resourceGroups:
palette.ai: 'true'If
controlPlaneComputeorworkerComputeis empty, no hosts are available with the required resources. -
Check Palette cluster status.
kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.paletteClusterStatuses}' | jqExpected output for healthy cluster:
[
{
"name": "ml-cluster",
"uid": "abc123def456",
"status": "Running",
"conditions": [
{
"type": "ClusterDeployed",
"status": "True",
"reason": "ClusterRunning"
}
]
}
]Common status field meanings:
status: Cluster lifecycle state (Provisioning,Running,Deleting,Failed,Unhealthy)uid: The Palette cluster UIDconditions: Array of condition objects tracking each reconciliation phasekubeconfigSecretName: Name of the Secret containing the cluster kubeconfig
Example failure output:
[
{
"name": "ml-cluster",
"uid": "abc123def456",
"status": "Provisioning",
"conditions": [
{
"type": "ClusterDeployed",
"status": "False",
"reason": "ClusterDeploymentFailed",
"message": "Failed to deploy Palette cluster"
}
]
}
] -
Verify the
ProfileBundleexists.kubectl get profilebundle <profilebundle-name> --namespace <project-namespace>Expected output:
NAME READY AGE
edge-profilebundle True 5dIf
READYisFalse, describe theProfileBundleto identify the cause:kubectl describe profilebundle <profilebundle-name> --namespace <project-namespace>
VIP Already in Use
Symptom: Compute Pool creation fails due to a VIP conflict.
Cause: The VIP is assigned to another cluster or device.
Resolution:
The VIP address cannot be changed after cluster creation. If you need to use a different VIP, you must delete the Compute Pool and create a new one.
-
For a new Compute Pool (not yet created successfully):
a. Choose a VIP that is not in use.
b. Check VIP values in existing Compute Pools to avoid conflicts:
kubectl get computepools --namespace <project-namespace> --output yaml | grep -A 1 "vip:"c. Verify the VIP is not in use on your network:
# Option 1: Use arping (recommended, more reliable)
# Requires root/sudo on most systems. No reply expected if VIP is free.
sudo arping -c 3 <new-vip-address>
# Expected output if VIP is free (no replies):
# ARPING 10.10.162.130
# ... (timeout, 0 packets received)
# Option 2: Check if VIP responds to ping (less reliable)
# Some networks may not respond to ping even if IP is in use
ping -c 3 <new-vip-address>
# Expected output if VIP is free:
# ... 100% packet loss
# Option 3: Check ARP table (only shows recently resolved IPs)
arp -a | grep <new-vip-address>
# Expected output if VIP is free:
# (no output)d. Update your Compute Pool manifest with the new VIP and apply it:
# Edit the manifest
vi computepool.yaml
# Update the VIP under spec.clusterVariant.dedicated.paletteClusterDeploymentConfig.edge.vip
# Apply the updated manifest
kubectl apply --filename computepool.yaml -
For an existing Compute Pool (already created with wrong VIP):
The VIP is immutable. You must delete and recreate the Compute Pool:
a. Back up any workloads or data from the cluster.
b. Delete the Compute Pool:
kubectl delete computepool <computepool-name> --namespace <project-namespace>c. Verify the Compute Pool is deleted:
kubectl get computepool <computepool-name> --namespace <project-namespace>The cluster will be removed from Palette. Once deletion completes, the command returns a NotFound error.
d. Update your manifest with the correct VIP.
e. Create the Compute Pool again:
kubectl apply --filename computepool.yaml
Insufficient GPU Resources
Symptom: Worker pool creation fails because GPU quota is exceeded.
Cause: GPU requirements exceed Project or Tenant GPU limits.
Resolution:
-
Check Project GPU limits.
kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.spec.gpuResources}' -
Check current GPU usage.
kubectl get project <project-name> --namespace <project-namespace> --output jsonpath='{.status.gpuUsage}' -
Reduce GPU requirements in the Compute Pool or request higher GPU limits from the Tenant admin.
Compute Pool Cannot Be Deleted
Symptom: Compute Pool deletion fails or does not complete.
Possible causes:
-
AIWorkloadsstill run on the cluster. -
A finalizer prevents deletion.
-
Palette cluster deletion fails.
Resolution:
Start by checking the Compute Pool detail page in the PaletteAI UI. The Overview tab shows deployed workloads and cluster status. If workloads are still running, delete them from the Workloads section before retrying the Compute Pool deletion.
If the UI does not provide enough detail, use the following kubectl commands.
-
List
AIWorkloadreferences.kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.status.aiWorkloadRefs}' -
Delete
AIWorkloads.kubectl delete aiworkload <aiworkload-name> --namespace <project-namespace> -
Check finalizers.
kubectl get computepool <computepool-name> --namespace <project-namespace> --output jsonpath='{.metadata.finalizers}'Example output:
["spectrocloud.com/computepool-finalizer"]warningFinalizers protect resources from premature deletion while cleanup occurs. Do not manually remove finalizers unless:
-
You have confirmed all
AIWorkloadsare deleted. -
Controller logs show the finalizer is stuck due to a known issue.
-
You have consulted with your PaletteAI administrator or support team.
Manually removing finalizers can leave orphaned resources in Palette or cause state inconsistencies.
If you must remove a finalizer (after confirming with support), use:
kubectl patch computepool <computepool-name> --namespace <project-namespace> \
--type json -p '[{"op": "remove", "path": "/metadata/finalizers/0"}]' -
-
If deletion does not complete, review controller logs.
First, find the Hue controller pods:
kubectl get pods --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllersThen review the logs:
kubectl logs --namespace mural-system --selector controller.spectrocloud.com/name=hue-controllers --tail=100 --followIf the
mural-systemnamespace or labels are not found, discover the controller deployment:kubectl get deployments --all-namespaces | grep -i hue