Skip to main content

Troubleshooting Tenants

This page provides troubleshooting guidance for common issues when creating and managing Tenants.

Tenant Not Ready

Symptom: Tenant READY is False.

Possible causes:

  1. The referenced Settings resource does not exist or is not ready.

  2. PaletteAI cannot create the controller-created Tenant namespace (tenant-<tenant-name>).

  3. GPU usage exceeds limits.

Resolution:

  1. View Tenant conditions.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.conditions}' | jq
  2. If SettingsConfigured is False, verify the Settings resource.

    Check the Settings resource in the auto-generated namespace referenced by .spec.settingsRef.namespace (must be tenant-<tenant-name>).

    # First, get the Settings namespace from the Tenant spec
    kubectl get tenant <tenant-name> --output jsonpath='{.spec.settingsRef.namespace}'

    # Then check the Settings resource
    kubectl get settings <settings-name> --namespace <settings-namespace>
    kubectl describe settings <settings-name> --namespace <settings-namespace>

    If the Settings resource shows an IntegrationsConfigured condition with status=False and reason="IntegrationsNotReady", one or more configured integrations have invalid or missing secrets. The condition message lists which integrations are affected. See Settings Integrations Not Ready for resolution steps.

  3. If TenantNamespaceCreated is False, verify the controller-created namespace and review events.

    # Check the controller-created namespace (tenant-<name>)
    kubectl get namespace tenant-<tenant-name>
    kubectl describe tenant <tenant-name>

    Debug namespace creation failures:

    First, check recent events to display the actual denial reason:

    # If the namespace exists but Tenant is not Ready
    kubectl get events --namespace tenant-<tenant-name> --sort-by=.lastTimestamp | tail -20

    # If namespace does not exist, check controller logs
    # First, find the tenant controller pod (label selectors may vary by installation)
    kubectl get pods --namespace mural-system

    # Then inspect logs (example using common label selector)
    kubectl logs --namespace mural-system --selector app=hue-controller --tail=100 | grep --ignore-case "tenant.*<tenant-name>"
    info

    The exact pod label selector may differ in your installation. Use kubectl get pods --namespace mural-system to identify the tenant controller pod and inspect its logs directly.

    Common causes for namespace creation failure:

    • Namespace already exists with conflicting ownership.

    • Pod Security Admission (PSA) policies blocking namespace creation. Check for pod-security.kubernetes.io labels.

    • Admission webhooks rejecting the namespace. Look for webhook names in events.

    • Finalizers may be blocking deletion. Check for finalizers with the following command:

      kubectl get namespace tenant-<tenant-name> --output yaml | grep finalizers
  4. If TenantOversubscribed is True, reduce GPU usage.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.gpuUsage}' | jq

Cannot Delete Tenant

Symptom: The Tenant delete request fails due to child Projects.

Cause: PaletteAI blocks deletion of Tenants with child Projects.

Resolution:

  1. List the Projects under the Tenant.

    kubectl get projects --all-namespaces --selector palette.ai/tenant-name=<tenant-name>
    info

    Projects are automatically labeled with palette.ai/tenant-name during creation. If a Project is missing this label, it may indicate a problem with the Project creation process or a Project created with an older version of PaletteAI.

  2. Delete the Projects.

    kubectl delete project <project-name> --namespace <project-namespace>
  3. Verify the child Project count is 0.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.childProjectCount}'
  4. Delete the Tenant.

    kubectl delete tenant <tenant-name>

GPU Quota Exceeded

Symptom: App Deployments or ComputePools remain pending due to GPU requests.

Cause: Total GPU usage across all Projects exceeds Tenant limits.

Resolution:

  1. Check current GPU usage.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.gpuUsage}' | jq
  2. Check Tenant GPU limits.

    kubectl get tenant <tenant-name> --output jsonpath='{.spec.gpuResources}' | jq
  3. Reduce GPU usage or increase Tenant limits.

    To increase limits, update the Tenant manifest and apply it.

    spec:
    gpuResources:
    limits:
    'NVIDIA-A100': 128
    kubectl apply --filename tenant.yaml

Invalid Settings Namespace

Symptom: Tenant creation or update is rejected at admission time with an error message similar to: settingsRef.namespace must be 'tenant-<tenant-name>'.

Cause: PaletteAI's webhook validation enforces that settingsRef.namespace must equal the auto-generated tenant namespace format tenant-<tenant-name>. Tenant creation or update requests are rejected at admission time if the namespace does not match. This validation cannot be overridden.

Resolution:

Update the settingsRef.namespace field in the Tenant manifest to match the auto-generated tenant namespace format: tenant-<tenant-name>, where <tenant-name> is the name of your Tenant resource.

Example:

For a Tenant named my-tenant, the correct configuration is:

spec:
settingsRef:
name: my-settings
namespace: tenant-my-tenant

Apply the corrected manifest:

kubectl apply --filename tenant.yaml

Project Name and Namespace Mismatch

Symptom: Project creation or update is rejected at admission time with an error message: project name must match its namespace: name="<project-name>", namespace="<namespace-name>".

Cause: PaletteAI's webhook validation enforces that a Project's metadata.name must match its metadata.namespace. Project creation or update requests are rejected at admission time if the name and namespace do not match. This validation cannot be overridden.

Resolution:

Update the Project manifest so that metadata.name matches metadata.namespace.

Example:

For a Project in namespace my-project, the correct configuration is:

apiVersion: spectrocloud.com/v1alpha1
kind: Project
metadata:
name: my-project
namespace: my-project
spec:
# ... project spec

Apply the corrected manifest:

kubectl apply --filename project.yaml

Settings Integrations Not Ready

Symptom: Settings resource shows IntegrationsConfigured condition with status=False and reason="IntegrationsNotReady". The condition message lists which integrations have invalid or missing secrets.

Cause: One or more configured integrations (Palette, Hugging Face, or NVIDIA) reference secrets that are missing, have incorrect names, or lack required fields.

Resolution:

  1. Check the Settings conditions to identify affected integrations.

    kubectl get settings <settings-name> --namespace <settings-namespace> --output jsonpath='{.status.conditions}' | jq

    The condition message identifies which integrations are failing. Example output:

    {
    "type": "IntegrationsConfigured",
    "status": "False",
    "reason": "IntegrationsNotReady",
    "message": "The following integrations have invalid or missing secrets: Hugging Face, NVIDIA"
    }
  2. Verify the referenced secrets exist in the correct namespace.

    kubectl get secrets --namespace <settings-namespace>
  3. For Palette integrations, verify the secret exists and matches the name in the Settings spec.

    kubectl get secret <palette-secret-name> --namespace <settings-namespace>
  4. For Hugging Face integrations, verify the secret exists and contains the required token field.

    kubectl get secret <huggingface-secret-name> --namespace <settings-namespace> --output jsonpath='{.data}' | jq

    The secret must be an Opaque Kubernetes secret with the field specified in the Settings spec.integrations.huggingFace.apiKey.key.

  5. For NVIDIA integrations, verify both the NGC API key and image pull secrets exist.

    # Check NGC API key secret
    kubectl get secret <ngc-api-key-secret-name> --namespace <settings-namespace> --output jsonpath='{.data}' | jq

    # Check NGC image pull secret
    kubectl get secret <ngc-image-pull-secret-name> --namespace <settings-namespace>

    The API key secret must contain the field specified in spec.integrations.nvidia.ngc.apiKey.key.

  6. Ensure secret names match exactly what is specified in the Settings resource.

    kubectl get settings <settings-name> --namespace <settings-namespace> --output yaml | grep -A 5 integrations
  7. After correcting the secrets, verify the Settings resource becomes ready.

    kubectl get settings <settings-name> --namespace <settings-namespace>

    The READY column should show True and the IntegrationsConfigured condition should have status=True with reason="IntegrationsValid".

Tenant Admin Groups Do Not Work

Symptom: Users in Tenant admin groups do not have expected permissions in Project namespaces.

Possible causes:

  1. OIDC groups do not match your identity provider configuration.

  2. PaletteAI did not create RBAC resources.

  3. Dex configuration does not match (static users).

Resolution:

  1. Verify Dex static user configuration (if applicable).

    kubectl get configmap dex --namespace mural-system --output yaml | grep --after-context=10 staticClients
  2. Verify RBAC resources exist in the Project namespace.

    kubectl get roles --namespace <project-namespace> | grep tnt-adm
    kubectl describe role prj-<project-name>-tnt-adm --namespace <project-namespace>
  3. Verify the RoleBinding includes the expected groups.

    kubectl get rolebinding --namespace <project-namespace> | grep tnt-adm
    kubectl describe rolebinding <rolebinding-name> --namespace <project-namespace>
  4. Verify group membership using impersonation (if OIDC is configured).

    # Test as a user in the tenant admin group
    kubectl auth can-i create projects --as=<username> --as-group=<tenant-admin-group>
    # Expected output: yes

    If the above test succeeds but real user login fails, the issue is likely an identity provider misconfiguration:

    • The user's token does not include the expected group claim. Verify IdP group claim mapping.

    • The Kubernetes API server is not configured to pass through group claims. Verify OIDC configuration in API server flags.

    • Compare --as-group test results with the actual groups returned in a user's token payload.

    To diagnose, decode a user's JWT token and verify the groups claim matches your tenantRoleMapping.groups configuration.

  5. If RBAC resources are missing, delete and recreate the Project.

    Warning

    Deleting and recreating a Project will remove all workloads, deployments, and resources in that Project's namespace. Before proceeding:

    • Back up any important data.

    • Document deployed workloads.

    • Ensure you can recreate the Project configuration.

    • Verify no production workloads are running.