Skip to main content

Troubleshooting Tenants

This page provides troubleshooting guidance for common issues when creating and managing Tenants.

Tenant Not Ready

Symptom: Tenant Ready is False.

Possible causes:

  1. The referenced Settings resource does not exist or is not ready.

  2. PaletteAI cannot create the controller-created Tenant namespace (tenant-<tenant-name>).

  3. GPU usage exceeds limits.

Resolution:

  1. View Tenant conditions.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.conditions}' | jq
  2. If SettingsConfigured is False, verify the Settings resource.

    Check the Settings resource in the namespace referenced by .spec.settingsRef.namespace (typically your user namespace, not tenant-<name>).

    # First, get the Settings namespace from the Tenant spec
    kubectl get tenant <tenant-name> --output jsonpath='{.spec.settingsRef.namespace}'

    # Then check the Settings resource
    kubectl get settings <settings-name> --namespace <settings-namespace>
    kubectl describe settings <settings-name> --namespace <settings-namespace>
  3. If TenantNamespaceCreated is False, verify the controller-created namespace and review events.

    # Check the controller-created namespace (tenant-<name>)
    kubectl get namespace tenant-<tenant-name>
    kubectl describe tenant <tenant-name>

    Debug namespace creation failures:

    First, check recent events to display the actual denial reason:

    # If the namespace exists but Tenant is not Ready
    kubectl get events --namespace tenant-<tenant-name> --sort-by=.lastTimestamp | tail -20

    # If namespace does not exist, check controller logs
    # First, find the tenant controller pod (label selectors may vary by installation)
    kubectl get pods --namespace mural-system

    # Then inspect logs (example using common label selector)
    kubectl logs --namespace mural-system --selector app=hue-controller --tail=100 | grep --ignore-case "tenant.*<tenant-name>"
    info

    The exact pod label selector may differ in your installation. Use kubectl get pods --namespace mural-system to identify the tenant controller pod and inspect its logs directly.

    Common causes for namespace creation failure:

    • Namespace already exists with conflicting ownership.

    • Pod Security Admission (PSA) policies blocking namespace creation. Check for pod-security.kubernetes.io labels.

    • Admission webhooks rejecting the namespace. Look for webhook names in events.

    • Finalizers may be blocking deletion. Check for finalizers with the following command:

      kubectl get namespace tenant-<tenant-name> --output yaml | grep finalizers
  4. If TenantOversubscribed is True, reduce GPU usage.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.gpuUsage}' | jq

Cannot Delete Tenant

Symptom: The Tenant delete request fails due to child Projects.

Cause: PaletteAI blocks deletion of Tenants with child Projects.

Resolution:

  1. List the Projects under the Tenant.

    kubectl get projects --all-namespaces --selector palette.ai/tenant-name=<tenant-name>
  2. Delete the Projects.

    kubectl delete project <project-name> --namespace <project-namespace>
  3. Verify the child Project count is 0.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.childProjectCount}'
  4. Delete the Tenant.

    kubectl delete tenant <tenant-name>

GPU Quota Exceeded

Symptom: App Deployments or ComputePools remain pending due to GPU requests.

Cause: Total GPU usage across all Projects exceeds Tenant limits.

Resolution:

  1. Check current GPU usage.

    kubectl get tenant <tenant-name> --output jsonpath='{.status.gpuUsage}' | jq
  2. Check Tenant GPU limits.

    kubectl get tenant <tenant-name> --output jsonpath='{.spec.gpuResources}' | jq
  3. Reduce GPU usage or increase Tenant limits.

    To increase limits, update the Tenant manifest and apply it.

    spec:
    gpuResources:
    limits:
    'NVIDIA-A100': 128
    kubectl apply --filename tenant.yaml

Tenant Admin Groups Do Not Work

Symptom: Users in Tenant admin groups do not have expected permissions in Project namespaces.

Possible causes:

  1. OIDC groups do not match your identity provider configuration.

  2. PaletteAI did not create RBAC resources.

  3. Dex configuration does not match (static users).

Resolution:

  1. Verify Dex static user configuration (if applicable).

    kubectl get configmap dex --namespace mural-system --output yaml | grep --after-context=10 staticClients
  2. Verify RBAC resources exist in the Project namespace.

    kubectl get roles --namespace <project-namespace> | grep tnt-adm
    kubectl describe role prj-<project-name>-tnt-adm --namespace <project-namespace>
  3. Verify the RoleBinding includes the expected groups.

    kubectl get rolebinding --namespace <project-namespace> | grep tnt-adm
    kubectl describe rolebinding <rolebinding-name> --namespace <project-namespace>
  4. Verify group membership using impersonation (if OIDC is configured).

    # Test as a user in the tenant admin group
    kubectl auth can-i create projects --as=<username> --as-group=<tenant-admin-group>
    # Expected output: yes

    If the above test succeeds but real user login fails, the issue is likely an identity provider misconfiguration:

    • The user's token does not include the expected group claim. Verify IdP group claim mapping.

    • The Kubernetes API server is not configured to pass through group claims. Verify OIDC configuration in API server flags.

    • Compare --as-group test results with the actual groups returned in a user's token payload.

    To diagnose, decode a user's JWT token and verify the groups claim matches your tenantRoleMapping.groups configuration.

  5. If RBAC resources are missing, delete and recreate the Project.

    Warning

    Deleting and recreating a Project will remove all workloads, deployments, and resources in that Project's namespace. Before proceeding:

    • Back up any important data.

    • Document deployed workloads.

    • Ensure you can recreate the Project configuration.

    • Verify no production workloads are running.