View Metrics in Grafana
PaletteAI exposes infrastructure and workload metrics from all clusters (hub and spokes) through a Prometheus server. By linking that Prometheus server to a Grafana instance you operate, you can open Grafana directly from the PaletteAI UI through the Metrics Dashboard panel on the Project Overview and Tenant Overview pages.
PaletteAI publishes the following metric sources to Prometheus out of the box. Exact availability depends on the spoke agentType and on which exporters run on each spoke cluster.
| Source | Metrics | Notes |
|---|---|---|
node-exporter | CPU, memory, disk, and network metrics scraped from each spoke node. | - prometheus-agent-minimal ships only node_cpu_seconds_total{mode="idle"}.- prometheus-agent ships all Node Exporter metrics.- PaletteAI uses node_cpu_seconds_total{mode="idle"} for CPU autoscaling through ScalingPolicy. |
dcgm-exporter | GPU utilization, memory, temperature, and other DCGM metrics scraped from spoke nodes that run the NVIDIA GPU Operator. | - prometheus-agent-minimal ships only DCGM_FI_DEV_GPU_UTIL.- prometheus-agent ships all DCGM Exporter metrics.- Refer to GPU Metrics Requirements. |
vLLM | Request, latency, token throughput, and KV-cache utilization metrics exposed by vLLM-based Model Deployments on a Prometheus-compatible /metrics endpoint. | - The default vllm-prod-stack ProfileBundle creates a ServiceMonitor for vLLM automatically.- For custom Model Deployments, configure your Prometheus server to scrape vLLM serving Pods directly through a PodMonitor or ServiceMonitor in the Model Deployment's namespace. |
Use this guide to link an existing Grafana instance to the Prometheus server PaletteAI ships metrics to, and to access spoke cluster metrics through the PaletteAI UI.
Prerequisites
-
A PaletteAI installation with administrative access to the Helm values.
-
A Grafana instance that you operate and that queries the same Prometheus server PaletteAI writes to. PaletteAI does not provision or manage Grafana. Set it up separately and ensure it is reachable from end-user browsers.
Enablement
-
Configure the Prometheus agent and Grafana link in your Helm
values.yaml. Setglobal.metrics.agentTypeto the agent that matches your use case, and setglobal.metrics.grafanaUrlto the URL the Open Grafana Dashboard button opens.prometheus-agent-minimalis the default.Agent Type Ships Use Case prometheus-agent-minimalOnly the metrics required for autoscaling: node_cpu_seconds_total{mode="idle"}andDCGM_FI_DEV_GPU_UTILPrometheus runs on the hub cluster, or you only need autoscaling metrics. prometheus-agentAll node-exportermetrics and alldcgm-exportermetrics collected on the spokeYou run an external Prometheus server and want broader node and GPU observability. Example global.metrics configurationglobal:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'
grafanaUrl: 'https://grafana.example.com'
agentType: 'prometheus-agent-minimal'Refer to Configure
global.metricsfor the full parameter reference, including authentication options. -
For GPU metrics, confirm the NVIDIA GPU Operator is running on each GPU-enabled spoke cluster you want metrics from. PaletteAI installs Node Exporter automatically as an Open Cluster Management (OCM) addon, but does not deploy the GPU Operator. The Prometheus agent scrapes
nvidia-dcgm-exporterpods in thegpu-operatornamespace wherever it finds them.Confirm the operator's pods are running on the spoke cluster:
kubectl get pods --namespace gpu-operatorExample OutputNAME READY STATUS RESTARTS AGE
nvidia-dcgm-exporter-abcde 1/1 Running 0 5mIf the operator is not present, install it through one of the following paths:
- Palette Cluster Profile
- Upstream Helm Chart
If you manage spoke clusters through Palette Cluster Profiles, add the NVIDIA GPU Operator pack as an Add-on Cluster Profile in your Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for guidance on attaching Add-on Profiles.
Install the upstream
gpu-operatorHelm chart directly:helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespaceRefer to GPU Metrics Requirements for the namespace and pod-name expectations the agent relies on.
-
Configure your Grafana instance to query the same Prometheus server PaletteAI sends metrics to. Provision Grafana sign-in credentials for the users who will open the Metrics Dashboard panel.
infoPaletteAI does not manage Grafana credentials. When a user selects Open Grafana Dashboard, PaletteAI opens the configured URL in a new browser tab and Grafana prompts for its own sign-in. Users authenticate with their Grafana credentials. The Helm values
global.metrics.username,global.metrics.password, andglobal.metrics.basicAuthSecretNameconfigure how PaletteAI authenticates to Prometheus, not Grafana. Refer to Authentication. -
Apply the updated Helm values to your PaletteAI installation. The Metrics Dashboard panel appears on Project Overview and Tenant Overview.
helm upgrade paletteai spectrocloud/paletteai \
--namespace mural-system \
--values values.yaml
Build Your Own Dashboards
Use Grafana to author dashboards that query the metrics described in the table above using PromQL.
Refer to the upstream Grafana dashboard documentation for guidance on creating panels, organizing dashboards, and importing community dashboards.
The Grafana community provides starting-point dashboards for the exporters PaletteAI uses, such as the Node Exporter Full dashboard and the NVIDIA DCGM Exporter dashboard.
Validate
-
Confirm the Metrics Dashboard panel appears on Tenant Overview after Tenant selection, and on Project Overview after navigating into a Project. If the panel is missing on either page,
global.metrics.grafanaUrlis unset or empty. -
From either page, select Open Grafana Dashboard and confirm Grafana opens in a new browser tab.
-
Sign in to Grafana and run a PromQL query against a metric in the table above to confirm Prometheus is receiving data from PaletteAI. For example, query
node_cpu_seconds_totalto verify Node Exporter metrics are present, orDCGM_FI_DEV_GPU_UTILto verify DCGM metrics are present.
Next Steps
-
Confirm dashboard variables and labels match your Prometheus configuration before relying on community dashboards or building your own. Validating that Grafana can connect to Prometheus and return data for the metrics you care about avoids debugging dashboards against an empty data source.
-
Refer to Configure Prometheus Agent Monitoring for the full Helm reference and authentication options.
-
Refer to Create and Manage Scaling Policies to use CPU and GPU metrics for autoscaling.