View Metrics in Grafana

PaletteAI exposes infrastructure and workload metrics from all clusters (hub and spokes) through a Prometheus server. By linking that Prometheus server to a Grafana instance you operate, you can open Grafana directly from the PaletteAI UI through the Metrics Dashboard panel on the Project Overview and Tenant Overview pages.

PaletteAI publishes the following metric sources to Prometheus out of the box. Exact availability depends on the spoke agentType and on which exporters run on each spoke cluster.

Source	Metrics	Notes
`node-exporter`	CPU, memory, disk, and network metrics scraped from each spoke node.	- `prometheus-agent-minimal` ships only `node_cpu_seconds_total{mode="idle"}`. - `prometheus-agent` ships all Node Exporter metrics. - PaletteAI uses `node_cpu_seconds_total{mode="idle"}` for CPU autoscaling through `ScalingPolicy`.
`dcgm-exporter`	GPU utilization, memory, temperature, and other DCGM metrics scraped from spoke nodes that run the NVIDIA GPU Operator.	- `prometheus-agent-minimal` ships only `DCGM_FI_DEV_GPU_UTIL`. - `prometheus-agent` ships all DCGM Exporter metrics. - Refer to GPU Metrics Requirements.
`vLLM`	Request, latency, token throughput, and KV-cache utilization metrics exposed by vLLM-based Model Deployments on a Prometheus-compatible `/metrics` endpoint.	- The default `vllm-prod-stack` `ProfileBundle` creates a `ServiceMonitor` for vLLM automatically. - For custom Model Deployments, configure your Prometheus server to scrape vLLM serving Pods directly through a `PodMonitor` or `ServiceMonitor` in the Model Deployment's namespace.

Use this guide to link an existing Grafana instance to the Prometheus server PaletteAI ships metrics to, and to access spoke cluster metrics through the PaletteAI UI.

Prerequisites

A PaletteAI installation with administrative access to the Helm values.
A Grafana instance that you operate and that queries the same Prometheus server PaletteAI writes to. PaletteAI does not provision or manage Grafana. Set it up separately and ensure it is reachable from end-user browsers.

Enablement

Configure the Prometheus agent and Grafana link in your Helm values.yaml. Set global.metrics.agentType to the agent that matches your use case, and set global.metrics.grafanaUrl to the URL the Open Grafana Dashboard button opens. prometheus-agent-minimal is the default.

Agent Type	Ships	Use Case
`prometheus-agent-minimal`	Only the metrics required for autoscaling: `node_cpu_seconds_total{mode="idle"}` and `DCGM_FI_DEV_GPU_UTIL`	Prometheus runs on the hub cluster, or you only need autoscaling metrics.
`prometheus-agent`	All `node-exporter` metrics and all `dcgm-exporter` metrics collected on the spoke	You run an external Prometheus server and want broader node and GPU observability.

Example global.metrics configuration
global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090'
    grafanaUrl: 'https://grafana.example.com'
    agentType: 'prometheus-agent-minimal'

Refer to Configure global.metrics for the full parameter reference, including authentication options.

For GPU metrics, confirm the NVIDIA GPU Operator is running on each GPU-enabled spoke cluster you want metrics from. PaletteAI installs Node Exporter automatically as an Open Cluster Management (OCM) addon, but does not deploy the GPU Operator. The Prometheus agent scrapes nvidia-dcgm-exporter pods in the gpu-operator namespace wherever it finds them.

Confirm the operator's pods are running on the spoke cluster:
```
kubectl get pods --namespace gpu-operator
```
Example Output
```
NAME                                  READY   STATUS    RESTARTS   AGE
nvidia-dcgm-exporter-abcde            1/1     Running   0          5m
```
If the operator is not present, install it through one of the following paths:
- Palette Cluster Profile
- Upstream Helm Chart
If you manage spoke clusters through Palette Cluster Profiles, add the NVIDIA GPU Operator pack as an Add-on Cluster Profile in your Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for guidance on attaching Add-on Profiles.
Install the upstream gpu-operator Helm chart directly:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace
Refer to GPU Metrics Requirements for the namespace and pod-name expectations the agent relies on.
Configure your Grafana instance to query the same Prometheus server PaletteAI sends metrics to. Provision Grafana sign-in credentials for the users who will open the Metrics Dashboard panel.

info
PaletteAI does not manage Grafana credentials. When a user selects Open Grafana Dashboard, PaletteAI opens the configured URL in a new browser tab and Grafana prompts for its own sign-in. Users authenticate with their Grafana credentials. The Helm values global.metrics.username, global.metrics.password, and global.metrics.basicAuthSecretName configure how PaletteAI authenticates to Prometheus, not Grafana. Refer to Authentication.
Apply the updated Helm values to your PaletteAI installation. The Metrics Dashboard panel appears on Project Overview and Tenant Overview.
```
helm upgrade paletteai spectrocloud/paletteai \
  --namespace mural-system \
  --values values.yaml
```

Build Your Own Dashboards

Use Grafana to author dashboards that query the metrics described in the table above using PromQL.

Refer to the upstream Grafana dashboard documentation for guidance on creating panels, organizing dashboards, and importing community dashboards.

The Grafana community provides starting-point dashboards for the exporters PaletteAI uses, such as the Node Exporter Full dashboard and the NVIDIA DCGM Exporter dashboard.

Validate

Confirm the Metrics Dashboard panel appears on Tenant Overview after Tenant selection, and on Project Overview after navigating into a Project. If the panel is missing on either page, global.metrics.grafanaUrl is unset or empty.
From either page, select Open Grafana Dashboard and confirm Grafana opens in a new browser tab.
Sign in to Grafana and run a PromQL query against a metric in the table above to confirm Prometheus is receiving data from PaletteAI. For example, query node_cpu_seconds_total to verify Node Exporter metrics are present, or DCGM_FI_DEV_GPU_UTIL to verify DCGM metrics are present.

Next Steps

Confirm dashboard variables and labels match your Prometheus configuration before relying on community dashboards or building your own. Validating that Grafana can connect to Prometheus and return data for the metrics you care about avoids debugging dashboards against an empty data source.
Refer to Configure Prometheus Agent Monitoring for the full Helm reference and authentication options.
Refer to Create and Manage Scaling Policies to use CPU and GPU metrics for autoscaling.

Prerequisites​

Enablement​

Build Your Own Dashboards​

Validate​

Next Steps​

Prerequisites

Enablement

Build Your Own Dashboards

Validate

Next Steps