Configure Prometheus Agent Monitoring
PaletteAI can ship metrics from spoke clusters to a Prometheus server and use them for autoscaling decisions on the hub cluster. Configure this behavior with the global.metrics section in your Helm values.yaml.
This page explains how to:
-
Choose between
prometheus-agent-minimalandprometheus-agent. -
Configure the
global.metricsvalues. -
Prepare Prometheus and Grafana for PaletteAI.
-
Understand the GPU metrics and authentication requirements.
PaletteAI uses global.metrics for three separate behaviors:
-
Spoke-side Prometheus agents that scrape metrics and send them to Prometheus with
remote_write. -
Hub-side Prometheus queries used by
ScalingPolicyautoscaling. -
The Grafana link shown on the Project Overview and Tenant Overview pages.
Prerequisites
-
A running PaletteAI installation
-
A Prometheus server that is reachable from the hub cluster and every spoke cluster that will run a Prometheus agent
-
A Prometheus deployment that accepts remote-write traffic on
/api/v1/write -
Node Exporter enabled on each spoke cluster that runs a Prometheus agent, so agents can scrape CPU metrics for autoscaling
-
DCGM Exporter enabled on each GPU-enabled spoke cluster where you want GPU metrics or GPU autoscaling (for example through the NVIDIA GPU Operator)
If you use the Spectro Cloud Prometheus Operator pack, enable the remote monitoring feature so the Prometheus server can receive metrics from Prometheus agents. Enable Node Exporter and DCGM Exporter in the pack configuration so Prometheus collects the CPU and GPU metrics that PaletteAI uses. Refer to Deploy Monitoring Stack and Prometheus Operator for Prometheus server and Grafana setup guidance.
If you use a standalone Prometheus deployment, enable the remote-write receiver endpoint. Refer to the Prometheus remote_write configuration and Prometheus command-line flags.
For All-In-One (AIO) installs, Prometheus is provisioned automatically on the PaletteAI hub cluster. The PaletteAI installer creates a prometheus-basic-auth Secret in the mural-system namespace with username and password keys you can use to sign in to Prometheus.
Choose a Prometheus Agent
PaletteAI supports two spoke-side agent modes.
| Agent Type | What It Ships | Recommended Use |
|---|---|---|
prometheus-agent-minimal | Only the metrics required for autoscaling: node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL | Use this when Prometheus runs on the hub cluster, or when you only need autoscaling metrics |
prometheus-agent | All node-exporter metrics and all dcgm-exporter metrics collected on the spoke | Use this when you have an external Prometheus server and want broader node and GPU observability |
prometheus-agent-minimal is the default and the recommended choice when your Prometheus server runs on the hub cluster. It reduces metric volume, network traffic, and storage pressure, and is sufficient for autoscaling because PaletteAI only queries node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL.
Configure global.metrics
Set the following values in your Helm values.yaml.
| Parameter | Description | Default |
|---|---|---|
global.metrics.prometheusBaseUrl | Base URL of the Prometheus server. Use protocol, host, and port only. Do not include /api/v1/write. | https://prometheus.mural-system.svc.cluster.local:30090 |
global.metrics.grafanaUrl | URL opened by the Open Grafana Dashboard button on Project Overview and Tenant Overview. This does not configure Grafana itself. | "" |
global.metrics.timeout | Timeout for Prometheus remote_write and query operations. | "5s" |
global.metrics.scrapeInterval | How often spoke agents scrape metrics from exporters. Lower values increase metric resolution, network traffic, and storage use. | "15s" |
global.metrics.agentType | Prometheus agent to deploy on spoke clusters. Supported values are prometheus-agent-minimal and prometheus-agent. | "prometheus-agent-minimal" |
global.metrics.username | Basic auth username used by the Prometheus agent remote_write configuration. | "" |
global.metrics.password | Basic auth password used by the Prometheus agent remote_write configuration. | "" |
global.metrics.basicAuthSecretName | Secret containing username and password keys for Prometheus authentication. When set, PaletteAI uses it for hub-side Prometheus queries and spoke-side remote_write credentials. | "" |
Use a configuration similar to the following:
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'
grafanaUrl: 'https://grafana.example.com'
timeout: '5s'
scrapeInterval: '15s'
agentType: 'prometheus-agent-minimal'
username: ''
password: ''
basicAuthSecretName: ''
Prometheus Server Requirements
PaletteAI appends /api/v1/write to global.metrics.prometheusBaseUrl when it configures spoke-side remote_write. The value you provide must therefore be the Prometheus base URL only.
For example, if your Prometheus server receives remote-write traffic at https://prometheus.example.com:9090/api/v1/write, set:
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'
Do not set:
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090/api/v1/write'
GPU Metrics Requirements
Both PaletteAI Prometheus agent modes look for nvidia-dcgm-exporter pods in the gpu-operator namespace on spoke clusters. If those exporters are not present, GPU metrics are not shipped to Prometheus and GPU autoscaling data is unavailable.
Install the NVIDIA GPU Operator on every GPU-enabled spoke cluster where you want GPU monitoring or GPU autoscaling.
If you manage spoke cluster software through Palette Cluster Profiles, you can include the NVIDIA GPU Operator as an Add-on Cluster Profile in an Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for more information.
Authentication
PaletteAI supports Prometheus basic auth by either reading inline values from global.metrics.username and global.metrics.password or by reading a Secret named in global.metrics.basicAuthSecretName.
-
If your Prometheus server does not require basic auth, leave all three fields empty.
-
If you want to provide credentials inline, set
global.metrics.usernameandglobal.metrics.password. -
If you want to read credentials from a Secret, set
global.metrics.basicAuthSecretName. The Secret must containusernameandpasswordkeys. -
If you use
global.metrics.basicAuthSecretName, PaletteAI uses those credentials for both hub-side Prometheus queries and spoke-side Prometheus agentremote_writeconfiguration.
Apply the Configuration
Apply the updated Helm values to your PaletteAI installation.
helm upgrade paletteai spectrocloud/paletteai \
--namespace mural-system \
--values values.yaml
Validate
-
Confirm the hub can query Prometheus by checking a
ScalingPolicyresource.kubectl get scalingpolicy --namespace <project-namespace>Example OutputNAME PROMETHEUS_AVAILABLE AGE
example-policy True 2mIf the
PROMETHEUS_AVAILABLEcolumn showsTrue, the hub can reach Prometheus for autoscaling queries. -
Confirm your Prometheus server is receiving metrics from spoke clusters. Spoke targets should appear with
state=UPin the Prometheus Targets view, and remote-write ingestion metrics should report non-zero samples.Review your Prometheus targets and remote-write ingestion metrics in Prometheus or Grafana. If metrics are missing, confirm network connectivity from the spoke to
global.metrics.prometheusBaseUrl, and confirm the remote-write receiver is enabled on the Prometheus server. -
Confirm the Grafana link appears in the PaletteAI UI.
If
global.metrics.grafanaUrlis set, the Project Overview and Tenant Overview pages display a Metrics Dashboard panel with an Open Grafana Dashboard button.
Instance name on the hub and spokes
PaletteAI installations are uniquely identified via global.instanceName. That value is surfaced to spoke-side Prometheus add-ons (for example PALETTEAI_INSTANCE_NAME) and metric selectors (for example paletteai_instance) so hub queries, Grafana, and autoscaling can line up series from the hub and from spoke clusters.
For appliance merge behavior and installer-managed fields, refer to Deploy PaletteAI. For the Helm global section, refer to the Global step in Install PaletteAI on Kubernetes.
Next Steps
-
Refer to Create and Manage Scaling Policies to configure CPU or GPU autoscaling.
-
Refer to View Metrics in Grafana for the user-facing workflow that opens the Metrics Dashboard panel and the metrics PaletteAI publishes.
-
Refer to Helm Chart Configuration for the full Helm values reference, including
global.metrics. -
Refer to the Hub and Spoke Model for topology guidance when you plan hub-only or dedicated spoke monitoring layouts.