Configure Prometheus Agent Monitoring
PaletteAI can ship metrics from spoke clusters to a Prometheus server and use them for autoscaling decisions on the hub cluster. Configure this behavior with the global.metrics section in your Helm values.yaml.
This page explains how to:
-
Choose between
prometheus-agent-minimalandprometheus-agent. -
Configure the
global.metricsvalues. -
Prepare Prometheus and Grafana for PaletteAI.
-
Understand the GPU metrics and authentication requirements.
PaletteAI uses global.metrics for three separate behaviors:
-
Spoke-side Prometheus agents that scrape metrics and send them to Prometheus with
remote_write. -
Hub-side Prometheus queries used by
ScalingPolicyautoscaling. -
The Grafana link shown on the Project Overview page.
Prerequisites
-
A running PaletteAI installation
-
A Prometheus server that is reachable from the hub cluster and every spoke cluster that will run a Prometheus agent
-
A Prometheus deployment that accepts remote-write traffic on
/api/v1/write -
Node Exporter enabled on each spoke cluster that runs a Prometheus agent, so agents can scrape CPU metrics for autoscaling
-
DCGM Exporter enabled on each GPU-enabled spoke cluster where you want GPU metrics or GPU autoscaling (for example through the NVIDIA GPU Operator)
If you use the Spectro Cloud Prometheus Operator pack, enable the remote monitoring feature so the Prometheus server can receive metrics from Prometheus agents. Enable Node Exporter and DCGM Exporter in the pack configuration so Prometheus collects the CPU and GPU metrics that PaletteAI uses. Refer to Deploy Monitoring Stack and Prometheus Operator for Prometheus server and Grafana setup guidance.
If you use a standalone Prometheus deployment, enable the remote-write receiver endpoint. Refer to the Prometheus remote_write configuration and Prometheus command-line flags.
Choose a Prometheus Agent
PaletteAI supports two spoke-side agent modes.
| Agent Type | What It Ships | Recommended Use |
|---|---|---|
prometheus-agent-minimal | Only the metrics required for autoscaling: node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL | Use this when Prometheus runs on the hub cluster, or when you only need autoscaling metrics |
prometheus-agent | All node-exporter metrics and all dcgm-exporter metrics collected on the spoke | Use this when you have an external Prometheus server and want broader node and GPU observability |
prometheus-agent-minimal is the default and the recommended choice when your Prometheus server runs on the hub cluster. It reduces metric volume, network traffic, and storage pressure, and is sufficient for autoscaling because PaletteAI only queries node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL.
Configure global.metrics
Set the following values in your Helm values.yaml.
| Parameter | Description | Default |
|---|---|---|
global.metrics.prometheusBaseUrl | Base URL of the Prometheus server. Use protocol, host, and port only. Do not include /api/v1/write. | https://prometheus.mural-system.svc.cluster.local:30090 |
global.metrics.grafanaUrl | URL opened by the Open Grafana Dashboard button on Project Overview. This does not configure Grafana itself. | "" |
global.metrics.timeout | Timeout for Prometheus remote_write and query operations. | "5s" |
global.metrics.scrapeInterval | How often spoke agents scrape metrics from exporters. Lower values increase metric resolution, network traffic, and storage use. | "15s" |
global.metrics.agentType | Prometheus agent to deploy on spoke clusters. Supported values are prometheus-agent-minimal and prometheus-agent. | "prometheus-agent-minimal" |
global.metrics.username | Basic auth username used by the Prometheus agent remote_write configuration. | "" |
global.metrics.password | Basic auth password used by the Prometheus agent remote_write configuration. | "" |
global.metrics.basicAuthSecretName | Secret containing username and password keys for Prometheus authentication. When set, PaletteAI uses it for hub-side Prometheus queries and spoke-side remote_write credentials. | "" |
Use a configuration similar to the following:
```yaml title="Example global.metrics configuration"
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'
grafanaUrl: 'https://grafana.example.com'
timeout: '5s'
scrapeInterval: '15s'
agentType: 'prometheus-agent-minimal'
username: ''
password: ''
basicAuthSecretName: ''
<!-- vale on -->
## Prometheus Server Requirements \{#prometheus-server-requirements}
PaletteAI appends `/api/v1/write` to `global.metrics.prometheusBaseUrl` when it configures spoke-side `remote_write`. The value you provide must therefore be the Prometheus base URL only.
For example, if your Prometheus server receives remote-write traffic at `https://prometheus.example.com:9090/api/v1/write`, set:
<!-- vale off -->
```yaml
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'
Do not set:
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090/api/v1/write'
GPU Metrics Requirements
Both PaletteAI Prometheus agent modes look for nvidia-dcgm-exporter pods in the gpu-operator namespace on spoke clusters. If those exporters are not present, GPU metrics are not shipped to Prometheus and GPU autoscaling data is unavailable.
Install the NVIDIA GPU Operator on every GPU-enabled spoke cluster where you want GPU monitoring or GPU autoscaling.
If you manage spoke cluster software through Palette Cluster Profiles, you can include the NVIDIA GPU Operator as an Add-on Cluster Profile in an Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for more information.
Authentication
PaletteAI supports Prometheus basic auth by either reading inline values from global.metrics.username and global.metrics.password or by reading a Secret named in global.metrics.basicAuthSecretName.
-
If your Prometheus server does not require basic auth, leave all three fields empty.
-
If you want to provide credentials inline, set
global.metrics.usernameandglobal.metrics.password. -
If you want to read credentials from a Secret, set
global.metrics.basicAuthSecretName. The Secret must containusernameandpasswordkeys. -
If you use
global.metrics.basicAuthSecretName, PaletteAI uses those credentials for both hub-side Prometheus queries and spoke-side Prometheus agentremote_writeconfiguration.
Validate
-
Confirm the hub can query Prometheus by checking a
ScalingPolicyresource.kubectl get scalingpolicy --namespace <project-namespace>Example OutputNAME PROMETHEUS_AVAILABLE AGE
example-policy True 2mIf the
PROMETHEUS_AVAILABLEcolumn showsTrue, the hub can reach Prometheus for autoscaling queries. -
Confirm your Prometheus server is receiving metrics from spoke clusters.
Review your Prometheus targets and remote-write ingestion metrics in Prometheus or Grafana. If metrics are missing, confirm network connectivity from the spoke to
global.metrics.prometheusBaseUrl, and confirm the remote-write receiver is enabled on the Prometheus server. -
Confirm the Grafana link appears in the PaletteAI UI.
If
global.metrics.grafanaUrlis set, Project Overview displays a Metrics Dashboard section with an Open Grafana Dashboard button.
Next Steps
-
Refer to Create and Manage Scaling Policies to configure CPU or GPU autoscaling.
-
Refer to Helm Chart Configuration for the full Helm values reference, including
global.metrics. -
Refer to the Hub and Spoke Model for topology guidance when you plan hub-only or dedicated spoke monitoring layouts.