Version: v0.2.x

Configure Prometheus Agent Monitoring

PaletteAI can ship metrics from spoke clusters to a Prometheus server and use them for autoscaling decisions on the hub cluster. Configure this behavior with the global.metrics section in your Helm values.yaml.

This page explains how to:

Choose between prometheus-agent-minimal and prometheus-agent.
Configure the global.metrics values.
Prepare Prometheus and Grafana for PaletteAI.
Understand the GPU metrics and authentication requirements.

PaletteAI uses global.metrics for three separate behaviors:

Spoke-side Prometheus agents that scrape metrics and send them to Prometheus with remote_write.
Hub-side Prometheus queries used by ScalingPolicy autoscaling.
The Grafana link shown on the Project Overview page.

Prerequisites

A running PaletteAI installation
A Prometheus server that is reachable from the hub cluster and every spoke cluster that will run a Prometheus agent
A Prometheus deployment that accepts remote-write traffic on /api/v1/write
Node Exporter enabled on each spoke cluster that runs a Prometheus agent, so agents can scrape CPU metrics for autoscaling
DCGM Exporter enabled on each GPU-enabled spoke cluster where you want GPU metrics or GPU autoscaling (for example through the NVIDIA GPU Operator)

info

If you use the Spectro Cloud Prometheus Operator pack, enable the remote monitoring feature so the Prometheus server can receive metrics from Prometheus agents. Enable Node Exporter and DCGM Exporter in the pack configuration so Prometheus collects the CPU and GPU metrics that PaletteAI uses. Refer to Deploy Monitoring Stack and Prometheus Operator for Prometheus server and Grafana setup guidance.

If you use a standalone Prometheus deployment, enable the remote-write receiver endpoint. Refer to the Prometheus remote_write configuration and Prometheus command-line flags.

Choose a Prometheus Agent

PaletteAI supports two spoke-side agent modes.

Agent Type	What It Ships	Recommended Use
`prometheus-agent-minimal`	Only the metrics required for autoscaling: `node_cpu_seconds_total{mode="idle"}` and `DCGM_FI_DEV_GPU_UTIL`	Use this when Prometheus runs on the hub cluster, or when you only need autoscaling metrics
`prometheus-agent`	All `node-exporter` metrics and all `dcgm-exporter` metrics collected on the spoke	Use this when you have an external Prometheus server and want broader node and GPU observability

prometheus-agent-minimal is the default and the recommended choice when your Prometheus server runs on the hub cluster. It reduces metric volume, network traffic, and storage pressure, and is sufficient for autoscaling because PaletteAI only queries node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL.

Configure `global.metrics`

Set the following values in your Helm values.yaml.

Parameter	Description	Default
`global.metrics.prometheusBaseUrl`	Base URL of the Prometheus server. Use protocol, host, and port only. Do not include `/api/v1/write`.	`https://prometheus.mural-system.svc.cluster.local:30090`
`global.metrics.grafanaUrl`	URL opened by the Open Grafana Dashboard button on Project Overview. This does not configure Grafana itself.	`""`
`global.metrics.timeout`	Timeout for Prometheus `remote_write` and query operations.	`"5s"`
`global.metrics.scrapeInterval`	How often spoke agents scrape metrics from exporters. Lower values increase metric resolution, network traffic, and storage use.	`"15s"`
`global.metrics.agentType`	Prometheus agent to deploy on spoke clusters. Supported values are `prometheus-agent-minimal` and `prometheus-agent`.	`"prometheus-agent-minimal"`
`global.metrics.username`	Basic auth username used by the Prometheus agent `remote_write` configuration.	`""`
`global.metrics.password`	Basic auth password used by the Prometheus agent `remote_write` configuration.	`""`
`global.metrics.basicAuthSecretName`	Secret containing `username` and `password` keys for Prometheus authentication. When set, PaletteAI uses it for hub-side Prometheus queries and spoke-side `remote_write` credentials.	`""`

Use a configuration similar to the following:

```yaml title="Example global.metrics configuration" global: metrics: prometheusBaseUrl: 'https://prometheus.example.com:9090' grafanaUrl: 'https://grafana.example.com' timeout: '5s' scrapeInterval: '15s' agentType: 'prometheus-agent-minimal' username: '' password: '' basicAuthSecretName: ''

<!-- vale on -->

## Prometheus Server Requirements \{#prometheus-server-requirements}

PaletteAI appends `/api/v1/write` to `global.metrics.prometheusBaseUrl` when it configures spoke-side `remote_write`. The value you provide must therefore be the Prometheus base URL only.

For example, if your Prometheus server receives remote-write traffic at `https://prometheus.example.com:9090/api/v1/write`, set:

<!-- vale off -->

```yaml
global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090'

Do not set:

global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090/api/v1/write'

GPU Metrics Requirements

Both PaletteAI Prometheus agent modes look for nvidia-dcgm-exporter pods in the gpu-operator namespace on spoke clusters. If those exporters are not present, GPU metrics are not shipped to Prometheus and GPU autoscaling data is unavailable.

Install the NVIDIA GPU Operator on every GPU-enabled spoke cluster where you want GPU monitoring or GPU autoscaling.

If you manage spoke cluster software through Palette Cluster Profiles, you can include the NVIDIA GPU Operator as an Add-on Cluster Profile in an Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for more information.

Authentication

PaletteAI supports Prometheus basic auth by either reading inline values from global.metrics.username and global.metrics.password or by reading a Secret named in global.metrics.basicAuthSecretName.

If your Prometheus server does not require basic auth, leave all three fields empty.
If you want to provide credentials inline, set global.metrics.username and global.metrics.password.
If you want to read credentials from a Secret, set global.metrics.basicAuthSecretName. The Secret must contain username and password keys.
If you use global.metrics.basicAuthSecretName, PaletteAI uses those credentials for both hub-side Prometheus queries and spoke-side Prometheus agent remote_write configuration.

Validate

Confirm the hub can query Prometheus by checking a ScalingPolicy resource.
```
kubectl get scalingpolicy --namespace <project-namespace>
```
Example Output
```
NAME             PROMETHEUS_AVAILABLE   AGE
example-policy   True                   2m
```
If the PROMETHEUS_AVAILABLE column shows True, the hub can reach Prometheus for autoscaling queries.
Confirm your Prometheus server is receiving metrics from spoke clusters.

Review your Prometheus targets and remote-write ingestion metrics in Prometheus or Grafana. If metrics are missing, confirm network connectivity from the spoke to global.metrics.prometheusBaseUrl, and confirm the remote-write receiver is enabled on the Prometheus server.
Confirm the Grafana link appears in the PaletteAI UI.

If global.metrics.grafanaUrl is set, Project Overview displays a Metrics Dashboard section with an Open Grafana Dashboard button.

Next Steps

Refer to Create and Manage Scaling Policies to configure CPU or GPU autoscaling.
Refer to Helm Chart Configuration for the full Helm values reference, including global.metrics.
Refer to the Hub and Spoke Model for topology guidance when you plan hub-only or dedicated spoke monitoring layouts.

Prerequisites​

Choose a Prometheus Agent​

Configure global.metrics​

GPU Metrics Requirements​

Authentication​

Validate​

Next Steps​