Version: v1.1.x

Configure Prometheus Agent Monitoring

PaletteAI can ship metrics from spoke clusters to a Prometheus server and use them for autoscaling decisions on the hub cluster. Configure this behavior with the global.metrics section in your Helm values.yaml.

This page explains how to:

Choose between prometheus-agent-minimal and prometheus-agent.
Configure the global.metrics values.
Prepare Prometheus and Grafana for PaletteAI.
Understand the GPU metrics and authentication requirements.

PaletteAI uses global.metrics for three separate behaviors:

Spoke-side Prometheus agents that scrape metrics and send them to Prometheus with remote_write.
Hub-side Prometheus queries used by ScalingPolicy autoscaling.
The Grafana link shown on the Project Overview and Tenant Overview pages.

Prerequisites

A running PaletteAI installation
A Prometheus server that is reachable from the hub cluster and every spoke cluster that will run a Prometheus agent
A Prometheus deployment that accepts remote-write traffic on /api/v1/write
Node Exporter enabled on each spoke cluster that runs a Prometheus agent, so agents can scrape CPU metrics for autoscaling
DCGM Exporter enabled on each GPU-enabled spoke cluster where you want GPU metrics or GPU autoscaling (for example through the NVIDIA GPU Operator)

info

If you use the Spectro Cloud Prometheus Operator pack, enable the remote monitoring feature so the Prometheus server can receive metrics from Prometheus agents. Enable Node Exporter and DCGM Exporter in the pack configuration so Prometheus collects the CPU and GPU metrics that PaletteAI uses. Refer to Deploy Monitoring Stack and Prometheus Operator for Prometheus server and Grafana setup guidance.

If you use a standalone Prometheus deployment, enable the remote-write receiver endpoint. Refer to the Prometheus remote_write configuration and Prometheus command-line flags.

For All-In-One (AIO) installs, Prometheus is provisioned automatically on the PaletteAI hub cluster. The PaletteAI installer creates a prometheus-basic-auth Secret in the mural-system namespace with username and password keys you can use to sign in to Prometheus.

Choose a Prometheus Agent

PaletteAI supports two spoke-side agent modes.

Agent Type	What It Ships	Recommended Use
`prometheus-agent-minimal`	Only the metrics required for autoscaling: `node_cpu_seconds_total{mode="idle"}` and `DCGM_FI_DEV_GPU_UTIL`	Use this when Prometheus runs on the hub cluster, or when you only need autoscaling metrics
`prometheus-agent`	All `node-exporter` metrics and all `dcgm-exporter` metrics collected on the spoke	Use this when you have an external Prometheus server and want broader node and GPU observability

prometheus-agent-minimal is the default and the recommended choice when your Prometheus server runs on the hub cluster. It reduces metric volume, network traffic, and storage pressure, and is sufficient for autoscaling because PaletteAI only queries node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL.

Configure `global.metrics`

Set the following values in your Helm values.yaml.

Parameter	Description	Default
`global.metrics.prometheusBaseUrl`	Base URL of the Prometheus server. Use protocol, host, and port only. Do not include `/api/v1/write`.	`https://prometheus.mural-system.svc.cluster.local:30090`
`global.metrics.grafanaUrl`	URL opened by the Open Grafana Dashboard button on Project Overview and Tenant Overview. This does not configure Grafana itself.	`""`
`global.metrics.timeout`	Timeout for Prometheus `remote_write` and query operations.	`"5s"`
`global.metrics.scrapeInterval`	How often spoke agents scrape metrics from exporters. Lower values increase metric resolution, network traffic, and storage use.	`"15s"`
`global.metrics.agentType`	Prometheus agent to deploy on spoke clusters. Supported values are `prometheus-agent-minimal` and `prometheus-agent`.	`"prometheus-agent-minimal"`
`global.metrics.username`	Basic auth username used by the Prometheus agent `remote_write` configuration.	`""`
`global.metrics.password`	Basic auth password used by the Prometheus agent `remote_write` configuration.	`""`
`global.metrics.basicAuthSecretName`	Secret containing `username` and `password` keys for Prometheus authentication. When set, PaletteAI uses it for hub-side Prometheus queries and spoke-side `remote_write` credentials.	`""`

Use a configuration similar to the following:

Example global.metrics configuration
global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090'
    grafanaUrl: 'https://grafana.example.com'
    timeout: '5s'
    scrapeInterval: '15s'
    agentType: 'prometheus-agent-minimal'
    username: ''
    password: ''
    basicAuthSecretName: ''

Prometheus Server Requirements

PaletteAI appends /api/v1/write to global.metrics.prometheusBaseUrl when it configures spoke-side remote_write. The value you provide must therefore be the Prometheus base URL only.

For example, if your Prometheus server receives remote-write traffic at https://prometheus.example.com:9090/api/v1/write, set:

global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090'

Do not set:

global:
  metrics:
    prometheusBaseUrl: 'https://prometheus.example.com:9090/api/v1/write'

GPU Metrics Requirements

Both PaletteAI Prometheus agent modes look for nvidia-dcgm-exporter pods in the gpu-operator namespace on spoke clusters. If those exporters are not present, GPU metrics are not shipped to Prometheus and GPU autoscaling data is unavailable.

Install the NVIDIA GPU Operator on every GPU-enabled spoke cluster where you want GPU monitoring or GPU autoscaling.

If you manage spoke cluster software through Palette Cluster Profiles, you can include the NVIDIA GPU Operator as an Add-on Cluster Profile in an Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for more information.

Authentication

PaletteAI supports Prometheus basic auth by either reading inline values from global.metrics.username and global.metrics.password or by reading a Secret named in global.metrics.basicAuthSecretName.

If your Prometheus server does not require basic auth, leave all three fields empty.
If you want to provide credentials inline, set global.metrics.username and global.metrics.password.
If you want to read credentials from a Secret, set global.metrics.basicAuthSecretName. The Secret must contain username and password keys.
If you use global.metrics.basicAuthSecretName, PaletteAI uses those credentials for both hub-side Prometheus queries and spoke-side Prometheus agent remote_write configuration.

Apply the Configuration

Apply the updated Helm values to your PaletteAI installation.

helm upgrade paletteai spectrocloud/paletteai \
  --namespace mural-system \
  --values values.yaml

Validate

Confirm the hub can query Prometheus by checking a ScalingPolicy resource.
```
kubectl get scalingpolicy --namespace <project-namespace>
```
Example Output
```
NAME             PROMETHEUS_AVAILABLE   AGE
example-policy   True                   2m
```
If the PROMETHEUS_AVAILABLE column shows True, the hub can reach Prometheus for autoscaling queries.
Confirm your Prometheus server is receiving metrics from spoke clusters. Spoke targets should appear with state=UP in the Prometheus Targets view, and remote-write ingestion metrics should report non-zero samples.

Review your Prometheus targets and remote-write ingestion metrics in Prometheus or Grafana. If metrics are missing, confirm network connectivity from the spoke to global.metrics.prometheusBaseUrl, and confirm the remote-write receiver is enabled on the Prometheus server.
Confirm the Grafana link appears in the PaletteAI UI.

If global.metrics.grafanaUrl is set, the Project Overview and Tenant Overview pages display a Metrics Dashboard panel with an Open Grafana Dashboard button.

Instance name on the hub and spokes

PaletteAI installations are uniquely identified via global.instanceName. That value is surfaced to spoke-side Prometheus add-ons (for example PALETTEAI_INSTANCE_NAME) and metric selectors (for example paletteai_instance) so hub queries, Grafana, and autoscaling can line up series from the hub and from spoke clusters.

For appliance merge behavior and installer-managed fields, refer to Deploy PaletteAI. For the Helm global section, refer to the Global step in Install PaletteAI on Kubernetes.

Next Steps

Refer to Create and Manage Scaling Policies to configure CPU or GPU autoscaling.
Refer to View Metrics in Grafana for the user-facing workflow that opens the Metrics Dashboard panel and the metrics PaletteAI publishes.
Refer to Helm Chart Configuration for the full Helm values reference, including global.metrics.
Refer to the Hub and Spoke Model for topology guidance when you plan hub-only or dedicated spoke monitoring layouts.

Prerequisites​

Choose a Prometheus Agent​

Configure global.metrics​

Prometheus Server Requirements​

GPU Metrics Requirements​

Authentication​

Apply the Configuration​

Validate​

Instance name on the hub and spokes​

Next Steps​