Skip to main content
Version: v0.2.x

Configure Prometheus Agent Monitoring

PaletteAI can ship metrics from spoke clusters to a Prometheus server and use them for autoscaling decisions on the hub cluster. Configure this behavior with the global.metrics section in your Helm values.yaml.

This page explains how to:

  • Choose between prometheus-agent-minimal and prometheus-agent.

  • Configure the global.metrics values.

  • Prepare Prometheus and Grafana for PaletteAI.

  • Understand the GPU metrics and authentication requirements.

PaletteAI uses global.metrics for three separate behaviors:

  • Spoke-side Prometheus agents that scrape metrics and send them to Prometheus with remote_write.

  • Hub-side Prometheus queries used by ScalingPolicy autoscaling.

  • The Grafana link shown on the Project Overview page.

Prerequisites

  • A running PaletteAI installation

  • A Prometheus server that is reachable from the hub cluster and every spoke cluster that will run a Prometheus agent

  • A Prometheus deployment that accepts remote-write traffic on /api/v1/write

  • Node Exporter enabled on each spoke cluster that runs a Prometheus agent, so agents can scrape CPU metrics for autoscaling

  • DCGM Exporter enabled on each GPU-enabled spoke cluster where you want GPU metrics or GPU autoscaling (for example through the NVIDIA GPU Operator)

info

If you use the Spectro Cloud Prometheus Operator pack, enable the remote monitoring feature so the Prometheus server can receive metrics from Prometheus agents. Enable Node Exporter and DCGM Exporter in the pack configuration so Prometheus collects the CPU and GPU metrics that PaletteAI uses. Refer to Deploy Monitoring Stack and Prometheus Operator for Prometheus server and Grafana setup guidance.

If you use a standalone Prometheus deployment, enable the remote-write receiver endpoint. Refer to the Prometheus remote_write configuration and Prometheus command-line flags.

Choose a Prometheus Agent

PaletteAI supports two spoke-side agent modes.

Agent TypeWhat It ShipsRecommended Use
prometheus-agent-minimalOnly the metrics required for autoscaling: node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTILUse this when Prometheus runs on the hub cluster, or when you only need autoscaling metrics
prometheus-agentAll node-exporter metrics and all dcgm-exporter metrics collected on the spokeUse this when you have an external Prometheus server and want broader node and GPU observability

prometheus-agent-minimal is the default and the recommended choice when your Prometheus server runs on the hub cluster. It reduces metric volume, network traffic, and storage pressure, and is sufficient for autoscaling because PaletteAI only queries node_cpu_seconds_total{mode="idle"} and DCGM_FI_DEV_GPU_UTIL.

Configure global.metrics

Set the following values in your Helm values.yaml.

ParameterDescriptionDefault
global.metrics.prometheusBaseUrlBase URL of the Prometheus server. Use protocol, host, and port only. Do not include /api/v1/write.https://prometheus.mural-system.svc.cluster.local:30090
global.metrics.grafanaUrlURL opened by the Open Grafana Dashboard button on Project Overview. This does not configure Grafana itself.""
global.metrics.timeoutTimeout for Prometheus remote_write and query operations."5s"
global.metrics.scrapeIntervalHow often spoke agents scrape metrics from exporters. Lower values increase metric resolution, network traffic, and storage use."15s"
global.metrics.agentTypePrometheus agent to deploy on spoke clusters. Supported values are prometheus-agent-minimal and prometheus-agent."prometheus-agent-minimal"
global.metrics.usernameBasic auth username used by the Prometheus agent remote_write configuration.""
global.metrics.passwordBasic auth password used by the Prometheus agent remote_write configuration.""
global.metrics.basicAuthSecretNameSecret containing username and password keys for Prometheus authentication. When set, PaletteAI uses it for hub-side Prometheus queries and spoke-side remote_write credentials.""

Use a configuration similar to the following:

```yaml title="Example global.metrics configuration" global: metrics: prometheusBaseUrl: 'https://prometheus.example.com:9090' grafanaUrl: 'https://grafana.example.com' timeout: '5s' scrapeInterval: '15s' agentType: 'prometheus-agent-minimal' username: '' password: '' basicAuthSecretName: ''


<!-- vale on -->

## Prometheus Server Requirements \{#prometheus-server-requirements}

PaletteAI appends `/api/v1/write` to `global.metrics.prometheusBaseUrl` when it configures spoke-side `remote_write`. The value you provide must therefore be the Prometheus base URL only.

For example, if your Prometheus server receives remote-write traffic at `https://prometheus.example.com:9090/api/v1/write`, set:

<!-- vale off -->

```yaml
global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090'

Do not set:

global:
metrics:
prometheusBaseUrl: 'https://prometheus.example.com:9090/api/v1/write'

GPU Metrics Requirements

Both PaletteAI Prometheus agent modes look for nvidia-dcgm-exporter pods in the gpu-operator namespace on spoke clusters. If those exporters are not present, GPU metrics are not shipped to Prometheus and GPU autoscaling data is unavailable.

Install the NVIDIA GPU Operator on every GPU-enabled spoke cluster where you want GPU monitoring or GPU autoscaling.

If you manage spoke cluster software through Palette Cluster Profiles, you can include the NVIDIA GPU Operator as an Add-on Cluster Profile in an Infrastructure or Fullstack Profile Bundle. Refer to Profile Bundles for more information.

Authentication

PaletteAI supports Prometheus basic auth by either reading inline values from global.metrics.username and global.metrics.password or by reading a Secret named in global.metrics.basicAuthSecretName.

  • If your Prometheus server does not require basic auth, leave all three fields empty.

  • If you want to provide credentials inline, set global.metrics.username and global.metrics.password.

  • If you want to read credentials from a Secret, set global.metrics.basicAuthSecretName. The Secret must contain username and password keys.

  • If you use global.metrics.basicAuthSecretName, PaletteAI uses those credentials for both hub-side Prometheus queries and spoke-side Prometheus agent remote_write configuration.

Validate

  1. Confirm the hub can query Prometheus by checking a ScalingPolicy resource.

    kubectl get scalingpolicy --namespace <project-namespace>
    Example Output
    NAME             PROMETHEUS_AVAILABLE   AGE
    example-policy True 2m

    If the PROMETHEUS_AVAILABLE column shows True, the hub can reach Prometheus for autoscaling queries.

  2. Confirm your Prometheus server is receiving metrics from spoke clusters.

    Review your Prometheus targets and remote-write ingestion metrics in Prometheus or Grafana. If metrics are missing, confirm network connectivity from the spoke to global.metrics.prometheusBaseUrl, and confirm the remote-write receiver is enabled on the Prometheus server.

  3. Confirm the Grafana link appears in the PaletteAI UI.

    If global.metrics.grafanaUrl is set, Project Overview displays a Metrics Dashboard section with an Open Grafana Dashboard button.

Next Steps