Skip to content

Monitor Opaque performance

Monitoring your AI workloads ensures performance, reliability, and compliance without exposing sensitive data. This tutorial will guide you through setting up observability for your Opaque workloads using Prometheus and OpenTelemetry. You'll learn to:

  • Enable metrics for your Opaque workloads.
  • Tell your observability system how to discover these metrics.
  • Verify metrics are being collected successfully.

Who this guide is for

  • Customers with an existing Prometheus setup who want to integrate Opaque workloads.

Prerequisites

Before starting, ensure you have:

  • An Opaque deployment (self-hosted) running
  • Basic Kubernetes knowledge (for Helm-based deployment)
  • An existing Prometheus installation

Collect metrics

Opaque workloads do not send metrics to Opaque—customers own and manage their observability data. Opaque provides Prometheus-compatible metrics over HTTP endpoints. To collect metrics, you'll need to enable monitoring in your Opaque Helm chart. This will configure your observability system (Prometheus) to scrape the Opaque metrics endpoints.

Wire up monitoring

The Opaque Helm Charts make use of the called ServiceMonitor Custom Resource Definition which is part of the Prometheus Operator project. Despite these being CRDs, they are commonly enabled as an optional component of a Helm chart to tell Prometheus to monitor the application.

In particular, the ServiceMonitor resource tells Prometheus how to discover the metrics endpoint for the specific Opaque Service as well as what named port and path the metrics endpoint is accessible on.

Set the following configuration in your Helm values file to enable monitoring:

monitoring:
  serviceMonitor:
    enabled: true

Deploy the Opaque workload using Helm:

helm --namespace opaque upgrade ...

This will create a ServiceMonitor resource inside the cluster similar to this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: opaque-client-api
  namespace: opaque
spec:
  attachMetadata:
    node: false
  endpoints:
  - interval: 30s
    path: /metrics
    port: http
    scheme: http
    scrapeTimeout: 3s
  selector:
    matchLabels:
      app.kubernetes.io/instance: opaque-client
      app.kubernetes.io/name: api

This ServiceMonitor resource tells Prometheus to look for a service matching the labels app.kubernetes.io/instance: opaque-client and app.kubernetes.io/name: api, and scrape the /metrics endpoint of the service every 30 seconds.

Verify setup

To verify that these metrics are accessible, you can create a port forward:

kubectl -n opaque port-forward svc/opaque-client-api 8888:http

Now run a curl command to the /metrics endpoint to see the results.

curl -s http://localhost:8888/metrics
# ...snip...
# HELP flask_http_request_total Total number of HTTP requests
# TYPE flask_http_request_total counter
flask_http_request_total{method="GET",status="500"} 79.0

Access and view metrics

Opaque does not provide long term storage of metrics data. Once you’ve configured Prometheus to scrape metrics, Opaque recommends integrating the metrics with your existing observability tooling (eg: Prometheus, Grafana).

Exposed metrics

The table below outlines the metrics each service exports, and the mechanism by which export is done. For example, some services export metrics on an endpoint that an OpenTelemetry collector running on the same node can scrape. Other services export metrics by writing to a file.

Note

This table is a work in progress.

Plane (client, data plane) Service (Frontend, API, EDE, Job operator, verifier, exit handler, heartbeat…) How metrics are exposed (endpoint on port, written to file, etc) What metrics are exposed