Monitor Opaque performance
Monitoring your AI workloads ensures performance, reliability, and compliance without exposing sensitive data. This tutorial will guide you through setting up observability for your Opaque workloads using Prometheus and OpenTelemetry. You'll learn to:
- Enable metrics for your Opaque workloads.
- Tell your observability system how to discover these metrics.
- Verify metrics are being collected successfully.
Who this guide is for
- Customers with an existing Prometheus setup who want to integrate Opaque workloads.
Prerequisites
Before starting, ensure you have:
- An Opaque deployment (self-hosted) running
- Basic Kubernetes knowledge (for Helm-based deployment)
- An existing Prometheus installation
Collect metrics
Opaque workloads do not send metrics to Opaque—customers own and manage their observability data. Opaque provides Prometheus-compatible metrics over HTTP endpoints. To collect metrics, you'll need to enable monitoring in your Opaque Helm chart. This will configure your observability system (Prometheus) to scrape the Opaque metrics endpoints.
Wire up monitoring
The Opaque Helm Charts make use of the called ServiceMonitor
Custom Resource Definition which is part of the Prometheus Operator project. Despite these being CRDs, they are commonly enabled as an optional component of a Helm chart to tell Prometheus to monitor the application.
In particular, the ServiceMonitor
resource tells Prometheus how to discover the metrics endpoint for the specific Opaque Service
as well as what named port and path the metrics endpoint is accessible on.
Set the following configuration in your Helm values file to enable monitoring:
Deploy the Opaque workload using Helm:
This will create a ServiceMonitor
resource inside the cluster similar to this:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: opaque-client-api
namespace: opaque
spec:
attachMetadata:
node: false
endpoints:
- interval: 30s
path: /metrics
port: http
scheme: http
scrapeTimeout: 3s
selector:
matchLabels:
app.kubernetes.io/instance: opaque-client
app.kubernetes.io/name: api
This ServiceMonitor
resource tells Prometheus to look for a service matching the labels app.kubernetes.io/instance: opaque-client
and app.kubernetes.io/name: api
, and scrape the /metrics
endpoint of the service every 30 seconds.
Verify setup
To verify that these metrics are accessible, you can create a port forward:
Now run a curl command to the /metrics
endpoint to see the results.
curl -s http://localhost:8888/metrics
# ...snip...
# HELP flask_http_request_total Total number of HTTP requests
# TYPE flask_http_request_total counter
flask_http_request_total{method="GET",status="500"} 79.0
Access and view metrics
Opaque does not provide long term storage of metrics data. Once you’ve configured Prometheus to scrape metrics, Opaque recommends integrating the metrics with your existing observability tooling (eg: Prometheus, Grafana).
Exposed metrics
The table below outlines the metrics each service exports, and the mechanism by which export is done. For example, some services export metrics on an endpoint that an OpenTelemetry collector running on the same node can scrape. Other services export metrics by writing to a file.
Note
This table is a work in progress.
Plane (client, data plane) | Service (Frontend, API, EDE, Job operator, verifier, exit handler, heartbeat…) | How metrics are exposed (endpoint on port, written to file, etc) | What metrics are exposed |
---|---|---|---|