Monitor Opaque performance
Monitoring your AI workloads ensures performance, reliability, and compliance without exposing sensitive data. This tutorial will guide you through setting up observability for your Opaque workloads using Prometheus and OpenTelemetry. You'll learn to:
- Enable metrics for your Opaque workloads.
- Tell your observability system how to discover and collect these metrics.
- Verify that metrics are being collected and visualized successfully.
Before you begin
Ensure you have the following in place:
- A self-hosted Opaque deployment running in your Kubernetes cluster
- Basic familiarity with Kubernetes and Helm
- A working Prometheus installation (Grafana optional but recommended)
During setup, you’ll be working with the following Helm charts:
How the metrics pipeline works
Opaque’s observability setup is built around OpenTelemetry (OTel), a CNCF-supported standard for collecting telemetry from distributed systems. Each application emits metrics locally to an OTel Collector Agent. These agents forward data to a centralized OTel Hub, which exports metrics to Prometheus for long-term storage and visualization in Grafana. The following diagram illustrates how metrics flow from instrumented apps to your observability backend.
More specifically:
- Applications are instrumented to emit metrics using the OpenTelemetry Protocol (OTLP), typically over gRPC (
4317
) or HTTP (4318
). - OTel agent collectors, deployed as sidecars or node-level DaemonSets, receive these metrics. They process data using the
resource
processor (for metadata enrichment) andbatch
processor (for efficient delivery). Optionally, adebug
exporter can be enabled for local inspection. Metrics are then forwarded using theotlp
exporter. - The central OTel hub is a standalone OpenTelemetry collector that aggregates metrics from all agents. It can optionally perform additional processing or transformation before exporting to a backend.
- Prometheus and Grafana complete the pipeline. The hub uses the
prometheusremotewrite
exporter to forward data to Prometheus. Grafana then visualizes that data for dashboards and alerts.
Set up metrics for your Opaque deployment
Follow these steps to configure each part of the metrics pipeline—from emitting telemetry in your workloads to visualizing metrics in Prometheus and Grafana.
Step 1. Define an OTLP ingestion endpoint
To receive metrics from your applications, you'll need to configure an OpenTelemetry Protocol (OTLP) endpoint. This endpoint will serve as the destination for your application's Collector Agents. You’ll typically expose this endpoint using a DNS name like the following:
Step 2. Configure the OTel Agent
Use the official OpenTelemetry Helm chart to deploy an agent for collecting local metrics.
First, add the OTel repo:
helm repo add open-telemetry <https://open-telemetry.github.io/opentelemetry-helm-charts>
helm repo update
Then define a otel-values.yaml
file similar to the following example. This config sets up:
- A Deployment-mode collector
- OTLP receiver and exporter
- Cluster metadata enrichment
- Health check and telemetry metrics
mode: "deployment"
image:
repository: "otel/opentelemetry-collector-contrib"
pullPolicy: IfNotPresent
tag: "" # Specify a version to ensure consistent builds (e.g., "0.97.0")
config:
exporters:
debug: {}
otlp:
endpoint: 10.22.0.105:4317 # Replace with your actual OTLP ingestion endpoint
tls:
insecure: true # Use for unencrypted HTTP/gRPC (no TLS); set to false for TLS or mTLS connections
processors:
batch:
send_batch_size: 4096
timeout: 10s
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
resource:
attributes:
- action: insert
key: customer_cluster_type
value: "<cluster-type>" # e.g., "client"
- action: insert
key: customer_environment
value: "<environment-name>" # e.g., "prod" or your cluster name
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
extensions:
health_check: {}
service:
telemetry:
metrics:
address: 0.0.0.0:8888
extensions:
- health_check
pipelines:
metrics:
receivers: [otlp]
processors: [resource, batch]
exporters: [otlp, debug]
Deploy the agent with:
Step 3. Deploy the central collector
To receive metrics from your Opaque workloads, deploy an OpenTelemetry collector that acts as the central hub. This collector receives OTLP traffic from all agent collectors and forwards metrics to Prometheus.
We recommend deploying this collector in a dedicated monitoring
namespace to isolate observability components and align with standard Prometheus/Grafana setups. However, you're free to deploy it elsewhere if that better fits your infrastructure or organizational requirements.
Deploy the collector
Your central hub should expose standard OTLP receiver ports for both HTTP and gRPC. Here’s a minimal receiver configuration:
You can run the collector as a Deployment, StatefulSet, or another topology depending on your availability and scaling needs.
Expose the service via DNS
Expose the collector using a DNS record such as:
This DNS entry should route to the collector’s service via a Kubernetes LoadBalancer, Ingress, or your preferred service mesh. Ensure this endpoint is reachable from all Opaque workloads sending metrics.
Export metrics to Prometheus
Once the collector receives OTLP data, configure it to forward metrics to Prometheus using the prometheusremotewrite
exporter. Here's an example of the remaining configuration:
processors:
batch: {}
exporters:
prometheusremotewrite:
endpoint: "<http://prometheus.monitoring.svc:9090/api/v1/write>"
service:
telemetry:
logs:
level: "debug"
metrics:
level: "none"
address: "0.0.0.0:0"
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite, debug]
In this setup:
- The Collector receives OTLP metrics from all instrumented apps.
- It batches and forwards those metrics to Prometheus.
- Grafana then queries Prometheus to visualize them.
Step 4. Update Helm values in your Opaque workloads
Once your OTLP ingestion endpoint is live, update each Helm release to reference it. This ensures that metrics from your Opaque jobs and services are sent to the central collector.
Note
If you're using a shared values.yaml or common Helm chart overlay, this is often a single change.
Step 5. Visualize and alert in Grafana
After metrics reach Prometheus, you can use Grafana to build dashboards or set up alerts. Opaque provides example dashboard templates, or you can create your own based on your workloads.
Common metrics to track include:
- CPU and memory usage of key platform components
- Container restarts or crash loops
- Health of Opaque services (including service host, cert manager, client and encryption/decryption service)
- Job execution counts, durations, and failure rates
To verify that metrics are being collected successfully, open Grafana and select your Prometheus data source. Run a basic query (such as a built-in Prometheus metric or a known workload-level signal) to confirm metric activity. If results appear, metrics are flowing correctly. You should also see Opaque-specific metrics appear under custom namespaces once workloads run.
Component | Cluster | OTEL Service Name | OTEL Exporter Protocol | OTEL Exporter Endpoint | Requires http:// | Default Export Interval (s) | Metric Name | Type | Unit | Description |
---|---|---|---|---|---|---|---|---|---|---|
Flask client API | client | client-api | grpc | http://localhost:4317 | Yes | 60 | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Enc/Dec engine | client | enc-dec-engine | http | http://localhost:4318 | yes | per request | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
ATLS cert manager | client / dataplane | atls-cert-mgr | http | http://localhost:4318 | yes | 15 sec / configurable | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Service host | dataplane | service-host | http | http://localhost:4318 | yes | 15 sec / configurable | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Troubleshooting your telemetry pipeline
The Opaque metrics pipeline follows a clear structure: application-level OpenTelemetry SDKs export metrics to local agent collectors, which forward data to a centralized OTel hub. From there, metrics are exposed to Prometheus and visualized in Grafana.
Troubleshooting begins when symptoms appear — such as missing metrics, delayed dashboards, or gaps in data. The most effective approach is to trace the pipeline backwards, starting with Grafana and moving toward the source.
- Start with Grafana. If dashboards are missing data, check whether Prometheus is receiving metrics. Visit the Prometheus
/targets
page to confirm that the OTel hub is either being scraped (pull-based) or actively writing metrics (via remote write). - Check the OTel hub. If Prometheus seems healthy but data is still missing, the issue may lie in the hub’s export pipeline. Problems like missing exporters, broken pipelines, or retry failures can silently block metric delivery. Review hub logs and built-in metrics (e.g.,
otelcol_exporter_*
,otelcol_receiver_*
) for clues. - Inspect the agent collectors. If the hub shows no incoming data, verify that each workload-level agent is running and configured to forward metrics to the hub. Common issues include missing OTLP receivers (ports 4317 for gRPC or 4318 for HTTP), misconfigured endpoints, ports, or network restrictions (like blocked egress or DNS resolution errors).
- Check the application layer. Finally, confirm that the application is generating metrics at all. SDK misconfiguration—missing or extra
http://
prefixes or invalid OTLP endpoints—is a common cause of silent failure. Enable debug logging or use tools likeotel-cli
to generate and trace synthetic telemetry during testing.
When diagnosing pipeline issues, simplify wherever possible. Swap exporters for console loggers to confirm emission or redirect to a local Prometheus instance to test ingestion. Monitoring the health of the pipeline itself—collector status, export success, and queue sizes—is just as critical as observing application-level metrics.
Approaching telemetry as a layered system makes it easier to isolate problems and restore visibility quickly.