Skip to content

Monitor Opaque performance

Monitoring your AI workloads ensures performance, reliability, and compliance without exposing sensitive data. This tutorial will guide you through setting up observability for your Opaque workloads using Prometheus and OpenTelemetry. You'll learn to:

  • Enable metrics for your Opaque workloads.
  • Tell your observability system how to discover and collect these metrics.
  • Verify that metrics are being collected and visualized successfully.

Before you begin

Ensure you have the following in place:

  • A self-hosted Opaque deployment running in your Kubernetes cluster
  • Basic familiarity with Kubernetes and Helm
  • A working Prometheus installation (Grafana optional but recommended)

During setup, you’ll be working with the following Helm charts:

How the metrics pipeline works

Opaque’s observability setup is built around OpenTelemetry (OTel), a CNCF-supported standard for collecting telemetry from distributed systems. Each application emits metrics locally to an OTel Collector Agent. These agents forward data to a centralized OTel Hub, which exports metrics to Prometheus for long-term storage and visualization in Grafana. The following diagram illustrates how metrics flow from instrumented apps to your observability backend.

Diagram of the metrics pipeline from Opaque workloads to Prometheus/Grafana

More specifically:

  • Applications are instrumented to emit metrics using the OpenTelemetry Protocol (OTLP), typically over gRPC (4317) or HTTP (4318).
  • OTel agent collectors, deployed as sidecars or node-level DaemonSets, receive these metrics. They process data using the resource processor (for metadata enrichment) and batch processor (for efficient delivery). Optionally, a debug exporter can be enabled for local inspection. Metrics are then forwarded using the otlp exporter.
  • The central OTel hub is a standalone OpenTelemetry collector that aggregates metrics from all agents. It can optionally perform additional processing or transformation before exporting to a backend.
  • Prometheus and Grafana complete the pipeline. The hub uses the prometheusremotewrite exporter to forward data to Prometheus. Grafana then visualizes that data for dashboards and alerts.

Set up metrics for your Opaque deployment

Follow these steps to configure each part of the metrics pipeline—from emitting telemetry in your workloads to visualizing metrics in Prometheus and Grafana.

Step 1. Define an OTLP ingestion endpoint

To receive metrics from your applications, you'll need to configure an OpenTelemetry Protocol (OTLP) endpoint. This endpoint will serve as the destination for your application's Collector Agents. You’ll typically expose this endpoint using a DNS name like the following:

otlp-ingest.yourcompany.com:4317

Step 2. Configure the OTel Agent

Use the official OpenTelemetry Helm chart to deploy an agent for collecting local metrics.

First, add the OTel repo:

helm repo add open-telemetry <https://open-telemetry.github.io/opentelemetry-helm-charts>
helm repo update

Then define a otel-values.yaml file similar to the following example. This config sets up:

  • A Deployment-mode collector
  • OTLP receiver and exporter
  • Cluster metadata enrichment
  • Health check and telemetry metrics
mode: "deployment"

image:
  repository: "otel/opentelemetry-collector-contrib"
  pullPolicy: IfNotPresent
  tag: ""  # Specify a version to ensure consistent builds (e.g., "0.97.0") 

config:
  exporters:
    debug: {}
    otlp:
      endpoint: 10.22.0.105:4317 # Replace with your actual OTLP ingestion endpoint
      tls:
        insecure: true  # Use for unencrypted HTTP/gRPC (no TLS); set to false for TLS or mTLS connections

  processors:
    batch:
      send_batch_size: 4096
      timeout: 10s
    memory_limiter:
      check_interval: 1s
      limit_percentage: 75
      spike_limit_percentage: 15
    resource:
      attributes:
        - action: insert
          key: customer_cluster_type
          value: "<cluster-type>"  # e.g., "client"
        - action: insert
          key: customer_environment
          value: "<environment-name>" # e.g., "prod" or your cluster name

  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  extensions:
    health_check: {}

  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
    extensions:
      - health_check
    pipelines:
      metrics:
        receivers: [otlp]
        processors: [resource, batch]
        exporters: [otlp, debug]

Deploy the agent with:

helm install \
    otel-agent open-telemetry/opentelemetry-collector \
    -f otel-values.yaml

Step 3. Deploy the central collector

To receive metrics from your Opaque workloads, deploy an OpenTelemetry collector that acts as the central hub. This collector receives OTLP traffic from all agent collectors and forwards metrics to Prometheus.

We recommend deploying this collector in a dedicated monitoring namespace to isolate observability components and align with standard Prometheus/Grafana setups. However, you're free to deploy it elsewhere if that better fits your infrastructure or organizational requirements.

Deploy the collector

Your central hub should expose standard OTLP receiver ports for both HTTP and gRPC. Here’s a minimal receiver configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

You can run the collector as a Deployment, StatefulSet, or another topology depending on your availability and scaling needs.

Expose the service via DNS

Expose the collector using a DNS record such as:

otlp-ingest.yourcompany.com

This DNS entry should route to the collector’s service via a Kubernetes LoadBalancer, Ingress, or your preferred service mesh. Ensure this endpoint is reachable from all Opaque workloads sending metrics.

Export metrics to Prometheus

Once the collector receives OTLP data, configure it to forward metrics to Prometheus using the prometheusremotewrite exporter. Here's an example of the remaining configuration:

processors:
  batch: {}

exporters:
  prometheusremotewrite:
    endpoint: "<http://prometheus.monitoring.svc:9090/api/v1/write>"

service:
  telemetry:
    logs:
      level: "debug"
    metrics:
      level: "none"
    address: "0.0.0.0:0"
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite, debug]

In this setup:

  • The Collector receives OTLP metrics from all instrumented apps.
  • It batches and forwards those metrics to Prometheus.
  • Grafana then queries Prometheus to visualize them.

Step 4. Update Helm values in your Opaque workloads

Once your OTLP ingestion endpoint is live, update each Helm release to reference it. This ensures that metrics from your Opaque jobs and services are sent to the central collector.

Note

If you're using a shared values.yaml or common Helm chart overlay, this is often a single change.

Step 5. Visualize and alert in Grafana

After metrics reach Prometheus, you can use Grafana to build dashboards or set up alerts. Opaque provides example dashboard templates, or you can create your own based on your workloads.

Common metrics to track include:

  • CPU and memory usage of key platform components
  • Container restarts or crash loops
  • Health of Opaque services (including service host, cert manager, client and encryption/decryption service)
  • Job execution counts, durations, and failure rates

To verify that metrics are being collected successfully, open Grafana and select your Prometheus data source. Run a basic query (such as a built-in Prometheus metric or a known workload-level signal) to confirm metric activity. If results appear, metrics are flowing correctly. You should also see Opaque-specific metrics appear under custom namespaces once workloads run.

Component Cluster OTEL Service Name OTEL Exporter Protocol OTEL Exporter Endpoint Requires http:// Default Export Interval (s) Metric Name Type Unit Description
Flask client API client client-api grpc http://localhost:4317 Yes 60 http.server.request.duration histogram buckets s Duration of HTTP server requests
Enc/Dec engine client enc-dec-engine http http://localhost:4318 yes per request http.server.request.duration histogram buckets s Duration of HTTP server requests
ATLS cert manager client / dataplane atls-cert-mgr http http://localhost:4318 yes 15 sec / configurable http.server.request.duration histogram buckets s Duration of HTTP server requests
Service host dataplane service-host http http://localhost:4318 yes 15 sec / configurable http.server.request.duration histogram buckets s Duration of HTTP server requests

Troubleshooting your telemetry pipeline

The Opaque metrics pipeline follows a clear structure: application-level OpenTelemetry SDKs export metrics to local agent collectors, which forward data to a centralized OTel hub. From there, metrics are exposed to Prometheus and visualized in Grafana.

Troubleshooting begins when symptoms appear — such as missing metrics, delayed dashboards, or gaps in data. The most effective approach is to trace the pipeline backwards, starting with Grafana and moving toward the source.

  • Start with Grafana. If dashboards are missing data, check whether Prometheus is receiving metrics. Visit the Prometheus /targets page to confirm that the OTel hub is either being scraped (pull-based) or actively writing metrics (via remote write).
  • Check the OTel hub. If Prometheus seems healthy but data is still missing, the issue may lie in the hub’s export pipeline. Problems like missing exporters, broken pipelines, or retry failures can silently block metric delivery. Review hub logs and built-in metrics (e.g., otelcol_exporter_*, otelcol_receiver_*) for clues.
  • Inspect the agent collectors. If the hub shows no incoming data, verify that each workload-level agent is running and configured to forward metrics to the hub. Common issues include missing OTLP receivers (ports 4317 for gRPC or 4318 for HTTP), misconfigured endpoints, ports, or network restrictions (like blocked egress or DNS resolution errors).
  • Check the application layer. Finally, confirm that the application is generating metrics at all. SDK misconfiguration—missing or extra http:// prefixes or invalid OTLP endpoints—is a common cause of silent failure. Enable debug logging or use tools like otel-cli to generate and trace synthetic telemetry during testing.

When diagnosing pipeline issues, simplify wherever possible. Swap exporters for console loggers to confirm emission or redirect to a local Prometheus instance to test ingestion. Monitoring the health of the pipeline itself—collector status, export success, and queue sizes—is just as critical as observing application-level metrics.

Approaching telemetry as a layered system makes it easier to isolate problems and restore visibility quickly.