Monitor Opaque performance¶

Monitoring your AI workloads helps you track performance, ensure reliability, and maintain compliance without exposing sensitive data. This tutorial shows you how to set up observability for your Opaque workloads using OpenTelemetry. It uses Prometheus (for metrics) and Azure Blob Storage (for logs) as example destinations, but the same steps apply if you use other OTLP-compatible back ends. You'll learn to:

Enable metrics and logs collection for your Opaque workloads.
Configure your observability system to collect and forward these metrics and logs.
Point your observability system at the right endpoints for collection.
Verify that metrics are being collected and can be queried or visualized in your preferred time-series system.
Verify that logs are being delivered to a storage back end such as Azure Blob Storage.

Before you begin¶

Ensure you have the following:

A self-hosted Opaque deployment running in your Kubernetes cluster
A separate Kubernetes cluster (new or existing) where you’ll deploy your observability stack (OpenTelemetry hub collector, Prometheus, etc.)
Basic familiarity with Kubernetes and Helm
A working Prometheus installation
Access to an Azure Blob Storage container (required if you’re enabling log export)
A bearer token to authenticate Opaque agent collectors with the hub collector
A TLS certificate and private key (tls.crt and tls.key) for securing telemetry traffic into the hub collector

During setup, you’ll be working with the following Helm charts:

How observability works¶

An observability pipeline is the path that system data takes from where it’s generated to where it’s stored, visualized, and analyzed. At a high level, the flow looks like this:

Applications emit telemetry. Running services generate both metrics—such as latency, CPU usage, and error rates—and logs, including event records, status updates, and error messages.
Local collectors receive the data. Lightweight agents or sidecars capture telemetry close to the application. They may enrich the data with metadata, batch it for efficiency, or apply transformations.
A central hub aggregates telemetry. Collectors across the environment forward their data to a centralized service that unifies streams from many applications. This hub can apply additional processing and then route telemetry to one or more destinations.
Back ends store, visualize, and analyze. Metrics are typically sent to time-series databases and monitoring dashboards, while logs are shipped to searchable storage systems for troubleshooting and auditing.

This pipeline ensures that telemetry from distributed systems can be consistently captured, processed, and made actionable.

Diagram of a generic observability pipeline — A generic observability pipeline.

Opaque implements this pattern using OpenTelemetry (OTel), a CNCF-supported standard for collecting telemetry from distributed systems:

Applications emit metrics and logs over the OpenTelemetry Protocol (OTLP), typically over gRPC (4317) or HTTP (4318).
OTel collector agents (bundled with the Opaque deployment and running as sidecars or node-level agents) receive telemetry from applications and forward it using the otlp exporter. You can optionally enable a debug exporter for local inspection.
The central OTel hub is set up and managed by you, outside of the Opaque deployment. It aggregates telemetry from all agents and forwards it to your chosen back ends. The hub can also apply optional processing, filtering, or transformation.
Back ends for storage and analysis are also your choice. In this tutorial, metrics are exported to Prometheus and logs to Azure Blob Storage, but you can use any OTLP-compatible destinations.

The following diagram illustrates how metrics flow from instrumented apps to your observability back ends.

Diagram of the Opaque implementation of an observability pipelin from workloads to back ends — The Opaque observability pipeline.

Set up metrics and logs with the OpenTelemetry Operator¶

This guide uses the OpenTelemetry Operator Helm chart to manage collectors in your Kubernetes cluster. Opaque provides per-workload collector agents as part of the client and data plane deployment, so you don’t need to create those yourself. Your role is to deploy and configure a central hub collector using the Operator. The hub aggregates telemetry from the built-in collector agents and forwards it to the metrics and logs back ends you choose (for example, Prometheus or Azure Blob Storage).

Step 1. Decide destinations and protocols¶

To receive metrics and logs from your applications, you’ll need to configure an OpenTelemetry Protocol (OTLP) endpoint. This endpoint is exposed by the hub collector, and the Opaque-deployed agent collectors forward their telemetry to it before export to your chosen back end.

For example, you might expose a DNS name like:

otlp-ingest.yourcompany.com:4317

This endpoint serves as the entry point in your infrastructure for both metrics and logs, typically over OTLP gRPC (4317).

Info

Metrics back ends include any OTLP-compatible time-series store, such as Prometheus (via Remote Write), Datadog, or similar systems.

Logs back ends include any OTLP-compatible log store, such as Azure Blob Storage or Datadog.

As you plan, make sure you have the following details for each back end:

DNS/URL of the ingestion endpoint
Protocol (gRPC)
Authentication method (bearer token, API key, or mTLS)
TLS requirements (certificates, CA bundle)

Step 2. Deploy the OpenTelemetry Operator Helm chart¶

The OpenTelemetry Operator manages the lifecycle of collectors in your Kubernetes cluster. In this setup, the Operator runs in your observability cluster, which is separate from the cluster where Opaque is deployed.

Installing this cluster adds two key components:

A custom resource definition named OpenTelemetryCollector (or otelcol for short). This allows you to define collectors declaratively as Kubernetes resources.
A controller (pod) that watches for otelcol objects in the cluster. When a new pod is created, the controller deploys and manages the underlying collector pods on your behalf.

This makes it easy to run collectors in different modes (agent, hub) without hand-crafting deployments.

To get started, add the chart with Helm:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install opentelemetry-operator \
  open-telemetry/opentelemetry-operator \
  --namespace monitoring \
  --create-namespace \
  --set admissionWebhooks.certManager.enabled=false \
  --set admissionWebhooks.autoGenerateCert.enabled=true

We recommend installing the Operator in a dedicated monitoring namespace to keep observability components isolated. If you prefer another namespace, replace monitoring in all subsequent manifests and commands.

For additional context or troubleshooting, see the official OpenTelemetry Operator documentation.

Step 3. Configure secrets and certificates¶

First, create a Kubernetes secret with the bearer token that the Opaque-deployed agent collectors will use to authenticate when sending telemetry to the hub collector:

kubectl create secret generic bearer-token-secret \\
  --from-literal=token=SuperSecret1234 \\ # Replace with your token
  --namespace monitoring

You'll also need a TLS certificate and private key specifically for the hub collector. This certificate is separate from the ones used when deploying Opaque—it must be created for securing telemetry traffic into your observability cluster. Once you have the files (tls.crt and tls.key)

, create the Kubernetes TLS secret:

kubectl create secret tls otel-hub-tls \
  --cert=tls.crt \
  --key=tls.key \
  -n monitoring

The certificate’s DNS name must match the hostname you’ll expose for the hub collector (for example, otlp-ingest.example.com).

You’ll reference these secrets in the next step to secure the receiver and configure any exporters that need credentials.

Step 4. Create your hub collector¶

The hub collector aggregates telemetry from all agents in your cluster and forwards it to your chosen back ends. Run it in the monitoring namespace and scale replicas as needed. Keep the base config simple for bring-up; you’ll add exporters in a later step.

To define the hub, create a manifest named otel-hub.yaml:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-hub
  namespace: monitoring
spec:
  mode: deployment
  image: docker.io/otel/opentelemetry-collector-contrib:0.131.0

  # Provide the bearer token for authenticated ingest (from Step 3 secret).
  env:
  - name: OTEL_RECEIVER_BEARER_TOKEN
    valueFrom:
      secretKeyRef:
        name: otel-hub-bearer-token-secret
        key: token

  # Mount TLS certs if you enabled TLS in Step 3.
  volumeMounts:
  - mountPath: /etc/ssl/certs
    name: tls-certs
    readOnly: true
  volumes:
  - name: tls-certs
    secret:
      secretName: otel-hub-tls

  config:
    extensions:
      bearertokenauth/otlp:
        scheme: Bearer
        # Token must match the value you set when launching your Azure Managed App.
        token: ${env:OTEL_RECEIVER_BEARER_TOKEN}

    receivers:
      otlp:
        protocols:
          grpc:
            auth:
              authenticator: bearertokenauth/otlp
            endpoint: 0.0.0.0:4317
            tls:
              cert_file: /etc/ssl/certs/tls.crt
              key_file: /etc/ssl/certs/tls.key

    processors:
      batch:
        send_batch_size: 4096
        timeout: 10s

    exporters:
      # Debug exporter writes summaries to container stdout.
      # In Step 6 you'll replace this with your real logs back end (e.g., Azure Blob).
      debug:
        verbosity: normal

    service:
      extensions: [bearertokenauth/otlp]
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]

Then create this resource:

kubectl apply -f otel-hub.yaml -n monitoring

Step 5. Expose your hub service via DNS¶

When the hub collector is deployed, Kubernetes creates a Service (for example, otel-hub-collector in the monitoring namespace). To make it usable, you need to expose this Service with a DNS name that matches the TLS certificate you created in Step 3.

For example:

otlp-ingest.example.com

This DNS record should route to the collector’s Service through a Kubernetes LoadBalancer, Ingress, or your preferred service mesh.

Note

The exact setup depends on your environment, but the key requirement is that all Opaque workloads must be able to reach the DNS endpoint you configure.

Step 6. Configure your exporters¶

The hub collector doesn’t store telemetry itself—it needs exporters to deliver data to your monitoring or storage back ends. Exporters are part of pipelines that move data from receivers, through processors, and out to exporters. You can define multiple exporters in the same otel-hub.yaml and configure separate pipelines for metrics and logs.

The following examples show how to forward metrics to Prometheus and logs to Azure Blob Storage. If you use different OTLP-compatible back ends, substitute the appropriate exporters.

processors:
  batch: {}

exporters:
  # --- Metrics exporter (Prometheus Remote Write) ---
  prometheusremotewrite:
    endpoint: "http://prometheus.monitoring.svc:9090/api/v1/write"

  # --- Logs exporter (Azure Blob Storage) ---
  azureblob:
    url: "https://<your-account>.blob.core.windows.net/"
    container:
      logs: "logs"
    auth:
      type: "connection_string"
      connection_string: "DefaultEndpointsProtocol=https;AccountName=<your-account>;AccountKey=<account-key>;EndpointSuffix=core.windows.net"
    encodings:
      logs: text_encoding
    append_blob:
      enabled: true

service:
  pipelines:
    # --- Metrics pipeline ---
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite, debug]

    # --- Logs pipeline ---
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [azureblob, debug]

This configuration ensures:

Metrics flow through the metrics pipeline and are exported to Prometheus.
Logs flow through the logs pipeline and are exported to Azure Blob Storage (while also going to debug for troubleshooting).

Step 7. Verify your telemetry flow¶

How you verify depends on whether you’re sending metrics or logs.

Metrics: Confirm ingestion in your time-series store¶

After metrics are exported, confirm that they are being received by your Prometheus instance or another OTLP-compatible time-series database.

Common metrics to track include:

CPU and memory usage of key platform components
Container restarts or crash loops
Health of Opaque services (service host, cert manager, client, and encryption/decryption service)
Job execution counts, durations, and failure rates

To confirm that metrics are flowing:

Query your time-series store for a built-in metric or a known workload signal.
If results appear and update over time, metrics are being collected.
Once workloads run, you should also see Opaque-specific metrics appear under custom namespaces.

The following table lists reference signal examples.

Component	Cluster	OTEL Service Name	OTEL Exporter Endpoint	Requires http://	Default Export Interval (s)	Metric Name	Type	Unit	Description
Flask client API	client	client-api	http://localhost:4317	Yes	60	http.server.request.duration	histogram buckets	s	Duration of HTTP server requests
Enc/Dec engine	client	enc-dec-engine	http://localhost:4318	Yes	per request	http.server.request.duration	histogram buckets	s	Duration of HTTP server requests
ATLS cert manager	client/dataplane	atls-cert-mgr	http://localhost:4318	Yes	15 / configurable	http.server.request.duration	histogram buckets	s	Duration of HTTP server requests
Service host	dataplane	service-host	http://localhost:4318	Yes	15 / configurable	http.server.request.duration	histogram buckets	s	Duration of HTTP server requests

Logs: Confirm storage in your log back end¶

If you’ve configured the Azure Blob exporter, you can use the Azure portal or CLI to verify that logs are being written:

az storage blob list \
  --container-name logs \
  --account-name <your-account> \
  --output table

You should see new objects appear (organized by date and time) that correspond to your workload activity.

If you’re using a different OTLP-compatible log system (such as Datadog, Splunk, or Elasticsearch), use that system’s query or search interface to confirm that new log entries are arriving.

Troubleshooting your telemetry pipeline¶

Both metrics and logs flow through the same pipeline: applications → local agent collectors → central hub collector → export to your chosen back end.

When data goes missing, the fastest way to diagnose issues is to trace this path backward, starting from the visualization or storage layer and moving toward the source.

General approach¶

Start with the back end. If dashboards (Grafana) or storage (Azure Blob) are missing data, check whether the hub is exporting correctly.
- For metrics: open the Prometheus /targets page to confirm the hub is being scraped or writing via remote write.
- For logs: use the Azure CLI to confirm new files are being written, for example,
```
az storage blob list --container-name logs --account-name <your-account> --output table
```
  You should see recent log objects appear with timestamps matching your workload activity.
Check the hub collector. Missing exporters, misconfigured pipelines, or retry failures often block delivery silently. Review hub logs and built-in metrics (otelcol_exporter_*, otelcol_receiver_*) for errors.
Verify URL/Certificate/PW are the same (we’re not validating certs right now; we should - customers use valid certs)

Metrics-specific checks¶

Prometheus targets. Ensure Prometheus is receiving remote writes (push mode).
Opaque system metrics. Look for service-level signals (e.g., http.server.request.duration) under Opaque-specific namespaces once workloads are active.

Log-specific checks¶

Auth and TLS. If your hub is exposed externally, confirm both the bearer token (it must match the Azure Managed App launch value) and the TLS certificate (use openssl s_client to check for Verification: OK).
Debugging exporter output. If logs still don’t appear in storage, increase verbosity on the debug exporter to confirm the hub is receiving records:
```
exporters:
  debug:
    verbosity: detailed
```

By tracing the pipeline step by step, you’ll always have a clear path to diagnose and resolve gaps in your metrics or logs.