Monitor Opaque performance¶
Monitoring your AI workloads helps you track performance, ensure reliability, and maintain compliance without exposing sensitive data. This tutorial shows you how to set up observability for your Opaque workloads using OpenTelemetry. It uses Prometheus (for metrics) and Azure Blob Storage (for logs) as example destinations, but the same steps apply if you use other OTLP-compatible back ends. You'll learn to:
- Enable metrics and logs collection for your Opaque workloads.
- Configure your observability system to collect and forward these metrics and logs.
- Point your observability system at the right endpoints for collection.
- Verify that metrics are being collected and can be queried or visualized in your preferred time-series system.
- Verify that logs are being delivered to a storage back end such as Azure Blob Storage.
Before you begin¶
Ensure you have the following:
- A self-hosted Opaque deployment running in your Kubernetes cluster
- A separate Kubernetes cluster (new or existing) where you’ll deploy your observability stack (OpenTelemetry hub collector, Prometheus, etc.)
- Basic familiarity with Kubernetes and Helm
- A working Prometheus installation
- Access to an Azure Blob Storage container (required if you’re enabling log export)
- A bearer token to authenticate Opaque agent collectors with the hub collector
- A TLS certificate and private key (
tls.crt
andtls.key
) for securing telemetry traffic into the hub collector
During setup, you’ll be working with the following Helm charts:
How observability works¶
An observability pipeline is the path that system data takes from where it’s generated to where it’s stored, visualized, and analyzed. At a high level, the flow looks like this:
- Applications emit telemetry. Running services generate both metrics—such as latency, CPU usage, and error rates—and logs, including event records, status updates, and error messages.
- Local collectors receive the data. Lightweight agents or sidecars capture telemetry close to the application. They may enrich the data with metadata, batch it for efficiency, or apply transformations.
- A central hub aggregates telemetry. Collectors across the environment forward their data to a centralized service that unifies streams from many applications. This hub can apply additional processing and then route telemetry to one or more destinations.
- Back ends store, visualize, and analyze. Metrics are typically sent to time-series databases and monitoring dashboards, while logs are shipped to searchable storage systems for troubleshooting and auditing.
This pipeline ensures that telemetry from distributed systems can be consistently captured, processed, and made actionable.
A generic observability pipeline.
Opaque implements this pattern using OpenTelemetry (OTel), a CNCF-supported standard for collecting telemetry from distributed systems:
- Applications emit metrics and logs over the OpenTelemetry Protocol (OTLP), typically over gRPC (
4317
) or HTTP (4318
). - OTel collector agents (bundled with the Opaque deployment and running as sidecars or node-level agents) receive telemetry from applications and forward it using the
otlp
exporter. You can optionally enable adebug
exporter for local inspection. - The central OTel hub is set up and managed by you, outside of the Opaque deployment. It aggregates telemetry from all agents and forwards it to your chosen back ends. The hub can also apply optional processing, filtering, or transformation.
- Back ends for storage and analysis are also your choice. In this tutorial, metrics are exported to Prometheus and logs to Azure Blob Storage, but you can use any OTLP-compatible destinations.
The following diagram illustrates how metrics flow from instrumented apps to your observability back ends.
The Opaque observability pipeline.
Set up metrics and logs with the OpenTelemetry Operator¶
This guide uses the OpenTelemetry Operator Helm chart to manage collectors in your Kubernetes cluster. Opaque provides per-workload collector agents as part of the client and data plane deployment, so you don’t need to create those yourself. Your role is to deploy and configure a central hub collector using the Operator. The hub aggregates telemetry from the built-in collector agents and forwards it to the metrics and logs back ends you choose (for example, Prometheus or Azure Blob Storage).
Step 1. Decide destinations and protocols¶
To receive metrics and logs from your applications, you’ll need to configure an OpenTelemetry Protocol (OTLP) endpoint. This endpoint is exposed by the hub collector, and the Opaque-deployed agent collectors forward their telemetry to it before export to your chosen back end.
For example, you might expose a DNS name like:
This endpoint serves as the entry point in your infrastructure for both metrics and logs, typically over OTLP gRPC (4317).
Info
Metrics back ends include any OTLP-compatible time-series store, such as Prometheus (via Remote Write), Datadog, or similar systems.
Logs back ends include any OTLP-compatible log store, such as Azure Blob Storage or Datadog.
As you plan, make sure you have the following details for each back end:
- DNS/URL of the ingestion endpoint
- Protocol (gRPC)
- Authentication method (bearer token, API key, or mTLS)
- TLS requirements (certificates, CA bundle)
Step 2. Deploy the OpenTelemetry Operator Helm chart¶
The OpenTelemetry Operator manages the lifecycle of collectors in your Kubernetes cluster. In this setup, the Operator runs in your observability cluster, which is separate from the cluster where Opaque is deployed.
Installing this cluster adds two key components:
- A custom resource definition named
OpenTelemetryCollector
(orotelcol
for short). This allows you to define collectors declaratively as Kubernetes resources. - A controller (pod) that watches for
otelcol
objects in the cluster. When a new pod is created, the controller deploys and manages the underlying collector pods on your behalf.
This makes it easy to run collectors in different modes (agent, hub) without hand-crafting deployments.
To get started, add the chart with Helm:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install opentelemetry-operator \
open-telemetry/opentelemetry-operator \
--namespace monitoring \
--create-namespace \
--set admissionWebhooks.certManager.enabled=false \
--set admissionWebhooks.autoGenerateCert.enabled=true
We recommend installing the Operator in a dedicated monitoring
namespace to keep observability components isolated. If you prefer another namespace, replace monitoring
in all subsequent manifests and commands.
For additional context or troubleshooting, see the official OpenTelemetry Operator documentation.
Step 3. Configure secrets and certificates¶
First, create a Kubernetes secret with the bearer token that the Opaque-deployed agent collectors will use to authenticate when sending telemetry to the hub collector:
kubectl create secret generic bearer-token-secret \\
--from-literal=token=SuperSecret1234 \\ # Replace with your token
--namespace monitoring
You'll also need a TLS certificate and private key specifically for the hub collector. This certificate is separate from the ones used when deploying Opaque—it must be created for securing telemetry traffic into your observability cluster. Once you have the files (tls.crt
and tls.key
)
, create the Kubernetes TLS secret:
The certificate’s DNS name must match the hostname you’ll expose for the hub collector (for example, otlp-ingest.example.com
).
You’ll reference these secrets in the next step to secure the receiver and configure any exporters that need credentials.
Step 4. Create your hub collector¶
The hub collector aggregates telemetry from all agents in your cluster and forwards it to your chosen back ends. Run it in the monitoring
namespace and scale replicas as needed. Keep the base config simple for bring-up; you’ll add exporters in a later step.
To define the hub, create a manifest named otel-hub.yaml
:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-hub
namespace: monitoring
spec:
mode: deployment
image: docker.io/otel/opentelemetry-collector-contrib:0.131.0
# Provide the bearer token for authenticated ingest (from Step 3 secret).
env:
- name: OTEL_RECEIVER_BEARER_TOKEN
valueFrom:
secretKeyRef:
name: otel-hub-bearer-token-secret
key: token
# Mount TLS certs if you enabled TLS in Step 3.
volumeMounts:
- mountPath: /etc/ssl/certs
name: tls-certs
readOnly: true
volumes:
- name: tls-certs
secret:
secretName: otel-hub-tls
config:
extensions:
bearertokenauth/otlp:
scheme: Bearer
# Token must match the value you set when launching your Azure Managed App.
token: ${env:OTEL_RECEIVER_BEARER_TOKEN}
receivers:
otlp:
protocols:
grpc:
auth:
authenticator: bearertokenauth/otlp
endpoint: 0.0.0.0:4317
tls:
cert_file: /etc/ssl/certs/tls.crt
key_file: /etc/ssl/certs/tls.key
processors:
batch:
send_batch_size: 4096
timeout: 10s
exporters:
# Debug exporter writes summaries to container stdout.
# In Step 6 you'll replace this with your real logs back end (e.g., Azure Blob).
debug:
verbosity: normal
service:
extensions: [bearertokenauth/otlp]
pipelines:
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]
Then create this resource:
Step 5. Expose your hub service via DNS¶
When the hub collector is deployed, Kubernetes creates a Service (for example, otel-hub-collector
in the monitoring
namespace). To make it usable, you need to expose this Service with a DNS name that matches the TLS certificate you created in Step 3.
For example:
This DNS record should route to the collector’s Service through a Kubernetes LoadBalancer, Ingress, or your preferred service mesh.
Note
The exact setup depends on your environment, but the key requirement is that all Opaque workloads must be able to reach the DNS endpoint you configure.
Step 6. Configure your exporters¶
The hub collector doesn’t store telemetry itself—it needs exporters to deliver data to your monitoring or storage back ends. Exporters are part of pipelines that move data from receivers, through processors, and out to exporters. You can define multiple exporters in the same otel-hub.yaml
and configure separate pipelines for metrics and logs.
The following examples show how to forward metrics to Prometheus and logs to Azure Blob Storage. If you use different OTLP-compatible back ends, substitute the appropriate exporters.
processors:
batch: {}
exporters:
# --- Metrics exporter (Prometheus Remote Write) ---
prometheusremotewrite:
endpoint: "http://prometheus.monitoring.svc:9090/api/v1/write"
# --- Logs exporter (Azure Blob Storage) ---
azureblob:
url: "https://<your-account>.blob.core.windows.net/"
container:
logs: "logs"
auth:
type: "connection_string"
connection_string: "DefaultEndpointsProtocol=https;AccountName=<your-account>;AccountKey=<account-key>;EndpointSuffix=core.windows.net"
encodings:
logs: text_encoding
append_blob:
enabled: true
service:
pipelines:
# --- Metrics pipeline ---
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite, debug]
# --- Logs pipeline ---
logs:
receivers: [otlp]
processors: [batch]
exporters: [azureblob, debug]
This configuration ensures:
- Metrics flow through the metrics pipeline and are exported to Prometheus.
- Logs flow through the logs pipeline and are exported to Azure Blob Storage (while also going to
debug
for troubleshooting).
Step 7. Verify your telemetry flow¶
How you verify depends on whether you’re sending metrics or logs.
Metrics: Confirm ingestion in your time-series store¶
After metrics are exported, confirm that they are being received by your Prometheus instance or another OTLP-compatible time-series database.
Common metrics to track include:
- CPU and memory usage of key platform components
- Container restarts or crash loops
- Health of Opaque services (service host, cert manager, client, and encryption/decryption service)
- Job execution counts, durations, and failure rates
To confirm that metrics are flowing:
- Query your time-series store for a built-in metric or a known workload signal.
- If results appear and update over time, metrics are being collected.
- Once workloads run, you should also see Opaque-specific metrics appear under custom namespaces.
The following table lists reference signal examples.
Component | Cluster | OTEL Service Name | OTEL Exporter Endpoint | Requires http:// | Default Export Interval (s) | Metric Name | Type | Unit | Description |
---|---|---|---|---|---|---|---|---|---|
Flask client API | client | client-api | http://localhost:4317 | Yes | 60 | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Enc/Dec engine | client | enc-dec-engine | http://localhost:4318 | Yes | per request | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
ATLS cert manager | client/dataplane | atls-cert-mgr | http://localhost:4318 | Yes | 15 / configurable | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Service host | dataplane | service-host | http://localhost:4318 | Yes | 15 / configurable | http.server.request.duration | histogram buckets | s | Duration of HTTP server requests |
Logs: Confirm storage in your log back end¶
If you’ve configured the Azure Blob exporter, you can use the Azure portal or CLI to verify that logs are being written:
You should see new objects appear (organized by date and time) that correspond to your workload activity.
If you’re using a different OTLP-compatible log system (such as Datadog, Splunk, or Elasticsearch), use that system’s query or search interface to confirm that new log entries are arriving.
Troubleshooting your telemetry pipeline¶
Both metrics and logs flow through the same pipeline: applications → local agent collectors → central hub collector → export to your chosen back end.
When data goes missing, the fastest way to diagnose issues is to trace this path backward, starting from the visualization or storage layer and moving toward the source.
General approach¶
-
Start with the back end. If dashboards (Grafana) or storage (Azure Blob) are missing data, check whether the hub is exporting correctly.
- For metrics: open the Prometheus
/targets
page to confirm the hub is being scraped or writing via remote write. -
For logs: use the Azure CLI to confirm new files are being written, for example,
You should see recent log objects appear with timestamps matching your workload activity.
- For metrics: open the Prometheus
-
Check the hub collector. Missing exporters, misconfigured pipelines, or retry failures often block delivery silently. Review hub logs and built-in metrics (
otelcol_exporter_*
,otelcol_receiver_*
) for errors. - Verify URL/Certificate/PW are the same (we’re not validating certs right now; we should - customers use valid certs)
Metrics-specific checks¶
- Prometheus targets. Ensure Prometheus is receiving remote writes (push mode).
- Opaque system metrics. Look for service-level signals (e.g.,
http.server.request.duration
) under Opaque-specific namespaces once workloads are active.
Log-specific checks¶
- Auth and TLS. If your hub is exposed externally, confirm both the bearer token (it must match the Azure Managed App launch value) and the TLS certificate (use
openssl s_client
to check forVerification: OK
). -
Debugging exporter output. If logs still don’t appear in storage, increase verbosity on the
debug
exporter to confirm the hub is receiving records:
By tracing the pipeline step by step, you’ll always have a clear path to diagnose and resolve gaps in your metrics or logs.