diff --git a/.env b/.env index f67e6c489c..cc191c9adc 100644 --- a/.env +++ b/.env @@ -9,7 +9,7 @@ OTEL_JAVA_AGENT_VERSION=2.23.0 OPENTELEMETRY_CPP_VERSION=1.24.0 # Dependent images -COLLECTOR_CONTRIB_IMAGE=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.142.0 +COLLECTOR_CONTRIB_IMAGE=ghcr.io/aknuds1/otelcontribcol:postgresreceiver-uuid-v0.143.0 FLAGD_IMAGE=ghcr.io/open-feature/flagd:v0.12.9 GRAFANA_IMAGE=grafana/grafana:12.3.1 JAEGERTRACING_IMAGE=jaegertracing/jaeger:2.12.0 @@ -17,7 +17,7 @@ JAEGERTRACING_IMAGE=jaegertracing/jaeger:2.12.0 OPENSEARCH_IMAGE=opensearchproject/opensearch:3.4.0 OPENSEARCH_DOCKERFILE=./src/opensearch/Dockerfile POSTGRES_IMAGE=postgres:17.6 -PROMETHEUS_IMAGE=quay.io/prometheus/prometheus:v3.8.1 +PROMETHEUS_IMAGE=ghcr.io/aknuds1/prometheus@sha256:5daac9ac954a23b1918d2dca10c0604355b3c2c5dbf0657e5a2358adea917e5c VALKEY_IMAGE=valkey/valkey:9.0.1-alpine3.23 TRACETEST_IMAGE=kubeshop/tracetest:${TRACETEST_IMAGE_VERSION} diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000..6179af621b --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,134 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +This is the **OpenTelemetry Astronomy Shop Demo** - a polyglot microservices e-commerce application showcasing OpenTelemetry instrumentation across multiple programming languages. It serves as a realistic example for demonstrating distributed tracing, metrics, and logging. + +## Common Commands + +### Running the Demo +```bash +make start # Start all services (http://localhost:8080) +make start-minimal # Start minimal set of services +make stop # Stop all services +``` + +### Building +```bash +make build # Build all Docker images +make redeploy service= # Rebuild and restart a single service +``` + +### Testing +```bash +make run-tests # Run all tests (frontend + trace-based) +make run-tracetesting # Run trace-based tests only +make run-tracetesting SERVICES_TO_TEST="ad payment" # Test specific services +``` + +### Linting & Validation +```bash +make check # Run all checks (misspell, markdownlint, license, links) +make misspell # Check spelling in markdown files +make markdownlint # Lint markdown files +make checklicense # Check license headers +``` + +### Protobuf Generation +```bash +make generate-protobuf # Generate protobuf code (requires local tools) +make docker-generate-protobuf # Generate protobuf code via Docker +make clean # Remove generated protobuf files +``` + +## Architecture + +### Service Communication +- **gRPC**: Primary protocol for inter-service communication (defined in `pb/demo.proto`) +- **HTTP/REST**: Used by frontend, email service, and external-facing endpoints +- **Kafka**: Async messaging for checkout -> accounting/fraud-detection flow +- **Envoy**: Frontend proxy handling routing to all services + +### Microservices by Language + +| Language | Services | +|----------|----------| +| **Go** | checkout, product-catalog, shipping | +| **Java** | ad, fraud-detection (with OTel Java agent) | +| **.NET/C#** | accounting, cart | +| **Python** | recommendation, product-reviews, load-generator | +| **TypeScript/Node.js** | frontend (Next.js), payment | +| **Ruby** | email | +| **PHP** | quote | +| **C++** | currency | +| **Rust** | shipping | +| **Elixir** | flagd-ui | + +### Key Infrastructure Components +- **OpenTelemetry Collector**: Central telemetry pipeline (`src/otel-collector/`) +- **Jaeger**: Distributed tracing backend (http://localhost:8080/jaeger/ui) +- **Grafana**: Dashboards and visualization (http://localhost:8080/grafana) +- **Prometheus**: Metrics storage +- **Flagd**: Feature flags service (`src/flagd/demo.flagd.json`) +- **Kafka**: Event streaming for order processing +- **Valkey**: Cart session storage (Redis-compatible) +- **PostgreSQL**: Persistent storage for accounting + +### Directory Structure +``` +src/ +├── / # Each microservice has its own directory +│ ├── Dockerfile # Build definition +│ └── README.md # Service-specific documentation +pb/ +└── demo.proto # Shared protobuf definitions for gRPC services +test/ +└── tracetesting/ # Trace-based test definitions +``` + +## Configuration + +- **Environment variables**: Defined in `.env` (base) and `.env.override` (local customizations) +- **Docker Compose**: Main orchestration in `docker-compose.yml` +- **Feature flags**: Configured in `src/flagd/demo.flagd.json` + +**Note:** Do not commit changes to `.env.override` - it is for local customizations only. + +## Development Workflow + +1. Make code changes to a service in `src//` +2. Rebuild and restart only that service: `make redeploy service=` +3. View traces in Jaeger and logs via `docker logs ` +4. For protobuf changes, update `pb/demo.proto` then run `make docker-generate-protobuf` + +## PromQL Conventions + +### Prefer `info()` over Resource Attribute Promotion + +When writing PromQL queries that need to filter or group by OpenTelemetry resource attributes (e.g., `service_name`, `deployment_environment_name`, `k8s_cluster_name`), prefer using the experimental `info()` function over resource attribute promotion in the collector. + +**Pattern:** +```promql +# Preferred: Use info() with data-label-selector +sum by (service_name) ( + info(rate(http_server_request_duration_seconds_count[$__rate_interval]), + {deployment_environment_name=~"$env", service_name="$service"}) +) + +# Avoid: Resource attributes promoted directly onto metrics +sum by (service_name) ( + rate(http_server_request_duration_seconds_count{ + deployment_environment_name=~"$env", + service_name="$service" + }[$__rate_interval]) +) +``` + +**Why:** +- Reduces metric cardinality in Prometheus +- Resource attributes are stored once in `target_info` rather than on every metric +- The `info()` function joins metrics with `target_info` at query time + +**Note:** Requires Prometheus with `--enable-feature=promql-experimental-functions`. diff --git a/docker-compose.yml b/docker-compose.yml index 441d5ecdd4..b037b253c3 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -147,7 +147,7 @@ services: - GOMEMLIMIT=16MiB - OTEL_EXPORTER_OTLP_ENDPOINT - OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE - - OTEL_RESOURCE_ATTRIBUTES + - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=checkout - OTEL_SERVICE_NAME=checkout depends_on: cart: @@ -500,7 +500,7 @@ services: - GOMEMLIMIT=16MiB - OTEL_EXPORTER_OTLP_ENDPOINT - OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE - - OTEL_RESOURCE_ATTRIBUTES + - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=product-catalog - OTEL_SERVICE_NAME=product-catalog - OTEL_SEMCONV_STABILITY_OPT_IN=database - DB_CONNECTION_STRING=postgres://otelu:otelp@${POSTGRES_HOST}/${POSTGRES_DB}?sslmode=disable @@ -669,7 +669,7 @@ services: - FLAGD_OTEL_COLLECTOR_URI=${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_GRPC} - FLAGD_METRICS_EXPORTER=otel - GOMEMLIMIT=60MiB - - OTEL_RESOURCE_ATTRIBUTES + - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=flagd - OTEL_SERVICE_NAME=flagd command: [ "start", @@ -907,6 +907,7 @@ services: - --web.route-prefix=/ - --web.enable-otlp-receiver - --enable-feature=exemplar-storage + - --enable-feature=promql-experimental-functions volumes: - ./src/prometheus/prometheus-config.yaml:/etc/prometheus/prometheus-config.yaml deploy: diff --git a/kubernetes/deploy-info-function.sh b/kubernetes/deploy-info-function.sh new file mode 100755 index 0000000000..8bbf495ee7 --- /dev/null +++ b/kubernetes/deploy-info-function.sh @@ -0,0 +1,72 @@ +#!/bin/bash +# Deploy OpenTelemetry Demo with experimental Prometheus info() function support +# +# This script: +# 1. Installs/upgrades the Helm chart with custom values +# 2. Deploys custom Grafana dashboards that use the info() function + +set -e + +NAMESPACE="${NAMESPACE:-otel-demo}" +RELEASE_NAME="${RELEASE_NAME:-opentelemetry-demo}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(dirname "$SCRIPT_DIR")" + +echo "=== Deploying OpenTelemetry Demo with info() function support ===" +echo "Namespace: $NAMESPACE" +echo "Release: $RELEASE_NAME" +echo "" + +# Add Helm repo if not already added +echo "Adding Helm repository..." +helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts 2>/dev/null || true +helm repo update + +# Create namespace if it doesn't exist +kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - + +# Install/upgrade the Helm chart +echo "" +echo "Installing/upgrading Helm chart..." +helm upgrade --install "$RELEASE_NAME" open-telemetry/opentelemetry-demo \ + --namespace "$NAMESPACE" \ + -f "$SCRIPT_DIR/values-info-function.yaml" \ + --wait + +# Deploy custom dashboards as ConfigMaps +echo "" +echo "Deploying custom Grafana dashboards..." + +# APM Dashboard +echo " - APM Dashboard" +kubectl create configmap apm-dashboard \ + --from-file=apm-dashboard.json="$REPO_ROOT/src/grafana/provisioning/dashboards/demo/apm-dashboard.json" \ + --namespace "$NAMESPACE" \ + --dry-run=client -o yaml | kubectl apply -f - +kubectl label configmap apm-dashboard grafana_dashboard=1 --namespace "$NAMESPACE" --overwrite + +# PostgreSQL Dashboard +echo " - PostgreSQL Dashboard" +kubectl create configmap postgresql-dashboard \ + --from-file=postgresql-dashboard.json="$REPO_ROOT/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json" \ + --namespace "$NAMESPACE" \ + --dry-run=client -o yaml | kubectl apply -f - +kubectl label configmap postgresql-dashboard grafana_dashboard=1 --namespace "$NAMESPACE" --overwrite + +# Restart Grafana to pick up the new dashboards +echo "" +echo "Restarting Grafana to load dashboards..." +kubectl rollout restart deployment/grafana --namespace "$NAMESPACE" 2>/dev/null || \ +kubectl rollout restart deployment/"$RELEASE_NAME"-grafana --namespace "$NAMESPACE" 2>/dev/null || \ +echo " (Could not restart Grafana - dashboards will load on next restart)" + +echo "" +echo "=== Deployment complete ===" +echo "" +echo "Access the demo:" +echo " kubectl port-forward svc/frontend-proxy 8080:8080 -n $NAMESPACE" +echo " Open http://localhost:8080" +echo "" +echo "Access Grafana:" +echo " kubectl port-forward svc/grafana 3000:80 -n $NAMESPACE" +echo " Open http://localhost:3000 (admin/admin)" diff --git a/kubernetes/deploy-kind.sh b/kubernetes/deploy-kind.sh new file mode 100755 index 0000000000..120a0e8dab --- /dev/null +++ b/kubernetes/deploy-kind.sh @@ -0,0 +1,113 @@ +#!/bin/bash +# Deploy OpenTelemetry Demo to a local Kind cluster +# +# This script: +# 1. Creates a Kind cluster (if it doesn't exist) +# 2. Installs the Helm chart with info() function support +# 3. Deploys custom Grafana dashboards +# +# Prerequisites: +# - kind: https://kind.sigs.k8s.io/docs/user/quick-start/#installation +# - kubectl +# - helm + +set -e + +CLUSTER_NAME="${CLUSTER_NAME:-otel-demo}" +NAMESPACE="${NAMESPACE:-otel-demo}" +RELEASE_NAME="${RELEASE_NAME:-opentelemetry-demo}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(dirname "$SCRIPT_DIR")" + +echo "=== OpenTelemetry Demo on Kind ===" +echo "Cluster: $CLUSTER_NAME" +echo "Namespace: $NAMESPACE" +echo "" + +# Check prerequisites +command -v kind >/dev/null 2>&1 || { echo "Error: kind is not installed. See https://kind.sigs.k8s.io/docs/user/quick-start/#installation"; exit 1; } +command -v kubectl >/dev/null 2>&1 || { echo "Error: kubectl is not installed."; exit 1; } +command -v helm >/dev/null 2>&1 || { echo "Error: helm is not installed."; exit 1; } + +# Create Kind cluster if it doesn't exist +if ! kind get clusters 2>/dev/null | grep -q "^${CLUSTER_NAME}$"; then + echo "Creating Kind cluster '$CLUSTER_NAME'..." + kind create cluster --config "$SCRIPT_DIR/kind-config.yaml" --name "$CLUSTER_NAME" + echo "" +else + echo "Kind cluster '$CLUSTER_NAME' already exists." + # Ensure kubectl context is set to the Kind cluster + kubectl config use-context "kind-${CLUSTER_NAME}" + echo "" +fi + +# Add Helm repo +echo "Adding Helm repository..." +helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts 2>/dev/null || true +helm repo update + +# Create namespace +kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - + +# Install/upgrade the Helm chart +echo "" +echo "Installing OpenTelemetry Demo (this may take a few minutes)..." +helm upgrade --install "$RELEASE_NAME" open-telemetry/opentelemetry-demo \ + --namespace "$NAMESPACE" \ + -f "$SCRIPT_DIR/values-info-function.yaml" \ + -f "$SCRIPT_DIR/values-kind.yaml" \ + --timeout 10m \ + --wait + +# Deploy custom dashboards +echo "" +echo "Deploying custom Grafana dashboards..." + +# Delete conflicting dashboards from Helm chart that don't use info() function. +# The Helm chart bundles dashboards that query metrics directly with resource +# attributes as labels. Our custom dashboards use the info() function instead. +echo " - Removing default Helm chart dashboards..." +kubectl delete configmap grafana-dashboard-apm-dashboard --namespace "$NAMESPACE" 2>/dev/null || true +kubectl delete configmap grafana-dashboard-postgresql-dashboard --namespace "$NAMESPACE" 2>/dev/null || true + +echo " - APM Dashboard" +kubectl create configmap apm-dashboard \ + --from-file=apm-dashboard.json="$REPO_ROOT/src/grafana/provisioning/dashboards/demo/apm-dashboard.json" \ + --namespace "$NAMESPACE" \ + --dry-run=client -o yaml | kubectl apply -f - +kubectl label configmap apm-dashboard grafana_dashboard=1 --namespace "$NAMESPACE" --overwrite + +echo " - PostgreSQL Dashboard" +kubectl create configmap postgresql-dashboard \ + --from-file=postgresql-dashboard.json="$REPO_ROOT/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json" \ + --namespace "$NAMESPACE" \ + --dry-run=client -o yaml | kubectl apply -f - +kubectl label configmap postgresql-dashboard grafana_dashboard=1 --namespace "$NAMESPACE" --overwrite + +# Restart Grafana to pick up dashboards +echo "" +echo "Restarting Grafana to load dashboards..." +kubectl rollout restart deployment/grafana --namespace "$NAMESPACE" 2>/dev/null || true + +# Wait for pods +echo "" +echo "Waiting for pods to be ready..." +kubectl wait --for=condition=ready pod -l app.kubernetes.io/instance="$RELEASE_NAME" \ + --namespace "$NAMESPACE" --timeout=5m 2>/dev/null || true + +echo "" +echo "=== Deployment complete ===" +echo "" +echo "Access the demo:" +echo " Frontend: http://localhost:8080 (via Kind NodePort)" +echo "" +echo "For Grafana, Prometheus, Jaeger use port-forward:" +echo " kubectl port-forward svc/grafana 3000:80 -n $NAMESPACE" +echo " kubectl port-forward svc/prometheus 9090:9090 -n $NAMESPACE" +echo " kubectl port-forward svc/jaeger 16686:16686 -n $NAMESPACE" +echo "" +echo "View pods:" +echo " kubectl get pods -n $NAMESPACE" +echo "" +echo "Delete cluster when done:" +echo " kind delete cluster --name $CLUSTER_NAME" diff --git a/kubernetes/kind-config.yaml b/kubernetes/kind-config.yaml new file mode 100644 index 0000000000..f329059562 --- /dev/null +++ b/kubernetes/kind-config.yaml @@ -0,0 +1,15 @@ +# Kind cluster configuration for OpenTelemetry Demo +# Creates a cluster with port mapping for the frontend proxy +# +# Usage: +# kind create cluster --config kubernetes/kind-config.yaml --name otel-demo +# +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +nodes: + - role: control-plane + extraPortMappings: + # Frontend proxy (main entry point) - exposed via NodePort + - containerPort: 30080 + hostPort: 8080 + protocol: TCP diff --git a/kubernetes/values-info-function.yaml b/kubernetes/values-info-function.yaml new file mode 100644 index 0000000000..d3819a50cc --- /dev/null +++ b/kubernetes/values-info-function.yaml @@ -0,0 +1,115 @@ +# Helm values override for testing the experimental Prometheus info() function +# This configuration uses a custom Prometheus build with the info() bug fix and +# removes workaround resource attribute promotions. +# +# Usage: +# helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts +# helm repo update +# helm install opentelemetry-demo open-telemetry/opentelemetry-demo \ +# --namespace otel-demo --create-namespace \ +# -f kubernetes/values-info-function.yaml +# +# Note: This requires Prometheus with the experimental info() function fix. +# See https://github.com/prometheus/prometheus/pull/XXXX + +# OTel Collector: Use custom image with PostgreSQL receiver fix +opentelemetry-collector: + image: + repository: ghcr.io/aknuds1/otelcontribcol + tag: postgresreceiver-uuid-v0.143.0 + config: + processors: + # Don't override host.name from the original service with the collector's hostname + resourcedetection: + detectors: [env, system] + override: false + # Set host.name from k8s.pod.name since the resourcedetection processor + # would otherwise set it to the collector's hostname. + # Also preserve the existing service.instance.id setting from the chart. + resource: + attributes: + - key: service.instance.id + from_attribute: k8s.pod.uid + action: insert + - key: host.name + from_attribute: k8s.pod.name + action: upsert + +prometheus: + server: + # Custom Prometheus image with info() function bug fix + image: + repository: ghcr.io/aknuds1/prometheus + # Using digest for reproducibility + digest: sha256:5daac9ac954a23b1918d2dca10c0604355b3c2c5dbf0657e5a2358adea917e5c + + # Enable experimental PromQL functions including info() + extraFlags: + - "enable-feature=exemplar-storage" + - "enable-feature=promql-experimental-functions" + - "web.enable-otlp-receiver" + + # OTLP receiver configuration + # Resource attributes are NOT promoted to labels on metrics. + # Instead, use the experimental info() PromQL function to enrich metrics with + # labels from target_info at query time. This reduces label cardinality while + # still allowing access to resource attributes in queries. + otlp: + keep_identifying_resource_attributes: true + promote_resource_attributes: + # Kafka resource attributes produced by the OTel Collector Kafka receiver + # https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kafkametricsreceiver + - kafka.cluster.alias + + # See https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.142.0/connector/spanmetricsconnector/README.md#known-limitation-the-single-writer-principle + - collector.instance.id + + # host.name is needed for system/host metrics from the hostmetrics receiver, + # which cannot use info() because they lack service.name to match target_info. + - host.name + + # Allow out-of-order ingestion for metrics that may arrive late + tsdb: + out_of_order_time_window: 30m + +# Grafana configuration with custom dashboards +grafana: + # Enable the sidecar to load dashboards from ConfigMaps + sidecar: + dashboards: + enabled: true + # Label that ConfigMaps must have to be picked up + label: grafana_dashboard + # Search in all namespaces + searchNamespace: ALL + + # Dashboard providers configuration + dashboardProviders: + dashboardproviders.yaml: + apiVersion: 1 + providers: + - name: 'custom' + orgId: 1 + folder: 'Custom' + type: file + disableDeletion: false + editable: true + options: + path: /var/lib/grafana/dashboards/custom + + # Note: The APM and PostgreSQL dashboards with info() function queries + # need to be deployed as ConfigMaps. See the dashboards in: + # - src/grafana/provisioning/dashboards/demo/apm-dashboard.json + # - src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json + # + # To deploy them, create ConfigMaps like: + # + # kubectl create configmap apm-dashboard \ + # --from-file=apm-dashboard.json=src/grafana/provisioning/dashboards/demo/apm-dashboard.json \ + # -n otel-demo + # kubectl label configmap apm-dashboard grafana_dashboard=1 -n otel-demo + # + # kubectl create configmap postgresql-dashboard \ + # --from-file=postgresql-dashboard.json=src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json \ + # -n otel-demo + # kubectl label configmap postgresql-dashboard grafana_dashboard=1 -n otel-demo diff --git a/kubernetes/values-kind.yaml b/kubernetes/values-kind.yaml new file mode 100644 index 0000000000..8a52a53d15 --- /dev/null +++ b/kubernetes/values-kind.yaml @@ -0,0 +1,22 @@ +# Additional Helm values for Kind deployment +# Use with: -f values-info-function.yaml -f values-kind.yaml +# +# This configures the frontend-proxy service as NodePort for Kind access +# and increases memory limits for services that need more than the defaults + +components: + frontend-proxy: + service: + type: NodePort + nodePort: 30080 + # Increase memory limits for services that OOMKill with defaults + product-catalog: + resources: + limits: + memory: 100Mi + flagd: + resources: + limits: + memory: 500Mi + # Disable flagd-ui sidecar - it OOMKills even with 1Gi limit + sidecarContainers: [] diff --git a/src/grafana/provisioning/dashboards/demo/apm-dashboard.json b/src/grafana/provisioning/dashboards/demo/apm-dashboard.json index cbcc4a50ff..bfdb8c1719 100644 --- a/src/grafana/provisioning/dashboards/demo/apm-dashboard.json +++ b/src/grafana/provisioning/dashboards/demo/apm-dashboard.json @@ -235,7 +235,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "histogram_quantile(\n 0.95,\n sum by (le, deployment_environment_name, service_namespace, service_name) (\n rate(\n http_server_request_duration_seconds_bucket{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n )\n )\n)", + "expr": "histogram_quantile(\n 0.95,\n sum by (le, deployment_environment_name, service_namespace, service_name) (\n info(rate(http_server_request_duration_seconds_bucket[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})\n )\n)", "fullMetaSearch": false, "includeNullMetadata": true, "interval": "60s", @@ -251,7 +251,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "avg by (deployment_environment_name, service_namespace, service_name) (\n rate(\n http_server_request_duration_seconds_sum{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n )\n)\n/\navg by (deployment_environment_name, service_namespace, service_name) (\n rate(\n http_server_request_duration_seconds_count{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n )\n)", + "expr": "avg by (deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_sum[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})) / avg by (deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -267,7 +267,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "histogram_quantile(\n 0.95,\n sum by (le, deployment_environment_name, service_namespace, service_name) (\n rate(\n rpc_server_duration_milliseconds_bucket{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n ) / 1000\n )\n)", + "expr": "histogram_quantile(\n 0.95,\n sum by (le, deployment_environment_name, service_namespace, service_name) (\n info(rate(rpc_server_duration_milliseconds_bucket[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}) / 1000\n )\n)", "hide": false, "instant": false, "interval": "60", @@ -281,7 +281,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "avg by (deployment_environment_name, service_namespace, service_name) (\n rate(\n rpc_server_duration_milliseconds_sum{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n ) / 1000\n)\n/\navg by (deployment_environment_name, service_namespace, service_name) (\n rate(\n rpc_server_duration_milliseconds_count{\n deployment_environment_name=~\"$deployment_environment_name\",\n service_namespace=~\"$service_namespace\",\n service_name=\"$service_name\"\n }[$__rate_interval]\n )\n)", + "expr": "avg by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_sum[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}) / 1000) / avg by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}))", "hide": false, "instant": false, "interval": "60", @@ -387,7 +387,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "(\n (sum by(deployment_environment_name, service_namespace, service_name) (rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", http_response_status_code=~\"5..\"}[$__rate_interval])) * 100) \n / \n sum by(deployment_environment_name, service_namespace, service_name) (rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]))\n)\nor\n(\n 0\n * \n sum by(deployment_environment_name, service_namespace, service_name) (rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]))\n)", + "expr": "(\n (sum by(deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_count{http_response_status_code=~\"5..\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})) * 100) \n / \n sum by(deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}))\n)\nor\n(\n 0\n * \n sum by(deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}))\n)", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -404,7 +404,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "((sum without (rpc_grpc_status_code, instance) (rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", rpc_grpc_status_code!=\"0\"}[$__rate_interval])) * 100) / sum without (rpc_grpc_status_code, instance) (rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval])))\nor\n(0 * sum without (rpc_grpc_status_code, instance) (rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval])))", + "expr": "((sum by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_count{rpc_grpc_status_code!=\"0\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})) * 100) / sum by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})))\nor\n(0 * sum by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})))", "hide": false, "instant": false, "interval": "60", @@ -511,7 +511,7 @@ }, "editorMode": "code", "exemplar": false, - "expr": "(sum(rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval])) by (deployment_environment_name, service_namespace, service_name)) ", + "expr": "sum by (deployment_environment_name, service_namespace, service_name) (info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}))", "hide": false, "instant": false, "interval": "60s", @@ -525,7 +525,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(sum(rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval])) by (deployment_environment_name, service_namespace, service_name)) * $__interval_ms / 1000", + "expr": "sum by (deployment_environment_name, service_namespace, service_name) (info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"})) * $__interval_ms / 1000", "hide": false, "instant": false, "interval": "60", @@ -683,7 +683,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "\n sum by (operation) (\n label_join(\n rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \" \",\n \"http_request_method\",\n \"http_route\"\n )\n )\n ", + "expr": "sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \" \", \"http_request_method\", \"http_route\"))", "fullMetaSearch": false, "includeNullMetadata": true, "interval": "60s", @@ -698,7 +698,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(\n sum by (operation) (\n label_join(\n rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", http_response_status_code=~\"5..\"}[$__rate_interval]),\n \"operation\",\n \" \",\n \"http_request_method\",\n \"http_route\"\n )\n )\n / \n sum by (operation) (\n label_join(\n rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \" \",\n \"http_request_method\",\n \"http_route\"\n )\n )\n ) or (0 * \n sum by (operation) (\n label_join(\n rate(http_server_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \" \",\n \"http_request_method\",\n \"http_route\"\n )\n )\n )", + "expr": "(sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_server_request_duration_seconds_count{http_response_status_code=~\"5..\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \" \", \"http_request_method\", \"http_route\")) / sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \" \", \"http_request_method\", \"http_route\"))) or (0 * sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_server_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \" \", \"http_request_method\", \"http_route\")))", "hide": false, "instant": false, "interval": "60s", @@ -712,7 +712,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "\n histogram_quantile(\n 0.99,\n sum by (le, operation) (\n label_join(\n rate(http_server_request_duration_seconds_bucket{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[5m]),\n \"operation\",\n \" \",\n \"http_request_method\",\n \"http_route\"\n )\n )\n )\n ", + "expr": "histogram_quantile(0.99, sum by (le, operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_server_request_duration_seconds_bucket[5m]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \" \", \"http_request_method\", \"http_route\")))", "hide": false, "instant": false, "interval": "60s", @@ -873,7 +873,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "\nsum by (operation) (\n label_join(\n rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \"/\",\n \"rpc_service\",\n \"rpc_method\"\n )\n)\n ", + "expr": "sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \"/\", \"rpc_service\", \"rpc_method\"))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -889,7 +889,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(\n sum by (operation) (\n label_join(\n rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", rpc_grpc_status_code!=\"0\"}[$__rate_interval]),\n \"operation\",\n \"/\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n / \n sum by (operation) (\n label_join(\n rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \"/\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n ) or (0 * \n sum by (operation) (\n label_join(\n rate(rpc_server_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"operation\",\n \"/\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n )\n ", + "expr": "(sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_server_duration_milliseconds_count{rpc_grpc_status_code!=\"0\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \"/\", \"rpc_service\", \"rpc_method\")) / sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \"/\", \"rpc_service\", \"rpc_method\"))) or (0 * sum by (operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_server_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \"/\", \"rpc_service\", \"rpc_method\")))", "hide": false, "instant": false, "interval": "60s", @@ -903,7 +903,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "\n histogram_quantile(\n 0.99,\n sum by (le, operation) (\n label_join(\n rate(rpc_server_duration_milliseconds_bucket{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[5m]),\n \"operation\",\n \"/\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n )\n ", + "expr": "histogram_quantile(0.99, sum by (le, operation, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_server_duration_milliseconds_bucket[5m]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"operation\", \"/\", \"rpc_service\", \"rpc_method\")))", "hide": false, "instant": false, "interval": "60s", @@ -1071,7 +1071,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "\n sum by (outbound_service) (\n label_join(\n rate(http_client_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \" \",\n \"server_address\",\n \"http_request_method\",\n \"url_template\"\n )\n )\n ", + "expr": "sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_client_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \" \", \"server_address\", \"http_request_method\", \"url_template\"))", "fullMetaSearch": false, "includeNullMetadata": true, "interval": "60s", @@ -1086,7 +1086,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(\n sum by (outbound_service) (\n label_join(\n rate(http_client_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", http_response_status_code=~\"5..\"}[$__rate_interval]),\n \"outbound_service\",\n \" \",\n \"server_address\",\n \"http_request_method\",\n \"url_template\"\n )\n )\n / \n sum by (outbound_service) (\n label_join(\n rate(http_client_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \" \",\n \"server_address\",\n \"http_request_method\",\n \"url_template\"\n )\n )\n ) or (0 * \n sum by (outbound_service) (\n label_join(\n rate(http_client_request_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \" \",\n \"server_address\",\n \"http_request_method\",\n \"url_template\"\n )\n )\n )", + "expr": "(sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_client_request_duration_seconds_count{http_response_status_code=~\"5..\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \" \", \"server_address\", \"http_request_method\", \"url_template\")) / sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_client_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \" \", \"server_address\", \"http_request_method\", \"url_template\"))) or (0 * sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_client_request_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \" \", \"server_address\", \"http_request_method\", \"url_template\")))", "hide": false, "instant": false, "interval": "60s", @@ -1100,7 +1100,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "\nhistogram_quantile(\n 0.99,\n sum by (le, outbound_service) (\n label_join(\n rate(http_client_request_duration_seconds_bucket{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[5m]),\n \"outbound_service\",\n \" \",\n \"server_address\",\n \"http_request_method\",\n \"url_template\"\n )\n )\n)", + "expr": "histogram_quantile(0.99, sum by (le, outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(http_client_request_duration_seconds_bucket[5m]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \" \", \"server_address\", \"http_request_method\", \"url_template\")))", "hide": false, "instant": false, "interval": "60s", @@ -1258,7 +1258,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "sum by (database) (\n label_join(\n rate(db_client_operation_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"database\",\n \"/\",\n \"server_address\",\n \"db_namespace\"\n )\n)\n ", + "expr": "sum by (database, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(db_client_operation_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"database\", \"/\", \"server_address\", \"db_namespace\"))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -1274,7 +1274,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(\n sum by (database) (\n label_join(\n rate(db_client_operation_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", http_response_status_code=~\"5..\"}[$__rate_interval]),\n \"database\",\n \"/\",\n \"server_address\",\n \"db_namespace\"\n )\n )\n / \n sum by (database) (\n label_join(\n rate(db_client_operation_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"database\",\n \"/\",\n \"server_address\",\n \"db_namespace\"\n )\n )\n ) or (0 * \n sum by (database) (\n label_join(\n rate(db_client_operation_duration_seconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"database\",\n \"/\",\n \"server_address\",\n \"db_namespace\"\n )\n )\n )", + "expr": "(sum by (database, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(db_client_operation_duration_seconds_count{http_response_status_code=~\"5..\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"database\", \"/\", \"server_address\", \"db_namespace\")) / sum by (database, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(db_client_operation_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"database\", \"/\", \"server_address\", \"db_namespace\"))) or (0 * sum by (database, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(db_client_operation_duration_seconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"database\", \"/\", \"server_address\", \"db_namespace\")))", "hide": false, "instant": false, "interval": "60s", @@ -1288,7 +1288,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "\nhistogram_quantile(\n 0.99,\n sum by (le, database) (\n label_join(\n rate(db_client_operation_duration_seconds_bucket{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[5m]),\n \"database\",\n \"/\",\n \"server_address\",\n \"db_namespace\"\n )\n )\n)", + "expr": "histogram_quantile(0.99, sum by (le, database, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(db_client_operation_duration_seconds_bucket[5m]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"database\", \"/\", \"server_address\", \"db_namespace\")))", "hide": false, "instant": false, "interval": "60s", @@ -1450,7 +1450,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "\n sum by (outbound_service) (\n label_join(\n rate(rpc_client_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \"/\",\n \"server_address\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n ", + "expr": "sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_client_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \"/\", \"server_address\", \"rpc_service\", \"rpc_method\"))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -1466,7 +1466,7 @@ "uid": "${prometheus_datasource}" }, "editorMode": "code", - "expr": "(\n sum by (outbound_service) (\n label_join(\n rate(rpc_client_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\", http_response_status_code=~\"5..\"}[$__rate_interval]),\n \"outbound_service\",\n \"/\",\n \"server_address\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n / \n sum by (outbound_service) (\n label_join(\n rate(rpc_client_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \"/\",\n \"server_address\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n ) or (0 * \n sum by (outbound_service) (\n label_join(\n rate(rpc_client_duration_milliseconds_count{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[$__rate_interval]),\n \"outbound_service\",\n \"/\",\n \"server_address\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n )", + "expr": "(sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_client_duration_milliseconds_count{http_response_status_code=~\"5..\"}[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \"/\", \"server_address\", \"rpc_service\", \"rpc_method\")) / sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_client_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \"/\", \"server_address\", \"rpc_service\", \"rpc_method\"))) or (0 * sum by (outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_client_duration_milliseconds_count[$__rate_interval]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \"/\", \"server_address\", \"rpc_service\", \"rpc_method\")))", "hide": false, "instant": false, "interval": "60s", @@ -1480,7 +1480,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "\nhistogram_quantile(\n 0.99,\n sum by (le, outbound_service) (\n label_join(\n rate(rpc_client_duration_milliseconds_bucket{deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\", service_name=\"$service_name\"}[5m]),\n \"outbound_service\",\n \"/\",\n \"server_address\",\n \"rpc_service\",\n \"rpc_method\"\n )\n )\n)", + "expr": "histogram_quantile(0.99, sum by (le, outbound_service, deployment_environment_name, service_namespace, service_name) (label_join(info(rate(rpc_client_duration_milliseconds_bucket[5m]), {service_name=\"$service_name\", deployment_environment_name=~\"$deployment_environment_name\", service_namespace=~\"$service_namespace\"}), \"outbound_service\", \"/\", \"server_address\", \"rpc_service\", \"rpc_method\")))", "hide": false, "instant": false, "interval": "60s", @@ -1632,7 +1632,7 @@ "targets": [ { "editorMode": "code", - "expr": "sum by (service_instance_id_host_name) (\n label_join(\n (\n sum by (service_instance_id, host_name) (\n rate(system_cpu_time_seconds_total{job=\"\", state!=\"idle\"}[$__rate_interval])\n )\n * on(host_name) group_left(service_instance_id)\n target_info{service_name=\"${service_name}\"}\n ),\n \"service_instance_id_host_name\",\n \" / \",\n \"service_instance_id\",\n \"host_name\"\n )\n)", + "expr": "sum by (service_instance_id_host_name) (\n label_join(\n (\n sum by (service_instance_id, host_name) (\n rate(system_cpu_time_seconds_total{state!=\"idle\"}[$__rate_interval])\n )\n * on(host_name) group_left(service_instance_id)\n target_info{service_namespace=~\"$service_namespace\", service_name=\"${service_name}\"}\n ),\n \"service_instance_id_host_name\",\n \" / \",\n \"service_instance_id\",\n \"host_name\"\n )\n)", "hide": false, "legendFormat": "__auto", "range": true, @@ -1644,7 +1644,7 @@ "uid": "webstore-metrics" }, "editorMode": "code", - "expr": "sum by (service_instance_id_host_name) (\n label_join(\n (\n (\n sum by (service_instance_id, host_name) (\n system_memory_usage_bytes{job=\"\", state!=\"free\"}\n )\n /\n sum by (service_instance_id, host_name) (\n system_memory_usage_bytes{job=\"\"}\n )\n )\n * on(host_name) group_left(service_instance_id) (\n target_info{service_name=\"${service_name}\"}\n )\n ),\n \"service_instance_id_host_name\",\n \" / \",\n \"service_instance_id\",\n \"host_name\"\n )\n)", + "expr": "sum by (service_instance_id_host_name) (\n label_join(\n (\n (\n sum by (service_instance_id, host_name) (\n system_memory_usage_bytes{state!=\"free\"}\n )\n /\n sum by (service_instance_id, host_name) (\n system_memory_usage_bytes\n )\n )\n * on(host_name) group_left(service_instance_id) (\n target_info{service_namespace=~\"$service_namespace\", service_name=\"${service_name}\"}\n )\n ),\n \"service_instance_id_host_name\",\n \" / \",\n \"service_instance_id\",\n \"host_name\"\n )\n)", "hide": false, "instant": false, "legendFormat": "__auto", @@ -2023,7 +2023,7 @@ "text": "Prometheus", "value": "webstore-metrics" }, - "description": "OpenTelemetry metrics. \nSend metrics using the Prometheus OTLP endpoint activating `keep_identifying_resource_attributes` and resource attribute promotion (aka `promote_resource_attributes`) including `service.name`, service.namespace`, `service.instance.id`, and `deployment.environment.name`", + "description": "OpenTelemetry metrics. \nSend metrics using the Prometheus OTLP endpoint with `keep_identifying_resource_attributes` enabled. Resource attributes like `service.name`, `service.namespace`, and `deployment.environment.name` are accessed via the experimental `info()` function joining with `target_info`.", "label": "Metrics", "name": "prometheus_datasource", "options": [], diff --git a/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json b/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json index d4f7456143..020380c820 100644 --- a/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json +++ b/src/grafana/provisioning/dashboards/demo/postgresql-dashboard.json @@ -121,7 +121,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_commits_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval])) + sum(irate(postgresql_rollbacks_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_commits_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"})) + sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_rollbacks_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -266,7 +266,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_tup_fetched_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_tup_fetched_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -325,7 +325,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_tup_returned_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_tup_returned_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -385,7 +385,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_tup_inserted_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_tup_inserted_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -445,7 +445,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_tup_updated_total{dpostgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_tup_updated_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -506,7 +506,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(irate(postgresql_tup_deleted_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"}[$__rate_interval]))", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(irate(postgresql_tup_deleted_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -653,7 +653,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "irate(postgresql_bgwriter_buffers_allocated_total{host_name=~\"$host_name\", k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"}[$__rate_interval])", + "expr": "info(irate(postgresql_bgwriter_buffers_allocated_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\"})", "format": "time_series", "groupBy": [ { @@ -710,7 +710,7 @@ }, "disableTextWrap": false, "editorMode": "code", - "expr": "irate(postgresql_bgwriter_buffers_writes_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__interval])", + "expr": "info(irate(postgresql_bgwriter_buffers_writes_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__interval]), {service_name=\"postgresql\"})", "fullMetaSearch": false, "hide": false, "includeNullMetadata": true, @@ -840,7 +840,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "irate(postgresql_bgwriter_checkpoint_count_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval])", + "expr": "info(irate(postgresql_bgwriter_checkpoint_count_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\"})", "format": "time_series", "groupBy": [ { @@ -1025,7 +1025,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(postgresql_deadlocks_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"})", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(postgresql_deadlocks_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}, {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -1082,7 +1082,7 @@ }, "dsType": "prometheus", "editorMode": "code", - "expr": "sum(postgresql_conflicts_total{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",host_name=~\"$host_name\"})", + "expr": "sum by (k8s_cluster_name, k8s_statefulset_name, host_name) (info(postgresql_conflicts_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}, {service_name=\"postgresql\", postgresql_database_name=~\"$db\"}))", "format": "time_series", "groupBy": [ { @@ -1225,7 +1225,7 @@ "uid": "$datasource" }, "editorMode": "code", - "expr": "round(\n sum by (postgresql_database_name) (\n rate(\n postgresql_blks_hit_total{\n postgresql_database_name=~\"$db\",\n k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",\n host_name=~\"$host_name\"\n }[$__rate_interval]\n )\n )\n /\n (\n sum by (postgresql_database_name) (\n rate(\n postgresql_blks_hit_total{\n postgresql_database_name=~\"$db\",\n k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",\n host_name=~\"$host_name\"\n }[$__rate_interval]\n )\n )\n +\n sum by (postgresql_database_name) (\n rate(\n postgresql_blks_read_total{\n postgresql_database_name=~\"$db\",\n k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\",\n host_name=~\"$host_name\"\n }[$__rate_interval]\n )\n )\n ) * 100,\n 0.001\n)", + "expr": "round(\n (sum by (postgresql_database_name, k8s_cluster_name, k8s_statefulset_name, host_name) (info(rate(postgresql_blks_hit_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"})) /\n (sum by (postgresql_database_name, k8s_cluster_name, k8s_statefulset_name, host_name) (info(rate(postgresql_blks_hit_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"})) +\n sum by (postgresql_database_name, k8s_cluster_name, k8s_statefulset_name, host_name) (info(rate(postgresql_blks_read_total{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}[$__rate_interval]), {service_name=\"postgresql\", postgresql_database_name=~\"$db\"})))) * 100,\n 0.001\n)", "format": "time_series", "legendFormat": "{{postgresql_database_name}} - cache hit ratio", "range": true, @@ -1323,7 +1323,7 @@ "uid": "$datasource" }, "editorMode": "code", - "expr": "postgresql_backends{postgresql_database_name=~\"$db\",k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}", + "expr": "info(postgresql_backends{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\", host_name=~\"$host_name\"}, {service_name=\"postgresql\", postgresql_database_name=~\"$db\"})", "format": "time_series", "intervalFactor": 2, "legendFormat": "{{postgresql_database_name}} - connections", @@ -1369,7 +1369,7 @@ "type": "prometheus", "uid": "webstore-metrics" }, - "definition": "label_values(postgresql_table_count,k8s_cluster_name)", + "definition": "label_values(target_info,k8s_cluster_name)", "description": "When deploying PostgreSQL on Kubernetes, name of the Kubernetes cluster. \nFor other deployments, select \"All\". ", "includeAll": true, "label": "K8s Cluster", @@ -1378,7 +1378,7 @@ "options": [], "query": { "qryType": 1, - "query": "label_values(postgresql_table_count,k8s_cluster_name)", + "query": "label_values(target_info,k8s_cluster_name)", "refId": "PrometheusVariableQueryEditor-VariableQuery" }, "refresh": 2, @@ -1392,7 +1392,7 @@ "text": "", "value": "" }, - "definition": "label_values(postgresql_table_count{k8s_cluster_name=~\"$k8s_cluster_name\"},k8s_statefulset_name)", + "definition": "label_values(target_info{k8s_cluster_name=~\"$k8s_cluster_name\"},k8s_statefulset_name)", "description": "When deploying on Kubernetes, name of the `StatefulSet` of the PostgreSQL deployment (e.g. `my-pg-cluster`).\nFor other deployments, select \"All\". ", "includeAll": true, "label": "K8s Statefulset", @@ -1401,7 +1401,7 @@ "options": [], "query": { "qryType": 1, - "query": "label_values(postgresql_table_count{k8s_cluster_name=~\"$k8s_cluster_name\"},k8s_statefulset_name)", + "query": "label_values(target_info{k8s_cluster_name=~\"$k8s_cluster_name\"},k8s_statefulset_name)", "refId": "PrometheusVariableQueryEditor-VariableQuery" }, "refresh": 1, @@ -1419,7 +1419,7 @@ "datasource": { "uid": "$datasource" }, - "definition": "label_values(postgresql_table_count{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},host_name)", + "definition": "label_values(target_info{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},host_name)", "description": "When deploying PostgreSQL on VMs, name on the host on which the database is deployed.\nFor other deployments, select \"All\". ", "includeAll": true, "label": "Host", @@ -1428,7 +1428,7 @@ "options": [], "query": { "qryType": 1, - "query": "label_values(postgresql_table_count{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},host_name)", + "query": "label_values(target_info{k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},host_name)", "refId": "PrometheusVariableQueryEditor-VariableQuery" }, "refresh": 2, @@ -1446,14 +1446,14 @@ "type": "prometheus", "uid": "webstore-metrics" }, - "definition": "label_values(postgresql_table_count{host_name=~\"$host_name\", postgresql_database_name!~\"template.*|postgres\", k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},postgresql_database_name)", + "definition": "label_values(target_info{service_name=\"postgresql\", host_name=~\"$host_name\", postgresql_database_name!~\"template.*|postgres\", k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},postgresql_database_name)", "includeAll": true, "label": "Database", "name": "db", "options": [], "query": { "qryType": 1, - "query": "label_values(postgresql_table_count{host_name=~\"$host_name\", postgresql_database_name!~\"template.*|postgres\", k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},postgresql_database_name)", + "query": "label_values(target_info{service_name=\"postgresql\", host_name=~\"$host_name\", postgresql_database_name!~\"template.*|postgres\", k8s_cluster_name=~\"$k8s_cluster_name\", k8s_statefulset_name=~\"$k8s_statefulset_name\"},postgresql_database_name)", "refId": "PrometheusVariableQueryEditor-VariableQuery" }, "refresh": 2, diff --git a/src/otel-collector/otelcol-config.yml b/src/otel-collector/otelcol-config.yml index 709047404c..c5c147b9f6 100644 --- a/src/otel-collector/otelcol-config.yml +++ b/src/otel-collector/otelcol-config.yml @@ -160,6 +160,11 @@ processors: # could be removed when https://github.com/vercel/next.js/pull/64852 is fixed upstream - replace_pattern(name, "\\?.*", "") - replace_match(name, "GET /api/products/*", "GET /api/products/{productId}") + metric_statements: + - context: resource + statements: + # Set service.instance.id to service.name if not already set (needed for Prometheus target_info) + - set(attributes["service.instance.id"], attributes["service.name"]) where attributes["service.instance.id"] == nil and attributes["service.name"] != nil connectors: spanmetrics: @@ -172,7 +177,7 @@ service: exporters: [otlp, debug, spanmetrics] metrics: receivers: [docker_stats, httpcheck/frontend-proxy, hostmetrics, nginx, otlp, postgresql, redis, spanmetrics] - processors: [resourcedetection, memory_limiter] + processors: [resourcedetection, transform, memory_limiter] exporters: [otlphttp/prometheus, debug] logs: receivers: [otlp] diff --git a/src/prometheus/prometheus-config.yaml b/src/prometheus/prometheus-config.yaml index ddcacf236c..7b76d48b03 100644 --- a/src/prometheus/prometheus-config.yaml +++ b/src/prometheus/prometheus-config.yaml @@ -7,48 +7,22 @@ global: otlp: keep_identifying_resource_attributes: true + # Resource attributes are NOT promoted to labels on metrics. + # Instead, use the experimental info() PromQL function to enrich metrics with + # labels from target_info at query time. This reduces label cardinality while + # still allowing access to resource attributes in queries. + # See https://prometheus.io/blog/2025/12/16/introducing-info-function/ promote_resource_attributes: - - service.instance.id - - service.name - - service.namespace - - service.version - - cloud.availability_zone - - cloud.region - - deployment.environment.name - - # When deploying on Kubernetes, resource attributes used to identify the - # kubernetes resources in dashboards and alerts. - - k8s.cluster.name - - k8s.container.name - - k8s.cronjob.name - - k8s.daemonset.name - - k8s.deployment.name - - k8s.job.name - - k8s.namespace.name - - k8s.node.name - - k8s.pod.name - - k8s.replicaset.name - - k8s.statefulset.name - - container.name - - # When deploying on VMs, resource attributes used to identify - # the host in dashboards and alerts. - - host.name - - # PostgreSQL resource attributes produced by the OTel Collector PostgreSQL receiver - # and used in dashboards and alerts. - # See https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/postgresqlreceiver/metadata.yaml - - postgresql.database.name - - postgresql.schema.name - - postgresql.table.name - - postgresql.index.name - # Kafka resource attributes produced by the OTel Collector Kafka receiver # https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kafkametricsreceiver - kafka.cluster.alias # See https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.142.0/connector/spanmetricsconnector/README.md#known-limitation-the-single-writer-principle - collector.instance.id + + # host.name is needed for system/host metrics from the hostmetrics receiver, + # which cannot use info() because they lack service.name to match target_info. + - host.name storage: tsdb: out_of_order_time_window: 30m