-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Use PromQL info function instead of resource attribute promotion #2869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2cb5b84 to
93d3951
Compare
| - OTEL_EXPORTER_OTLP_ENDPOINT | ||
| - OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE | ||
| - OTEL_RESOURCE_ATTRIBUTES | ||
| - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=checkout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service.instance.id is expected to be generated by SDKs or derived from the K8s environment, moreover, it should be a GUID.
Specs: https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id
K8s naming specs: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/
| resource/postgresql: | ||
| attributes: | ||
| - key: service.name | ||
| value: postgresql | ||
| action: upsert | ||
| - key: service.instance.id | ||
| value: ${env:POSTGRES_HOST} | ||
| action: upsert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the service.name specs here, we could try to broaden the requirements in specs to also define service.name in infrastructure monitoring use cases and convince OTel Collector Receiver maintainers to adopt this but today, no infrastructure monitoring receiver produces service.name or service.instance.id
| - context: resource | ||
| statements: | ||
| # Set service.instance.id to service.name if not already set (needed for Prometheus info() joins) | ||
| - set(attributes["service.instance.id"], attributes["service.name"]) where attributes["service.instance.id"] == nil and attributes["service.name"] != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have collision if the same service type (eg a redis) is running multiple times. For infra monitoring metrics, we commonly use attributes like host.name... to differentiate the instances.
| @@ -0,0 +1,136 @@ | |||
| # CLAUDE.md | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure you want to commit this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @jmichalek132 - it's useful while working out the PR however - I would remove it before finalizing the PR.
|
Overall looks good outside of what @cyrille-leclerc already pointed out, did you test this locally (given lot of changes to the queries even just re-formatting) do all of the panel still show metrics? Would be potentially nice to show screenshots of it. |
@jmichalek132 I did some simple testing locally, but I already don't know the demo much, so I'm not a very effective tester :/ Do you know the demo well enough to look for discrepancies? I did fix the bugs I could find from checking the APM and PostegreSQL dashboards, on Docker Compose. |
93d3951 to
43cdf83
Compare
|
As discussed offline with @cyrille-leclerc, it might be better to implement |
| transform/postgresql: | ||
| error_mode: ignore | ||
| metric_statements: | ||
| - context: resource | ||
| statements: | ||
| # Construct unique service.instance.id based on PostgreSQL resource scope. | ||
| # The PostgreSQL receiver sets postgresql.database.name, postgresql.table.name, | ||
| # postgresql.index.name as resource attributes, creating multiple target_info | ||
| # entries with the same identifying labels. By including these in service.instance.id, | ||
| # each scope gets a unique target_info, allowing info() to work correctly. | ||
| - set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"], "/", attributes["postgresql.table.name"], "/", attributes["postgresql.index.name"]], "")) where attributes["postgresql.index.name"] != nil | ||
| - set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"], "/", attributes["postgresql.table.name"]], "")) where attributes["postgresql.table.name"] != nil and attributes["postgresql.index.name"] == nil | ||
| - set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"]], "")) where attributes["postgresql.database.name"] != nil and attributes["postgresql.table.name"] == nil | ||
| - set(attributes["service.instance.id"], attributes["service.name"]) where attributes["postgresql.database.name"] == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with @aknuds1, we notice that I had a similar idea, though not using service.instance.id (or instance from Prometheus side), but a different resource.uid.
One advantage of using service.instance.id is that it works out of the box for older Prometheus, but it adds additional translation which might create issues. For instance, it does some translation which may be surprising for a user, but it's not clear what will happen if the entity data model is supported. Should the user remove this translation and break the ui?
Using a different attributes has the advantage that while not changing the status-quo, will work better with the introduction of the entity data model and is also a bit more clear. service.instance.id is not always enough, sometime it requires service.name and service.namespace (i.e., job).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably drop this client side synthesis though, in favour of something on the Prometheus OTLP side instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, and presumably, the synthesis wouldn't be changing instance, but an other label, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldufr The synthesis should generate the instance label, why not? If the OTLP endpoint cannot generate target_info, because the identifying resource attribute triplet isn't present (service.namespace, service.name, service.instance.id), the idea is to have a potential fallback for determining another identifying resource attribute subset (from which to generate target_info and job and instance labels).
7210765 to
a8a33a0
Compare
Add a dedicated pipeline for PostgreSQL metrics with a resource processor that sets service.name and service.instance.id. This ensures Prometheus generates target_info for PostgreSQL metrics, enabling the info() function to work correctly. Signed-off-by: Arve Knudsen <[email protected]>
Replace info() with explicit group_left + max() joins to handle duplicate target_info series from the PostgreSQL receiver. The PostgreSQL receiver creates multiple target_info entries (one per database/table/index scope), causing info() to fail with "duplicate series for info metric" error. The max() aggregation collapses duplicate target_info series while preserving all k8s and host variable filters for K8s compatibility. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update PostgreSQL dashboard to use the experimental Prometheus info() function for enriching metrics with resource attributes from target_info. Changes: - Add transform/postgresql processor to generate unique service.instance.id per PostgreSQL resource scope (database, table, index), fixing duplicate target_info entries - Promote k8s.cluster.name and k8s.statefulset.name to metric labels to work around Prometheus info() filtering bug when filter labels exist on input metric - Simplify dashboard queries from verbose group_left + max() to cleaner info() function calls This approach works for both Kubernetes and Docker Compose deployments. Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
Use a custom collector image that generates unique service.instance.id per PostgreSQL resource scope to fix duplicate target_info entries. Co-Authored-By: Claude Opus 4.5 <[email protected]>
The custom collector image now generates unique service.instance.id per PostgreSQL resource scope natively, making this workaround unnecessary. Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
- Remove resource/postgresql processor (custom receiver sets service.name) - Remove metric_statements from transform processor (info() only needs one of service.name or service.instance.id) - Merge PostgreSQL into main metrics pipeline - Remove transform from metrics pipeline (only needed for traces) Signed-off-by: Arve Knudsen <[email protected]>
Promote service.name to metrics to work around a Prometheus bug where info() filtering doesn't work when the filter label already exists on the input metric. This fixes the APM dashboard which filters by service_name. Update PostgreSQL dashboard queries to filter by service_name directly on the metric (first argument to info()) rather than in the target_info filter (second argument), avoiding the same Prometheus bug. TODO: Remove service.name promotion when upgrading to Prometheus >= v3.10.x.
9fd96c8 to
2b30a83
Compare
Switch to custom Prometheus image that includes the fix for the info() function filtering bug. Remove the workarounds that were needed: - Remove service.name, k8s.cluster.name, k8s.statefulset.name from promote_resource_attributes in prometheus-config.yaml - Revert PostgreSQL dashboard queries to filter by service_name in the info() second argument instead of the first argument - Update APM dashboard queries to filter by service_name in the info() second argument instead of the first argument The info() function now correctly filters by labels in the second argument even when those labels exist on target_info. Signed-off-by: Arve Knudsen <[email protected]>
Add Helm values overrides and deployment scripts for testing the experimental Prometheus info() function in Kubernetes environments: - values-info-function.yaml: Custom Prometheus image (public) and OTLP config - values-kind.yaml: NodePort service for Kind frontend access - kind-config.yaml: Kind cluster config with frontend port mapping - deploy-kind.sh: All-in-one script for Kind deployment - deploy-info-function.sh: Generic Kubernetes deployment script Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use ghcr.io/aknuds1/otelcontribcol:postgresreceiver-uuid-v0.143.0 which includes the PostgreSQL receiver fix for proper service.name resource attributes. Co-Authored-By: Claude Opus 4.5 <[email protected]>
product-catalog and flagd were getting OOMKilled with the default memory limits. Increase limits to prevent crashes: - product-catalog: 100Mi (up from 20Mi default) - flagd: 500Mi (up from default) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Delete conflicting APM and PostgreSQL dashboards from Helm chart before deploying our custom info() function versions - Increase memory limits for product-catalog (100Mi) and flagd (500Mi) to prevent OOMKill in Kind Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Configure OTel Collector to set host.name from k8s.pod.name using upsert action in the resource processor. This ensures services have their pod name as host_name instead of the collector's hostname. - Set resourcedetection override: false to preserve existing attributes - Disable flagd-ui sidecar which OOMKills even with 1Gi memory limit Co-Authored-By: Claude Opus 4.5 <[email protected]>
Changes
Changing the demo to use the PromQL
infofunction instead of configuring Prometheus to promote resource attributes, for a more light weight approach aligning with Prometheus recommendations.Bear also in mind that Prometheus might in the future store OTel resource attributes as native metadata - this PR would prepare for that because
infoshould keep working.Merge Requirements
For new features contributions, please make sure you have completed the following
essential items:
CHANGELOG.mdupdated to document new feature additionsMaintainers will not merge until the above have been completed. If you're unsure
which docs need to be changed ping the
@open-telemetry/demo-approvers.