Skip to content

Conversation

@aknuds1
Copy link

@aknuds1 aknuds1 commented Jan 8, 2026

Changes

Changing the demo to use the PromQL info function instead of configuring Prometheus to promote resource attributes, for a more light weight approach aligning with Prometheus recommendations.

Bear also in mind that Prometheus might in the future store OTel resource attributes as native metadata - this PR would prepare for that because info should keep working.

Merge Requirements

For new features contributions, please make sure you have completed the following
essential items:

  • CHANGELOG.md updated to document new feature additions
  • Appropriate documentation updates in the docs
  • Appropriate Helm chart updates in the helm-charts

Maintainers will not merge until the above have been completed. If you're unsure
which docs need to be changed ping the
@open-telemetry/demo-approvers.

@github-actions github-actions bot added the helm-update-required Requires an update to the Helm chart when released label Jan 8, 2026
@aknuds1 aknuds1 force-pushed the arve/prometheus-metadata branch 4 times, most recently from 2cb5b84 to 93d3951 Compare January 8, 2026 10:58
@aknuds1 aknuds1 changed the title WIP: Use PromQL info function instead of resource attribute promotion Use PromQL info function instead of resource attribute promotion Jan 8, 2026
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES},service.instance.id=checkout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service.instance.id is expected to be generated by SDKs or derived from the K8s environment, moreover, it should be a GUID.
Specs: https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id
K8s naming specs: https://opentelemetry.io/docs/specs/semconv/non-normative/k8s-attributes/

Comment on lines 155 to 162
resource/postgresql:
attributes:
- key: service.name
value: postgresql
action: upsert
- key: service.instance.id
value: ${env:POSTGRES_HOST}
action: upsert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the service.name specs here, we could try to broaden the requirements in specs to also define service.name in infrastructure monitoring use cases and convince OTel Collector Receiver maintainers to adopt this but today, no infrastructure monitoring receiver produces service.name or service.instance.id

- context: resource
statements:
# Set service.instance.id to service.name if not already set (needed for Prometheus info() joins)
- set(attributes["service.instance.id"], attributes["service.name"]) where attributes["service.instance.id"] == nil and attributes["service.name"] != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have collision if the same service type (eg a redis) is running multiple times. For infra monitoring metrics, we commonly use attributes like host.name... to differentiate the instances.

@@ -0,0 +1,136 @@
# CLAUDE.md

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure you want to commit this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @jmichalek132 - it's useful while working out the PR however - I would remove it before finalizing the PR.

@jmichalek132
Copy link

Overall looks good outside of what @cyrille-leclerc already pointed out, did you test this locally (given lot of changes to the queries even just re-formatting) do all of the panel still show metrics? Would be potentially nice to show screenshots of it.

@aknuds1
Copy link
Author

aknuds1 commented Jan 8, 2026

did you test this locally (given lot of changes to the queries even just re-formatting) do all of the panel still show metrics?

@jmichalek132 I did some simple testing locally, but I already don't know the demo much, so I'm not a very effective tester :/ Do you know the demo well enough to look for discrepancies?

I did fix the bugs I could find from checking the APM and PostegreSQL dashboards, on Docker Compose.

@aknuds1 aknuds1 force-pushed the arve/prometheus-metadata branch from 93d3951 to 43cdf83 Compare January 8, 2026 16:37
@aknuds1
Copy link
Author

aknuds1 commented Jan 8, 2026

As discussed offline with @cyrille-leclerc, it might be better to implement instance label synthesis in Prometheus' OTLP endpoint, based on user configuration, instead of in OTel Collector config (because this would be a hurdle to users).

Comment on lines 160 to 173
transform/postgresql:
error_mode: ignore
metric_statements:
- context: resource
statements:
# Construct unique service.instance.id based on PostgreSQL resource scope.
# The PostgreSQL receiver sets postgresql.database.name, postgresql.table.name,
# postgresql.index.name as resource attributes, creating multiple target_info
# entries with the same identifying labels. By including these in service.instance.id,
# each scope gets a unique target_info, allowing info() to work correctly.
- set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"], "/", attributes["postgresql.table.name"], "/", attributes["postgresql.index.name"]], "")) where attributes["postgresql.index.name"] != nil
- set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"], "/", attributes["postgresql.table.name"]], "")) where attributes["postgresql.table.name"] != nil and attributes["postgresql.index.name"] == nil
- set(attributes["service.instance.id"], Concat([attributes["service.name"], "/", attributes["postgresql.database.name"]], "")) where attributes["postgresql.database.name"] != nil and attributes["postgresql.table.name"] == nil
- set(attributes["service.instance.id"], attributes["service.name"]) where attributes["postgresql.database.name"] == nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @aknuds1, we notice that I had a similar idea, though not using service.instance.id (or instance from Prometheus side), but a different resource.uid.

One advantage of using service.instance.id is that it works out of the box for older Prometheus, but it adds additional translation which might create issues. For instance, it does some translation which may be surprising for a user, but it's not clear what will happen if the entity data model is supported. Should the user remove this translation and break the ui?

Using a different attributes has the advantage that while not changing the status-quo, will work better with the introduction of the entity data model and is also a bit more clear. service.instance.id is not always enough, sometime it requires service.name and service.namespace (i.e., job).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably drop this client side synthesis though, in favour of something on the Prometheus OTLP side instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and presumably, the synthesis wouldn't be changing instance, but an other label, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldufr The synthesis should generate the instance label, why not? If the OTLP endpoint cannot generate target_info, because the identifying resource attribute triplet isn't present (service.namespace, service.name, service.instance.id), the idea is to have a potential fallback for determining another identifying resource attribute subset (from which to generate target_info and job and instance labels).

@aknuds1 aknuds1 force-pushed the arve/prometheus-metadata branch 4 times, most recently from 7210765 to a8a33a0 Compare January 12, 2026 17:11
aknuds1 and others added 7 commits January 13, 2026 08:51
Add a dedicated pipeline for PostgreSQL metrics with a resource
processor that sets service.name and service.instance.id. This
ensures Prometheus generates target_info for PostgreSQL metrics,
enabling the info() function to work correctly.

Signed-off-by: Arve Knudsen <[email protected]>
Replace info() with explicit group_left + max() joins to handle
duplicate target_info series from the PostgreSQL receiver. The
PostgreSQL receiver creates multiple target_info entries (one per
database/table/index scope), causing info() to fail with "duplicate
series for info metric" error.

The max() aggregation collapses duplicate target_info series while
preserving all k8s and host variable filters for K8s compatibility.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update PostgreSQL dashboard to use the experimental Prometheus info()
function for enriching metrics with resource attributes from target_info.

Changes:
- Add transform/postgresql processor to generate unique service.instance.id
  per PostgreSQL resource scope (database, table, index), fixing duplicate
  target_info entries
- Promote k8s.cluster.name and k8s.statefulset.name to metric labels to
  work around Prometheus info() filtering bug when filter labels exist
  on input metric
- Simplify dashboard queries from verbose group_left + max() to cleaner
  info() function calls

This approach works for both Kubernetes and Docker Compose deployments.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
Use a custom collector image that generates unique service.instance.id
per PostgreSQL resource scope to fix duplicate target_info entries.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The custom collector image now generates unique service.instance.id
per PostgreSQL resource scope natively, making this workaround unnecessary.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
- Remove resource/postgresql processor (custom receiver sets service.name)
- Remove metric_statements from transform processor (info() only needs
  one of service.name or service.instance.id)
- Merge PostgreSQL into main metrics pipeline
- Remove transform from metrics pipeline (only needed for traces)

Signed-off-by: Arve Knudsen <[email protected]>
Promote service.name to metrics to work around a Prometheus bug where
info() filtering doesn't work when the filter label already exists on
the input metric. This fixes the APM dashboard which filters by
service_name.

Update PostgreSQL dashboard queries to filter by service_name directly
on the metric (first argument to info()) rather than in the target_info
filter (second argument), avoiding the same Prometheus bug.

TODO: Remove service.name promotion when upgrading to Prometheus >= v3.10.x.
@aknuds1 aknuds1 force-pushed the arve/prometheus-metadata branch from 9fd96c8 to 2b30a83 Compare January 13, 2026 07:57
aknuds1 and others added 4 commits January 13, 2026 11:12
Switch to custom Prometheus image that includes the fix for the info()
function filtering bug. Remove the workarounds that were needed:

- Remove service.name, k8s.cluster.name, k8s.statefulset.name from
  promote_resource_attributes in prometheus-config.yaml
- Revert PostgreSQL dashboard queries to filter by service_name in
  the info() second argument instead of the first argument
- Update APM dashboard queries to filter by service_name in the
  info() second argument instead of the first argument

The info() function now correctly filters by labels in the second
argument even when those labels exist on target_info.

Signed-off-by: Arve Knudsen <[email protected]>
Add Helm values overrides and deployment scripts for testing the
experimental Prometheus info() function in Kubernetes environments:

- values-info-function.yaml: Custom Prometheus image (public) and OTLP config
- values-kind.yaml: NodePort service for Kind frontend access
- kind-config.yaml: Kind cluster config with frontend port mapping
- deploy-kind.sh: All-in-one script for Kind deployment
- deploy-info-function.sh: Generic Kubernetes deployment script

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use ghcr.io/aknuds1/otelcontribcol:postgresreceiver-uuid-v0.143.0
which includes the PostgreSQL receiver fix for proper service.name
resource attributes.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
product-catalog and flagd were getting OOMKilled with the default
memory limits. Increase limits to prevent crashes:
- product-catalog: 100Mi (up from 20Mi default)
- flagd: 500Mi (up from default)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
aknuds1 and others added 2 commits January 13, 2026 16:43
- Delete conflicting APM and PostgreSQL dashboards from Helm chart before
  deploying our custom info() function versions
- Increase memory limits for product-catalog (100Mi) and flagd (500Mi)
  to prevent OOMKill in Kind

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Configure OTel Collector to set host.name from k8s.pod.name using
  upsert action in the resource processor. This ensures services have
  their pod name as host_name instead of the collector's hostname.
- Set resourcedetection override: false to preserve existing attributes
- Disable flagd-ui sidecar which OOMKills even with 1Gi memory limit

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helm-update-required Requires an update to the Helm chart when released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants