Skip to content

Conversation

@ivoanjo
Copy link

@ivoanjo ivoanjo commented Oct 31, 2025

Changes

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.

When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.

Why open as draft: I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.

This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.

This OTEP introduces a standard mechanism for OpenTelemetry SDKs to
publish process-level resource attributes for access by out-of-process
readers such as the OpenTelemetry eBPF Profiler.

External readers like the OpenTelemetry eBPF Profiler operate outside
the instrumented process and cannot access resource attributes
configured within OpenTelemetry SDKs.

We propose a mechanism for OpenTelemetry SDKs to publish process-level
resource attributes, through a standard format based on Linux anonymous
memory mappings.

When an SDK initializes (or updates its resource attributes) it
publishes this information to a small, fixed-size memory region that
external processes can discover and read.

The OTEL eBPF profiler will then, upon observing a previously-unseen
process, probe and read this information, associating it with any
profiling samples taken from a given process.

_I'm opening this PR as a draft with the intention of sharing with
the Profiling SIG for an extra round of feedback before asking for a
wider review._

_This OTEP is based on
[Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler](https://docs.google.com/document/d/1-4jo29vWBZZ0nKKAOG13uAQjRcARwmRc4P313LTbPOE/edit?tab=t.0),
big thanks to everyone that provided feedback and helped refine the
idea so far._
@ivoanjo
Copy link
Author

ivoanjo commented Nov 5, 2025

Marking as ready for review!

@ivoanjo ivoanjo marked this pull request as ready for review November 5, 2025 12:19
@ivoanjo ivoanjo requested review from a team as code owners November 5, 2025 12:19
@tsloughter
Copy link
Member

So this would be a new requirement for eBPF profiler implementations?

My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so.

@ivoanjo
Copy link
Author

ivoanjo commented Nov 6, 2025

So this would be a new requirement for eBPF profiler implementations?

No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere.

Having this discussion is exactly why it was included as one of the open questions in the doc 👍


Our view is that we should go for recommended to implement and recommended to enable by default.

In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience.

For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested.

We should probably record the above guidance on the OTEP, if/once we're happy with it 🤔

@carlosalberto
Copy link
Contributor

cc @open-telemetry/specs-entities-approvers for extra eyes

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 15, 2025

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems:

- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).
- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate various signals with each other).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping the comment about the runtimes with multiple processes? I think that's one good use-case where it's especially hard to map what multiple pids seen from the outside actually are.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked the description here in b1583c6

| Field | Type | Description |
|-------------------|-----------|----------------------------------------------------------------------|
| `signature` | `char[8]` | Set to `"OTEL_CTX"` when the payload is ready (written last) |
| `version` | `uint32` | Format version. Currently `2` (`1` was used for development) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Development versions should not matter at this point as this OTEP is the point of introduction. All previous work is just for experimentation.

Suggested change
| `version` | `uint32` | Format version. Currently `2` (`1` was used for development) |
| `version` | `uint32` | Format version. Currently `1`. |

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting at 2 would make it really easy to distinguish from the earlier experiments that we deployed in a lot of spots already...

Since there's space for uint32 different versions, do you see starting at 2 as a big blocker? (I can still remove the comment explaining what 1 was, I agree it's TMI)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting at 2 is not a blocker to me. It just feels strange that this OTel protocol starts at 2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's slightly annoying that in most cases v0 is the development one, but in this case we are reserving 0 to "not filled in yet" which is why 1 ended up being the development version.


### Publication Protocol

Publishing the context should follow these steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As context sharing provides also an opportunity for others, what is the idea for other OS than Linux (or more general OS that don't have a mmap syscall).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For windows, we've experimented at Datadog with using an in-memory file. For macOS it's a bit more nebulous: we can still use mmap, and maybe combine it with mach_vm_region to discover the region?

While this mechanism can be extended to other OS's in the future, our thinking so far was that since the eBPF profiler is Linux-only, the main focus should be on getting Linux support in really amazing shape and then later extend as-needed.

8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only
9. **Name mapping** (Linux ≥5.17): Use `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")` to name the mapping

The signature MUST be written last to ensure readers never observe incomplete or invalid data. Once the signature is present and the mapping set to read-only, the entire mapping is considered valid and immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it simplify the publication protocol to require the writer to set published_at_ns to a time in the future, when writing the data is guaranteed to be finished?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. In theory a "malicious"/buggy/overloaded scheduler could always schedule out the thread after writing the timestamp and before it finished the rest of the steps...

One really nice property is that the pages are zeroed out by the kernel so it shouldn't be possible to observe anything else other than zeroes or valid data.

@github-actions github-actions bot removed the Stale label Nov 18, 2025
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read.

The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe how it would/could/(or won't) work when an application is instrumented with OBI (https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this question!

I researched this and my conclusion is that right now this won't work with OBI.

From what I'm seeing, while it's possible for ebpf programs to write into userspace using bpf_probe_write_user (and this is already used by OBI to support GO tracing), I don't see a way to do the other things listed in the publication protocol, such as allocating (small amounts of) memory, or invoking system calls to set up the naming and the inheritance permissions.

That said, I don't think this would necessarily be a blocker for OBI-to-OTEL eBPF Profiler communication, since we could introduce a specific out-of-band channel between them using the existing kernel eBPF primitives; but given the current limitations of eBPF I don't think we can get OBI to implement this specification on behalf of an instrumented application.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document it in the OTEP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9c8d9ed

Following discussion so far, we can probably avoid having our home-grown
`OtelProcessCtx` and instead use the common OTEL `Resource` message.
ivoanjo added a commit to ivoanjo/sig-profiling that referenced this pull request Dec 1, 2025
This PR adds an experimental C/C++ implementation for the "Process
Context" OTEP being proposed in
open-telemetry/opentelemetry-specification#4719

This implementation previously lived in
https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib
and as discussed during the OTEL profiling SIG meeting we want to add
it to this repository so it becomes easier to find and contribute to.

I've made sure to include a README explaining how to use it. Here's
the ultra-quick start (Linux-only):

```bash
$ ./build.sh
$ ./build/example_ctx --keep-running
Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2
Continuing forever, to exit press ctrl+c...
TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context

 # In another shell
$ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above
Found OTEL context for PID 267023
Start address: 756f28ce1000
00000000  4f 54 45 4c 5f 43 54 58  02 00 00 00 0b 68 55 47  |OTEL_CTX.....hUG|
00000010  70 24 7d 18 50 01 00 00  a0 82 6d 7e 6a 5f 00 00  |p$}.P.....m~j_..|
00000020
Parsed struct:
  otel_process_ctx_signature       : "OTEL_CTX"
  otel_process_ctx_version         : 2
  otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT)
  otel_process_payload_size        : 336
  otel_process_payload             : 0x00005f6a7e6d82a0
Payload dump (336 bytes):
00000000  0a 25 0a 1b 64 65 70 6c  6f 79 6d 65 6e 74 2e 65  |.%..deployment.e|
00000010  6e 76 69 72 6f 6e 6d 65  6e 74 2e 6e 61 6d 65 12  |nvironment.name.|
...
Protobuf decode:
attributes {
  key: "deployment.environment.name"
  value {
    string_value: "prod"
  }
}
attributes {
  key: "service.instance.id"
  value {
    string_value: "123d8444-2c7e-46e3-89f6-6217880f7123"
  }
}
attributes {
  key: "service.name"
  value {
    string_value: "my-service"
  }
}
...
```

Note that because the upstream OTEP is still under discussion, this
implementation is experimental and may need changes to match up with
the final version of the OTEP.
As pointed out during review, these don't necessarily exist for some
resources so let's streamline the spec for now.
option go_package = "go.opentelemetry.io/proto/otlp/resource/v1";
// Resource information.
message Resource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late question - but this just popped into my mind:

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

Iirc this approach should also be used later on to provide more information about process internals. But Resource.attributes only holds information covered by OTel Semantic Convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

I suspect protobuf will be a bit too heavy/awkward for the thread state payload format BUT my thinking is that anything we put there should otherwise map to/from attributes.

But Resource.attributes only holds information covered by OTel Semantic Convention.

Actually I don't think that's the case? I've seen a lot of prior art for custom attributes, so anything we don't think should end up in semantic conventions could stay as a custom attribute. I think? 👀


- **Inconsistent resource attributes across signals**: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging.

- **Correlation is dependent on process activity**: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant to the issue we're trying to solve with this OTEP meaning isn't this problem still going to exist with eBPF profiler even if we adopt the proposed mechanism? Maybe add a clarification that for eBPF profiler this behavior is unaffected by the proposed mechanism?

(I don't think we should remove it as it's contextual information but as it's currently listed in Motivation there's room for misunderstanding)

If there's something else you had in mind re: different external reader, feel free to clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking behind this point is two-fold:

  1. off-cpu/wall-time profiling

    My thinking is since the OTEL eBPF profiler already supports off-cpu profiles, for such samples, we would add support for including the process context as well.

    +1 that indeed "can read even when there's no activity" would not impact CPU profiling, since CPU profiling is only concerned about activity.

    If, in the future, wall-time profiling (e.g. a combination of on-cpu and off-cpu) was added to the OTEL eBPF profiler, that would be another use-case for this mechanism.

  2. non-reliance on mechanisms that require activity from the application

    If we were to try to solve the process context problem with an approach of having the application calling something from time to time (or once/a few times, after handshaking with the reader), such a solution would be fragile in the presence of applications that are blocked/stuck, if the application for some reason stops performing those calls.

    The current solution is not affected by this since the process context setup is intended to be performed once at application start, in a fire-and-forget way, independently of what the reader is doing.


Publishing the context should follow these steps:

1. **Drop existing mapping**: If a previous context was published, unmap/free it
Copy link
Member

@christos68k christos68k Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to drop the existing mapping? If we keep it fixed, the reader may cache the address for the target process which simplifies checking if the data has been updated (no overhead of re-parsing mappings, this can also help with higher-frequency updates).

Since the payload pointer can point to anywhere in target process memory, we'll never be limited by the two pages fixed mapping size (meaning we don't need to grow this mapping to span more pages either during process runtime or in the future).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to drop the existing mapping?

Not strictly?

For the existing approach, it's possible to avoid polling mappings to figure out the address by:

  • Checking that published_at_ns can be read and hasn't changed and/or
  • Hooking on prctl calls

Reusing the mapping instead of dropping it does not conflict with the above approaches, but... I think it would complicate concurrency control on the reader. That is, having this invariant allows the reader to reader know that while the mapping is up, the payload is valid and consistent as far as the writer is concerned.

Copy link
Member

@christos68k christos68k Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

But the point I'm making is more general: The current update protocol mentions that the "previous mapping should be removed" before publishing new ones. If we assume that most implementors abide by this, then the overhead of parsing mappings will be there. For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

Can we examine the concurrency control edge cases in more detail? It should be possible to provide the same guarantees as now while keeping a fixed mapping.

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

Copy link
Author

@ivoanjo ivoanjo Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

I think it is! Let me try to convince you ;)

After a context gets dropped one of two things happens:

a) The mapping becomes invalid. This would make reads return an error, which would be a clear indication of not valid.

b) A new mapping (otel or not) gets put in its place. Reads to the old location of published_at_ns would return whatever's there now. Note that this would would not be published_at_ns because the kernel zeroes out memory before mapping it (e.g. this is not regular malloc/free) and thus I don't think it's possible for leftover garbage to exist to confuse the reader. (Edit: And thus the reader will know what it read is not valid)

For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

(The reader may even choose to do time-based caching, e.g. read the context and reuse it for the next N seconds/minutes, rather than trying to always have the latest up-to-date version if it wants to even save more reads)

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

To be clear, I believe prctl is not needed at all to be able to follow invalidation of existing contexts/creation of new ones, it's a fully optional possibility.

We could even completely omit references to hooking on prctl in the current spec -- but I think it's an interesting feature to document in the spec for readers that want to use it.

Copy link
Member

@christos68k christos68k Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

True, that speeds up detection but the overhead of parsing mappings before fetching an update is still there. It also makes for a more complicated update protocol, maybe limiting update operation frequency.

Advantages for keeping the mapping fixed:

  1. Simpler publisher logic
  2. Simpler reader logic
  3. Minimal (non-existent after mapping is first detected in the reader) accessing and processing /proc/ overhead
  4. Scales to thousands of processes
  5. Scales to higher frequency updates, minimizing possibility of stale data

Can we clarify the disadvantages?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so to clarify, your concern is around the overhead of updates to the process context.

I can agree that if the writer keeps dropping the context and creating a new one, then the reader needs to do a bit more work to keep up than just "re-read the payload".

...maybe limiting update operation frequency.

Yet I'm a bit puzzled with this part of the premise: The kind of attributes we ship here are expected to change rarely, if ever. So that's why above my focus in answering was focusing on "is it cheap to check it hasn't changed" vs "is it cheap to handle when it has changed"...

Making updates is supported but not intended to be very often used part of the spec, and even why I was suggesting caching it for seconds/minutes even.

Can you share more of your thoughts around why an application would change this info often?

Copy link
Member

@christos68k christos68k Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kind of attributes we ship here are expected to change rarely, if ever.

Maybe that's the case, but there is currently no mention of expected frequency in the OTEP. It's not uncommon for a general data transmission mechanism to be used in a more broad way than originally conceived. My main goal with all these comments is to clarify and understand the trade-offs. If we're going to specify a less flexible, more complicated and possibly less performant mechanism, let's try to understand what we're gaining in return.

What's a strong argument for recreating the mapping on each update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants