-
Notifications
You must be signed in to change notification settings - Fork 940
OTEP: Process Context: Sharing Resource Attributes with External Readers #4719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OTEP: Process Context: Sharing Resource Attributes with External Readers #4719
Conversation
This OTEP introduces a standard mechanism for OpenTelemetry SDKs to publish process-level resource attributes for access by out-of-process readers such as the OpenTelemetry eBPF Profiler. External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings. When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process. _I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review._ _This OTEP is based on [Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler](https://docs.google.com/document/d/1-4jo29vWBZZ0nKKAOG13uAQjRcARwmRc4P313LTbPOE/edit?tab=t.0), big thanks to everyone that provided feedback and helped refine the idea so far._
|
Marking as ready for review! |
|
So this would be a new requirement for eBPF profiler implementations? My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so. |
No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere. Having this discussion is exactly why it was included as one of the open questions in the doc 👍 Our view is that we should go for recommended to implement and recommended to enable by default. In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience. For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested. We should probably record the above guidance on the OTEP, if/once we're happy with it 🤔 |
|
cc @open-telemetry/specs-entities-approvers for extra eyes |
|
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
oteps/profiles/4719-process-ctx.md
Outdated
|
|
||
| External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems: | ||
|
|
||
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes). | |
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate various signals with each other). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping the comment about the runtimes with multiple processes? I think that's one good use-case where it's especially hard to map what multiple pids seen from the outside actually are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tweaked the description here in b1583c6
| | Field | Type | Description | | ||
| |-------------------|-----------|----------------------------------------------------------------------| | ||
| | `signature` | `char[8]` | Set to `"OTEL_CTX"` when the payload is ready (written last) | | ||
| | `version` | `uint32` | Format version. Currently `2` (`1` was used for development) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Development versions should not matter at this point as this OTEP is the point of introduction. All previous work is just for experimentation.
| | `version` | `uint32` | Format version. Currently `2` (`1` was used for development) | | |
| | `version` | `uint32` | Format version. Currently `1`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting at 2 would make it really easy to distinguish from the earlier experiments that we deployed in a lot of spots already...
Since there's space for uint32 different versions, do you see starting at 2 as a big blocker? (I can still remove the comment explaining what 1 was, I agree it's TMI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting at 2 is not a blocker to me. It just feels strange that this OTel protocol starts at 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's slightly annoying that in most cases v0 is the development one, but in this case we are reserving 0 to "not filled in yet" which is why 1 ended up being the development version.
|
|
||
| ### Publication Protocol | ||
|
|
||
| Publishing the context should follow these steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As context sharing provides also an opportunity for others, what is the idea for other OS than Linux (or more general OS that don't have a mmap syscall).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For windows, we've experimented at Datadog with using an in-memory file. For macOS it's a bit more nebulous: we can still use mmap, and maybe combine it with mach_vm_region to discover the region?
While this mechanism can be extended to other OS's in the future, our thinking so far was that since the eBPF profiler is Linux-only, the main focus should be on getting Linux support in really amazing shape and then later extend as-needed.
| 8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only | ||
| 9. **Name mapping** (Linux ≥5.17): Use `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")` to name the mapping | ||
|
|
||
| The signature MUST be written last to ensure readers never observe incomplete or invalid data. Once the signature is present and the mapping set to read-only, the entire mapping is considered valid and immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it simplify the publication protocol to require the writer to set published_at_ns to a time in the future, when writing the data is guaranteed to be finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. In theory a "malicious"/buggy/overloaded scheduler could always schedule out the thread after writing the timestamp and before it finished the rest of the steps...
One really nice property is that the pages are zeroed out by the kernel so it shouldn't be possible to observe anything else other than zeroes or valid data.
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>
oteps/profiles/4719-process-ctx.md
Outdated
|
|
||
| When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. | ||
|
|
||
| The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please describe how it would/could/(or won't) work when an application is instrumented with OBI (https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this question!
I researched this and my conclusion is that right now this won't work with OBI.
From what I'm seeing, while it's possible for ebpf programs to write into userspace using bpf_probe_write_user (and this is already used by OBI to support GO tracing), I don't see a way to do the other things listed in the publication protocol, such as allocating (small amounts of) memory, or invoking system calls to set up the naming and the inheritance permissions.
That said, I don't think this would necessarily be a blocker for OBI-to-OTEL eBPF Profiler communication, since we could introduce a specific out-of-band channel between them using the existing kernel eBPF primitives; but given the current limitations of eBPF I don't think we can get OBI to implement this specification on behalf of an instrumented application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please document it in the OTEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 9c8d9ed
Following discussion so far, we can probably avoid having our home-grown `OtelProcessCtx` and instead use the common OTEL `Resource` message.
This PR adds an experimental C/C++ implementation for the "Process Context" OTEP being proposed in open-telemetry/opentelemetry-specification#4719 This implementation previously lived in https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib and as discussed during the OTEL profiling SIG meeting we want to add it to this repository so it becomes easier to find and contribute to. I've made sure to include a README explaining how to use it. Here's the ultra-quick start (Linux-only): ```bash $ ./build.sh $ ./build/example_ctx --keep-running Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2 Continuing forever, to exit press ctrl+c... TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context # In another shell $ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above Found OTEL context for PID 267023 Start address: 756f28ce1000 00000000 4f 54 45 4c 5f 43 54 58 02 00 00 00 0b 68 55 47 |OTEL_CTX.....hUG| 00000010 70 24 7d 18 50 01 00 00 a0 82 6d 7e 6a 5f 00 00 |p$}.P.....m~j_..| 00000020 Parsed struct: otel_process_ctx_signature : "OTEL_CTX" otel_process_ctx_version : 2 otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT) otel_process_payload_size : 336 otel_process_payload : 0x00005f6a7e6d82a0 Payload dump (336 bytes): 00000000 0a 25 0a 1b 64 65 70 6c 6f 79 6d 65 6e 74 2e 65 |.%..deployment.e| 00000010 6e 76 69 72 6f 6e 6d 65 6e 74 2e 6e 61 6d 65 12 |nvironment.name.| ... Protobuf decode: attributes { key: "deployment.environment.name" value { string_value: "prod" } } attributes { key: "service.instance.id" value { string_value: "123d8444-2c7e-46e3-89f6-6217880f7123" } } attributes { key: "service.name" value { string_value: "my-service" } } ... ``` Note that because the upstream OTEP is still under discussion, this implementation is experimental and may need changes to match up with the final version of the OTEP.
As pointed out during review, these don't necessarily exist for some resources so let's streamline the spec for now.
| option go_package = "go.opentelemetry.io/proto/otlp/resource/v1"; | ||
| // Resource information. | ||
| message Resource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late question - but this just popped into my mind:
What is the idea of going forward using message Resource for sharing thread state information or more process internals?
Iirc this approach should also be used later on to provide more information about process internals. But Resource.attributes only holds information covered by OTel Semantic Convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea of going forward using message Resource for sharing thread state information or more process internals?
I suspect protobuf will be a bit too heavy/awkward for the thread state payload format BUT my thinking is that anything we put there should otherwise map to/from attributes.
But Resource.attributes only holds information covered by OTel Semantic Convention.
Actually I don't think that's the case? I've seen a lot of prior art for custom attributes, so anything we don't think should end up in semantic conventions could stay as a custom attribute. I think? 👀
|
|
||
| - **Inconsistent resource attributes across signals**: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging. | ||
|
|
||
| - **Correlation is dependent on process activity**: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant to the issue we're trying to solve with this OTEP meaning isn't this problem still going to exist with eBPF profiler even if we adopt the proposed mechanism? Maybe add a clarification that for eBPF profiler this behavior is unaffected by the proposed mechanism?
(I don't think we should remove it as it's contextual information but as it's currently listed in Motivation there's room for misunderstanding)
If there's something else you had in mind re: different external reader, feel free to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking behind this point is two-fold:
-
off-cpu/wall-time profiling
My thinking is since the OTEL eBPF profiler already supports off-cpu profiles, for such samples, we would add support for including the process context as well.
+1 that indeed "can read even when there's no activity" would not impact CPU profiling, since CPU profiling is only concerned about activity.
If, in the future, wall-time profiling (e.g. a combination of on-cpu and off-cpu) was added to the OTEL eBPF profiler, that would be another use-case for this mechanism.
-
non-reliance on mechanisms that require activity from the application
If we were to try to solve the process context problem with an approach of having the application calling something from time to time (or once/a few times, after handshaking with the reader), such a solution would be fragile in the presence of applications that are blocked/stuck, if the application for some reason stops performing those calls.
The current solution is not affected by this since the process context setup is intended to be performed once at application start, in a fire-and-forget way, independently of what the reader is doing.
|
|
||
| Publishing the context should follow these steps: | ||
|
|
||
| 1. **Drop existing mapping**: If a previous context was published, unmap/free it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to drop the existing mapping? If we keep it fixed, the reader may cache the address for the target process which simplifies checking if the data has been updated (no overhead of re-parsing mappings, this can also help with higher-frequency updates).
Since the payload pointer can point to anywhere in target process memory, we'll never be limited by the two pages fixed mapping size (meaning we don't need to grow this mapping to span more pages either during process runtime or in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to drop the existing mapping?
Not strictly?
For the existing approach, it's possible to avoid polling mappings to figure out the address by:
- Checking that
published_at_nscan be read and hasn't changed and/or - Hooking on prctl calls
Reusing the mapping instead of dropping it does not conflict with the above approaches, but... I think it would complicate concurrency control on the reader. That is, having this invariant allows the reader to reader know that while the mapping is up, the payload is valid and consistent as far as the writer is concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).
But the point I'm making is more general: The current update protocol mentions that the "previous mapping should be removed" before publishing new ones. If we assume that most implementors abide by this, then the overhead of parsing mappings will be there. For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.
Can we examine the concurrency control edge cases in more detail? It should be possible to provide the same guarantees as now while keeping a fixed mapping.
We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).
I think it is! Let me try to convince you ;)
After a context gets dropped one of two things happens:
a) The mapping becomes invalid. This would make reads return an error, which would be a clear indication of not valid.
b) A new mapping (otel or not) gets put in its place. Reads to the old location of published_at_ns would return whatever's there now. Note that this would would not be published_at_ns because the kernel zeroes out memory before mapping it (e.g. this is not regular malloc/free) and thus I don't think it's possible for leftover garbage to exist to confuse the reader. (Edit: And thus the reader will know what it read is not valid)
For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.
The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.
(The reader may even choose to do time-based caching, e.g. read the context and reuse it for the next N seconds/minutes, rather than trying to always have the latest up-to-date version if it wants to even save more reads)
We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)
To be clear, I believe prctl is not needed at all to be able to follow invalidation of existing contexts/creation of new ones, it's a fully optional possibility.
We could even completely omit references to hooking on prctl in the current spec -- but I think it's an interesting feature to document in the spec for readers that want to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking
prctl.
True, that speeds up detection but the overhead of parsing mappings before fetching an update is still there. It also makes for a more complicated update protocol, maybe limiting update operation frequency.
Advantages for keeping the mapping fixed:
- Simpler publisher logic
- Simpler reader logic
- Minimal (non-existent after mapping is first detected in the reader) accessing and processing
/proc/overhead - Scales to thousands of processes
- Scales to higher frequency updates, minimizing possibility of stale data
Can we clarify the disadvantages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so to clarify, your concern is around the overhead of updates to the process context.
I can agree that if the writer keeps dropping the context and creating a new one, then the reader needs to do a bit more work to keep up than just "re-read the payload".
...maybe limiting update operation frequency.
Yet I'm a bit puzzled with this part of the premise: The kind of attributes we ship here are expected to change rarely, if ever. So that's why above my focus in answering was focusing on "is it cheap to check it hasn't changed" vs "is it cheap to handle when it has changed"...
Making updates is supported but not intended to be very often used part of the spec, and even why I was suggesting caching it for seconds/minutes even.
Can you share more of your thoughts around why an application would change this info often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kind of attributes we ship here are expected to change rarely, if ever.
Maybe that's the case, but there is currently no mention of expected frequency in the OTEP. It's not uncommon for a general data transmission mechanism to be used in a more broad way than originally conceived. My main goal with all these comments is to clarify and understand the trade-offs. If we're going to specify a less flexible, more complicated and possibly less performant mechanism, let's try to understand what we're gaining in return.
What's a strong argument for recreating the mapping on each update?
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
…n Linux 5.17+ See open-telemetry/sig-profiling#23 for a wider discussion of this.
Changes
External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.
When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.
Why open as draft:
I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.
CHANGELOG.mdfile updated for non-trivial changes