Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,25 @@ def call(_worker_class, job, _queue, _redis_pool)
}
attributes[SemanticConventions::Trace::PEER_SERVICE] = instrumentation_config[:peer_service] if instrumentation_config[:peer_service]

scheduled_at = job['at']
op = scheduled_at.nil? ? 'publish' : 'scheduled'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 This naming does not match any of the semantic convention naming for messaging.

https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-spans.md#span-name

Is there something special about scheduled spans that help you disambiguate them from any other publisher?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will answer in the next comment ...


span_name = case instrumentation_config[:span_naming]
when :job_class then "#{job['wrapped']&.to_s || job['class']} publish"
else "#{job['queue']} publish"
when :job_class then "#{job['wrapped']&.to_s || job['class']} #{op}"
else "#{job['queue']} #{op}"
end

tracer.in_span(span_name, attributes: attributes, kind: :producer) do |span|
# In case this is Scheduled job, there is already context injected, so link to that context
# NOTE: :propagation_style = :child is not supported as it is quite tricky when :trace_poller_enqueue = true
extracted_context = OpenTelemetry.propagation.extract(job, context: OpenTelemetry::Context::ROOT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite interesting and do not see a case covered in the spec https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-spans.md#span-name

If I understand this correctly, this is disconnecting the parent/child context and treats a Sidekiq scheduled job from the place where it was enqueued?

This seems unexpected to do from the client all cases of links because it is going to disconnect the clients, publishers and consumers from each other generating 3 traces ids instead of 2. By default, even in a link situation, I would expect the publisher span to use parent/child propagation and the consumer to generate links.

E.g. you have a web request that uses wait_until, I would expect the http.server span to have a parent/child relationship with the messaging.producer span:

def checkout
  # ... schedule housekeeping clean up
  GuestsCleanupJob.set(wait_until: Date.tomorrow.noon).perform_later(guest)
end

When the GuestsCleanupJob then executes it would create a new trace, and link the web request trace to it.

Sidekiq maps wait_until to the at attribute AFAICT: https://github.com/sidekiq/sidekiq/blob/96f867cb58b7fa0a6a832af1a732a339aa0eb61f/lib/sidekiq/job.rb#L194

With these changes, it will end up changing this expected behavior and that is undesirable in my opinion.

My recommendation here would be for you to add a utility method to your code or identify something more specific in Sidekiq that would address your use case.

Copy link
Contributor Author

@smoke smoke Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how Sidekiq is working, to what I have found:

  1. <SomeJob>.perform_in(1.minute) pushes the Job definition to a dedicated queue (sorted set) with name schedule, using the at to sort the set
  2. Sidekiq::Scheduled::Poller runs roughly every average_scheduled_poll_interval,
  3. During the each run, it picks the items in the schedule that are due and pushes to the defined <SomeJob>.queue
  4. <SomeJob>.perform is invoked

When it comes to OpenTelemetry instrumentation of that:

Currently when when using span_naming=:job_class (default is :queue), propagation_style=:link (default is :link) this results in the following spans

  1. <SomeJob> publish and attrs {"messaging.destination.name": "high", ...} time=T, id=123, trace=ABC, parent=123
  2. If using trace_poller_enqueue: true - Sidekiq::Scheduled::Poller#enqueue time~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs
    • no relation / link to p.1
  3. If using trace_poller_enqueue: true - <SomeJob> publish and attrs {"messaging.destination.name": "high", ...} time~=T+1.minute, id=678, trace=XYZ, parent=567
    • no relation / link to p.1
  4. <SomeJob> process and attrs {"messaging.destination.name": "high", ...} time~=T+1.minute+operational delay, id=987, trace=DEF
    • relation / link to p.3, but only if using trace_poller_enqueue: true

Caveats:

  • If using the default trace_poller_enqueue: false - the P.2, P.3 are completely missed
  • If using the default trace_poller_enqueue: true - the P.2, P.3 and P.4 are correlated, but the Span from P.2 (Sidekiq::Scheduled::Poller#enqueue) can't be child of many parents as each there are NxP.1 each on its own Trace, thus only Link is available to correlate.
  • If using the default trace_poller_enqueue: true - the The P.3 (<SomeJob> publish) spans are put on the trace of P.2. The spans of P.2 eventually with quite some work, may be refactored to be put as child of P.1 and have links to P.2, but that is not something I am keen on doing as it is very hard. image
  • If using the default trace_poller_enqueue: false (the default) all of the P.2 and P.3 spans and correlation are missing

My goal is to have meaningful correlation from P.1 to P.4, regardless same Trace or Links, whatever is easy and feasible.
With the above caveats in place, this PR changes

  1. <SomeJob> scheduled and attrs {"messaging.destination.name": "high", ...} time=T, id=123, trace=ABC, parent=123
    • changed from publish to scheduled to distinguish from P.3
  2. If using trace_poller_enqueue: true - Sidekiq::Scheduled::Poller#enqueue time~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs
    • no relation / link to p.1
  3. If using trace_poller_enqueue: true - <SomeJob> publish and attrs {"messaging.destination.name": "high", ...} time~=T+1.minute, id=678, trace=XYZ, parent=567
    • if using propagation_style=:link - a Link to P.1 is added
  4. <SomeJob> process and attrs {"messaging.destination.name": "high", ...} time~=T+1.minute+operational delay, id=987, trace=DEF
    • relation / link to p.3, but only if using trace_poller_enqueue: true

Resulting in this more meaningful list and correlation with links or traces

image

I have been thinking to do some monkey patching or whatever, but it is way harder, moreover this PR has value for others.

So @arielvalentin how about, the following?

  1. Keep correlation as improved (limited Trace and added Links)
  2. If you are still very keen on obliging the standards, I can easily change P.1 spans addressing feat: sidekiq scheduled jobs improvements - scheduled operation and add links #1717 (comment)
    • from <SomeJob> **scheduled** and attrs {"messaging.destination.name": "**high**", ...} time=T, id=123, trace=ABC, parent=123
    • to <SomeJob> **publish** and attrs {"messaging.destination.name": "**schedule**", ...} time=T, id=123, trace=ABC, parent=123

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a sidekiq user so I'm thinking about this from the perspective of an active job user who may use the sidekiq adapter.

I see your point though of using non standard semconv in the cases where the naming pattern is Job Name instead of semantic conventions. I think in those cases using scheduled in the span name should be fine; however I think the semconv cases should remain untouched.

I'd need to review your comments more carefully since again, I'm not a sidekiq user so I don't quite understand the nuances of the cause of disconnect between the producer and consumer spans in the cases you outlined above.

Copy link
Contributor

@thompson-tomo thompson-tomo Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also am not a sidekiq user but am coming from the sem conv side. I am wondering if the messaging conventions are the correct conventions to even be following.

When I do some searching I get the impression that it is a task/job system and the official website seems to back this up.

Scale your app with Ruby's fastest job system, up to 20x faster than the competition!

Based on that I suspect it would be highly unlikely to be able to get sidekiq listed in sem conv as a messaging system.

So with that in mind as well as the sidekiq docs I find myself asking myself how would I define this and what I come up with is:

I know this would be a significant change and additional feedback should be sought from others but i feel moving away from using messaging spans better sets up given the metric definitions available in the Workflow pr. I am more than willing to refine the attribute usage in those spans based on findings when attempting to implement it. Also in coming up with span name convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the conventions as long as things are well observable and correlated, however this is a huge change that I am not keen on doing as part of this PR.
This PR is intended as an unobtrusive incremental change, so it can be quickly approved and merged, so let's focus on this. Later I would love helping in the refactoring with real life usage and feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be a big change but my concern is introducing the schedule messaging operation as a way to try and force the scenario of scheduling a task into the messaging specs. As a middle ground as if I understand correctly one of the issues is differentiating between the scheduling and publishing event. In that case why don't we if we are scheduling a task follow a definition which is not messaging. That way we aren't making the situation worse and I wouldn't have an issue if a couple of attributes were shared to assist with interoperability ie message.message.id

links = []
span_context = OpenTelemetry::Trace.current_span(extracted_context).context
links << OpenTelemetry::Trace::Link.new(span_context) if instrumentation_config[:propagation_style] == :link && span_context.valid?

tracer.in_span(span_name, attributes: attributes, links: links, kind: :producer) do |span|
OpenTelemetry.propagation.inject(job)
span.add_event('created_at', timestamp: time_from_timestamp(job['created_at']))
span.add_event('scheduled_at', timestamp: time_from_timestamp(scheduled_at)) unless scheduled_at.nil?
yield
end
end
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,14 @@ def call(_worker, msg, _queue)

extracted_context = OpenTelemetry.propagation.extract(msg)
created_at = time_from_timestamp(msg['created_at'])
enqueued_at = time_from_timestamp(msg['created_at'])
enqueued_at = time_from_timestamp(msg['enqueued_at'])
scheduled_at = time_from_timestamp(msg['at']) unless msg['at'].nil?
OpenTelemetry::Context.with_current(extracted_context) do
if instrumentation_config[:propagation_style] == :child
tracer.in_span(span_name, attributes: attributes, kind: :consumer) do |span|
span.add_event('created_at', timestamp: created_at)
span.add_event('enqueued_at', timestamp: enqueued_at)
span.add_event('scheduled_at', timestamp: scheduled_at) unless scheduled_at.nil?
yield
end
else
Expand All @@ -48,6 +50,7 @@ def call(_worker, msg, _queue)
OpenTelemetry::Trace.with_span(span) do
span.add_event('created_at', timestamp: created_at)
span.add_event('enqueued_at', timestamp: enqueued_at)
span.add_event('scheduled_at', timestamp: scheduled_at) unless scheduled_at.nil?
yield
rescue Exception => e # rubocop:disable Lint/RescueException
span.record_exception(e)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
_(root_span.kind).must_equal :producer

# process span is linked to the root enqueuing job
child_span1 = spans.find { |s| s.links && s.links.first.span_context.span_id == root_span.span_id }
child_span1 = spans.find { |s| s.links&.first && s.links.first.span_context.span_id == root_span.span_id }
_(child_span1.name).must_equal 'default process'
_(child_span1.kind).must_equal :consumer

Expand All @@ -143,7 +143,7 @@
_(child_span2.kind).must_equal :producer

# last process job is linked back to the process job that enqueued it
child_span3 = spans.find { |s| s.links && s.links.first.span_context.span_id == child_span2.span_id }
child_span3 = spans.find { |s| s.links&.first && s.links.first.span_context.span_id == child_span2.span_id }
_(child_span3.name).must_equal 'default process'
_(child_span3.kind).must_equal :consumer
end
Expand Down
Loading