Skip to content

Worker is trying to rescue jobs from queues that are not assigned to it #1105

@NanoBjorn

Description

@NanoBjorn

A bit of context

We have two kinds of workers at the moment, sharing the same database, let's say worker-1 is running jobs from queue-1 and worker-2 is running jobs from queue-2. However, we are using the same binary for the worker and we recently tried registering jobs conditionally, basically registering only those jobs that are supposed to be executed by this worker based on queue selection.

Another thing to mention is that our RescueStuckJobsAfter is configured to be pretty small (30s) whereas timeouts on the jobs are minutes.

An issue

After introducing this conditional registration, we observed the following behaviour:

  • worker-1 is fetching all jobs (for both queues) to rescue:
    -- name: JobGetStuck :many
    SELECT *
    FROM /* TEMPLATE: schema */river_job
    WHERE state = 'running'
    AND attempted_at < @stuck_horizon::timestamptz
    ORDER BY id
    LIMIT @max;
  • amongst others, it can fetch jobs dedicated for queue-2 and they are not registered in the worker-1, so makeRetryDecision discards the job:
    workUnitFactory := s.Config.WorkUnitFactoryFunc(job.Kind)
    if workUnitFactory == nil {
    s.Logger.ErrorContext(ctx, s.Name+": Attempted to rescue unhandled job kind, discarding",
    slog.String("job_kind", job.Kind), slog.Int64("job_id", job.ID))
    return jobRetryDecisionDiscard, time.Time{}
    }

Checked documentation if we are missing something, and both pages that could have mention any details of this behaviour are not saying anything about our weird case:

  • Inserting and working jobs does not mention if all jobs from a single database have to be registered in all workers
  • Multiple queues only says but workers will only select jobs to work for queues that they're configured to handle, but still it is unclear if we should register all jobs or not

A very bad side effect of this for us was that discarded job continued execution (context wasn't cancelled) and another job with the same unique_key was scheduled, violating unique constraint that we are relying on for correctness, but I am not blaming this on river, this is a consequence of our ignorance of rescue mechanics nuances.


Short-term we fixed our issue by again registering all jobs in the worker binary regardless its configuration, but it would be valuable to have an answer if this is a designed behaviour or a bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions