-
Notifications
You must be signed in to change notification settings - Fork 131
Description
A bit of context
We have two kinds of workers at the moment, sharing the same database, let's say worker-1 is running jobs from queue-1 and worker-2 is running jobs from queue-2. However, we are using the same binary for the worker and we recently tried registering jobs conditionally, basically registering only those jobs that are supposed to be executed by this worker based on queue selection.
Another thing to mention is that our RescueStuckJobsAfter is configured to be pretty small (30s) whereas timeouts on the jobs are minutes.
An issue
After introducing this conditional registration, we observed the following behaviour:
worker-1is fetching all jobs (for both queues) to rescue:river/riverdriver/riverpgxv5/internal/dbsqlc/river_job.sql
Lines 257 to 263 in eb0b985
-- name: JobGetStuck :many SELECT * FROM /* TEMPLATE: schema */river_job WHERE state = 'running' AND attempted_at < @stuck_horizon::timestamptz ORDER BY id LIMIT @max; - amongst others, it can fetch jobs dedicated for
queue-2and they are not registered in theworker-1, somakeRetryDecisiondiscards the job:river/internal/maintenance/job_rescuer.go
Lines 302 to 307 in eb0b985
workUnitFactory := s.Config.WorkUnitFactoryFunc(job.Kind) if workUnitFactory == nil { s.Logger.ErrorContext(ctx, s.Name+": Attempted to rescue unhandled job kind, discarding", slog.String("job_kind", job.Kind), slog.Int64("job_id", job.ID)) return jobRetryDecisionDiscard, time.Time{} }
Checked documentation if we are missing something, and both pages that could have mention any details of this behaviour are not saying anything about our weird case:
- Inserting and working jobs does not mention if all jobs from a single database have to be registered in all workers
- Multiple queues only says but workers will only select jobs to work for queues that they're configured to handle, but still it is unclear if we should register all jobs or not
A very bad side effect of this for us was that discarded job continued execution (context wasn't cancelled) and another job with the same unique_key was scheduled, violating unique constraint that we are relying on for correctness, but I am not blaming this on river, this is a consequence of our ignorance of rescue mechanics nuances.
Short-term we fixed our issue by again registering all jobs in the worker binary regardless its configuration, but it would be valuable to have an answer if this is a designed behaviour or a bug?