-
Notifications
You must be signed in to change notification settings - Fork 131
Description
Introduction
We have been using River for about a year, and we’ve had almost no issues that we couldn’t solve by tuning the configs or modifying the workers. Overall, we are very satisfied. However, in the last week, we encountered an issue twice that we haven’t been able to resolve yet.
We have the following versions in production:
github.com/riverqueue/river v0.20.2
github.com/riverqueue/river/riverdriver/riverpgxv5 v0.20.2
github.com/riverqueue/river/rivertype v0.20.2
Use-case
We run a go app in k8s, the rest API pods emit events, we insert these events to postgres as river jobs, and another deployment with river workers in it processes the events (make HTTP calls for the webhook subscriptions).
The problem
We noticed that fewer webhook calls were being made that the usual.
We checked what's happening in our DB with select state, count(*) from webhooks.river_job;:
state | count
-----------+--------
available | 1.404
completed | 272.082
discarded | 44.583
retryable | 98.424
running | 2.739
Then we checked the error fields of the retryable jobs, and all of them errored with "Job rescued by rescuer".
We initially thought our workers couldn't process the jobs in time, but we only do an HTTP call in each job which has a 2 second timeout, so we were almost certain it's not the case.
Then we found many BatchCompleter: Completer error (will retry after sleep) and BatchCompleter: Too many errors; giving up log entries, all showing timeout: context deadline exceeded error. There were also JobScheduler: Error scheduling jobs and Elector: Error attempting reelection errors with the same context deadline exceeded cause.
For postgres we have a 4vCPU 16GB memory Cloud SQL instance. (Which is shared among some apps, but the current one is the most significant.)

Here we see a significant increase in IO wait. We eventually switch to maintenance mode around 7:30PM and tried to fix the error, that's why the load decreased, it wasn't natural. (The horizontal line in the chart is a different metric, it does not mean we hit the CPU limit or something. The CPU utilization was ~70-80% so we never hit 100%, when we doulbed the resources the utilization was ~60%.)
What we tried
- We deleted all the completed jobs, but the issue persisted
- We updated all the retryable and running jobs to available status, but the issue persisted
- We increased the DB resources to 8vCPU 32GB ram, the issue persisted
- We stopped the app completely, removed everything but the 120k available jobs from the river_jobs table, did a VACUUM, but when we started the app, the
BatchCompleter: Completer error (will retry after sleep)errors appeared almost instantly, and only ~30% of the jobs were completed, the rest stayed in running, eventually in retryable state.
Now we will update river to the newest version, and see if the issue appears again.
We don't expect anyone to solve our problems, but we are curious if you seen something similar before, or have some tips what might be causing this behavior.
Another minor detail is that we use river in multiple databases in this cluster, even in multiple schemas in the same database, in case this has anything to do with our issue.