Increased IO Wait in postgres

### Introduction
We have been using River for about a year, and we’ve had almost no issues that we couldn’t solve by tuning the configs or modifying the workers. Overall, we are very satisfied. However, in the last week, we encountered an issue twice that we haven’t been able to resolve yet.

We have the following versions in production:
```
github.com/riverqueue/river v0.20.2
github.com/riverqueue/river/riverdriver/riverpgxv5 v0.20.2
github.com/riverqueue/river/rivertype v0.20.2
```

### Use-case

We run a go app in k8s, the rest API pods emit events, we insert these events to postgres as river jobs, and another deployment with river workers in it processes the events (make HTTP calls for the webhook subscriptions).

### The problem

We noticed that fewer webhook calls were being made that the usual. 
We checked what's happening in our DB with `select state, count(*) from webhooks.river_job;`:
```
   state   | count  
-----------+--------
 available |   1.404
 completed | 272.082
 discarded |  44.583
 retryable |  98.424
 running   |   2.739
```
Then we checked the error fields of the retryable jobs, and all of them errored with "Job rescued by rescuer".
We initially thought our workers couldn't process the jobs in time, but we only do an HTTP call in each job which has a 2 second timeout, so we were almost certain it's not the case.
Then we found many `BatchCompleter: Completer error (will retry after sleep)` and `BatchCompleter: Too many errors; giving up` log entries, all showing `timeout: context deadline exceeded` error. There were also `JobScheduler: Error scheduling jobs` and `Elector: Error attempting reelection` errors with the same `context deadline exceeded` cause.

For postgres we have a 4vCPU 16GB memory Cloud SQL instance. (Which is shared among some apps, but the current one is the most significant.)
<img width="2177" height="663" alt="Image" src="https://github.com/user-attachments/assets/1607d8b4-6eb4-439c-a712-124fc8b063e9" />
Here we see a significant increase in IO wait. We eventually switch to maintenance mode around 7:30PM and tried to fix the error, that's why the load decreased, it wasn't natural. (The horizontal line in the chart is a different metric, it does not mean we hit the CPU limit or something. The CPU utilization was ~70-80% so we never hit 100%, when we doulbed the resources the utilization was ~60%.)

### What we tried
- We deleted all the completed jobs, but the issue persisted
- We updated all the retryable and running jobs to available status, but the issue persisted
- We increased the DB resources to 8vCPU 32GB ram, the issue persisted
- We stopped the app completely, removed everything but the 120k available jobs from the river_jobs table, did a VACUUM, but when we started the app, the `BatchCompleter: Completer error (will retry after sleep)` errors appeared almost instantly, and only ~30% of the jobs were completed, the rest stayed in running, eventually in retryable state.

Now we will update river to the newest version, and see if the issue appears again. 
We don't expect anyone to solve our problems, but we are curious if you seen something similar before, or have some tips what might be causing this behavior.

Another minor detail is that we use river in multiple databases in this cluster, even in multiple schemas in the same database, in case this has anything to do with our issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increased IO Wait in postgres #995

Introduction

Use-case

The problem

What we tried

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increased IO Wait in postgres #995

Description

Introduction

Use-case

The problem

What we tried

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions