From 25759c44f50db9b2926f8e3dbe6596219a219c89 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Fri, 17 Oct 2025 16:03:30 -0700 Subject: [PATCH 01/14] add exponential backoff to withTransactions API --- .../tests/README.md | 6 ++++++ .../transactions-convenient-api.md | 21 ++++++++++++++++--- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index a797a3182f..905c14a9ba 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -41,8 +41,14 @@ If possible, drivers should implement these tests without requiring the test run the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some private API or using a mock timer. +### Retry Backoff is Enforced + +Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms +and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. + ## Changelog +- 2025-10-17: Added Backoff test. - 2024-09-06: Migrated from reStructuredText to Markdown. - 2024-02-08: Converted legacy tests to unified format. - 2021-04-29: Remove text about write concern timeouts from prose test. diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 7d1864391a..fa7db0a09d 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -99,7 +99,8 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er "TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again, drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded, -drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. +drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, +drivers MUST implement an exponential backoff with jitter following the algorithm described below. If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at any point, the driver MUST NOT retry and MUST allow `withTransaction` to propagate the error to its caller. @@ -129,7 +130,13 @@ This method should perform the following sequence of actions: 1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, jump back to step two. + less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + 1. jitter is a random float between [0, 1) + 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed + 3. BACKOFF_INITIAL is 1ms + 4. BACKOFF_MAX is 500ms + + Then, jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return immediately. @@ -154,11 +161,18 @@ This method should perform the following sequence of actions: This method can be expressed by the following pseudo-code: ```typescript +var BACKOFF_INITIAL = 1 // 1ms initial backoff +var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch + var retry = 0 retryTransaction: while (true) { + if (retry > 0): + sleep(Math.random() * min(BACKOFF_INITIAL * (1.25**retry), + BACKOFF_MAX)) + retry += 1 this.startTransaction(options); // may throw on error try { @@ -324,7 +338,7 @@ exceed the user's original intention for `maxTimeMS`. The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent callbacks. -A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that +A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur @@ -356,6 +370,7 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ([DRIVERS-488](https://jira.mongodb.org/browse/DRIVERS-488)). ## Changelog +- 2025-10-17: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From bdcd2ef0d8ba136d88515865c9228614dbd32173 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 20 Oct 2025 10:10:42 -0700 Subject: [PATCH 02/14] run pre-commit --- .../tests/README.md | 2 +- .../transactions-convenient-api.md | 27 +++++++++++-------- 2 files changed, 17 insertions(+), 12 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 905c14a9ba..dab2dfc544 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -44,7 +44,7 @@ private API or using a mock timer. ### Retry Backoff is Enforced Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms -and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. +and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. ## Changelog diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index fa7db0a09d..13c187adc5 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -99,7 +99,7 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er "TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again, drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded, -drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, +drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying, drivers MUST implement an exponential backoff with jitter following the algorithm described below. If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at @@ -129,17 +129,21 @@ This method should perform the following sequence of actions: 6. If the callback reported an error: 1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. + 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: - 1. jitter is a random float between [0, 1) - 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed - 3. BACKOFF_INITIAL is 1ms - 4. BACKOFF_MAX is 500ms - - Then, jump back to step two. + less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + + 1. jitter is a random float between \[0, 1) + 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed + 3. BACKOFF_INITIAL is 1ms + 4. BACKOFF_MAX is 500ms + + Then, jump back to step two. + 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return immediately. + 4. Otherwise, propagate the callback's error to the caller of `withTransaction` and return immediately. 7. If the ClientSession is in the "no transaction", "transaction aborted", or "transaction committed" state, assume the callback intentionally aborted or committed the transaction and return immediately. @@ -338,8 +342,8 @@ exceed the user's original intention for `maxTimeMS`. The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent callbacks. -A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that -`withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; +A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate +that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly; however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur regularly in a healthy application, as opposed to UnknownTransactionCommitResult, which would typically be caused by an @@ -370,7 +374,8 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ([DRIVERS-488](https://jira.mongodb.org/browse/DRIVERS-488)). ## Changelog -- 2025-10-17: withTransaction applies exponential backoff when retrying. + +- 2025-10-17: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From 48890a2c5b1b9f0e3f1b7ada38bd4b85329482e6 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 20 Oct 2025 16:07:36 -0700 Subject: [PATCH 03/14] add design rational for backoff --- .../transactions-convenient-api.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 13c187adc5..0c32812be4 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -356,6 +356,16 @@ non-configurable default and is intentionally twice the value of MongoDB 4.0's d parameter (60 seconds). Applications that desire longer retry periods may call `withTransaction` additional times as needed. Applications that desire shorter retry periods should not use this method. +### Backoff Benefits + +Previously, the driver would retry transactions immediately, which is fine for low levels of contention. But, as the +server load increases, immediate retries can result in retry storms, unnecessarily further overloading the server. + +Exponential backoff is well-researched and accepted backoff strategy that is simple to implement. A low initial backoff +(1-millisecond) and growth value (1.25x) were chosen specifically to mitigate latency in low levels of contention. +Empirical evidence suggests that 500-millisecond max backoff ensured that a transaction did not wait so long as to +exceed the 120-second timeout and reduced load spikes. + ## Backwards Compatibility The specification introduces a new method on the ClientSession class and does not introduce any backward breaking From 71ba1babe797f00e06b8ad98f442a097ed720e78 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Thu, 23 Oct 2025 10:19:51 -0700 Subject: [PATCH 04/14] fix prose test --- source/transactions-convenient-api/tests/README.md | 8 ++++++-- .../transactions-convenient-api.md | 11 +++++++---- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index dab2dfc544..78e3a3eafc 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -43,8 +43,12 @@ private API or using a mock timer. ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms -and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. +Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 +retries. Ensure that: + +- 3 backoffs occurred +- each backoff was greater than or equal to 0 +- the total operation time took more than the sum of the individual backoffs ## Changelog diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 0c32812be4..df679859ec 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -138,7 +138,7 @@ This method should perform the following sequence of actions: 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - Then, jump back to step two. + Append this sleep duration to a list for testing purposes. Then, jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -170,12 +170,15 @@ var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch - var retry = 0 + var retry = 0; + this._transaction_retry_backoffs = []; // for testing purposes retryTransaction: while (true) { if (retry > 0): - sleep(Math.random() * min(BACKOFF_INITIAL * (1.25**retry), - BACKOFF_MAX)) + var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), + BACKOFF_MAX) + this._transaction_retry_backoffs.push(backoff) + sleep(backoff) retry += 1 this.startTransaction(options); // may throw on error From b6026066bec1077b353e7a1db163c2fb14b8f6c1 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Mon, 27 Oct 2025 15:07:36 -0700 Subject: [PATCH 05/14] fix pseudocode --- .../transactions-convenient-api.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index df679859ec..565523716a 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -131,14 +131,16 @@ This method should perform the following sequence of actions: [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + less than 120 seconds, calculate the backoff value to be + `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: 1. jitter is a random float between \[0, 1) 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - Append this sleep duration to a list for testing purposes. Then, jump back to step two. + If the time elapsed thus far plus the backoff value would not exceed 120 seconds, then sleep for the backoff + value and jump back to step two, otherwise, raise last known error. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -171,20 +173,23 @@ withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch var retry = 0; - this._transaction_retry_backoffs = []; // for testing purposes retryTransaction: while (true) { - if (retry > 0): + if (retry > 0) { var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), - BACKOFF_MAX) - this._transaction_retry_backoffs.push(backoff) - sleep(backoff) + BACKOFF_MAX); + if (Date.now() + backoff - startTime >= 120000) { + throw last_error; + } + sleep(backoff); + } retry += 1 this.startTransaction(options); // may throw on error try { callback(this); } catch (error) { + var last_error = error; if (this.transactionState == STARTING || this.transactionState == IN_PROGRESS) { this.abortTransaction(); From 057fbbf1b25cc5dc88ac1a8788ee336d2491de77 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Tue, 28 Oct 2025 14:46:54 -0700 Subject: [PATCH 06/14] fix test --- source/transactions-convenient-api/tests/README.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 78e3a3eafc..4e0dd59f3c 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -43,12 +43,9 @@ private API or using a mock timer. ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3 -retries. Ensure that: - -- 3 backoffs occurred -- each backoff was greater than or equal to 0 -- the total operation time took more than the sum of the individual backoffs +Drivers should test that retries within `withTransaction` do not occur immediately. Optionally, set BACKOFF_INITIAL to a +higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries. Check that the total +time for all retries exceeded 1.25 seconds. ## Changelog From 42e4d94146cee87c3ef66539479185c3d417803c Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Wed, 29 Oct 2025 15:38:06 -0700 Subject: [PATCH 07/14] account for CSOT / timeoutMS in algorithm --- .../transactions-convenient-api.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 565523716a..6fb5114074 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -131,7 +131,7 @@ This method should perform the following sequence of actions: [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, calculate the backoff value to be + less than 120 seconds, calculate the backoffMS to be `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: 1. jitter is a random float between \[0, 1) @@ -139,8 +139,8 @@ This method should perform the following sequence of actions: 3. BACKOFF_INITIAL is 1ms 4. BACKOFF_MAX is 500ms - If the time elapsed thus far plus the backoff value would not exceed 120 seconds, then sleep for the backoff - value and jump back to step two, otherwise, raise last known error. + If timeoutMS is set and remainingTimeMS < backoffMS or timoutMS is not set and elapsed time + backoffMS > 120 + seconds then, raise last known error. Otherwise, sleep for backoffMS and jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -178,7 +178,10 @@ withTransaction(callback, options) { if (retry > 0) { var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX); - if (Date.now() + backoff - startTime >= 120000) { + if (timeoutMS is None) { + timeoutMS = 120000 + } + if (Date.now() + backoff - startTime >= timeoutMS) { throw last_error; } sleep(backoff); From c11aef816018cf5f2732408c9b8287088a3a8c69 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Wed, 29 Oct 2025 17:05:00 -0700 Subject: [PATCH 08/14] add more details to tests --- .../tests/README.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 4e0dd59f3c..1181fdbb23 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -44,8 +44,23 @@ private API or using a mock timer. ### Retry Backoff is Enforced Drivers should test that retries within `withTransaction` do not occur immediately. Optionally, set BACKOFF_INITIAL to a -higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries. Check that the total -time for all retries exceeded 1.25 seconds. +higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries like so: + +```json +{ + "configureFailPoint": "failCommand", + "mode": { + "times": 30 + }, + "data": { + "failCommands": ["commitTransaction"], + "errorCode": 24, + }, +} +``` + +Additionally, let the callback for the transaction be a simple `insertOne` command. Check that the total time for all +retries exceeded 1.25 seconds. ## Changelog From a6b7b95d5243f2860a76d7feaa460ecdca8143b4 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Wed, 12 Nov 2025 13:43:53 -0800 Subject: [PATCH 09/14] add second test that is deterministic --- .../tests/README.md | 29 ++++++++++++++++--- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 1181fdbb23..05657b9b3c 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -41,10 +41,31 @@ If possible, drivers should implement these tests without requiring the test run the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some private API or using a mock timer. +### Retry Backoff is Random + +Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces +30 retries like so: + +```json +{ + "configureFailPoint": "failCommand", + "mode": { + "times": 30 + }, + "data": { + "failCommands": ["commitTransaction"], + "errorCode": 24, + }, +} +``` + +Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded +3.5 seconds. + ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Optionally, set BACKOFF_INITIAL to a -higher value to decrease flakiness of this test. Configure a fail point that forces 30 retries like so: +Drivers should test that retries within `withTransaction` do not occur immediately. Configure the random number +generator used for jitter to always return `1`. Configure a fail point that forces 30 retries like so: ```json { @@ -59,8 +80,8 @@ higher value to decrease flakiness of this test. Configure a fail point that for } ``` -Additionally, let the callback for the transaction be a simple `insertOne` command. Check that the total time for all -retries exceeded 1.25 seconds. +Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded +3.5 seconds. ## Changelog From f40529ffd248ee7046184b70fa01abdef544b33f Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Thu, 13 Nov 2025 15:51:32 -0800 Subject: [PATCH 10/14] change test to use no backoff as baseline time and ensure with backoff takes longer --- .../tests/README.md | 40 ++++++------------- 1 file changed, 13 insertions(+), 27 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 05657b9b3c..62758b57ce 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -41,47 +41,33 @@ If possible, drivers should implement these tests without requiring the test run the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some private API or using a mock timer. -### Retry Backoff is Random - -Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces -30 retries like so: - -```json -{ - "configureFailPoint": "failCommand", - "mode": { - "times": 30 - }, - "data": { - "failCommands": ["commitTransaction"], - "errorCode": 24, - }, -} -``` - -Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded -3.5 seconds. - ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. Configure the random number -generator used for jitter to always return `1`. Configure a fail point that forces 30 retries like so: +Drivers should test that retries within `withTransaction` do not occur immediately. First, run transactions without +backoff. To do so, configure the random number generator used for jitter to always return `0` -- this effectively +disables backoff. Then, configure a fail point that forces 30 retries like so: ```json { "configureFailPoint": "failCommand", "mode": { - "times": 30 + "times": 13 }, "data": { "failCommands": ["commitTransaction"], - "errorCode": 24, + "errorCode": 251, // NoSuchTransaction }, } ``` -Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded -3.5 seconds. +Let the callback for the transaction be a simple `insertOne` command. Let `no_backoff_time` be the time it took for the +command to succeed. + +Next, we will run the transactions again with backoff. Configure the random number generator used for jitter to always +return `1`. Set the fail point to force 13 retries using the same command as before. Using the same callback as before, +check that the total time for the withTransaction command is within +/-1 second of `no_backoff_time` plus 2.2 seconds. +Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window is just to account for +potential networking differences between the two runs. ## Changelog From 56e88f080ba5c8461bad92cd206ac47360957cbe Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Tue, 18 Nov 2025 16:43:48 -0800 Subject: [PATCH 11/14] address comments pt 1 --- source/transactions-convenient-api/tests/README.md | 8 ++++---- .../transactions-convenient-api.md | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 62758b57ce..6fd14ce962 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -60,14 +60,14 @@ disables backoff. Then, configure a fail point that forces 30 retries like so: } ``` -Let the callback for the transaction be a simple `insertOne` command. Let `no_backoff_time` be the time it took for the -command to succeed. +Let the callback for the transaction be a simple `insertOne` withTransaction call. Let `no_backoff_time` be the time it +took for the command to succeed. Next, we will run the transactions again with backoff. Configure the random number generator used for jitter to always return `1`. Set the fail point to force 13 retries using the same command as before. Using the same callback as before, check that the total time for the withTransaction command is within +/-1 second of `no_backoff_time` plus 2.2 seconds. -Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window is just to account for -potential networking differences between the two runs. +Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window accounts for potential +variance between the two runs. ## Changelog diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 6fb5114074..65330cff2f 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -132,11 +132,11 @@ This method should perform the following sequence of actions: 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is less than 120 seconds, calculate the backoffMS to be - `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: + `jitter * min(BACKOFF_INITIAL * (1.5**retry), BACKOFF_MAX)` where: 1. jitter is a random float between \[0, 1) 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed - 3. BACKOFF_INITIAL is 1ms + 3. BACKOFF_INITIAL is 5ms 4. BACKOFF_MAX is 500ms If timeoutMS is set and remainingTimeMS < backoffMS or timoutMS is not set and elapsed time + backoffMS > 120 From bc9153eb4375e0b3a77ba8ecccd96af4432c3907 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Tue, 18 Nov 2025 17:31:33 -0800 Subject: [PATCH 12/14] address comments pt 2 --- .../tests/README.md | 88 +++++++++++++------ .../transactions-convenient-api.md | 41 +++++---- 2 files changed, 84 insertions(+), 45 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index 6fd14ce962..e78263dc8c 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -43,35 +43,71 @@ private API or using a mock timer. ### Retry Backoff is Enforced -Drivers should test that retries within `withTransaction` do not occur immediately. First, run transactions without -backoff. To do so, configure the random number generator used for jitter to always return `0` -- this effectively -disables backoff. Then, configure a fail point that forces 30 retries like so: - -```json -{ - "configureFailPoint": "failCommand", - "mode": { - "times": 13 - }, - "data": { - "failCommands": ["commitTransaction"], - "errorCode": 251, // NoSuchTransaction - }, -} -``` - -Let the callback for the transaction be a simple `insertOne` withTransaction call. Let `no_backoff_time` be the time it -took for the command to succeed. - -Next, we will run the transactions again with backoff. Configure the random number generator used for jitter to always -return `1`. Set the fail point to force 13 retries using the same command as before. Using the same callback as before, -check that the total time for the withTransaction command is within +/-1 second of `no_backoff_time` plus 2.2 seconds. -Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window accounts for potential -variance between the two runs. +Drivers should test that retries within `withTransaction` do not occur immediately. + +1. let `client` be a `MongoClient` +2. let `coll` be a collection +3. Now, run transactions without backoff: + 1. Configure the random number generator used for jitter to always return `0` -- this effectively disables backoff. + + 2. Configure a fail point that forces 13 retries like so: + + ```python + set_fail_point( + { + "configureFailPoint": "failCommand", + "mode": { + "times": 13 + }, # sufficiently high enough such that the time effect of backoff is noticeable + "data": { + "failCommands": ["commitTransaction"], + "errorCode": 251, + }, + } + ) + ``` + + > [!NOTE] + > errorCode 251 is NoSuchTransaction. + + 3. Define the callback for the transaction as follows: + + ```python + def callback(session): + coll.insert_one({}, session=session) + ``` + + 4. Let `no_backoff_time` be the duration of the withTransaction API call: + + ```python + start = time.monotonic() + with client.start_session() as s: + s.with_transaction(callback) + end = time.monotonic() + no_backoff_time = end - start + ``` +4. Now run the command with backoff: + 1. Configure the random number generator used for jitter to always return `1`. + 2. Configure a fail point that forces 13 retries like in step 3.2. + 3. Use the same callback defined in 3.3. + 4. Let `with_backoff_time` be the duration of the withTransaction API call: + ```python + start = time.monotonic() + with client.start_session() as s: + s.with_transaction(callback) + end = time.monotonic() + no_backoff_time = end - start + ``` +5. Compare the two time between the two runs. + ```python + assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 2.2 seconds)) < 1) + ``` + The sum of 13 backoffs is roughly 2.2 seconds. There is a 1-second window to account for potential variance between + the two runs. ## Changelog -- 2025-10-17: Added Backoff test. +- 2025-11-18: Added Backoff test. - 2024-09-06: Migrated from reStructuredText to Markdown. - 2024-02-08: Converted legacy tests to unified format. - 2021-04-29: Remove text about write concern timeouts from prose test. diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 65330cff2f..3147e06346 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -114,7 +114,11 @@ needed (e.g. user data to pass as a parameter to the callback). This method should perform the following sequence of actions: -1. Record the current monotonic time, which will be used to enforce the 120-second timeout before later retry attempts. +1. Define the following: + 1. Record the current monotonic time, which will be used to enforce the 120-second / CSOT timeout before later retry + attempts. + 2. Set `retry` to `0`. This will be used for backoff later in step 7. + 3. Set `TIMEOUT_MS` to be `timeoutMS` if given, otherwise 120-seconds. 2. Invoke [startTransaction](../transactions/transactions.md#starttransaction) on the session. If TransactionOptions were specified in the call to `withTransaction`, those MUST be used for `startTransaction`. Note that `ClientSession.defaultTransactionOptions` will be used in the absence of any explicit TransactionOptions. @@ -131,16 +135,16 @@ This method should perform the following sequence of actions: [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is - less than 120 seconds, calculate the backoffMS to be - `jitter * min(BACKOFF_INITIAL * (1.5**retry), BACKOFF_MAX)` where: + less than TIMEOUT_MS, calculate the backoffMS to be `jitter * min(BACKOFF_INITIAL * (1.5**retry), BACKOFF_MAX)` + where: 1. jitter is a random float between \[0, 1) - 2. retry is one less than the number of times Step 2 has been executed since Step 1 was executed - 3. BACKOFF_INITIAL is 5ms - 4. BACKOFF_MAX is 500ms + 2. retry is the variable defined in step 1. + 3. `BACKOFF_INITIAL` is 5ms + 4. `BACKOFF_MAX` is 500ms - If timeoutMS is set and remainingTimeMS < backoffMS or timoutMS is not set and elapsed time + backoffMS > 120 - seconds then, raise last known error. Otherwise, sleep for backoffMS and jump back to step two. + If elapsed time + `backoffMS` > `TIMEOUT_MS`, then raise last known error. Otherwise, sleep for `backoffMS`, + increment `retry`, and jump back to step two. 3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually committed a transaction, propagate the callback's error to the caller of `withTransaction` and return @@ -152,12 +156,12 @@ This method should perform the following sequence of actions: 8. Invoke [commitTransaction](../transactions/transactions.md#committransaction) on the session. 9. If `commitTransaction` reported an error: 1. If the `commitTransaction` error includes a "UnknownTransactionCommitResult" label and the error is not - MaxTimeMSExpired and the elapsed time of `withTransaction` is less than 120 seconds, jump back to step eight. - We will trust `commitTransaction` to apply a majority write concern on retry attempts (see: + MaxTimeMSExpired and the elapsed time of `withTransaction` is less than TIMEOUT_MS, jump back to step eight. We + will trust `commitTransaction` to apply a majority write concern on retry attempts (see: [Majority write concern is used when retrying commitTransaction](#majority-write-concern-is-used-when-retrying-committransaction)). 2. If the `commitTransaction` error includes a "TransientTransactionError" label and the elapsed time of - `withTransaction` is less than 120 seconds, jump back to step two. + `withTransaction` is less than TIMEOUT_MS, jump back to step two. 3. Otherwise, propagate the `commitTransaction` error to the caller of `withTransaction` and return immediately. 10. The transaction was committed successfully. Return immediately. @@ -172,16 +176,15 @@ var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch + var TIMEOUT_MS = timeoutMS is None ? 120000 : TIMEOUT_MS var retry = 0; retryTransaction: while (true) { if (retry > 0) { var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX); - if (timeoutMS is None) { - timeoutMS = 120000 - } - if (Date.now() + backoff - startTime >= timeoutMS) { + + if (Date.now() + backoff - startTime >= TIMEOUT_MS) { throw last_error; } sleep(backoff); @@ -199,7 +202,7 @@ withTransaction(callback, options) { } if (error.hasErrorLabel("TransientTransactionError") && - Date.now() - startTime < 120000) { + Date.now() - startTime < TIMEOUT_MS) { continue retryTransaction; } @@ -227,12 +230,12 @@ withTransaction(callback, options) { */ if (!isMaxTimeMSExpiredError(error) && error.hasErrorLabel("UnknownTransactionCommitResult") && - Date.now() - startTime < 120000) { + Date.now() - startTime < TIMEOUT_MS) { continue retryCommit; } if (error.hasErrorLabel("TransientTransactionError") && - Date.now() - startTime < 120000) { + Date.now() - startTime < TIMEOUT_MS) { continue retryTransaction; } @@ -396,7 +399,7 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ## Changelog -- 2025-10-17: withTransaction applies exponential backoff when retrying. +- 2025-11-18: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From 5c74235a134b9465feb5b3eb242fae24dbf2f5eb Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Thu, 20 Nov 2025 14:51:49 -0800 Subject: [PATCH 13/14] update constants (oops) and CSOT changes --- .../transactions-convenient-api.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/source/transactions-convenient-api/transactions-convenient-api.md b/source/transactions-convenient-api/transactions-convenient-api.md index 3147e06346..719568db5a 100644 --- a/source/transactions-convenient-api/transactions-convenient-api.md +++ b/source/transactions-convenient-api/transactions-convenient-api.md @@ -171,20 +171,21 @@ This method should perform the following sequence of actions: This method can be expressed by the following pseudo-code: ```typescript -var BACKOFF_INITIAL = 1 // 1ms initial backoff +var BACKOFF_INITIAL = 5 // 5ms initial backoff var BACKOFF_MAX = 500 // 500ms max backoff withTransaction(callback, options) { // Note: drivers SHOULD use a monotonic clock to determine elapsed time var startTime = Date.now(); // milliseconds since Unix epoch - var TIMEOUT_MS = timeoutMS is None ? 120000 : TIMEOUT_MS + // See the CSOT specification for information on calculating timeoutMS for a convenient transaction API call. + var timeout = getCSOTTimeoutIfSet() ?? 120_000; var retry = 0; retryTransaction: while (true) { if (retry > 0) { - var backoff = Math.random() * min(BACKOFF_INITIAL * (1.25**retry), + var backoff = Math.random() * min(BACKOFF_INITIAL * (1.5**retry), BACKOFF_MAX); - if (Date.now() + backoff - startTime >= TIMEOUT_MS) { + if (Date.now() + backoff - startTime >= timeout) { throw last_error; } sleep(backoff); @@ -202,7 +203,7 @@ withTransaction(callback, options) { } if (error.hasErrorLabel("TransientTransactionError") && - Date.now() - startTime < TIMEOUT_MS) { + Date.now() - startTime < timeout) { continue retryTransaction; } @@ -230,12 +231,12 @@ withTransaction(callback, options) { */ if (!isMaxTimeMSExpiredError(error) && error.hasErrorLabel("UnknownTransactionCommitResult") && - Date.now() - startTime < TIMEOUT_MS) { + Date.now() - startTime < timeout) { continue retryCommit; } if (error.hasErrorLabel("TransientTransactionError") && - Date.now() - startTime < TIMEOUT_MS) { + Date.now() - startTime < timeout) { continue retryTransaction; } @@ -399,7 +400,7 @@ provides an implementation of a technique already described in the MongoDB 4.0 d ## Changelog -- 2025-11-18: withTransaction applies exponential backoff when retrying. +- 2025-11-20: withTransaction applies exponential backoff when retrying. - 2024-09-06: Migrated from reStructuredText to Markdown. From 1f04505c85cb72dfe921f1ee70033fc9f07ed336 Mon Sep 17 00:00:00 2001 From: Iris Ho Date: Thu, 20 Nov 2025 14:59:35 -0800 Subject: [PATCH 14/14] gh fancy alerts can't be indented ;-; --- source/transactions-convenient-api/tests/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/source/transactions-convenient-api/tests/README.md b/source/transactions-convenient-api/tests/README.md index e78263dc8c..c990c432d2 100644 --- a/source/transactions-convenient-api/tests/README.md +++ b/source/transactions-convenient-api/tests/README.md @@ -67,8 +67,7 @@ Drivers should test that retries within `withTransaction` do not occur immediate ) ``` - > [!NOTE] - > errorCode 251 is NoSuchTransaction. + > Note: errorCode 251 is NoSuchTransaction. 3. Define the callback for the transaction as follows: