Skip to content
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,14 @@ Endpoint. The pool has the following properties:
- **Rate-limited:** A Pool MUST limit the number of [Connections](#connection) being
[established](#establishing-a-connection-internal-implementation) concurrently via the **maxConnecting**
[pool option](#connection-pool-options).
- **Backpressure-enabled** - The pool MUST add the error labels `SystemOverloadedError` and `RetryableError` to network
errors or network timeouts it encounters during the connection establishment or the `hello` message. These labels
are used by the
[SDAM error handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling-pseudocode)
to avoid clearing the pool. The pool MUST NOT add the backpressure error labels during an authentication step
after the `hello` message. For errors that the driver can distinguish as never occurring due to server overload,
such as DNS lookup failures, TLS related errors, or errors encountered establishing a connection to a socks5 proxy,
the driver MUST clear the connection pool and MUST mark the server Unknown for these error types.

```typescript
interface ConnectionPool {
Expand Down Expand Up @@ -461,6 +469,7 @@ try:
return connection
except error:
close connection
add `SystemOverloadedError` label if appropriate (see "backpressure-enabled" in [Connection Pool](#connection-pool))
throw error # Propagate error in manner idiomatic to language.
```

Expand Down Expand Up @@ -1375,6 +1384,8 @@ to close and remove from its pool a [Connection](#connection) which has unread e

## Changelog

- 2025-XX-YY: Add handling of backpressure error labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changelog dates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


- 2025-01-22: Clarify durationMS in logs may be Int32/Int64/Double.

- 2024-11-27: Relaxed the WaitQueue fairness requirement.
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ failPoint:
mode: { times: 50 }
data:
failCommands: ["isMaster","hello"]
closeConnection: true
errorCode: 91
appName: "poolCreateMinSizeErrorTest"
poolOptions:
minPoolSize: 1
Expand Down
4 changes: 2 additions & 2 deletions source/load-balancers/tests/sdam-error-handling.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions source/load-balancers/tests/sdam-error-handling.yml
Original file line number Diff line number Diff line change
Expand Up @@ -153,14 +153,14 @@ tests:
mode: { times: 1 }
data:
failCommands: [isMaster, hello]
closeConnection: true
errorCode: 11600
appName: *singleClientAppName
- name: insertOne
object: *singleColl
arguments:
document: { x: 1 }
expectError:
isClientError: true
isError: true
expectEvents:
- client: *singleClient
eventType: cmap
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -172,3 +172,40 @@ This test requires failCommand appName support which is only available in MongoD
5. Then verify that a ServerHeartbeatSucceededEvent and a ConnectionPoolReadyEvent (CMAP) are emitted.

6. Disable the failpoint.

## Connection Pool Backpressure

This test will be used to ensure that connection establishment failures during the TLS handshake do not result in a pool
clear event. We create a setup client to enable the ingress connection establishment rate limiter, and then induce a
connection storm. After the storm, we verify that some of the connections failed to checkout, but that the pool was not
cleared.

This test requires MongoDB 7.0+.

1. Create a test client that listens to CMAP events, with maxConnecting=100. The higher maxConnecting will help ensure
contention for creating connections.

2. Run the following commands to set up the rate limiter.

```python
client.admin.command("setParameter", 1, ingressConnectionEstablishmentRateLimiterEnabled=True)
client.admin.command("setParameter", 1, ingressConnectionEstablishmentRatePerSec=20)
client.admin.command("setParameter", 1, ingressConnectionEstablishmentBurstCapacitySecs=1)
client.admin.command("setParameter", 1, ingressConnectionEstablishmentMaxQueueDepth=1)
```

3. Add a document to the test collection so that the sleep operations will actually block:
`client.test.test.insert_one({})`.

4. Run the following find command on the collection in 100 parallel threads/coroutines. Run these commands concurrently
but block on their completion, and ignore errors raised by the command.
`client.test.test.find_one({"$where": "function() { sleep(2000); return true; }})`

5. Assert that at least 10 `ConnectionCheckOutFailedEvent` occurred.

6. Assert that 0 `PoolClearedEvent` occurred.

7. Sleep for 1 second to clear the rate limiter.

8. Ensure that the following command runs at test teardown even if the test fails.
`client.admin("setParameter", 1, ingressConnectionEstablishmentRateLimiterEnabled=False)`.
Original file line number Diff line number Diff line change
Expand Up @@ -434,18 +434,18 @@ correspond to [replica set member states](https://www.mongodb.com/docs/manual/re
some replica set member states like STARTUP and RECOVERING are identical from the client's perspective, so they are
merged into "RSOther". Additionally, states like Standalone and Mongos are not replica set member states at all.

| State | Symptoms |
| --------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Unknown | Initial, or after a network error or failed hello or legacy hello call, or "ok: 1" not in hello or legacy hello response. |
| Standalone | No "msg: isdbgrid", no setName, and no "isreplicaset: true". |
| Mongos | "msg: isdbgrid". |
| PossiblePrimary | Not yet checked, but another member thinks it is the primary. |
| RSPrimary | "isWritablePrimary: true" or "ismaster: true", "setName" in response. |
| RSSecondary | "secondary: true", "setName" in response. |
| RSArbiter | "arbiterOnly: true", "setName" in response. |
| RSOther | "setName" in response, "hidden: true" or not primary, secondary, nor arbiter. |
| RSGhost | "isreplicaset: true" in response. |
| LoadBalanced | "loadBalanced=true" in URI. |
| State | Symptoms |
| --------------- | -------------------------------------------------------------------------------------------------------- |
| Unknown | Initial, or after a failed hello or legacy hello call, or "ok: 1" not in hello or legacy hello response. |
| Standalone | No "msg: isdbgrid", no setName, and no "isreplicaset: true". |
| Mongos | "msg: isdbgrid". |
| PossiblePrimary | Not yet checked, but another member thinks it is the primary. |
| RSPrimary | "isWritablePrimary: true" or "ismaster: true", "setName" in response. |
| RSSecondary | "secondary: true", "setName" in response. |
| RSArbiter | "arbiterOnly: true", "setName" in response. |
| RSOther | "setName" in response, "hidden: true" or not primary, secondary, nor arbiter. |
| RSGhost | "isreplicaset: true" in response. |
| LoadBalanced | "loadBalanced=true" in URI. |

A server can transition from any state to any other. For example, an administrator could shut down a secondary and bring
up a mongos in its place.
Expand Down Expand Up @@ -1055,7 +1055,10 @@ def handleError(error):
# next full scan.
if isNotWritablePrimary(error):
check failing server
elif isNetworkError(error) or (not error.completedHandshake and (isNetworkTimeout(error) or isAuthError(error))):
elif isNetworkError(error) or (not error.completedHandshake):
# Ignore errors that have a backpressure error label applied.
if error.hasLabel("SystemOverloadedError"):
continue
if type != LoadBalanced
# Mark the server Unknown
unknown = new ServerDescription(type=Unknown, error=error)
Expand Down Expand Up @@ -1139,16 +1142,20 @@ errors, network timeout errors, state change errors, and authentication errors.

##### Network error when reading or writing

To describe how the client responds to network errors during application operations, we distinguish two phases of
To describe how the client responds to network errors during application operations, we distinguish three phases of
connecting to a server and using it for application operations:

- *Before the handshake completes*: the client establishes a new connection to the server and completes an initial
handshake by calling "hello" or legacy hello and reading the response, and optionally completing authentication
- *Connection establishment and hello*: the client establishes a new connection to the server and completes an initial
handshake by calling "hello" or legacy hello and reading the response
- *Authentication step*: the client optionally completes an authentication step
- *After the handshake completes*: the client uses the established connection for application operations

If there is a network error or timeout on the connection before the handshake completes, the client MUST replace the
server's description with a default ServerDescription of type Unknown when the TopologyType is not LoadBalanced, and
fill the ServerDescription's error field with useful information.
If there is a network error or timeout on the connection establishment or the hello, the client MUST NOT change the
server's description.

If there is an network error or timeout during the authentication step, the client MUST replace the server's description
with a default ServerDescription of type Unknown when the TopologyType is not LoadBalanced, and fill the
ServerDescription's error field with useful information.

If there is a network error or timeout on the connection before the handshake completes, and the TopologyType is
LoadBalanced, the client MUST keep the ServerDescription as LoadBalancer.
Expand Down Expand Up @@ -1253,11 +1260,12 @@ if and only if the error is "node is shutting down" or the error originated from
and [other transient errors](#other-transient-errors) and
[Why close connections when a node is shutting down?](#why-close-connections-when-a-node-is-shutting-down).)

##### Authentication and Handshake errors
##### MongoDB Handshake errors

If the driver encounters errors when establishing application connections (this includes the initial handshake and
authentication), the driver MUST mark the server Unknown and clear the server's connection pool if the TopologyType is
not LoadBalanced. (See [Why mark a server Unknown after an auth error?](#why-mark-a-server-unknown-after-an-auth-error))
If the driver encounters errors that do not have the backpressure error label (`SystemOverloadedError`) applied when
establishing application connections (this includes the initial handshake and authentication), the driver MUST mark the
server Unknown and clear the server's connection pool if the TopologyType is not LoadBalanced. (See
[Why mark a server Unknown after an auth error?](#why-mark-a-server-unknown-after-an-auth-error))

### Monitoring SDAM events

Expand Down Expand Up @@ -2027,6 +2035,8 @@ oversaw the specification process.
- 2025-01-22: Add error messages when a new primary is elected or a primary with a stale electionId or setVersion is
discovered.

- 2025-XX-YY: Add handling of backpressure error labels.

______________________________________________________________________

[^1]: "localThresholdMS" was called "secondaryAcceptableLatencyMS" in the Read Preferences Spec, before it was superseded
Expand Down
10 changes: 9 additions & 1 deletion source/server-discovery-and-monitoring/server-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,8 @@ MUST be used to satisfy the check and update the topology.
When a client successfully calls hello or legacy hello to handshake a new connection for application operations, it
SHOULD use the hello or legacy hello reply to update the ServerDescription and TopologyDescription, the same as with a
hello or legacy hello reply on a monitoring socket. If the hello or legacy hello call fails, the client SHOULD mark the
server Unknown and update its TopologyDescription, the same as a failed server check on monitoring socket.
server Unknown and update its TopologyDescription, the same as a failed server check on monitoring socket, unless the
connection pool has added the `SystemOverloadedError` to the error.

##### Clients use the streaming protocol when supported

Expand Down Expand Up @@ -273,6 +274,11 @@ minHeartbeatFrequencyMS (500ms).

(See [heartbeatFrequencyMS in the main SDAM spec](server-discovery-and-monitoring.md#heartbeatFrequencyMS).)

#### Handling of backpressure labels

Because the scan may occur on an authenticated connection, the server may apply backpressure by failing the command with
a `SystemOverloadedError` label. The driver MUST not close the connection when this label is encountered.

### Awaitable hello or legacy hello Server Specification

As of MongoDB 4.4 the hello or legacy hello command can wait to reply until there is a topology change or a maximum time
Expand Down Expand Up @@ -573,6 +579,8 @@ class Monitor(Thread):
topology.onServerDescriptionChanged(description, connection pool for server)
if description.error != Null:
# Clear the connection pool only after the server description is set to Unknown.
# Note: for single-threaded monitors, only clear if the `SystemOverloadedError` is not applied to the
# error.
clear(interruptInUseConnections: isNetworkTimeout(description.error)) connection pool for server

# Immediately proceed to the next check if the previous response
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading