Centralize backoff logic for offline servers (!70) · Merge requests · Matrix projects / Grapevine

Olivia Lee requested to merge benjamin/refactor-backoff into main Aug 11, 2024

Context

There are several places in the current code that we implement "exponential" backoff (usually actually quadratic). These mostly don't share code, leading to multiple distinct bugs. It's not clear what the intent of the backoff is in each case, and there are multiple reasons that we might want to backoff or rate limit requests. An effective approach depends on what we're trying to guard against, and we will likely need multiple overlapping implementations.

Changes in this MR

One reason that we want to backoff outgoing federation requests is to avoid spamming offline servers. This shows up separately in our current implementations for remote device key queries, remote signing key queries, and outgoing federation transactions. Many other sources of outgoing federation requests are unprotected.

This MR centralizes backoff state intended to protect against offline servers in one place, and enables this backoff logic for all outgoing federation traffic. This implementation is better suited than the existing ones for handling offline servers with a stream of concurrent requests.

This is a first step towards #23

Other considerations

One DoS vector to keep in mind is that an attacker may generate incoming requests that cause us to hit a disproportionately large fraction of errors from some other server. For example, they might know a bug that causes another homeserver to return 500 on a federation request with specific characteristics, and generate a large volume of incoming client traffic that triggers affected federation requests from us. This may cause us to backoff all requests to that server, as if it was offline. I'm not sure it's possible to fix this, in general, but we can mitigate it by being careful to only record failures when we receive an error that we're confident shouldn't be generated under normal conditions. I'm pretty sure that synapse has the same issue.

Edited Aug 11, 2024 by Olivia Lee

Centralize backoff logic for offline servers

Context

Changes in this MR

Other considerations

Merge request reports