Grapevine returns error responses for incoming federation transactions
The spec states here that the response to the /_matrix/federation/v1/send/:transaction_id
endpoint should always be 200, and that "The server is to use this response even in the event of one or more PDUs failing to be processed." We seem to be making a kinda halfhearted attempt to doing this, but there are still a lot of situations where errors specific to a single PDU will cause us to return an error for the entire transaction.
An example from the computer.surgery logs:
Aug 24 05:55:29 red grapevine[342632]: 2024-08-24T05:55:29.384561Z WARN grapevine::api::ruma_wrapper::axum: Failed to fetch signing keys, error: bad signature, still backing off
Aug 24 05:55:29 red grapevine[342632]: at src/api/ruma_wrapper/axum.rs:271
Aug 24 05:55:29 red grapevine[342632]: in grapevine::api::ruma_wrapper::axum::ar_from_request
Aug 24 05:55:29 red grapevine[342632]: in grapevine::http_request with otel.name: "PUT /_matrix/federation/v1/send/:transaction_id", method: PUT, endpoint: /_matrix/federation/v1/send/:transaction_id
This likely causes incoming federation traffic from some servers to get stuck, because the sending server is supposed to retry the exact same transaction repeatedly until it gets a 200 response.