Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(quinn): making a connection that never gets to a full Connection causes excessive delays in Endpoint::wait_idle #2147

Open
ramfox opened this issue Jan 29, 2025 · 7 comments

Comments

@ramfox
Copy link

ramfox commented Jan 29, 2025

I would appreciate any help.

The bug is illustrated in the endpoint_wait_idle test in draft PR #2146, but I will also copy the permalink:

quinn/quinn/src/tests.rs

Lines 154 to 177 in dd14b01

// Create an endpoint that will not make a successful connection
let endpoint = factory.endpoint();
// Connect to un-dialable addr
let connecting = endpoint
.connect(
SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), 1234),
"localhost",
)
.unwrap();
// decide you don't want to keep connecting or something goes wrong
// while connecting:
drop(connecting);
// close the endpoint
endpoint.close(0u8.into(), b"");
let duration = Duration::from_millis(2900);
let res = tokio::time::timeout(duration, endpoint.wait_idle()).await;
// this fails, `wait_idle` takes around 3s
assert!(
res.is_ok(),
"Endpoint took longer than {duration:?} to close"
);
}

To summarize, if you have an Endpoint that attempts a connection, and for whatever reason, that connection does not make it past the Connecting stage, it causes the Endpoint::wait_idle call (when attempting to gracefully close the endpoint) to always take ~3s. I'm assuming something happens like the ConnectionHandler that is created when Connecting never gets cleaned up and so wait_idle only returns after a timeout.

I'm only familiar enough with quinn to know something is wrong, but not what to do to fix it.

@Ralith
Copy link
Collaborator

Ralith commented Jan 29, 2025

What leads you to believe that this behavior is incorrect? My guess would be that this is exactly the draining period of the attempted connection. Waiting for that to complete is exactly the role of wait_idle.

@djc
Copy link
Member

djc commented Jan 29, 2025

I think the suggestion/surprise is that Connecting instances might not deserve the full draining period?

@flub
Copy link
Contributor

flub commented Jan 29, 2025

I assume what happens is that on dropping the Connecting a CONNECTION_CLOSE frame is sent, even though only the initial packet was ever sent. Since CONNECTION_CLOSE is ack-eliciting you then have to wait for 3*PTO to expire and the RTT for the PTO calculation will be set to the initial values?

This makes sense from a protocol point of view. From a user point of view I understand that it is a little surprising that the timeout happens at all in this case. So @djc's question is fair: would it be allowed to shorten the draining period? Would it be desirable?

@ramfox
Copy link
Author

ramfox commented Jan 29, 2025

I think the suggestion/surprise is that Connecting instances might not deserve the full draining period?

Yes, thank you, this is what is surprising for me. We have the "knowledge" that we don't have a Connection yet, since we have not reached that point in the life cycle (because we are still at Connecting), so it felt surprising that I would have to wait for that to close.

@Ralith
Copy link
Collaborator

Ralith commented Jan 30, 2025

You don't have to; wait_idle is strictly optional. It does serve a useful purpose: it helps the peer dispose of any state it may have already allocated in a timely manner. The connection not having been fully established doesn't preclude the peer having allocated state.

@djc
Copy link
Member

djc commented Feb 1, 2025

You don't have to; wait_idle is strictly optional. It does serve a useful purpose: it helps the peer dispose of any state it may have already allocated in a timely manner. The connection not having been fully established doesn't preclude the peer having allocated state.

But if the goal is to help the peer dispose of any state, arguably we might be able to use a shorter timeout? Like, I think you mean we should allow the Connecting to yield a CONNECTION_CLOSE message, but once that is done we probably don't need to wait for anything -- so a 250-500ms timeout might be sufficient for Connecting instances?

@Ralith
Copy link
Collaborator

Ralith commented Feb 2, 2025

Why would we be able to use a shorter timeout? As far as I can go, a client connection with a handshake in progress isn't in any way special as far as lifetime/connection-level resource concerns go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants