Etcd does not respect retries if the endpoint(s) cannot be reached #107

omgrr · 2017-10-11T16:07:28Z

It looks like if the etcd cluster is unreachable then the client spins with Failed to connect to endpoint 'example.com:2379'. I think the desired behavior would be it should try each endpoint once, and if there is only one endpoint, it should raise immediately.

It also looks like there is retry parameter that a user could use to adjust this further.

There are two issues as I see it

The infinite recursion
The retry parameter isn't used / cannot be set

I can totally take a stab at it if you like! Just let me know, thanks!

The text was updated successfully, but these errors were encountered:

davissp14 · 2017-10-12T13:09:27Z

The infinite recursion
In the event users were using this client in their application, I want to avoid raising any exceptions that would force users to have to restart their application to re-establish a connection to Etcd. At least by default.

I think the desired behavior would be it should try each endpoint once, and if there is only one endpoint, it should raise immediately.

I think this is probably use-case driven. Also we should be careful there, as most users will issue a rolling restart across their nodes when performing maintenance / change versions. In this case, we would want retry the set of endpoints multiple times.

The retry parameter isn't used / cannot be set
Totally for exposing this!

mgates · 2017-10-12T14:50:53Z

I think it makes sense to try endpoints multiple times, but it should maybe still respect the 'retry' count, right now it's ignored except for auth errors?

Also, any thoughts about switching from recursion to using 'retry'?

davissp14 · 2017-10-12T15:36:39Z

There's two things here. Retries per endpoint and how many times we would want to cycle through all endpoints before exploding, right?

mgates · 2017-10-12T16:03:16Z

I think they're different. One is that the retries value is ignored for some errors an not set-able. I think we for sure what to change that. I think the other is how we want to conceptualize retries if there are multiple servers:

have two values, one for cycles and one for retries for other errors
count one cycle as one retry
count once connection attempt as one retry

I don't really have a preference, as long as there is some way of controlling the number of retrys if the server is down. I think that the answer to the second problem needs to be figured out as part of making non-auth errors respect the retries argument.

davissp14 · 2017-10-12T16:18:52Z

I could see some people wanting to prioritize certain nodes over others, especially with gRPC proxies which do some caching. Some people may also want to prioritize the leader due to latency concerns. This is probably separate functionality that we can build on later, but may be worth considering. With that being said, I think having two values may make the most sense. Thoughts?

omgrr · 2017-10-19T22:13:06Z

Sorry for going quiet here! I'm a little confused on how the two values would interact with each other.

If I have,

cycles set to 2
retries set to 5
And 3 endpoints

Does it retry on each endpoint 5 times and will cycle through all of the endpoints 2 times? So 30 calls?

Or would it cycle through the endpoints, up to the retry limit. So 5 calls. This would be my personal preferred approach, but kind of defeats the purpose of a cycle value.

davissp14 · 2017-10-20T18:18:25Z

Or would it cycle through the endpoints, up to the retry limit. So 5 calls.

In a 3 node cluster, this would result in 2 of the 3 endpoints retrying twice and 1 endpoint retrying once, right? Or did I misunderstand that?

In anycase, it may be worth seeing how other people are solving this type of thing. Sorry I haven't been super responsive, work + moving is kicking my ass.

omgrr · 2017-10-20T18:28:13Z

This would result in 2 of the 3 endpoints retrying twice and 1 endpoint retrying once, right? Or did I misunderstand that?

Yup, thats right. Maybe thats strange though?

May be worth seeing how other people are solving this.

Yea, agreed. I'll look around some. Particularly interested in how the proxy does this.

cosgroveb · 2017-12-05T19:58:29Z

It seems like retry policies are coming as an official part of the protocol that this library could make-use of in the future: https://github.com/grpc/proposal/blob/master/A6-client-retries.md#maximum-number-of-retries

Something to keep an eye on at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd does not respect retries if the endpoint(s) cannot be reached #107

Etcd does not respect retries if the endpoint(s) cannot be reached #107

omgrr commented Oct 11, 2017 •

edited

Loading

davissp14 commented Oct 12, 2017 •

edited

Loading

mgates commented Oct 12, 2017 •

edited

Loading

davissp14 commented Oct 12, 2017

mgates commented Oct 12, 2017

davissp14 commented Oct 12, 2017 •

edited

Loading

omgrr commented Oct 19, 2017

davissp14 commented Oct 20, 2017 •

edited

Loading

omgrr commented Oct 20, 2017

cosgroveb commented Dec 5, 2017

Etcd does not respect retries if the endpoint(s) cannot be reached #107

Etcd does not respect retries if the endpoint(s) cannot be reached #107

Comments

omgrr commented Oct 11, 2017 • edited Loading

davissp14 commented Oct 12, 2017 • edited Loading

mgates commented Oct 12, 2017 • edited Loading

davissp14 commented Oct 12, 2017

mgates commented Oct 12, 2017

davissp14 commented Oct 12, 2017 • edited Loading

omgrr commented Oct 19, 2017

davissp14 commented Oct 20, 2017 • edited Loading

omgrr commented Oct 20, 2017

cosgroveb commented Dec 5, 2017

omgrr commented Oct 11, 2017 •

edited

Loading

davissp14 commented Oct 12, 2017 •

edited

Loading

mgates commented Oct 12, 2017 •

edited

Loading

davissp14 commented Oct 12, 2017 •

edited

Loading

davissp14 commented Oct 20, 2017 •

edited

Loading