-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd does not respect retries if the endpoint(s) cannot be reached #107
Comments
The infinite recursion
I think this is probably use-case driven. Also we should be careful there, as most users will issue a rolling restart across their nodes when performing maintenance / change versions. In this case, we would want retry the set of endpoints multiple times. The retry parameter isn't used / cannot be set |
I think it makes sense to try endpoints multiple times, but it should maybe still respect the 'retry' count, right now it's ignored except for auth errors? Also, any thoughts about switching from recursion to using 'retry'? |
There's two things here. Retries per endpoint and how many times we would want to cycle through all endpoints before exploding, right? |
I think they're different. One is that the retries value is ignored for some errors an not set-able. I think we for sure what to change that. I think the other is how we want to conceptualize retries if there are multiple servers:
I don't really have a preference, as long as there is some way of controlling the number of retrys if the server is down. I think that the answer to the second problem needs to be figured out as part of making non-auth errors respect the retries argument. |
I could see some people wanting to prioritize certain nodes over others, especially with gRPC proxies which do some caching. Some people may also want to prioritize the leader due to latency concerns. This is probably separate functionality that we can build on later, but may be worth considering. With that being said, I think having two values may make the most sense. Thoughts? |
Sorry for going quiet here! I'm a little confused on how the two values would interact with each other. If I have,
Does it retry on each endpoint 5 times and will cycle through all of the endpoints 2 times? So 30 calls? Or would it cycle through the endpoints, up to the retry limit. So 5 calls. This would be my personal preferred approach, but kind of defeats the purpose of a |
In a 3 node cluster, this would result in 2 of the 3 endpoints retrying twice and 1 endpoint retrying once, right? Or did I misunderstand that? In anycase, it may be worth seeing how other people are solving this type of thing. Sorry I haven't been super responsive, work + moving is kicking my ass. |
Yup, thats right. Maybe thats strange though?
Yea, agreed. I'll look around some. Particularly interested in how the proxy does this. |
It seems like retry policies are coming as an official part of the protocol that this library could make-use of in the future: https://github.com/grpc/proposal/blob/master/A6-client-retries.md#maximum-number-of-retries Something to keep an eye on at least. |
It looks like if the etcd cluster is unreachable then the client spins with
Failed to connect to endpoint 'example.com:2379'
. I think the desired behavior would be it should try each endpoint once, and if there is only one endpoint, it should raise immediately.It also looks like there is retry parameter that a user could use to adjust this further.
There are two issues as I see it
I can totally take a stab at it if you like! Just let me know, thanks!
The text was updated successfully, but these errors were encountered: