-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket errors when we lose a connection #208
Comments
👋
Ch doesn't set any socket options itself, and it doesn't seem like Mint sets sndbuf (i.e. SO_SNDBUF) either, only buffer (which is OTP-specific and controls how much to allocate for I haven't been able to reproduce this exact error even with extreme values: iex> Mix.install [:mint]
iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: -1]) # no-op?
{:ok, %Mint.HTTP1{...}}
iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1]) # very slow
{:ok, %Mint.HTTP1{...}}
iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1000000000000000000000000000000000000000]) # probably gets truncated
{:ok, %Mint.HTTP1{...}} iex> Mix.install [:ch]
# :ssl.connect error
iex> Ch.start_link(scheme: "https", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:48.209 [error] Ch.Connection (#PID<0.963.0>) failed to connect: ** (Mint.TransportError) Invalid socket option: # [{sndbuf,nil},
# {packet_size,0},
# {packet,0},
# {header,0},
# {active,false},
# {mode,binary}]
# :gen_tcp.connect error
iex> Ch.start_link(scheme: "http", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:23.915 [error] :gen_statem #PID<0.956.0> terminating
# ** (ArgumentError) argument error
# (kernel 10.0) gen_tcp.erl:578: :gen_tcp.connect/4
# (mint 1.6.2) lib/mint/core/transport/tcp.ex:45: Mint.Core.Transport.TCP.connect/3
# (mint 1.6.2) lib/mint/http1.ex:140: Mint.HTTP1.connect/4
# (ch 0.2.8) lib/ch/connection.ex:20: Ch.Connection.connect/1
# (db_connection 2.7.0) lib/db_connection/connection.ex:74: DBConnection.Connection.handle_event/4
# (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
# (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
# Queue: [internal: {:connect, :init}]
# Postponed: []
# State: Ch.Connection
# Callback mode: &DBConnection.Connection.handle_event/4, state_enter: false Maybe I could try with the exact Elixir/Erlang/Mint/Ch version you are using?
I think this error is normal when the socket gets, in fact, closed :) But maybe it could be improved. Since we use "passive" connections, we can only find out that the socket was closed on a call to it (ping (every two seconds by default) or query) and that causes this error and Ch gets reconnected. If we somehow switch to "active" connections, we'd be able to receive a message that the socket was closed right when FIN or whatever arrives. I think Finch had a similar problem, not sure if or how they solved it: sneako/finch#272 and sneako/finch#273 |
Ok, yeah, I get that, and that's understandable. However, the behavior we're seeing in prod is that we essentially completely fail to reconnect. Enough times, in fact, that BEAM brings our app down. We're investigating supervising the repo separately so we can control the # of restarts and backoff there, I'm not sure if that is a general case fix, but might work enough. It would be nice if this lib would allow configuring that directly, but we can handle it on the client side easily enough.
Ah, yeah that old chestnut. We also do see that from time to time in prod, I'm guessing also due to AWS NAT. We have retries built in to our HTTP clients but we do notice this. Maybe it would be a nice improvement to detect this case and retry? It seems like if we go to make a query and our connection we've checked out is closed, we should at least attempt to get a fresh one? |
Hm, that ... shouldn't happen. Ch should probably just keep retrying.
I would also be interested in learning the root cause of the error: where is an "invalid"
That would solve it if we do it in Lines 243 to 252 in ca21639
Lines 84 to 90 in ca21639
case DBConnection.execute(...) do
{:error, some_"connection_closed"_error} -> Ch.query(same params)
...
end But that also seems a bit like a hack. |
PR: #209 (merged) |
So to sum up, this issue uncovered two problems with Ch:
|
Oh fantastic! Thank you so much! |
Released as v0.2.9: https://diff.hex.pm/diff/ch/0.2.8..0.2.9 |
Hi! We're seeing a weird error in prod that happens when we lose our connection to Clickhouse Cloud. We have confirmed with them that these errors coincide with issues on their end with a pod dying. We're trying to figure out if we're still connecting to the dead pod or what. This is the error we see:
We also see interspersed
This seems like an issue on the client end, like the connection has been closed but we're trying to reuse it? I'm not sure, I haven't dug too deep into the implementation yet, looking for any guidance you might have
The text was updated successfully, but these errors were encountered: