Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket errors when we lose a connection #208

Closed
andyleclair opened this issue Oct 30, 2024 · 7 comments
Closed

Socket errors when we lose a connection #208

andyleclair opened this issue Oct 30, 2024 · 7 comments

Comments

@andyleclair
Copy link

Hi! We're seeing a weird error in prod that happens when we lose our connection to Clickhouse Cloud. We have confirmed with them that these errors coincide with issues on their end with a pod dying. We're trying to figure out if we're still connecting to the dead pod or what. This is the error we see:

Ch.Connection (#PID<0.200885.0>) failed to connect: ** (Mint.TransportError) Invalid transport socket option sndbuf: invalid argument

We also see interspersed

Ch.Connection (#PID<0.3291.0>) disconnected: ** (Mint.TransportError) socket closed

This seems like an issue on the client end, like the connection has been closed but we're trying to reuse it? I'm not sure, I haven't dug too deep into the implementation yet, looking for any guidance you might have

@ruslandoga
Copy link
Contributor

ruslandoga commented Oct 31, 2024

👋

** (Mint.TransportError) Invalid transport socket option sndbuf: invalid argument

Ch doesn't set any socket options itself, and it doesn't seem like Mint sets sndbuf (i.e. SO_SNDBUF) either, only buffer (which is OTP-specific and controls how much to allocate for :gen_tcp.recv(socket, 0, timeout) (i.e. receiving without specifying a size), Mint sets it to max of sndbuf/recbuf (i.e. SO_RCVBUF)/buffer): https://github.com/elixir-mint/mint/blob/50b11d668b6a240b0d9b20c67fbb41a10a7410b1/lib/mint/http1.ex#L192 and https://github.com/elixir-mint/mint/blob/50b11d668b6a240b0d9b20c67fbb41a10a7410b1/lib/mint/core/util.ex#L20-L38.

I haven't been able to reproduce this exact error even with extreme values:

iex> Mix.install [:mint]

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: -1]) # no-op?
{:ok, %Mint.HTTP1{...}}

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1]) # very slow
{:ok, %Mint.HTTP1{...}}

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1000000000000000000000000000000000000000]) # probably gets truncated
{:ok, %Mint.HTTP1{...}}
iex> Mix.install [:ch]

# :ssl.connect error
iex> Ch.start_link(scheme: "https", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:48.209 [error] Ch.Connection (#PID<0.963.0>) failed to connect: ** (Mint.TransportError) Invalid socket option: # [{sndbuf,nil},
#                         {packet_size,0},
#                         {packet,0},
#                         {header,0},
#                         {active,false},
#                         {mode,binary}]

# :gen_tcp.connect error
iex> Ch.start_link(scheme: "http", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:23.915 [error] :gen_statem #PID<0.956.0> terminating
# ** (ArgumentError) argument error
#     (kernel 10.0) gen_tcp.erl:578: :gen_tcp.connect/4
#     (mint 1.6.2) lib/mint/core/transport/tcp.ex:45: Mint.Core.Transport.TCP.connect/3
#     (mint 1.6.2) lib/mint/http1.ex:140: Mint.HTTP1.connect/4
#     (ch 0.2.8) lib/ch/connection.ex:20: Ch.Connection.connect/1
#     (db_connection 2.7.0) lib/db_connection/connection.ex:74: DBConnection.Connection.handle_event/4
#     (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
#     (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
# Queue: [internal: {:connect, :init}]
# Postponed: []
# State: Ch.Connection
# Callback mode: &DBConnection.Connection.handle_event/4, state_enter: false

Maybe I could try with the exact Elixir/Erlang/Mint/Ch version you are using?

** (Mint.TransportError) socket closed

I think this error is normal when the socket gets, in fact, closed :) But maybe it could be improved.

Since we use "passive" connections, we can only find out that the socket was closed on a call to it (ping (every two seconds by default) or query) and that causes this error and Ch gets reconnected. If we somehow switch to "active" connections, we'd be able to receive a message that the socket was closed right when FIN or whatever arrives. I think Finch had a similar problem, not sure if or how they solved it: sneako/finch#272 and sneako/finch#273

@andyleclair
Copy link
Author

Ok, yeah, I get that, and that's understandable. However, the behavior we're seeing in prod is that we essentially completely fail to reconnect. Enough times, in fact, that BEAM brings our app down. We're investigating supervising the repo separately so we can control the # of restarts and backoff there, I'm not sure if that is a general case fix, but might work enough. It would be nice if this lib would allow configuring that directly, but we can handle it on the client side easily enough.

Since we use "passive" connections, we can only find out that the socket was closed on a call to it (ping (every two seconds by default) or query) and that causes this error and Ch gets reconnected. If we somehow switch to "active" connections, we'd be able to receive a message that the socket was closed right when FIN or whatever arrives. I think Finch had a similar problem, not sure if or how they solved it: sneako/finch#272 and sneako/finch#273

Ah, yeah that old chestnut. We also do see that from time to time in prod, I'm guessing also due to AWS NAT. We have retries built in to our HTTP clients but we do notice this. Maybe it would be a nice improvement to detect this case and retry? It seems like if we go to make a query and our connection we've checked out is closed, we should at least attempt to get a fresh one?

@ruslandoga
Copy link
Contributor

ruslandoga commented Oct 31, 2024

Enough times, in fact, that BEAM brings our app down.

Hm, that ... shouldn't happen. Ch should probably just keep retrying. Would you be able to create a repro? It seems like my Ch error snippets above are enough to reproduce this, even if the error message doesn't match exactly. I'll try wrapping connection in a try/catch block to rescue any error/exit/throw on invalid configuration.

We're investigating supervising the repo separately

I would also be interested in learning the root cause of the error: where is an "invalid" sndbuf option coming from?

It seems like if we go to make a query and our connection we've checked out is closed, we should at least attempt to get a fresh one?

That would solve it if we do it in

ch/lib/ch/connection.ex

Lines 243 to 252 in ca21639

def handle_execute(query, params, opts, conn) do
{query_params, extra_headers, body} = params
path = path(conn, query_params, opts)
headers = headers(conn, extra_headers, opts)
with {:ok, conn, responses} <- request(conn, "POST", path, headers, body, opts) do
{:ok, query, responses, conn}
end
end
but would kind of be a "hack" since we would be setting up a connection outside of DBConnection lifecycle. Maybe we can do it at the call-site, e.g. in

ch/lib/ch.ex

Lines 84 to 90 in ca21639

def query(conn, statement, params \\ [], opts \\ []) do
query = Query.build(statement, opts)
with {:ok, _query, result} <- DBConnection.execute(conn, query, params, opts) do
{:ok, result}
end
end

case DBConnection.execute(...) do
  {:error, some_"connection_closed"_error} -> Ch.query(same params)
  ...
end

But that also seems a bit like a hack.

@ruslandoga
Copy link
Contributor

ruslandoga commented Oct 31, 2024

I'll try wrapping connection in a try/catch block to rescue exits on invalid configuration.

PR: #209 (merged)

@ruslandoga
Copy link
Contributor

ruslandoga commented Nov 2, 2024

So to sum up, this issue uncovered two problems with Ch:

@andyleclair
Copy link
Author

Oh fantastic! Thank you so much!

@ruslandoga
Copy link
Contributor

ruslandoga commented Nov 4, 2024

Released as v0.2.9: https://diff.hex.pm/diff/ch/0.2.8..0.2.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants