Socket errors when we lose a connection #208

andyleclair · 2024-10-30T14:42:06Z

Hi! We're seeing a weird error in prod that happens when we lose our connection to Clickhouse Cloud. We have confirmed with them that these errors coincide with issues on their end with a pod dying. We're trying to figure out if we're still connecting to the dead pod or what. This is the error we see:

Ch.Connection (#PID<0.200885.0>) failed to connect: ** (Mint.TransportError) Invalid transport socket option sndbuf: invalid argument

We also see interspersed

Ch.Connection (#PID<0.3291.0>) disconnected: ** (Mint.TransportError) socket closed

This seems like an issue on the client end, like the connection has been closed but we're trying to reuse it? I'm not sure, I haven't dug too deep into the implementation yet, looking for any guidance you might have

The text was updated successfully, but these errors were encountered:

ruslandoga · 2024-10-31T02:49:36Z

👋

** (Mint.TransportError) Invalid transport socket option sndbuf: invalid argument

Ch doesn't set any socket options itself, and it doesn't seem like Mint sets sndbuf (i.e. SO_SNDBUF) either, only buffer (which is OTP-specific and controls how much to allocate for :gen_tcp.recv(socket, 0, timeout) (i.e. receiving without specifying a size), Mint sets it to max of sndbuf/recbuf (i.e. SO_RCVBUF)/buffer): https://github.com/elixir-mint/mint/blob/50b11d668b6a240b0d9b20c67fbb41a10a7410b1/lib/mint/http1.ex#L192 and https://github.com/elixir-mint/mint/blob/50b11d668b6a240b0d9b20c67fbb41a10a7410b1/lib/mint/core/util.ex#L20-L38.

I haven't been able to reproduce this exact error even with extreme values:

iex> Mix.install [:mint]

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: -1]) # no-op?
{:ok, %Mint.HTTP1{...}}

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1]) # very slow
{:ok, %Mint.HTTP1{...}}

iex> Mint.HTTP1.connect(:https, "google.com", 443, transport_opts: [sndbuf: 1000000000000000000000000000000000000000]) # probably gets truncated
{:ok, %Mint.HTTP1{...}}

iex> Mix.install [:ch]

# :ssl.connect error
iex> Ch.start_link(scheme: "https", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:48.209 [error] Ch.Connection (#PID<0.963.0>) failed to connect: ** (Mint.TransportError) Invalid socket option: # [{sndbuf,nil},
#                         {packet_size,0},
#                         {packet,0},
#                         {header,0},
#                         {active,false},
#                         {mode,binary}]

# :gen_tcp.connect error
iex> Ch.start_link(scheme: "http", hostname: "localhost", port: 8123, database: "default", transport_opts: [sndbuf: nil])
# 10:01:23.915 [error] :gen_statem #PID<0.956.0> terminating
# ** (ArgumentError) argument error
#     (kernel 10.0) gen_tcp.erl:578: :gen_tcp.connect/4
#     (mint 1.6.2) lib/mint/core/transport/tcp.ex:45: Mint.Core.Transport.TCP.connect/3
#     (mint 1.6.2) lib/mint/http1.ex:140: Mint.HTTP1.connect/4
#     (ch 0.2.8) lib/ch/connection.ex:20: Ch.Connection.connect/1
#     (db_connection 2.7.0) lib/db_connection/connection.ex:74: DBConnection.Connection.handle_event/4
#     (stdlib 6.0) gen_statem.erl:3115: :gen_statem.loop_state_callback/11
#     (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
# Queue: [internal: {:connect, :init}]
# Postponed: []
# State: Ch.Connection
# Callback mode: &DBConnection.Connection.handle_event/4, state_enter: false

Maybe I could try with the exact Elixir/Erlang/Mint/Ch version you are using?

** (Mint.TransportError) socket closed

I think this error is normal when the socket gets, in fact, closed :) But maybe it could be improved.

Since we use "passive" connections, we can only find out that the socket was closed on a call to it (ping (every two seconds by default) or query) and that causes this error and Ch gets reconnected. If we somehow switch to "active" connections, we'd be able to receive a message that the socket was closed right when FIN or whatever arrives. I think Finch had a similar problem, not sure if or how they solved it: sneako/finch#272 and sneako/finch#273

andyleclair · 2024-10-31T03:48:20Z

Ok, yeah, I get that, and that's understandable. However, the behavior we're seeing in prod is that we essentially completely fail to reconnect. Enough times, in fact, that BEAM brings our app down. We're investigating supervising the repo separately so we can control the # of restarts and backoff there, I'm not sure if that is a general case fix, but might work enough. It would be nice if this lib would allow configuring that directly, but we can handle it on the client side easily enough.

Since we use "passive" connections, we can only find out that the socket was closed on a call to it (ping (every two seconds by default) or query) and that causes this error and Ch gets reconnected. If we somehow switch to "active" connections, we'd be able to receive a message that the socket was closed right when FIN or whatever arrives. I think Finch had a similar problem, not sure if or how they solved it: sneako/finch#272 and sneako/finch#273

Ah, yeah that old chestnut. We also do see that from time to time in prod, I'm guessing also due to AWS NAT. We have retries built in to our HTTP clients but we do notice this. Maybe it would be a nice improvement to detect this case and retry? It seems like if we go to make a query and our connection we've checked out is closed, we should at least attempt to get a fresh one?

ruslandoga · 2024-10-31T05:46:24Z

Enough times, in fact, that BEAM brings our app down.

Hm, that ... shouldn't happen. Ch should probably just keep retrying. ~~Would you be able to create a repro?~~ It seems like my Ch error snippets above are enough to reproduce this, even if the error message doesn't match exactly. I'll try wrapping connection in a try/catch block to rescue any error/exit/throw on invalid configuration.

We're investigating supervising the repo separately

I would also be interested in learning the root cause of the error: where is an "invalid" sndbuf option coming from?

It seems like if we go to make a query and our connection we've checked out is closed, we should at least attempt to get a fresh one?

That would solve it if we do it in

ch/lib/ch/connection.ex

Lines 243 to 252 in ca21639

    
           def handle_execute(query, params, opts, conn) do 
        
             {query_params, extra_headers, body} = params 
        
             path = path(conn, query_params, opts) 
        
             headers = headers(conn, extra_headers, opts) 
        
             with {:ok, conn, responses} <- request(conn, "POST", path, headers, body, opts) do 
        
               {:ok, query, responses, conn} 
        
             end 
        
           end

but would kind of be a "hack" since we would be setting up a connection outside of DBConnection lifecycle. Maybe we can do it at the call-site, e.g. in

ch/lib/ch.ex

Lines 84 to 90 in ca21639

    
           def query(conn, statement, params \\ [], opts \\ []) do 
        
             query = Query.build(statement, opts) 
        
             with {:ok, _query, result} <- DBConnection.execute(conn, query, params, opts) do 
        
               {:ok, result} 
        
             end 
        
           end

case DBConnection.execute(...) do
  {:error, some_"connection_closed"_error} -> Ch.query(same params)
  ...
end

But that also seems a bit like a hack.

ruslandoga · 2024-10-31T10:32:57Z

I'll try wrapping connection in a try/catch block to rescue exits on invalid configuration.

PR: #209 (merged)

ruslandoga · 2024-11-02T15:37:45Z

So to sum up, this issue uncovered two problems with Ch:

exceptions/exits/throws in connect weren't caught and could lead to the parent supervisor and then the whole app being killed (fixed in Catch all errors in connect/2 #209)
Ch doesn't know if its connection is actually open -> I've opened a separate issue for it Try to reconnect right after connection is closed (FIN arrived) instead of on query or ping #210

andyleclair · 2024-11-02T15:44:34Z

Oh fantastic! Thank you so much!

ruslandoga · 2024-11-04T04:28:14Z

Released as v0.2.9: https://diff.hex.pm/diff/ch/0.2.8..0.2.9

ruslandoga mentioned this issue Oct 31, 2024

Catch all errors in connect/2 #209

Merged

ruslandoga mentioned this issue Nov 2, 2024

Try to reconnect right after connection is closed (FIN arrived) instead of on query or ping #210

Open

ruslandoga closed this as completed Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket errors when we lose a connection #208

Socket errors when we lose a connection #208

andyleclair commented Oct 30, 2024

ruslandoga commented Oct 31, 2024 •

edited

Loading

andyleclair commented Oct 31, 2024

ruslandoga commented Oct 31, 2024 •

edited

Loading

ruslandoga commented Oct 31, 2024 •

edited

Loading

ruslandoga commented Nov 2, 2024 •

edited

Loading

andyleclair commented Nov 2, 2024

ruslandoga commented Nov 4, 2024 •

edited

Loading

Socket errors when we lose a connection #208

Socket errors when we lose a connection #208

Comments

andyleclair commented Oct 30, 2024

ruslandoga commented Oct 31, 2024 • edited Loading

andyleclair commented Oct 31, 2024

ruslandoga commented Oct 31, 2024 • edited Loading

ruslandoga commented Oct 31, 2024 • edited Loading

ruslandoga commented Nov 2, 2024 • edited Loading

andyleclair commented Nov 2, 2024

ruslandoga commented Nov 4, 2024 • edited Loading

ruslandoga commented Oct 31, 2024 •

edited

Loading

ruslandoga commented Oct 31, 2024 •

edited

Loading

ruslandoga commented Oct 31, 2024 •

edited

Loading

ruslandoga commented Nov 2, 2024 •

edited

Loading

ruslandoga commented Nov 4, 2024 •

edited

Loading