Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic reconnect clarification #1210

Open
Arttii opened this issue Feb 29, 2024 · 12 comments
Open

Automatic reconnect clarification #1210

Arttii opened this issue Feb 29, 2024 · 12 comments
Labels
feedback required reporter response needed stale requires a follow-up

Comments

@Arttii
Copy link

Arttii commented Feb 29, 2024

Describe the bug

In our app we use a shared instance of the stdLib connection, and today we had an instance of the connection seemingly not being alive any more. Queries were retried, but nothing was happening, which leads me to believe that maybe the connection was not up anymore?

Restarting the app, resolved the issue.

Expected behaviour

There are mentions of automatic reconnects in the lib, but was unable to find any docs on when and how this happens. I guess it should reconnect in this case?

Configuration

Environment

  • Client version: v2.3.0
  • Language version: GO
  • OS: Linux
  • Interface: database/sql compatible driver

ClickHouse server

  • ClickHouse Server version: Clickhouse Cloud latest
@Arttii Arttii added the bug label Feb 29, 2024
@jkaflik
Copy link
Contributor

jkaflik commented Feb 29, 2024

@Arttii Hi! Is there any reason to use such an old version?

@Arttii
Copy link
Author

Arttii commented Feb 29, 2024

@jkaflik That is a good point. We are planning to upgrade in the next few days. Do you think that could impact this?

@Arttii
Copy link
Author

Arttii commented Mar 3, 2024

Upgrading did not really help. Same behavior on the app being inactive for a while

@jkaflik
Copy link
Contributor

jkaflik commented Mar 4, 2024

@Arttii could you clarify this?

Queries were retried, but nothing was happening

what does it mean? The query was blocked and no response came from the server. Is this TCP or HTTP protocol?
Can you try enabling debug logs? https://github.com/ClickHouse/clickhouse-go?tab=readme-ov-file#dsn (see debug)

@Arttii
Copy link
Author

Arttii commented Mar 4, 2024

We are using the TCP protocol, and yes, it seems no response came, and the query was blocked. The connection is concurrency safe, right? We have a pool being reused by several go routines at the same time.

I will enable the debug logs and report back.

@jkaflik
Copy link
Contributor

jkaflik commented Mar 4, 2024

@Arttii connection pool is safe, but the connection itself is not thread-safe.

@Arttii
Copy link
Author

Arttii commented Mar 4, 2024

I am basically doing this:

opts, err := clickhouse.ParseDSN(cfg.DataDBUrl)
if err != nil {

	return err
}

clickhouseConn := clickhouse.OpenDB(opts)
clickhouseConn.SetMaxIdleConns(5)
clickhouseConn.SetMaxOpenConns(10)

and the dns is something like this: 'clickhouse://default:@127.0.0.1:9000/activities?dial_timeout=200ms&max_execution_time=60&compress=lz4&dial_timeout=30s&debug=true'

Is there anything obvious that is wrong?

@jkaflik
Copy link
Contributor

jkaflik commented Mar 5, 2024

Yes, debug=true is enough to get stuff logged to the stdout. Do you have it?

How do you handle your connections? Are you sure you properly close all sql.Rows and so on? Connections might be not released and thus pool blocks waiting for a free connection.

@Arttii
Copy link
Author

Arttii commented Mar 5, 2024

We just pass the pointer around to the gourotines. For querying, we use https://github.com/georgysavva/scany/ to scan in the struct(it seems the Rows are being closed by the package, just looking at the code)

What is the best way to understand, which statements have an impact on this? Just looking at our metrics, i think you are right it seems its waiting for connections.

# HELP db_sql_connections_active_ratio Count of active connections in the pool
# TYPE db_sql_connections_active_ratio gauge
db_sql_connections_active_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 10
# HELP db_sql_connections_idle_closed_ratio The total number of connections closed due to SetMaxIdleConns
# TYPE db_sql_connections_idle_closed_ratio gauge
db_sql_connections_idle_closed_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 4
# HELP db_sql_connections_idle_ratio Count of idle connections in the pool
# TYPE db_sql_connections_idle_ratio gauge
db_sql_connections_idle_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 0
# HELP db_sql_connections_lifetime_closed_ratio The total number of connections closed due to SetConnMaxLifetime
# TYPE db_sql_connections_lifetime_closed_ratio gauge
db_sql_connections_lifetime_closed_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 0
# HELP db_sql_connections_open_ratio Count of open connections in the pool
# TYPE db_sql_connections_open_ratio gauge
db_sql_connections_open_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 10
# HELP db_sql_connections_wait_count_ratio The total number of connections waited for
# TYPE db_sql_connections_wait_count_ratio gauge
db_sql_connections_wait_count_ratio{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 1189
# HELP db_sql_connections_wait_duration_milliseconds The total time blocked waiting for a new connection
# TYPE db_sql_connections_wait_duration_milliseconds gauge
db_sql_connections_wait_duration_milliseconds{db_name="activities",db_system="clickhouse",otel_scope_name="go.nhat.io/otelsql",otel_scope_version=""} 932384.413475

The wait duration specifically.

Would setting max_execution_time perhaps fix this? We basically have several queries, always running at an interval. Maybe they are starving out the pool?

@jkaflik
Copy link
Contributor

jkaflik commented Mar 7, 2024

Would setting max_execution_time perhaps fix this? We basically have several queries, always running at an interval. Maybe they are starving out the pool?

Does it mean these queries do not finish on time? You can basically play with timeout context and see if these queries are problematic.

Could you provide a code snippet more or less reflecting how you query?

@jkaflik jkaflik added stale requires a follow-up feedback required reporter response needed and removed bug labels Mar 25, 2024
@MathiasGr
Copy link

Just happened to us this morning,

write: write tcp 172.20.2.29:40584->35.201.102.65:9440: i/o timeout"

Until we restarted the service

@jkaflik
Copy link
Contributor

jkaflik commented Nov 13, 2024

@MathiasGr what is driver version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback required reporter response needed stale requires a follow-up
Projects
None yet
Development

No branches or pull requests

3 participants