Inconsistent results in pull queries with distributed KsqlDB setup #10241

xneg · 2024-02-23T15:13:22Z

Setup

First of all, we have 6 machines each containing its instance of running in Docker from the image confluentinc/ksqldb-server v 0.29.0.

Second, we have this setup:

listeners=http://0.0.0.0:8088/
ksq.advertised_listener  is set for each node
ksql.heartbeat.enable=true
ksql.streams.num.standby.replicas=1
ksql.query.pull.enable.standby.reads=true
ksql.heartbeat.enable=true

We have a scenario very similar to what is described here.

We have an input topic with 60 partitions. Topic's name is events.
We declared a stream:

CREATE STREAM EVENTS (EVENT_TYPE STRING, TS STRING) WITH (CLEANUP_POLICY='delete', FORMAT='json', KAFKA_TOPIC='events', TIMESTAMP='ts', TIMESTAMP_FORMAT='yyyy-MM-dd''T''HH:mm:ssX');

We defined a table with aggregations like this:

CREATE TABLE EVENTS_HOURLY_COUNTS AS
SELECT EVENTS.ROW_PARTITION AS PARTITION,
       COUNT(*)
FROM EVENTS
WINDOW TUMBLING ( SIZE 1 HOURS )
GROUP EVENTS.BY ROW_PARTITION
EMIT CHANGES;

It created for us 3 topics, 1 visible and 2 hidden.

Kafka Topic                                                                                             | Partitions | Partition Replicas
-------------------------------------------------------------------------------------------------------------------------------------------
 EVENTS_HOURLY_COUNTS                                                                                      | 60         | 2
 _confluent-ksql-data_query_CTAS_EVENTS_HOURLY_COUNTS_209-Aggregate-Aggregate-Materialize-changelog        | 60         | 2
 _confluent-ksql-data_query_CTAS_EVENTS_HOURLY_COUNTS_209-Aggregate-GroupBy-repartition                    | 60         | 2

The problem

When we issue pull queries for this table it returns us sporadically inconsistent results without any errors in logs.
Our queries look like this:

SELECT WINDOWSTART, partition, event_count FROM events_hourly_counts  WHERE WINDOWSTART >= 1708452000000 AND WINDOWEND  <= 1708509600000

We run them against already closed periods so we expect that newly arrived data shouldn't interfere with it.
We expect to get data from 60 partitions per hour but sometimes (roughly 1 out of 10) it returns us fewer rows from 44 to 54 and sometimes even 61.
My guess is some of the nodes "timeout" and do not return results in our multi-node setup but without any errors in logs, it's hard to investigate further.

If anyone could help somehow or point to the direction where to dig it would be great. Thanks in advance!

The text was updated successfully, but these errors were encountered:

xneg · 2024-02-27T08:50:17Z

Some addition.

I tried to create a table with only one partition:

CREATE TABLE EVENTS_HOURLY_COUNTS WITH (PARTITIONS=1)
AS SELECT EVENTS.ROW_PARTITION AS PARTITION,
       COUNT(*)
FROM EVENTS
WINDOW TUMBLING ( SIZE 1 HOURS )
GROUP EVENTS.BY ROW_PARTITION
EMIT CHANGES;

So there is only one partition but it still collects the keys from 0 to 59. And it's the same behavior. When I run pull query for 20 hours I expect to receive 1200 rows in results. Most times it is 1200 rows but from time to time it could be 1199, 936 or even 1201 rows!

xneg added the needs-triage label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results in pull queries with distributed KsqlDB setup #10241

Inconsistent results in pull queries with distributed KsqlDB setup #10241

xneg commented Feb 23, 2024

xneg commented Feb 27, 2024

Inconsistent results in pull queries with distributed KsqlDB setup #10241

Inconsistent results in pull queries with distributed KsqlDB setup #10241

Comments

xneg commented Feb 23, 2024

Setup

The problem

xneg commented Feb 27, 2024