Stream query response for Parquet format #25955

hiltontj · 2025-02-03T13:54:39Z

Problem statement

When a query is made via the HTTP API (/api/v3/query_sql, /api/v3/query_influxql) with parquet as the output format, the record batches in the response are buffered into memory before being serialized to parquet.

This could lead to OOMs for large query responses.

Additional context

The parquet writer is not compatible with writing to a streamed Body using the current version of hyper that we are using.

We either need to upgrade to hyper 1.x, to see if its trait-based approach to Body can be used alongside the existing arrow writers, or there may be some upstream work in arrow-rs required to make this possible.

Proposed solution

We should first start by upgrading hyper to 1.x. This may need to be done upstream in IOx before it can be brought in here, since we are leveraging code from IOx that uses hyper 0.14.x.

Then we can see if implementing a streamed writer for parquet with the new Body trait is possible, or if we need to open some issues upstream in arrow-rs.

The text was updated successfully, but these errors were encountered:

alamb · 2025-02-04T13:56:38Z

The parquet writer is not compatible with writing to a streamed Body using the current version of hyper that we are using.

ArrowWriter is sync

There is an AsyncWriter -- https://docs.rs/parquet/latest/parquet/arrow/async_writer/index.html that may allow streaming responses

That isn't to say we shouldn't update hyper, but since the current hyper version also supports streaming responses I don't think the hyper upgrade is necessarily blocking this work

hiltontj · 2025-02-04T14:45:12Z

@mgattozzi was grappling with this in the issue and related PR linked above. Perhaps there is an approach we haven't tried yet.

I thought this could be done by defining a type that implements Stream and wraps the record batch stream, a parquet writer, and a buffer of Bytes. The writer can write the streamed record batches out to the buffer, which then gets flushed as output of the outer stream. It would be similar to the streaming response implementation in the /query API, but simpler because it just flushes the bytes as is, no need for any weird JSON mangling.

mgattozzi · 2025-02-04T15:03:03Z

The issue @alamb is that Body does not implement Write so you can't just pass it into the writer. Body is used by Hyper to pull bytes from to send. Hyper wants exclusive ownership of Body. If it's given a stream then that works. What we really need is for the parquet writer to be able to create a stream of Bytes for hyper to consume, not a buffer we write all the data too and then pass to hyper which is what the sync and async versions of the writer do. I don't think upgrading to hyper 1.0 solves anything here.

Hopefully this makes sense. I don't think it's impossible though.

hiltontj added the v3 label Feb 3, 2025

This was referenced Feb 3, 2025

Queries in InfluxDB Edge should stream the data back #24655

Closed

feat: stream data back for CSV and JSON queries #25927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream query response for Parquet format #25955

Stream query response for Parquet format #25955

hiltontj commented Feb 3, 2025 •

edited

Loading

alamb commented Feb 4, 2025

hiltontj commented Feb 4, 2025

mgattozzi commented Feb 4, 2025

Stream query response for Parquet format #25955

Stream query response for Parquet format #25955

Comments

hiltontj commented Feb 3, 2025 • edited Loading

Problem statement

Additional context

Related:

Proposed solution

alamb commented Feb 4, 2025

hiltontj commented Feb 4, 2025

mgattozzi commented Feb 4, 2025

hiltontj commented Feb 3, 2025 •

edited

Loading