Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream query response for Parquet format #25955

Open
hiltontj opened this issue Feb 3, 2025 · 3 comments
Open

Stream query response for Parquet format #25955

hiltontj opened this issue Feb 3, 2025 · 3 comments
Labels

Comments

@hiltontj
Copy link
Contributor

hiltontj commented Feb 3, 2025

Problem statement

When a query is made via the HTTP API (/api/v3/query_sql, /api/v3/query_influxql) with parquet as the output format, the record batches in the response are buffered into memory before being serialized to parquet.

This could lead to OOMs for large query responses.

Additional context

The parquet writer is not compatible with writing to a streamed Body using the current version of hyper that we are using.

We either need to upgrade to hyper 1.x, to see if its trait-based approach to Body can be used alongside the existing arrow writers, or there may be some upstream work in arrow-rs required to make this possible.

Related:

Proposed solution

We should first start by upgrading hyper to 1.x. This may need to be done upstream in IOx before it can be brought in here, since we are leveraging code from IOx that uses hyper 0.14.x.

Then we can see if implementing a streamed writer for parquet with the new Body trait is possible, or if we need to open some issues upstream in arrow-rs.

@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

The parquet writer is not compatible with writing to a streamed Body using the current version of hyper that we are using.

ArrowWriter is sync

There is an AsyncWriter -- https://docs.rs/parquet/latest/parquet/arrow/async_writer/index.html that may allow streaming responses

That isn't to say we shouldn't update hyper, but since the current hyper version also supports streaming responses I don't think the hyper upgrade is necessarily blocking this work

@hiltontj
Copy link
Contributor Author

hiltontj commented Feb 4, 2025

@mgattozzi was grappling with this in the issue and related PR linked above. Perhaps there is an approach we haven't tried yet.

I thought this could be done by defining a type that implements Stream and wraps the record batch stream, a parquet writer, and a buffer of Bytes. The writer can write the streamed record batches out to the buffer, which then gets flushed as output of the outer stream. It would be similar to the streaming response implementation in the /query API, but simpler because it just flushes the bytes as is, no need for any weird JSON mangling.

@mgattozzi
Copy link
Contributor

The issue @alamb is that Body does not implement Write so you can't just pass it into the writer. Body is used by Hyper to pull bytes from to send. Hyper wants exclusive ownership of Body. If it's given a stream then that works. What we really need is for the parquet writer to be able to create a stream of Bytes for hyper to consume, not a buffer we write all the data too and then pass to hyper which is what the sync and async versions of the writer do. I don't think upgrading to hyper 1.0 solves anything here.

Hopefully this makes sense. I don't think it's impossible though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants