Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet performance: improve performance of reading int8/int16 #7097

Open
alamb opened this issue Feb 7, 2025 · 2 comments
Open

Parquet performance: improve performance of reading int8/int16 #7097

alamb opened this issue Feb 7, 2025 · 2 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate performance

Comments

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

@etseidl reports #7040 (comment):

I guess I got off on a bit of a tangent here. I was asking not so much about handling the bad data, but rather the serendipitous finding that the existing code is quite slow doing casts from 32 bits down to 8 or 16. That can be raised in a separate issue.

Describe the solution you'd like
Implement the improvement suggested by @etseidl

Describe alternatives you've considered
To be honest I don't fully understand what code this affects, so perhaps @etseidl could help

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 7, 2025
@alamb alamb added parquet Changes to the parquet crate performance labels Feb 7, 2025
@etseidl
Copy link
Contributor

etseidl commented Feb 7, 2025

Thanks @alamb! This is my (perhaps incorrect) understanding of the issue.

One step in reading a Parquet file is converting the Parquet physical/logical_type into a corresponding Arrow type. For Parquet primitive types, this process begins in PrimitiveArrayReader::consume_batch, where an appropriate array reader will be constructed based on the Parquet physical type. This array will then be cast to an array with a suitable Arrow type. Several special cases (INT32|INT64 -> Decimal128|Decimal256, INT32 -> Date64) are handled in consume_batch, with all others delegated to arrow_cast::cast. In the case here (32 bit integer being cast into a small bitwidth), control ultimately passes to cast_numeric_arrays, which in turn ultimately uses PrimitiveArray::unary_opt, passing num::cast::cast as the operation to perform. Because num::cast::cast returns an option, the slower unary_opt is used rather than the usually much faster unary. Some initial testing found that detecting INT32->INT8|INT16 in consume_batch instead using unary with a straightforward cast from i32 to u8 was 30-50% faster (at least on my old Mac laptop, I still need to check on a more modern processor), with an array containing some nulls seeing the biggest improvement. (I modified arrow_reader bench to add int8 and int16 benches).

So to my mind, the question is is that enough of a performance gain to warrant making consume_batch that much more complicated? Or is it worth reexamining the use of arrow_cast::cast_numeric_arrays, at least for integer->integer casts, and perhaps create a version that can use unary? I lack historical context here, so perhaps @tustvold can explain why this might be a bad idea.

FWIW, this all grew out of trying to see if a potential fix for the issue in #7040 would negatively impact performance.

@etseidl
Copy link
Contributor

etseidl commented Feb 7, 2025

So I ran down the rabbit hole implementing the above in arrow-cast, which resulted in several failed tests where casts were expected to fail. So if we do make any change, it will likely need to be Parquet specific (i.e. stay in consume_batch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate performance
Projects
None yet
Development

No branches or pull requests

2 participants