-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility #7040
Comments
For reference the raw column data appears to be _9
_10
IMO if the column contains an out of range Int32 for a UInt8 or UInt16 returning null or an error is the correct behaviour. I am not sure what exactly you are proposing should be different? I therefore think the current behaviour is arguably correct... Perhaps you could clarify what behaviour you are expecting? |
Let's take the example of the third value in |
To add one more thing - The parquet cli is in Java which does not have unsigned integers so the value that the cli displays for the column is |
Right, which is out of range for a uint8...
Can you point to where in the specification it says to do this? I can't help feeling we are exploring the undefined behaviour alluded to in the specification
TBC I have no real opinion on what the correct behaviour should or should not be, however, if we are going to make changes to the current behaviour we should make certain that we're changing it to what the behaviour should be, and not just what some other implementation happens to do. This would likely involve building consensus to make an appropriate addition to the parquet specification. Edit: FWIW an approach that allows for non-zero padding of integers would likely cause a lot of challenges for statistics and bloom filters. |
The relevant part of the spec is here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
and later
I would interpret this to mean that only 8 bits are significant and that the bit pattern represents an unsigned value. |
I don't disagree that this is an interpretation of the specification, but it also is by no means the only one. I think the correct behaviour would need to be clarified before making any changes here.
Given it spits out an objectively invalid value, I think a perhaps more reasonable take is that the data is invalid and the behaviour undefined.
I'm not sure what parquet implementation you are referring to, parquet doesn't have a canonical implementation. |
FWIW, pyarrow produces
Might be worth looking at how arrow-cpp handles this. Also, parquet-read (which uses One could also argue it's a bug in the java implementation...when writing an 8 bit unsigned, you'd expect values to be masked to the proper number of bits to avoid this issue. |
Yeah. As @tustvold said we might be looking at the undefined behavior part of the spec. @tustvold the parquet implementation I was referring to is the one pointed to by the Apache Parquet project - https://github.com/apache/parquet-java/ @etseidl the pyarrow output matches my expectation. Afaik, pyarrow uses arrow-cpp's parquet These two implementations are probably the most widely used implementations out there.
And therein lies the rub :) . |
Should we consider a reader option that allows 'java compatible' behavior? After playing around with the code a bit, I'm coming to the conclusion that the current behavior is correct, but also feel that compatibility with other engines is desirable. |
The parquet spec does say:
So there is no bug here, as @tustvold pointed out, but the writer that produced the file violated the spec.
I think we should figure out what we want to do with this malformed data, and then do it consistently. I'd be inclined to do as arrow-cpp and mask 8 and 16 bit integers after reading, but that might have a small performance penalty. Gating this behavior may be the way to go. |
I would suggest filling a bug against whatever produced the file, as I do think it is incorrect to produce such a file. I'd also be curious if the statistics are also incorrect. Once we have some consensus on what is correct, I'm fine with working around the bug by masking, but I think it sensible to get feedback from the broader community first. Perhaps there is something we are missing |
Fair enough. I will track down where the issue originates and log an issue with the appropriate project. Also, updating the issue title to reflect this is not a bug but a compatibility issue. |
I'm happy to engage with the parquet community about this if no one beats me to it. |
Thanks @etseidl. The issue is in the ExampleParquetWriter (from parquet-java) used here in Spark unit tests : https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871 |
Issue logged: apache/parquet-java#3142 |
Having looked more deeply into the code that created this file I'm not sure if there are many writers out there that can produce such erroneous files. On the other hand, a whole lot of readers are able to handle the illegal values so one could argue that we should too. |
At least for C++ the writes, do not appear to actually number of bit written but instead do a cast: Read side fo C++ just does the oppositve operation, so it isn't masking explicitly just relying on the down-cast behavior |
The rust record reader also relies on casting behavior AFAICT. On the arrow side, we're doing array casts, which then return |
|
I have a fix for this ready to go (mod tests), but want to ping the Parquet community first as @tustvold suggested. I think we'll need to add your file to parquet-testing unless I can figure out a way to coerce parquet-rs to write incorrect data like that. @parthchandra for now I'll put up a draft PR so you can test if my fix works. |
Opened #7055 |
Tested with arrow-rs example program and verified the fix works. Will also test withe datafusion comet but am fairly certain this will address the issue. |
Confirmed end-end with datafusion comet. (File written by parquet-java; parquet-rs produces same output as spark with this patch). |
I tried doing that and wasn't able to. I think that adding the file to parquet-testing is just fine. |
Let me correct that - Confirmed that datafusion comet returns the expected values now. (There is still a mismatch with Spark but that is likely a Spark issue) |
Attaching a couple of more files for testing purposes - small_uint_types_32768.parquet.zip |
Now I'm conflicted. On the one hand, sentiment in the Parquet community seems to be heading towards returning an error for data that is malformed as in this issue. On the other, I've done some benchmarking of the fix vs the current code and have found that the fallible cast currently used is up to 2X slower than using the infallible casting as in #7055. This is also true for signed INT8 and INT16, but not for UINT32. @tustvold and @alamb, what do you think of special casing casts to truncated integer types? Is the speedup worth the extra code? And how do we then handle detecting malformed data and returning an error? Would we add a validation step for 8 and 16 bit integers? |
If the question is "should we slow down decoding to give more compatible results for malformed input" my answer would be not by default.
@parthchandra 's suggestion seems reasonable to me, especially if the current behavior is "correct" according to the sepc Another potential alternative might be to provide some sort of pluggable behavior (like allow overriding the default conversion logic via template or something) -- that way we downstream users who needed different edge case behavior could implement whatever they needed without having to add such logic back up here 🤔 |
IMO given the following:
I'd personally be inclined to do nothing, it seems unnecessary to add complexity if this is purely a theoretical problem. |
I think it is a practical issue is for systems (like Comet) that are trying to be bug-for-bug compatible with Spark and other JVM based products. Of course, one might argue that bug-for-bug compatibility is a fools-errand, but I think it is a valuable one for certain people |
Right but this isn't a bug in Spark, the consensus appears to be this is a bug in some test harness within Spark producing invalid files, which then systems handle differently. Adding complexity to work around this seems unnecessary if no real system produces such files. |
If this is the case, then investing in fixing the harness seems reasonable to me (but of course since I am not going to do the work, it would sound like a good idea :) |
I guess I got off on a bit of a tangent here. I was asking not so much about handling the bad data, but rather the serendipitous finding that the existing code is quite slow doing casts from 32 bits down to 8 or 16. That can be raised in a separate issue. |
Yes 100% . Filed this one to start the conversation |
@tustvold You're right that this is not a bug in Spark but users of Spark must have found it useful to be able to read Parquet files with unsigned int8 and unsigned int16 which is why Spark supports reading such types.
@alamb, this might be a better way than the option I suggested. |
Aah I see the confusion, I interpreted
as there being some buggy spark test harness producing invalid files. I agree if parquet-java can produce such invalid files, users may want to control what behaviour they get when reading them. (It should also be filed as a bug in parquet-java). Given this is not inherently not standardised, providing a way to configure this seems reasonable to me |
I've logged an issue in parquet-java already (apache/parquet-java#3142). I logged it only for the |
Describe the bug
The parquet spec says a uint8 or uint16 value must be an
int32
annotated byINT(8, false), INT(16, false)
. A file with such values gets read into aint32
vector and the value read may be negative. When casting these values to the unsigned value, the cast method checks if the value is outside the range of valid values for an unsigned value. Since a negative value is outside the range the cast method will either return null or throw an error (depending on the specified cast option).To Reproduce
I modified
parquet/examples/read_parquet.rs
to read columns _9, and _10 from the attached file.The file schema and contents as dumped by the parquet cli -
Schema
Values -
Results
The text was updated successfully, but these errors were encountered: