Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility #7040

Open
parthchandra opened this issue Jan 28, 2025 · 37 comments · May be fixed by #7055
Open
Labels
question Further information is requested

Comments

@parthchandra
Copy link
Contributor

parthchandra commented Jan 28, 2025

Describe the bug
The parquet spec says a uint8 or uint16 value must be an int32 annotated by INT(8, false), INT(16, false). A file with such values gets read into a int32 vector and the value read may be negative. When casting these values to the unsigned value, the cast method checks if the value is outside the range of valid values for an unsigned value. Since a negative value is outside the range the cast method will either return null or throw an error (depending on the specified cast option).

To Reproduce
I modified parquet/examples/read_parquet.rs to read columns _9, and _10 from the attached file.

The file schema and contents as dumped by the parquet cli -

Schema

File path:  ./alltypes_extended_plain.parquet
Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties:
  writer.model.name: example
Schema:
message root {
  optional boolean _1;
  optional int32 _2 (INTEGER(8,true));
  optional int32 _3 (INTEGER(16,true));
  optional int32 _4;
  optional int64 _5;
  optional float _6;
  optional double _7;
  optional binary _8 (STRING);
  optional int32 _9 (INTEGER(8,false));
  optional int32 _10 (INTEGER(16,false));
  optional int32 _11 (INTEGER(32,false));
  optional int64 _12 (INTEGER(64,false));
  optional binary _13 (ENUM);
  optional fixed_len_byte_array(3) _14;
  optional int32 _15 (DECIMAL(5,2));
  optional int64 _16 (DECIMAL(18,10));
  optional fixed_len_byte_array(16) _17 (DECIMAL(38,37));
  optional int64 _18 (TIMESTAMP(MILLIS,true));
  optional int64 _19 (TIMESTAMP(MICROS,true));
  optional int32 _20 (DATE);
}

Values -

{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, "_7": 10002.0, "_8": "100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002", "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": [50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, "_7": 10004.0, "_8": "100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004", "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": [52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, "_7": 10008.0, "_8": "100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008", "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": [56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}

Results

+----+-----+
| _9 | _10 |
+----+-----+
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
|    |     |
+----+-----+

**Expected behavior**
Expect non-null values to be returned.

**Additional context**
Parquet file generated by Spark:  

[alltypes_extended_plain.parquet.zip](https://github.com/user-attachments/files/18577979/alltypes_extended_plain.parquet.zip)
@tustvold
Copy link
Contributor

tustvold commented Jan 30, 2025

For reference the raw column data appears to be

_9

null
null
-18
null
-20
null
null
null
-24
null

_10

null
null
-10002
null
-10004
null
null
null
-10008
null

IMO if the column contains an out of range Int32 for a UInt8 or UInt16 returning null or an error is the correct behaviour. I am not sure what exactly you are proposing should be different?

I therefore think the current behaviour is arguably correct... Perhaps you could clarify what behaviour you are expecting?

@tustvold tustvold added question Further information is requested and removed bug labels Jan 30, 2025
@parthchandra
Copy link
Contributor Author

Let's take the example of the third value in _9. The value in the parquet file is the bit pattern 0xffffffee.
Interpreted as a signed 8 bit integer the value would be -18, interpreted as an unsigned 8 bit integer the value would be 238. Note that this interpretation only looks at the lower 8 bits.
Because the annotation in the parquet schema says the value is unsigned and 8 bits we must interpret the bit pattern as an unsigned 8 bit integer and produce 238.

@parthchandra
Copy link
Contributor Author

To add one more thing - The parquet cli is in Java which does not have unsigned integers so the value that the cli displays for the column is -18. It is definitely not showing the correct value since an unsigned type cannot have a negative value.

@tustvold
Copy link
Contributor

tustvold commented Jan 30, 2025

The value in the parquet file is the bit pattern 0xffffffee.

Right, which is out of range for a uint8...

Because the annotation in the parquet schema says the value is unsigned and 8 bits we must interpret the bit pattern as an unsigned 8 bit integer and produce 238.

Can you point to where in the specification it says to do this? I can't help feeling we are exploring the undefined behaviour alluded to in the specification

If a stored value is larger than the maximum allowed by the annotation, the behavior is not defined and can be determined by the implementation.

TBC I have no real opinion on what the correct behaviour should or should not be, however, if we are going to make changes to the current behaviour we should make certain that we're changing it to what the behaviour should be, and not just what some other implementation happens to do. This would likely involve building consensus to make an appropriate addition to the parquet specification.

Edit: FWIW an approach that allows for non-zero padding of integers would likely cause a lot of challenges for statistics and bloom filters.

@parthchandra
Copy link
Contributor Author

The relevant part of the spec is here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
Quoting from the above -

INT annotation can be used to specify unsigned integer types, along with a maximum number of bits in the stored value.

and later

UINT_8 IntType (bitWidth = 8, isSigned = false)

I would interpret this to mean that only 8 bits are significant and that the bit pattern represents an unsigned value.
I agree we should have consensus on what the correct interpretation should be, but perhaps we could use the parquet implementation as the guide.
Irrespective of what the parquet cli displays for the value, it certainly does not display a null, implying the value is valid and within the range of legal values. Wdyt?

@tustvold
Copy link
Contributor

tustvold commented Jan 30, 2025

I would interpret this to mean that only 8 bits are significant and that the bit pattern represents an unsigned value.

I don't disagree that this is an interpretation of the specification, but it also is by no means the only one. I think the correct behaviour would need to be clarified before making any changes here.

Irrespective of what the parquet cli displays for the value, it certainly does not display a null, implying the value is valid and within the range of legal values.

Given it spits out an objectively invalid value, I think a perhaps more reasonable take is that the data is invalid and the behaviour undefined.

could use the parquet implementation as the guide.

I'm not sure what parquet implementation you are referring to, parquet doesn't have a canonical implementation.

@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

FWIW, pyarrow produces

_1: [[null,null,true,null,true,null,null,null,true,null]]
_2: [[null,null,18,null,20,null,null,null,24,null]]
_3: [[null,null,10002,null,10004,null,null,null,10008,null]]
_4: [[null,null,10002,null,10004,null,null,null,10008,null]]
_5: [[null,null,10002,null,10004,null,null,null,10008,null]]
_6: [[null,null,10002,null,10004,null,null,null,10008,null]]
_7: [[null,null,10002,null,10004,null,null,null,10008,null]]
_8: [[null,null,"100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002",null,"100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004",null,null,null,"100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008",null]]
_9: [[null,null,238,null,236,null,null,null,232,null]]
_10: [[null,null,55534,null,55532,null,null,null,55528,null]]
...

Might be worth looking at how arrow-cpp handles this.

Also, parquet-read (which uses SerializedFileReader) produces output consistent with what's desired (and pyarrow/arrow-cpp).

One could also argue it's a bug in the java implementation...when writing an 8 bit unsigned, you'd expect values to be masked to the proper number of bits to avoid this issue.

@parthchandra
Copy link
Contributor Author

Yeah. As @tustvold said we might be looking at the undefined behavior part of the spec.

@tustvold the parquet implementation I was referring to is the one pointed to by the Apache Parquet project - https://github.com/apache/parquet-java/

@etseidl the pyarrow output matches my expectation. Afaik, pyarrow uses arrow-cpp's parquet

These two implementations are probably the most widely used implementations out there.

One could also argue it's a bug in the java implementation...when writing an 8 bit unsigned, you'd expect values to be masked to the proper number of bits to avoid this issue

And therein lies the rub :) .

@parthchandra
Copy link
Contributor Author

Should we consider a reader option that allows 'java compatible' behavior? After playing around with the code a bit, I'm coming to the conclusion that the current behavior is correct, but also feel that compatibility with other engines is desirable.

@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

The parquet spec does say:

If a stored value is larger than the maximum allowed by the annotation, the behavior is not defined and can be determined by the implementation. Implementations must not write values that are larger than the annotation allows.

So there is no bug here, as @tustvold pointed out, but the writer that produced the file violated the spec.

Should we consider a reader option that allows 'java compatible' behavior?

I think we should figure out what we want to do with this malformed data, and then do it consistently. I'd be inclined to do as arrow-cpp and mask 8 and 16 bit integers after reading, but that might have a small performance penalty. Gating this behavior may be the way to go.

@tustvold
Copy link
Contributor

tustvold commented Jan 31, 2025

I would suggest filling a bug against whatever produced the file, as I do think it is incorrect to produce such a file. I'd also be curious if the statistics are also incorrect.

Once we have some consensus on what is correct, I'm fine with working around the bug by masking, but I think it sensible to get feedback from the broader community first. Perhaps there is something we are missing

@parthchandra
Copy link
Contributor Author

Fair enough. I will track down where the issue originates and log an issue with the appropriate project. Also, updating the issue title to reflect this is not a bug but a compatibility issue.

@parthchandra parthchandra changed the title Parquet reader can incorrectly return null (or error) for uint8, uint16 values Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility Jan 31, 2025
@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

I'm happy to engage with the parquet community about this if no one beats me to it.

@parthchandra
Copy link
Contributor Author

Thanks @etseidl. The issue is in the ExampleParquetWriter (from parquet-java) used here in Spark unit tests : https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871
(Spark itself writes only signed integers but will read the file written by the ExampleParquetWriter successfully).
I'll log an issue in Parquet-java and we can follow up there.

@parthchandra
Copy link
Contributor Author

Issue logged: apache/parquet-java#3142

@parthchandra
Copy link
Contributor Author

Having looked more deeply into the code that created this file I'm not sure if there are many writers out there that can produce such erroneous files. On the other hand, a whole lot of readers are able to handle the illegal values so one could argue that we should too.
I will leave it to you, @tustvold and @etseidl, to determine the priority of addressing this issue.
And thank you for spending time on this !

@emkornfield
Copy link
Contributor

emkornfield commented Jan 31, 2025

At least for C++ the writes, do not appear to actually number of bit written but instead do a cast:

https://github.com/apache/arrow/blob/f8723722341df31c0091c91ec451319ded58c214/cpp/src/parquet/column_writer.cc#L1854

Read side fo C++ just does the oppositve operation, so it isn't masking explicitly just relying on the down-cast behavior

@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

Read side fo C++ just does the oppositve operation, so it isn't masking explicitly just relying on the down-cast behavior

The rust record reader also relies on casting behavior AFAICT. On the arrow side, we're doing array casts, which then return null if the cast can't be performed. Specifically, we're relying on num::cast::NumCast, which does not allow casting from signed to unsigned types. If we want to change behavior, we'll likely need to do special work when casting from Parquet INT32 to unsigned integers.

@parthchandra
Copy link
Contributor Author

SerializedFileReader also seems to be doing a straight cast without any range checking.

@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

I have a fix for this ready to go (mod tests), but want to ping the Parquet community first as @tustvold suggested. I think we'll need to add your file to parquet-testing unless I can figure out a way to coerce parquet-rs to write incorrect data like that.

@parthchandra for now I'll put up a draft PR so you can test if my fix works.

@etseidl
Copy link
Contributor

etseidl commented Jan 31, 2025

Opened #7055

@parthchandra
Copy link
Contributor Author

Tested with arrow-rs example program and verified the fix works. Will also test withe datafusion comet but am fairly certain this will address the issue.

@parthchandra
Copy link
Contributor Author

Confirmed end-end with datafusion comet. (File written by parquet-java; parquet-rs produces same output as spark with this patch).
I will assume that this is not the full patch and there will be an option to turn this on?

@parthchandra
Copy link
Contributor Author

we'll need to add your file to parquet-testing unless I can figure out a way to coerce parquet-rs to write incorrect data like that

I tried doing that and wasn't able to. I think that adding the file to parquet-testing is just fine.

@parthchandra
Copy link
Contributor Author

Confirmed end-end with datafusion comet. (File written by parquet-java; parquet-rs produces same output as spark with this patch).

Let me correct that - Confirmed that datafusion comet returns the expected values now. (There is still a mismatch with Spark but that is likely a Spark issue)

@parthchandra
Copy link
Contributor Author

Attaching a couple of more files for testing purposes -

small_uint_types_32768.parquet.zip
small_uint_types_256.parquet.zip

@etseidl
Copy link
Contributor

etseidl commented Feb 4, 2025

Now I'm conflicted. On the one hand, sentiment in the Parquet community seems to be heading towards returning an error for data that is malformed as in this issue. On the other, I've done some benchmarking of the fix vs the current code and have found that the fallible cast currently used is up to 2X slower than using the infallible casting as in #7055. This is also true for signed INT8 and INT16, but not for UINT32.

@tustvold and @alamb, what do you think of special casing casts to truncated integer types? Is the speedup worth the extra code? And how do we then handle detecting malformed data and returning an error? Would we add a validation step for 8 and 16 bit integers?

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

If the question is "should we slow down decoding to give more compatible results for malformed input" my answer would be not by default.

Should we consider a reader option that allows 'java compatible' behavior? After playing around with the code a bit, I'm coming to the conclusion that the current behavior is correct, but also feel that compatibility with other engines is desirable.

@parthchandra 's suggestion seems reasonable to me, especially if the current behavior is "correct" according to the sepc

Another potential alternative might be to provide some sort of pluggable behavior (like allow overriding the default conversion logic via template or something) -- that way we downstream users who needed different edge case behavior could implement whatever they needed without having to add such logic back up here 🤔

@tustvold
Copy link
Contributor

tustvold commented Feb 7, 2025

what do you think of special casing casts to truncated integer types? Is the speedup worth the extra code

IMO given the following:

  • The consensus appears to be that this is a writer bug
  • It doesn't appear any real system produces such files outside of tests

I'd personally be inclined to do nothing, it seems unnecessary to add complexity if this is purely a theoretical problem.

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

I'd personally be inclined to do nothing, it seems unnecessary to add complexity if this is purely a theoretical problem.

I think it is a practical issue is for systems (like Comet) that are trying to be bug-for-bug compatible with Spark and other JVM based products.

Of course, one might argue that bug-for-bug compatibility is a fools-errand, but I think it is a valuable one for certain people

@tustvold
Copy link
Contributor

tustvold commented Feb 7, 2025

bug-for-bug compatible with Spark and other JVM based products

Right but this isn't a bug in Spark, the consensus appears to be this is a bug in some test harness within Spark producing invalid files, which then systems handle differently. Adding complexity to work around this seems unnecessary if no real system produces such files.

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

Right but this isn't a bug in Spark, the consensus appears to be this is a bug in some test harness within Spark producing invalid files, which then systems handle differently

If this is the case, then investing in fixing the harness seems reasonable to me (but of course since I am not going to do the work, it would sound like a good idea :)

@etseidl
Copy link
Contributor

etseidl commented Feb 7, 2025

what do you think of special casing casts to truncated integer types? Is the speedup worth the extra code

IMO given the following:

  • The consensus appears to be that this is a writer bug
  • It doesn't appear any real system produces such files outside of tests

I'd personally be inclined to do nothing, it seems unnecessary to add complexity if this is purely a theoretical problem.

I guess I got off on a bit of a tangent here. I was asking not so much about handling the bad data, but rather the serendipitous finding that the existing code is quite slow doing casts from 32 bits down to 8 or 16. That can be raised in a separate issue.

@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

I guess I got off on a bit of a tangent here. I was asking not so much about handling the bad data, but rather the serendipitous finding that the existing code is quite slow doing casts from 32 bits down to 8 or 16. That can be raised in a separate issue.

Yes 100% . Filed this one to start the conversation

@parthchandra
Copy link
Contributor Author

@tustvold You're right that this is not a bug in Spark but users of Spark must have found it useful to be able to read Parquet files with unsigned int8 and unsigned int16 which is why Spark supports reading such types.
apache/spark#31921
The actual data was written by parquet-java which was the canonical implementation followed by pretty much every Java based engine I am aware of. I suspect there are many files out there which have such values. I would suggest that having the flexibility to read such values will only help adoption.

Another potential alternative might be to provide some sort of pluggable behavior (like allow overriding the default conversion logic via template or something) -- that way we downstream users who needed different edge case behavior could implement whatever they needed without having to add such logic back up here

@alamb, this might be a better way than the option I suggested.

@tustvold
Copy link
Contributor

tustvold commented Feb 7, 2025

Aah I see the confusion, I interpreted

ExampleParquetWriter (from parquet-java) used here in Spark unit tests : https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871
(Spark itself writes only signed integers but will read the file written by the ExampleParquetWriter successfully).

as there being some buggy spark test harness producing invalid files. I agree if parquet-java can produce such invalid files, users may want to control what behaviour they get when reading them. (It should also be filed as a bug in parquet-java).

Given this is not inherently not standardised, providing a way to configure this seems reasonable to me

@parthchandra
Copy link
Contributor Author

I've logged an issue in parquet-java already (apache/parquet-java#3142). I logged it only for the ExampleParquetWriter but the issue is not in the example - it is just a thin wrapper around the group ParquetWriter apis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants