Different null handling behavior between Polars and Pandas validation #1835

alexismanuel · 2024-10-24T19:03:54Z

Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pa
import pandera.polars as papl
from functools import partial

# Create a simple dataframe with null values and an invalid value
df = pl.DataFrame({
    "col1": ['1', '2', None, 'x'],
    "col2": ['valid', None, None, 'valid']
})

# Define a simple schema with nullable fields and invalid values check
invalids = ['x']
schema_field = partial(
    pa.Field,
    nullable=True,
    notin=invalids
)

class PolarsSchema(papl.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

class PandasSchema(pa.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

# Test Polars validation
print("Original DataFrame:")
print(df)

print("\nUsing Polars validation:")
print(df.pipe(PolarsSchema.validate, lazy=True))

print("\nUsing Pandas validation:")
print(
    df.to_pandas()
    .pipe(PandasSchema.validate, lazy=True)
    .pipe(pl.from_pandas)
)

Expected behavior

Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.

Desktop (please complete the following information):

OS: Ubuntu 22.04
Python version: 3.12
Pandera version: 0.20.4
Polars version: 1.11.0
Pandas version: 2.2.3

Screenshots

Console Outputs:

Original DataFrame:
shape: (4, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
│ x    ┆ valid │
└──────┴───────┘

Using Polars validation:
shape: (2, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
└──────┴───────┘

Using Pandas validation:
shape: (3, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
└──────┴───────┘

Additional context

The behavior is consistent - Polars validation always drops the null rows while Pandas validation preserves them

The text was updated successfully, but these errors were encountered:

alexismanuel added the bug Something isn't working label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different null handling behavior between Polars and Pandas validation #1835

Different null handling behavior between Polars and Pandas validation #1835

alexismanuel commented Oct 24, 2024

Different null handling behavior between Polars and Pandas validation #1835

Different null handling behavior between Polars and Pandas validation #1835

Comments

alexismanuel commented Oct 24, 2024

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Screenshots

Additional context