Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different null handling behavior between Polars and Pandas validation #1835

Open
2 of 3 tasks
alexismanuel opened this issue Oct 24, 2024 · 0 comments
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@alexismanuel
Copy link

Describe the bug
When using Pandera with nullable fields, there's a difference in behavior between Polars and Pandas validation. The Polars validation appears to drop rows with null values even when fields are explicitly marked as nullable, while Pandas validation correctly preserves these rows.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pa
import pandera.polars as papl
from functools import partial

# Create a simple dataframe with null values and an invalid value
df = pl.DataFrame({
    "col1": ['1', '2', None, 'x'],
    "col2": ['valid', None, None, 'valid']
})

# Define a simple schema with nullable fields and invalid values check
invalids = ['x']
schema_field = partial(
    pa.Field,
    nullable=True,
    notin=invalids
)

class PolarsSchema(papl.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

class PandasSchema(pa.DataFrameModel):
    col1: str = schema_field()
    col2: str = schema_field()

    class Config:
        drop_invalid_rows = True

# Test Polars validation
print("Original DataFrame:")
print(df)

print("\nUsing Polars validation:")
print(df.pipe(PolarsSchema.validate, lazy=True))

print("\nUsing Pandas validation:")
print(
    df.to_pandas()
    .pipe(PandasSchema.validate, lazy=True)
    .pipe(pl.from_pandas)
)

Expected behavior

Both Polars and Pandas validation should handle null values the same way. Since the fields are marked as nullable=True, rows containing null values should be preserved. Only the row containing the invalid value 'x' should be dropped.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04
  • Python version: 3.12
  • Pandera version: 0.20.4
  • Polars version: 1.11.0
  • Pandas version: 2.2.3

Screenshots

Console Outputs:

Original DataFrame:
shape: (4, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
│ x    ┆ valid │
└──────┴───────┘

Using Polars validation:
shape: (2, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
└──────┴───────┘

Using Pandas validation:
shape: (3, 2)
┌──────┬───────┐
│ col1 ┆ col2  │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ 1    ┆ valid │
│ 2    ┆ null  │
│ null ┆ null  │
└──────┴───────┘

Additional context

The behavior is consistent - Polars validation always drops the null rows while Pandas validation preserves them

@alexismanuel alexismanuel added the bug Something isn't working label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant