How to reduce the columns returned in the failure_cases of a dataframe check? #900

ipear3 · 2022-08-02T12:24:50Z

ipear3
Aug 2, 2022

First, thank you contributors for this amazing library 🥇 Second, thank you anyone who contributes to this discussion.

I've been running some dataframe checks that should fail when two columns are each a specific value. Indeed, I've written several variations of the check, and they do fail as I would expect them to. However, the failure_cases returned are not as usable as the failure_cases returned by column checks. Specifically, on a dataframe with many columns, it is overwhelming to see 30+ rows for a check that only depends on two columns.

Thus, I'd like to reduce the columns returned in the failure_cases of a dataframe check. Is it possible to do so today? If not, is this a pattern contributors would be willing to accept a PR for?

Alternatively, is there a pattern like checks 2 and 4 below that would produce a single row in falure_cases, but include an index?

I've tried a handful of variations of dataframe checks, trying to name the series after my desired column and such. Check 0, 1, and 3 include an index (because shape[0] is congruent), but too many rows (unique by column + check). Check 2 and 4 do not include an index, but produce 1 row.

import pandas as pd
import pandera as pa

df = pd.DataFrame({
    "Record ID": ["MCOU", "FFSU", "MCOU"],
    "Period Covered": ["11999", "12010", "12010"],
    "Valid Row": [False, True, True]
}).astype({
    "Record ID": "category",
    "Period Covered": "string",
    "Valid Row": "bool"
})

schema = pa.DataFrameSchema(
    columns={
        "Record ID": pa.Column("category"),
        "Period Covered": pa.Column("string"),
        "Valid Row": pa.Column("bool")
    },
    # Desired logic: Invalid when the `Period Covered` year component is before 2010 and `Record ID` is not "FFSU".
    checks=[
        # Check 0: Take dataframe, return bool series of same length.
        pa.Check(lambda df: pd.Series(~((df["Period Covered"].str.slice(1).astype("int") < 2010) & (df["Record ID"] != "FFSU")))),
        # Check 1: Same logic, but name the series after the column we want.
        pa.Check(lambda df: pd.Series(~((df["Period Covered"].str.slice(1).astype("int") < 2010) & (df["Record ID"] != "FFSU")), name="Period Covered")),
        # Check 2: Take dataframe, return a named series filtered to the False condition.
        pa.Check(lambda df: df[df["Period Covered"].str.slice(1).astype("int") < 2010]["Record ID"] == "FFSU"),
        # Check 3: Take dataframe element-wise (series looks like df.iloc[0]), return a bool per row.
        pa.Check(lambda s: not (int(s["Period Covered"][1:]) < 2010 and s["Record ID"] != "FFSU"), element_wise=True),
        # Check 4: Same as Check 2, but using df.loc instead of df[]...
        pa.Check(lambda df: df.loc[df["Period Covered"].str.slice(1).astype("int") < 2010, "Record ID"] == "FFSU"),
    ]
)

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    failure_cases = exc.failure_cases
    failed_rows = df[df.index.isin(exc.failure_cases["index"])]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce the columns returned in the failure_cases of a dataframe check? #900

{{title}}

Replies: 0 comments

Select a reply

How to reduce the columns returned in the failure_cases of a dataframe check? #900

ipear3 Aug 2, 2022

Replies: 0 comments

ipear3
Aug 2, 2022