You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you contributors for this amazing library 🥇 Second, thank you anyone who contributes to this discussion.
I've been running some dataframe checks that should fail when two columns are each a specific value. Indeed, I've written several variations of the check, and they do fail as I would expect them to. However, the failure_cases returned are not as usable as the failure_cases returned by column checks. Specifically, on a dataframe with many columns, it is overwhelming to see 30+ rows for a check that only depends on two columns.
Thus, I'd like to reduce the columns returned in the failure_cases of a dataframe check. Is it possible to do so today? If not, is this a pattern contributors would be willing to accept a PR for?
Alternatively, is there a pattern like checks 2 and 4 below that would produce a single row in falure_cases, but include an index?
I've tried a handful of variations of dataframe checks, trying to name the series after my desired column and such. Check 0, 1, and 3 include an index (because shape[0] is congruent), but too many rows (unique by column + check). Check 2 and 4 do not include an index, but produce 1 row.
import pandas as pd
import pandera as pa
df = pd.DataFrame({
"Record ID": ["MCOU", "FFSU", "MCOU"],
"Period Covered": ["11999", "12010", "12010"],
"Valid Row": [False, True, True]
}).astype({
"Record ID": "category",
"Period Covered": "string",
"Valid Row": "bool"
})
schema = pa.DataFrameSchema(
columns={
"Record ID": pa.Column("category"),
"Period Covered": pa.Column("string"),
"Valid Row": pa.Column("bool")
},
# Desired logic: Invalid when the `Period Covered` year component is before 2010 and `Record ID` is not "FFSU".
checks=[
# Check 0: Take dataframe, return bool series of same length.
pa.Check(lambda df: pd.Series(~((df["Period Covered"].str.slice(1).astype("int") < 2010) & (df["Record ID"] != "FFSU")))),
# Check 1: Same logic, but name the series after the column we want.
pa.Check(lambda df: pd.Series(~((df["Period Covered"].str.slice(1).astype("int") < 2010) & (df["Record ID"] != "FFSU")), name="Period Covered")),
# Check 2: Take dataframe, return a named series filtered to the False condition.
pa.Check(lambda df: df[df["Period Covered"].str.slice(1).astype("int") < 2010]["Record ID"] == "FFSU"),
# Check 3: Take dataframe element-wise (series looks like df.iloc[0]), return a bool per row.
pa.Check(lambda s: not (int(s["Period Covered"][1:]) < 2010 and s["Record ID"] != "FFSU"), element_wise=True),
# Check 4: Same as Check 2, but using df.loc instead of df[]...
pa.Check(lambda df: df.loc[df["Period Covered"].str.slice(1).astype("int") < 2010, "Record ID"] == "FFSU"),
]
)
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
failure_cases = exc.failure_cases
failed_rows = df[df.index.isin(exc.failure_cases["index"])]
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
First, thank you contributors for this amazing library 🥇 Second, thank you anyone who contributes to this discussion.
I've been running some dataframe checks that should fail when two columns are each a specific value. Indeed, I've written several variations of the check, and they do fail as I would expect them to. However, the failure_cases returned are not as usable as the failure_cases returned by column checks. Specifically, on a dataframe with many columns, it is overwhelming to see 30+ rows for a check that only depends on two columns.
Thus, I'd like to reduce the columns returned in the failure_cases of a dataframe check. Is it possible to do so today? If not, is this a pattern contributors would be willing to accept a PR for?
Alternatively, is there a pattern like checks 2 and 4 below that would produce a single row in falure_cases, but include an index?
I've tried a handful of variations of dataframe checks, trying to name the series after my desired column and such. Check 0, 1, and 3 include an index (because shape[0] is congruent), but too many rows (unique by column + check). Check 2 and 4 do not include an index, but produce 1 row.
Beta Was this translation helpful? Give feedback.
All reactions