Can schema models support an arbitrary number of columns with checks? #991
-
Basically, I have some data tables that have some fixed columns that I know about in advance and then there might be any number of additional columns where I don't know the column name but I do know that it should be of type integer and >= 0. Can I support this in a schema model? I suppose that if the known columns always are at the beginning of the columns, I can add a custom data frame check that checks the dtype and elements of all remaining columns. Would that be the right way to go about it or is there another way to approach this? Rough example: import pandas as pd
import pandera as pa
from pandas.api.types import is_integer
from pandera.typing import Series
class VariableColumns(pa.SchemaModel):
first: Series[str] = pa.Field()
second: Series[float] = pa.Field(ge=5.0, le=10.0)
@pa.dataframe_check
@classmethod
def check_tail(cls, df: pd.DataFrame) -> bool:
# Ensure only integer columns.
result = df.dtypes[2:].map(is_integer).all()
if not result:
return False
# Ensure columns are >= 0.
return (df.iloc[:, 2:] >= 0).all() |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
hi @Midnighter great question! You can use the regex column key matching for this. The way it works in the class-based API would be: import pandas as pd
import pandera as pa
from pandas.api.types import is_integer
from pandera.typing import Series
class VariableColumns(pa.SchemaModel):
first: Series[str] = pa.Field()
second: Series[float] = pa.Field(ge=5.0, le=10.0)
unknown: Series[int] = pa.Field(ge=0, alias="^(first|second)", regex=True) You can generalize this by specifying the known columns in a private (or global) variable: import pandas as pd
import pandera as pa
from pandas.api.types import is_integer
from pandera.typing import Series
class VariableColumns(pa.SchemaModel):
_known_keys = ["first", "second", "..."]
first: Series[str] = pa.Field()
second: Series[float] = pa.Field(ge=5.0, le=10.0)
unknown: Series[int] = pa.Field(ge=0, alias=f"^({'|'.join(_known_keys)})", regex=True) |
Beta Was this translation helpful? Give feedback.
hi @Midnighter great question!
You can use the regex column key matching for this. The way it works in the class-based API would be:
You can generalize this by specifying the known columns in a private (or global) variable: