Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid index type when extracting values from a multi-index data frame #1018

Open
kubotty opened this issue Oct 21, 2024 · 1 comment
Open
Labels
Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@kubotty
Copy link

kubotty commented Oct 21, 2024

Describe the bug
When values are extracted from a data frame with a multi-index, the key is supposed to be an accepted tuple, but it is not.

To Reproduce

  1. Provide a minimal runnable pandas example that is not properly checked by the stubs.
from __future__ import annotations

from typing import TypeAlias
import pandas as pd

_KeyType: TypeAlias = str | list[str | bool] | slice
_MultiKeyType: TypeAlias = str | slice | tuple[_KeyType, ...] | list[str | bool | tuple[_KeyType, ...]]

def df_loc(df: pd.DataFrame, key: _MultiKeyType) -> pd.DataFrame:
    print(key)
    return df.loc[:, key]

if __name__=="__main__":
    df_multi_columns = pd.DataFrame({
        ("A", "a"): [1, 2, 3],
        ("A", "b"): [4, 5, 6],
        ("B", "a"): [7, 8, 9],
        ("B", "b"): [10, 11, 12]
    })
    print(df_loc(df_multi_columns, "A"))
    print(df_loc(df_multi_columns, ("A", "a")))
    print(df_loc(df_multi_columns, ["A", "B"]))
    print(df_loc(df_multi_columns, pd.IndexSlice[:, "a"]))
    print(df_loc(df_multi_columns, [("A", "a"), ("B", "b")]))
  1. Indicate which type checker you are using (mypy or pyright).
    mypy
  2. Show the error message received from that type checker while checking your example.
    get_pandas_loc.py:11: error: Invalid index type "tuple[slice, slice | tuple[str | list[str | builtins.bool] | slice, ...] | list[str | builtins.bool | tuple[str | list[str | builtins.bool] | slice, ...]]]" for "_LocIndexerFrame"; expected type "slice | ndarray[Any, dtype[integer[Any]]] | Index[Any] | list[int] | Series[int] | <6 more items>" [index]

Please complete the following information:

  • OS: Windows
  • OS Version [e.g. 22]: 11
  • python version: 3.10.15
  • version of type checker: mypy 2.1.2
  • version of installed pandas-stubs: 2.2.3.241009

Additional context

  • version of pandas: 2.2.3
  • mypy option: strict=True
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Oct 24, 2024

There are a couple of issues here. First, from a typing perspective, you are getting mismatches because your declaration of the parameter key in the function df_loc doesn't match the way that pandas-stubs declares such arguments. If you eliminate that function, and then use the following in your tests:

    df_multi_columns.loc[:,  "A"]
    df_multi_columns.loc[:,  ("A", "a")]
    df_multi_columns.loc[:,  ["A", "B"]]
    df_multi_columns.loc[:,  pd.IndexSlice[:, "a"]]
    df_multi_columns.loc[:,  [("A", "a"), ("B", "b")]]

then only the second and fourth tests fail. To fix that, one needs to add

    @overload
    def __getitem__(self, idx: tuple[slice, _IndexSliceTuple]) -> Series: ...

as the last overload for __getitem__() in _LocIndexerFrame in core/frame.pyi .

PR with tests welcome.

@Dr-Irv Dr-Irv added the Indexing Related to indexing on series/frames, not to indexes themselves label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants