Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

BUG: Pandas 1.1.5 location-based indexing error with quantized pivot table #38367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tgaddair opened this issue Dec 8, 2020 · 7 comments · Fixed by #38532
Closed
2 of 3 tasks

BUG: Pandas 1.1.5 location-based indexing error with quantized pivot table #38367

tgaddair opened this issue Dec 8, 2020 · 7 comments · Fixed by #38532
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@tgaddair
Copy link

tgaddair commented Dec 8, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

input_df = pd.DataFrame(**{
    'index': [0, 1], 
    'columns': ['loss', 'category_64973.fc_size', 'category_64973.num_fc_layers', 'training.learning_rate'], 
    'data': [[1.0549572706222534, 240, 2, 0.0014908184659929895], [1.225046157836914, 160, 2, 0.0013734204727201226]]
})

input_df['training.learning_rate'] = pd.qcut(
    input_df['training.learning_rate'],
    q=10,
    precision=3,
    duplicates='drop',
)

data = input_df.pivot_table(
    index='category_64973.fc_size',
    columns='training.learning_rate',
    values='loss',
    aggfunc='mean'
)

# Seaborn code starts here
mask = np.zeros(data.shape, bool)
mask = pd.DataFrame(mask,
                    index=data.index,
                    columns=data.columns,
                    dtype=bool)

mask | pd.isnull(data)

Problem description

An error occurs when attempting to plot a quantized pivot table using Seaborn with the latest version of Pandas (1.1.5).

The code above is a self-contained example showing what Seaborn is doing when heatmap() is called on the input pivot table (data). See this usage in the Ludwig framework: https://github.com/uber/ludwig/blob/master/ludwig/utils/visualization_utils.py#L1392. Prior to v1.1.5, this code was working fine and used to generate plots in Ludwig.

The stack trace is as follows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    701             try:
--> 702                 self._validate_key(k, i)
    703             except ValueError as err:

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1368         else:
-> 1369             raise ValueError(f"Can only index by location with a [{self._valid_types}]")
   1370 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-1-e654830c5b85> in <module>
     32                     dtype=bool)
     33 
---> 34 mask | pd.isnull(data)

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/ops/__init__.py in f(self, other, axis, level, fill_value)
    638             self, other, op, axis, default_axis, fill_value, level
    639         ):
--> 640             return _frame_arith_method_with_reindex(self, other, op)
    641 
    642         if isinstance(other, ABCSeries) and fill_value is not None:

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/ops/__init__.py in _frame_arith_method_with_reindex(left, right, op)
    572     )
    573 
--> 574     new_left = left.iloc[:, lcols]
    575     new_right = right.iloc[:, rcols]
    576     result = op(new_left, new_right)

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    871                     # AttributeError for IntervalTree get_value
    872                     pass
--> 873             return self._getitem_tuple(key)
    874         else:
    875             # we by definition only have the 0th axis

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1441     def _getitem_tuple(self, tup: Tuple):
   1442 
-> 1443         self._has_valid_tuple(tup)
   1444         try:
   1445             return self._getitem_lowerdim(tup)

~/repos/ludwig/env/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    705                     "Location based indexing can only have "
    706                     f"[{self._valid_types}] types"
--> 707                 ) from err
    708 
    709     def _is_nested_tuple_indexer(self, tup: Tuple) -> bool:

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Note that this last mask | pd.isnull(data) operations succeeds with Pandas 1.1.4 and all other dependencies being left the same.

Expected Output

The mask | pd.isnull(data) call should succeed.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.8.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.0
Cython : 0.29.21
pytest : 6.1.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : 0.10.1
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.2
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.52.0

@tgaddair tgaddair added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 10, 2020
@simonjayhawkins
Copy link
Member

Thanks @tgaddair for the report

Note that this last mask | pd.isnull(data) operations succeeds with Pandas 1.1.4 and all other dependencies being left the same.

first bad commit: [e99e5ab] BUG: Fix duplicates in intersection of multiindexes (#36927) cc @phofl

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 10, 2020
@phofl
Copy link
Member

phofl commented Dec 10, 2020

Yikes,
unique() on a categorical index deletes unused categories, while intersection does not.

@jbrockmendel Is this expected? If yes, any suggestions on how to handle this case? Problem lies in

if fill_value is None and level is None and axis is default_axis:

@jbrockmendel
Copy link
Member

Would #38140 fix this?

@phofl
Copy link
Member

phofl commented Dec 11, 2020

Yep, this would fix this

@simonjayhawkins
Copy link
Member

#38140 is milestoned for 1.3

we will probably want a fix in place for 1.2 since this is a regression.

@phofl
Copy link
Member

phofl commented Dec 11, 2020

@simonjayhawkins

We could apply unique to the intersection if it is categorical. This is an ugly fix which could be removed when #38140 is merged

@simonjayhawkins
Copy link
Member

@phofl if you could put together a PR (suitable for merging before 1.2 release) for review, that'll be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants