0501 Indexing and Selecting Data
0501 Indexing and Selecting Data
Indexing and Selecting Data
INDEXING AND SELECTING DATA
LIBRARIES
In [154]:
%pylab
import pandas as pd
from pandas.io.data import DataReader
Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib
Data Load
DataFrame
In [155]:
df1 = DataReader("SPY", "yahoo", "20030101", "20150612")
df1.head()
Out[155]:
Date
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 1/16
2/8/2015 0501 Indexing and Selecting Data
In [156]:
df1.describe()
Out[156]:
In [157]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3133 entries, 2003‐01‐02 00:00:00 to 2015‐06‐12 00:00:00
Data columns (total 6 columns):
Open 3133 non‐null float64
High 3133 non‐null float64
Low 3133 non‐null float64
Close 3133 non‐null float64
Volume 3133 non‐null int64
Adj Close 3133 non‐null float64
dtypes: float64(5), int64(1)
memory usage: 171.3 KB
Series
As a Series we are going to use one of the columns. E.g. the Adjustment Close
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 2/16
2/8/2015 0501 Indexing and Selecting Data
In [158]:
s = df1['Adj Close']
s.head()
Out[158]:
Date
2003‐01‐02 71.506562
2003‐01‐03 71.726413
2003‐01‐06 72.990557
2003‐01‐07 72.809968
2003‐01‐08 71.757821
Name: Adj Close, dtype: float64
Panel
We are going to create a new dataframe and add to the first one ('SPY') for creating a Panel.
In [159]:
df2 = DataReader("TLT", "yahoo", "20030101", "20150612")
df2.head()
Out[159]:
Date
In [160]:
p = pd.Panel({'df1': df1, 'df2': df2})
p.describe
Out[160]:
<bound method Panel.describe of <class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3133 (major_axis) x 6 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 2003‐01‐02 00:00:00 to 2015‐06‐12 00:00:00
Minor_axis axis: Open to Adj Close>
Index
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 3/16
2/8/2015 0501 Indexing and Selecting Data
Series have only one index.
In this case it is the dates in which prices have been recorded.
In [161]:
s.index
Out[161]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2003‐01‐02, ..., 2015‐06‐12]
Length: 3133, Freq: None, Timezone: None
DataFrames have two indexes (the rows and the columns).
Rows: Dates
Columns: Open, High, Low, ...
In [162]:
df1.index
Out[162]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2003‐01‐02, ..., 2015‐06‐12]
Length: 3133, Freq: None, Timezone: None
In [163]:
df1.columns
Out[163]:
Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype='object')
Panels have three indexes or axis (items index, rows and columns).
Items: df1 and df2
Rows or major index: Dates
Columns or minor index: Open, High, Low, ...
In [164]:
p.axes
Out[164]:
[Index([u'df1', u'df2'], dtype='object'),
<class 'pandas.tseries.index.DatetimeIndex'>
[2003‐01‐02, ..., 2015‐06‐12]
Length: 3133, Freq: None, Timezone: None,
Index([u'Open', u'High', u'Low', u'Close', u'Volume', u'Adj Close'], dtype='objec
t')]
Selection by Index
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 4/16
2/8/2015 0501 Indexing and Selecting Data
Selection by Index
.loc
It selects based in label values:
Series: s.loc[indexer]
DataFrame: df.loc[row_indexer, column_indexer]
Panel: p.loc[item_indexer, major_indexer, minor_indexer]
Series
In [165]:
s.head()
Out[165]:
Date
2003‐01‐02 71.506562
2003‐01‐03 71.726413
2003‐01‐06 72.990557
2003‐01‐07 72.809968
2003‐01‐08 71.757821
Name: Adj Close, dtype: float64
In [166]:
s.loc["2003‐01‐07"]
Out[166]:
72.809968000000012
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 5/16
2/8/2015 0501 Indexing and Selecting Data
In [167]:
s.loc[0] # Error when index is not found, and not possible to select by position
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
KeyError Traceback (most recent call last)
<ipython‐input‐167‐4753738e9717> in <module>()
‐‐‐‐> 1 s.loc[0] # Error when index is not found, and not possible to select by p
osition
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in __getitem__(self, key)
1200 return self._getitem_tuple(key)
1201 else:
‐> 1202 return self._getitem_axis(key, axis=0)
1203
1204 def _getitem_axis(self, key, axis=0):
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in _getitem_axis(self, key, a
xis)
1343
1344 # fall thru to straight lookup
‐> 1345 self._has_valid_type(key, axis)
1346 return self._get_label(key, axis=axis)
1347
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in _has_valid_type(self, key,
axis)
1305 raise
1306 except:
‐> 1307 error()
1308
1309 return True
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in error()
1292 "cannot use label indexing with a null key")
1293 raise KeyError("the label [%s] is not in the [%s]" %
‐> 1294 (key, self.obj._get_axis_name(axis)))
1295
1296 try:
KeyError: 'the label [0] is not in the [index]'
In [168]:
s.loc["2003‐01‐07":"2003‐01‐09"]
Out[168]:
Date
2003‐01‐07 72.809968
2003‐01‐08 71.757821
2003‐01‐09 72.872778
Name: Adj Close, dtype: float64
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 6/16
2/8/2015 0501 Indexing and Selecting Data
In [169]:
s.loc[::2]
Out[169]:
Date
2003‐01‐02 71.506562
2003‐01‐06 72.990557
2003‐01‐08 71.757821
2003‐01‐10 73.069074
2003‐01‐14 73.281076
2003‐01‐16 72.252483
2003‐01‐21 70.077531
2003‐01‐23 69.653531
2003‐01‐27 66.897539
2003‐01‐29 67.902578
2003‐01‐31 67.572797
2003‐02‐04 67.038873
2003‐02‐06 66.308653
2003‐02‐10 65.963176
2003‐02‐12 64.463475
...
2015‐05‐04 211.320007
2015‐05‐06 208.039993
2015‐05‐08 211.619995
2015‐05‐12 209.979996
2015‐05‐14 212.210007
2015‐05‐18 213.100006
2015‐05‐20 212.880005
2015‐05‐22 212.990005
2015‐05‐27 212.699997
2015‐05‐29 211.139999
2015‐06‐02 211.360001
2015‐06‐04 210.130005
2015‐06‐08 208.419998
2015‐06‐10 210.960007
2015‐06‐12 209.929993
Name: Adj Close, Length: 1567
DataFrames
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 7/16
2/8/2015 0501 Indexing and Selecting Data
In [170]:
df1.head()
Out[170]:
Date
In [171]:
df1.loc["2003‐01‐07"]
Out[171]:
Open 92.900002
High 93.370003
Low 92.199997
Close 92.730003
Volume 38640400.000000
Adj Close 72.809968
Name: 2003‐01‐07 00:00:00, dtype: float64
In [172]:
df1.loc["2003‐01‐07", "Adj Close"]
Out[172]:
72.809968000000012
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 8/16
2/8/2015 0501 Indexing and Selecting Data
In [173]:
df1.loc[:,"Adj Close"]
Out[173]:
Date
2003‐01‐02 71.506562
2003‐01‐03 71.726413
2003‐01‐06 72.990557
2003‐01‐07 72.809968
2003‐01‐08 71.757821
2003‐01‐09 72.872778
2003‐01‐10 73.069074
2003‐01‐13 73.045519
2003‐01‐14 73.281076
2003‐01‐15 72.550856
2003‐01‐16 72.252483
2003‐01‐17 71.184641
2003‐01‐21 70.077531
2003‐01‐22 69.229532
2003‐01‐23 69.653531
...
2015‐05‐22 212.990005
2015‐05‐26 210.699997
2015‐05‐27 212.699997
2015‐05‐28 212.460007
2015‐05‐29 211.139999
2015‐06‐01 211.570007
2015‐06‐02 211.360001
2015‐06‐03 211.919998
2015‐06‐04 210.130005
2015‐06‐05 209.770004
2015‐06‐08 208.419998
2015‐06‐09 208.449997
2015‐06‐10 210.960007
2015‐06‐11 211.649994
2015‐06‐12 209.929993
Name: Adj Close, Length: 3133
In [174]:
df1.loc[:, ["Open", "Close"]]
Out[174]:
Open Close
Date
Panels
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 10/16
2/8/2015 0501 Indexing and Selecting Data
In [175]:
p.describe
Out[175]:
<bound method Panel.describe of <class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3133 (major_axis) x 6 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 2003‐01‐02 00:00:00 to 2015‐06‐12 00:00:00
Minor_axis axis: Open to Adj Close>
In [176]:
p.loc["df1", "2003‐01‐07", "Adj Close"]
Out[176]:
72.809968000000012
.iloc
It selects based in position in the index:
Series: s.iloc[indexer]
DataFrame: df.iloc[row_indexer, column_indexer]
Panel: p.iloc[item_indexer, major_indexer, minor_indexer]
Series
In [177]:
s.head()
Out[177]:
Date
2003‐01‐02 71.506562
2003‐01‐03 71.726413
2003‐01‐06 72.990557
2003‐01‐07 72.809968
2003‐01‐08 71.757821
Name: Adj Close, dtype: float64
In [179]:
s.iloc[4]
Out[179]:
71.757820999999993
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 11/16
2/8/2015 0501 Indexing and Selecting Data
In [180]:
s.iloc["2003‐01‐07"]
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
TypeError Traceback (most recent call last)
<ipython‐input‐180‐0a927e23536a> in <module>()
‐‐‐‐> 1 s.iloc["2003‐01‐07"]
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in __getitem__(self, key)
1200 return self._getitem_tuple(key)
1201 else:
‐> 1202 return self._getitem_axis(key, axis=0)
1203
1204 def _getitem_axis(self, key, axis=0):
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in _getitem_axis(self, key, a
xis)
1461
1462 else:
‐> 1463 key = self._convert_scalar_indexer(key, axis)
1464
1465 if not com.is_integer(key):
C:\Anaconda\lib\site‐packages\pandas\core\indexing.pyc in _convert_scalar_indexer(se
lf, key, axis)
167 ax = self.obj._get_axis(min(axis, self.ndim ‐ 1))
168 # a scalar
‐‐> 169 return ax._convert_scalar_indexer(key, typ=self.name)
170
171 def _convert_slice_indexer(self, key, axis):
C:\Anaconda\lib\site‐packages\pandas\core\index.pyc in _convert_scalar_indexer(self,
key, typ)
641 type(self).__name__),FutureWarning)
642 return key
‐‐> 643 return self._convert_indexer_error(key, 'label')
644
645 if is_float(key):
C:\Anaconda\lib\site‐packages\pandas\core\index.pyc in _convert_indexer_error(self,
key, msg)
781 msg = 'label'
782 raise TypeError("the {0} [{1}] is not a proper indexer for this inde
x "
‐‐> 783 "type ({2})".format(msg, key, self.__class__.__nam
e__))
784
785 def get_duplicates(self):
TypeError: the label [2003‐01‐07] is not a proper indexer for this index type (Datet
imeIndex)
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 12/16
2/8/2015 0501 Indexing and Selecting Data
In [181]:
s.iloc[:10:‐2]
Out[181]:
Date
2015‐06‐12 209.929993
2015‐06‐10 210.960007
2015‐06‐08 208.419998
2015‐06‐04 210.130005
2015‐06‐02 211.360001
2015‐05‐29 211.139999
2015‐05‐27 212.699997
2015‐05‐22 212.990005
2015‐05‐20 212.880005
2015‐05‐18 213.100006
2015‐05‐14 212.210007
2015‐05‐12 209.979996
2015‐05‐08 211.619995
2015‐05‐06 208.039993
2015‐05‐04 211.320007
...
2003‐03‐03 66.025987
2003‐02‐27 66.222282
2003‐02‐25 66.324360
2003‐02‐21 66.881838
2003‐02‐19 66.881838
2003‐02‐14 66.073102
2003‐02‐12 64.463475
2003‐02‐10 65.963176
2003‐02‐06 66.308653
2003‐02‐04 67.038873
2003‐01‐31 67.572797
2003‐01‐29 67.902578
2003‐01‐27 66.897539
2003‐01‐23 69.653531
2003‐01‐21 70.077531
Name: Adj Close, Length: 1561
DataFrames
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 13/16
2/8/2015 0501 Indexing and Selecting Data
In [182]:
df1.head()
Out[182]:
Date
In [183]:
df1.iloc[3]
Out[183]:
Open 92.900002
High 93.370003
Low 92.199997
Close 92.730003
Volume 38640400.000000
Adj Close 72.809968
Name: 2003‐01‐07 00:00:00, dtype: float64
In [184]:
df1.iloc[3,5]
Out[184]:
72.809968000000012
Panels
In [185]:
p.describe
Out[185]:
<bound method Panel.describe of <class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3133 (major_axis) x 6 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 2003‐01‐02 00:00:00 to 2015‐06‐12 00:00:00
Minor_axis axis: Open to Adj Close>
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 14/16
2/8/2015 0501 Indexing and Selecting Data
In [186]:
p.iloc[0,3,5]
Out[186]:
72.809968000000012
.ix
When axis is not integer it supports both of selections based in label or based in position in
the index:
Series: s.ix[indexer]
DataFrame: df.ix[row_indexer, column_indexer]
Panel: p.ix[item_indexer, major_indexer, minor_indexer]
When axis is integer, it will select by label (as .loc)
In [187]:
s.head()
Out[187]:
Date
2003‐01‐02 71.506562
2003‐01‐03 71.726413
2003‐01‐06 72.990557
2003‐01‐07 72.809968
2003‐01‐08 71.757821
Name: Adj Close, dtype: float64
In [188]:
s.ix["2003‐01‐07"]
Out[188]:
72.809968000000012
In [189]:
s.ix[3]
Out[189]:
72.809968000000012
Attributes
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 15/16
2/8/2015 0501 Indexing and Selecting Data
In [193]:
df1.Close['2003‐01‐03']
Out[193]:
91.349997999999999
In [ ]:
http://localhost:8889/notebooks/0501%20Indexing%20and%20Selecting%20Data.ipynb 16/16