Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Select a Subset of Data Using Lexicographical Slicing in Python Pandas



Introduction

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to "Select a Subset Of Data Using lexicographical slicing".

Google is full of datasets. Search for movies dataset in kaggle.com. This post uses the movies data set from kaggle.

How to do it

1. Import the movies dataset with only the columns required for this example.

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)


title budget vote_average vote_count
Little Voice 0 6.6 61
Grown Ups 2 80000000 5.8 1155
The Best Years of Our Lives 2100000 7.6 143
Tusk 2800000 5.1 366
Operation Chromite 0 5.8 29

2. I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you aredealing with a huge dataset when your index is sorted.

What if I don't sort the index ?

No problem your code is going to run forever. Just kidding, well if the index labels are unsorted then pandas have to traversethrough all the labels one by one to match your query. Just imagine an Oxford dictionary without an index page, what you aregoing to do? With the index sorted you can jump around quickly to a label you want to extract, so is the case with Pandastoo.

Let us check first if our index is sorted or not.

# check if the index is sorted or not ?
movies.index.is_monotonic

False

3. Clearly, the index is un sorted. We will try to select the movies starting with A%. This is like writing

select * from movies where title like'A%'


movies.loc["Aa":"Bb"]
---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

4. Sort the index in ascending order and try the same command to take the advantage of sorting for lexicographical slicing.

True

5. Now our data is set and ready for lexicographical slicing. Let us now select all the movie titles starting with letter A till letter B.

title budget vote_average vote_count
Abandon 25000000 4.6 45
Abandoned 0 5.8 27
Abduction 35000000 5.6 961
Aberdeen 0 7.0 6
About Last Night 12500000 6.0 210
... ... ... ...
Battle for the Planet of the Apes 1700000 5.5 215
Battle of the Year 20000000 5.9 88
Battle: Los Angeles 70000000 5.5 1448
Battlefield Earth 44000000 3.0 255
Battleship 209000000 5.5 2114


title budget vote_average vote_count
Æon Flux 62000000 5.4 703
xXx: State of the Union 60000000 4.7 549
xXx 70000000 5.8 1424
eXistenZ 15000000 6.7 475
[REC]² 5600000 6.4 489

budget vote_average vote_count title

This is a no brainer to see the empty DataFrame as the data is sorted in reverse order. Let us reverse the letters and run this again.

title budget vote_average vote_count
B-Girl 0 5.5 7
Ayurveda: Art of Being 300000 5.5 3
Away We Go 17000000 6.7 189
Awake 86000000 6.3 395
Avengers: Age of Ultron 280000000 7.3 6767
... ... ... ...
About Last Night 12500000 6.0 210
Aberdeen 0 7.0 6
Abduction 35000000 5.6 961
Abandoned 0 5.8 27
Abandon 25000000 4.6 45


Updated on: 2020-11-10T09:34:45+05:30

287 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements