Data Mining With Python (2024)
Data Mining With Python (2024)
Data is everywhere and it’s growing at an unprecedented rate. But making sense of all that data
is a challenge. Data Mining is the process of discovering patterns and knowledge from large data
sets, and Data Mining with Python focuses on the hands-on approach to learning Data Mining.
It showcases how to use Python Packages to fulfil the Data Mining pipeline, which is to collect,
integrate, manipulate, clean, process, organize, and analyze data for knowledge.
The contents are organized based on the Data Mining pipeline, so readers can naturally prog-
ress step by step through the process. Topics, methods, and tools are explained in three aspects:
“What it is” as a theoretical background, “why we need it” as an application orientation, and
“how we do it” as a case study.
This book is designed to give students, data scientists, and business analysts an understanding of
Data Mining concepts in an applicable way. Through interactive tutorials that can be run, modi-
fied, and used for a more comprehensive learning experience, this book will help its readers gain
practical skills to implement Data Mining techniques in their work.
Python has been ranked as the most popular programming language, and it is widely used in
education and industry. This book series will offer a wide range of books on Python for students
and professionals. Titles in the series will help users learn the language at an introductory and
advanced level, and explore its many applications in data science, AI, and machine learning.
Series titles can also be supplemented with Jupyter notebooks.
Di Wu
First edition published 2024
by CRC Press
2385 Executive Center Drive, Suite 320, Boca Raton, FL 33431
© 2024 Di Wu
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not
available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
DOI: 10.1201/9781003462781
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
List of Figures xi
Foreword xix
Preface xxi
vii
viii ■ Contents
Index 389
List of Figures
xi
xii ■ LIST OF FIGURES
xix
Preface
xxi
Author Bios
xxiii
I
Data Wrangling
1
CHAPTER 1
Data Collection
DOI: 10.1201/9781003462781-1 3
4 ■ Data Mining with Python
Storing data in different file formats allows for versatility and compatibility with
various applications and tools.
• CSV (Comma-Separated Values): CSV files store tabular data in plain text
format, where each line represents a row, and values are separated by commas (or
other delimiters). CSV files are simple, human-readable, and widely supported.
They can be easily opened and edited using spreadsheet software or text editors.
However, CSV files may not support complex data structures, and there is no
standardized format for metadata or data types. Pandas provides the read_csv()
function, allowing you to read CSV files into a DataFrame object effortlessly.
It automatically detects the delimiter, handles missing values, and provides
convenient methods for data manipulation and analysis.
• TXT (Plain Text): TXT files contain unformatted text with no specific structure
or metadata. TXT files are lightweight, widely supported, and can be easily
opened with any text editor. However, TXT files lack a standardized structure or
format, making it challenging to handle data that requires specific organization
or metadata. Pandas offers the read_csv() function with customizable delimiters
to read text files with structured data. By specifying the appropriate delimiter,
you can read text files into a DataFrame for further analysis.
• XLSX (Microsoft Excel): XLSX is a file format used by Microsoft Excel to
store spreadsheet data with multiple sheets, formatting, formulas, and metadata.
XLSX files support complex spreadsheets with multiple tabs, cell formatting,
and formulas. They are widely used in business and data analysis scenarios.
However, XLSX files can be large, and manipulating them directly can be
memory-intensive. Additionally, XLSX files require software like Microsoft Excel
to view and edit. Pandas provides the read_excel() function, enabling the
reading of XLSX files into DataFrames. It allows you to specify the sheet name,
range of cells, and other parameters to extract data easily.
• JSON (JavaScript Object Notation): JSON is a lightweight, human-readable
data interchange format that represents structured data as key-value pairs, lists,
and nested objects. JSON is easy to read and write, supports complex nested
structures, and is widely used for data interchange between systems. However,
JSON files can be larger than their equivalent CSV representations, and handling
Data Collection ■ 5
1.1.1.1 CSV
We have done this when we learned pandas. You can get the path of your csv file,
and feed the path to the function read_csv.
df.head()
2 2 2020 SE FT
3 3 2020 MI FT
4 4 2020 SE FT
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 607 non-null int64
1 work_year 607 non-null int64
2 experience_level 607 non-null object
3 employment_type 607 non-null object
4 job_title 607 non-null object
5 salary 607 non-null int64
6 salary_currency 607 non-null object
7 salary_in_usd 607 non-null int64
8 employee_residence 607 non-null object
9 remote_ratio 607 non-null int64
10 company_location 607 non-null object
11 company_size 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
Customize setting You can manipulate arguments for your specific csv file
df = pd.read_csv('/content/ds_salaries.csv', header = None)
df.head()
0 1 2 3 \
0 NaN work_year experience_level employment_type
1 0.0 2020 MI FT
2 1.0 2020 SE FT
3 2.0 2020 SE FT
4 3.0 2020 MI FT
4 5 6 7 \
Data Collection ■ 7
8 9 10 11
0 employee_residence remote_ratio company_location company_size
1 DE 0 DE L
2 JP 0 JP S
3 GB 50 GB M
4 HN 0 HN S
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 607 non-null float64
1 1 608 non-null object
2 2 608 non-null object
3 3 608 non-null object
4 4 608 non-null object
5 5 608 non-null object
6 6 608 non-null object
7 7 608 non-null object
8 8 608 non-null object
9 9 608 non-null object
10 10 608 non-null object
11 11 608 non-null object
dtypes: float64(1), object(11)
memory usage: 57.1+ KB
0 1 2 3 4 5 6 7 8 9 \
0 0 2020 MI FT Data Scientist 70000 EUR 79833 DE 0
1 1 2020 SE FT Machine Learning Scientist 260000 USD 260000 JP 0
2 2 2020 SE FT Big Data Engineer 85000 GBP 109024 GB 50
3 3 2020 MI FT Product Data Analyst 20000 USD 20000 HN 0
4 4 2020 SE FT Machine Learning Engineer 150000 USD 150000 US 50
10 11
0 DE L
1 JP S
2 GB M
3 HN S
4 US L
8 ■ Data Mining with Python
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 607 non-null int64
1 1 607 non-null int64
2 2 607 non-null object
3 3 607 non-null object
4 4 607 non-null object
5 5 607 non-null int64
6 6 607 non-null object
7 7 607 non-null int64
8 8 607 non-null object
9 9 607 non-null int64
10 10 607 non-null object
11 11 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
0 1 2 3 4 5 6 7 8 9 \
0 0 2020 MI FT Data Scientist 70000 EUR 79833 DE 0
1 1 2020 SE FT Machine Learning Scientist 260000 USD 260000 JP 0
2 2 2020 SE FT Big Data Engineer 85000 GBP 109024 GB 50
3 3 2020 MI FT Product Data Analyst 20000 USD 20000 HN 0
4 4 2020 SE FT Machine Learning Engineer 150000 USD 150000 US 50
10 11
0 DE L
1 JP S
2 GB M
3 HN S
4 US L
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 307 non-null int64
1 1 307 non-null int64
2 2 307 non-null object
3 3 307 non-null object
4 4 307 non-null object
5 5 307 non-null int64
6 6 307 non-null object
Data Collection ■ 9
1.1.1.2 TXT
If the txt follows csv format, then it can be read as a csv file
df = pd.read_csv('/content/ds_salaries.txt')
df
1.1.1.3 Excel
df = pd.read_excel('/content/ds_salaries.xlsx')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 607 non-null int64
1 work_year 607 non-null int64
2 experience_level 607 non-null object
3 employment_type 607 non-null object
4 job_title 607 non-null object
5 salary 607 non-null int64
6 salary_currency 607 non-null object
7 salary_in_usd 607 non-null int64
8 employee_residence 607 non-null object
9 remote_ratio 607 non-null int64
10 company_location 607 non-null object
11 company_size 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
Data Collection ■ 11
1.1.1.4 json
df = pd.read_json('/content/ds_salaries.json')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 FIELD1 607 non-null int64
1 work_year 607 non-null int64
2 experience_level 607 non-null object
3 employment_type 607 non-null object
4 job_title 607 non-null object
5 salary 607 non-null int64
6 salary_currency 607 non-null object
7 salary_in_usd 607 non-null int64
8 employee_residence 607 non-null object
9 remote_ratio 607 non-null int64
10 company_location 607 non-null object
11 company_size 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
12 ■ Data Mining with Python
1.1.1.5 XML
df = pd.read_xml('/content/ds_salaries.xml')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 FIELD1 607 non-null int64
1 work_year 607 non-null int64
2 experience_level 607 non-null object
3 employment_type 607 non-null object
4 job_title 607 non-null object
5 salary 607 non-null int64
6 salary_currency 607 non-null object
7 salary_in_usd 607 non-null int64
8 employee_residence 607 non-null object
9 remote_ratio 607 non-null int64
10 company_location 607 non-null object
11 company_size 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
Data Collection ■ 13
1.1.1.6 HTM
df = pd.read_html('/content/ds_salaries.htm')[0]
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 FIELD1 607 non-null int64
1 work_year 607 non-null int64
2 experience_level 607 non-null object
3 employment_type 607 non-null object
4 job_title 607 non-null object
5 salary 607 non-null int64
6 salary_currency 607 non-null object
7 salary_in_usd 607 non-null int64
8 employee_residence 607 non-null object
9 remote_ratio 607 non-null int64
10 company_location 607 non-null object
11 company_size 607 non-null object
dtypes: int64(5), object(7)
memory usage: 57.0+ KB
1.1.2 Documentation
It is always good to have a reference of the read files functions in pandas. You can
find it via https://pandas.pydata.org/docs/reference/io.html
14 ■ Data Mining with Python
1.2.1.1 Wiki
Some websites maintains structured data, which is easy to read
table = pd.read_html('https://en.wikipedia.org/wiki/
List_of_countries_by_GDP_(nominal)#Table')
for i in table:
print(type(i))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
for i in table:
print(i.columns)
Int64Index([0], dtype='int64')
Int64Index([0, 1, 2], dtype='int64')
MultiIndex([( 'Country/Territory', 'Country/Territory'),
( 'UN Region', 'UN Region'),
16 ■ Data Mining with Python
( 'IMF[1][13]', 'Estimate'),
( 'IMF[1][13]', 'Year'),
( 'World Bank[14]', 'Estimate'),
( 'World Bank[14]', 'Year'),
('United Nations[15]', 'Estimate'),
('United Nations[15]', 'Year')],
)
...
Int64Index([0, 1], dtype='int64')
df = table[2]
df.head()
United Nations[15]
Estimate Year
0 85328323 2020
1 20893746 2020
2 14722801 [n 1]2020
3 5057759 2020
4 3846414 2020
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Country/Territory, Country/Territory) 217 non-null object
1 (UN Region, UN Region) 217 non-null object
2 (IMF[1][13], Estimate) 217 non-null object
3 (IMF[1][13], Year) 217 non-null object
4 (World Bank[14], Estimate) 217 non-null object
5 (World Bank[14], Year) 217 non-null object
6 (United Nations[15], Estimate) 217 non-null object
7 (United Nations[15], Year) 217 non-null object
dtypes: object(8)
memory usage: 13.7+ KB
Download by requests We’ll need to first import the requests library, and then
download the page using the requests.get method:
import requests
page = requests.get("https://dataquestio.github.io/
web-scraping-pages/simple.html")
page
<Response [200]>
After running our request, we get a Response object. This object has a status_code
property, which indicates if the page was downloaded successfully:
page.status_code
200
A status_code of 200 means that the page downloaded successfully. We won’t fully
dive into status codes here, but a status code starting with a 2 generally indicates
success, and a code starting with a 4 or a 5 indicates an error.
We can print out the HTML content of the page using the content property:
page.content
Parsing by BeautifulSoup As you can see above, we now have downloaded an HTML
document.
We can use the BeautifulSoup library to parse this document, and extract the text
from the p tag.
18 ■ Data Mining with Python
We can now print out the HTML content of the page, formatted nicely, using the
prettify method on the BeautifulSoup object.
print(soup.prettify())
<!DOCTYPE html>
<html>
<head>
<title>
A simple example page
</title>
</head>
<body>
<p>
Here is some simple content for this page.
</p>
</body>
</html>
This step isn’t strictly necessary, and we won’t always bother with it, but it can be
helpful to look at prettified HTML to make the structure of the and where tags are
nested easier to see.
Finding Tags Finding all instances of a tag at once What we did above was useful for
figuring out how to navigate a page, but it took a lot of commands to do something
fairly simple. If we want to extract a single tag, we can instead use the find_all
method, which will find all the instances of a tag on a page.
if we are looking for the title, we can look for <title> tag
soup.find_all('title')
for t in soup.find_all('title'):
print(t.get_text())
If you instead only want to find the first instance of a tag, you can use the find method,
which will return a single BeautifulSoup object:
soup.find('p').get_text()
{"type":"string"}
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p>
<p class="inner-text">
Second paragraph.
</p>
</div>
<p class="outer-text first-item" id="second">
<b>
First outer paragraph.
</b>
</p>
<p class="outer-text">
<b>
Second outer paragraph.
</b>
</p>
</body>
</html>
Now, we can use the find_all method to search for items by class or by id. In the
below example, we’ll search for any p tag that has the class outer-text:
soup.find_all('p', class_='outer-text')
In the below example, we’ll look for any tag that has the class outer-text:
soup.find_all(class_="outer-text")
import requests
from bs4 import BeautifulSoup
page = requests.get("https://forecast.weather.gov/
MapClick.php?lat=40.0466&lon=-105.2523#.YwpRBy2B1f0")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
print(forecast_items)
[<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Sunny...>
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear...>
...
tonight = forecast_items[0]
print(tonight.prettify())
<div class="tombstone-container">
<p class="period-name">
Today
<br/>
<br/>
</p>
<p>
<img alt="Today: Sunny, with a high near 88.
Northwest wind 9 to 13 mph,
with gusts as high as 21 mph. "
class="forecast-icon" src="newimages/medium/few.png"
title="Today: Sunny, with a high near 88.
Northwest wind 9 to 13 mph,
with gusts as high as 21 mph. "/>
</p>
<p class="short-desc">
Sunny
</p>
<p class="temp temp-high">
High: 88 °F
</p>
</div>
We’ll extract the name of the forecast item, the short description, and the temperature
first, since they’re all similar:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)
Today
Sunny
High: 88 °F
Now, we can extract the title attribute from the img tag. To do this, we just treat the
BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:
img = tonight.find("img")
desc = img['title']
print(desc)
Today: Sunny,
with a high near 88.
Northwest wind 9 to 13 mph,
with gusts as high as 21 mph.
['Today',
'Tonight',
'Sunday',
'SundayNight',
'Monday',
'MondayNight',
'Tuesday',
'TuesdayNight',
'Wednesday']
Data Collection ■ 23
As we can see above, our technique gets us each of the period names, in order.
We can apply the same technique to get the other three fields:
short_descs = [sd.get_text() for sd in seven_day.select(
".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(
".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(
".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)
Storing data in SQL databases offers several advantages and considerations. The
advantages are:
• Advantages of Storing Data in SQL Databases: Structured Storage: SQL
databases provide a structured storage model with tables, rows, and columns,
allowing for efficient organization and retrieval of data.
24 ■ Data Mining with Python
• Data Integrity and Consistency: SQL databases enforce data integrity through
constraints, such as primary keys, unique keys, and referential integrity, ensuring
the accuracy and consistency of the stored data.
• Querying and Analysis: SQL databases offer powerful query languages (e.g.,
SQL) that enable complex data retrieval, filtering, aggregations, and analysis
operations.
• ACID Compliance: SQL databases adhere to ACID (Atomicity, Consistency,
Isolation, Durability) properties, ensuring reliable and transactional data opera-
tions.
To collect data from a SQL database, you need to establish a connection to the database
server. This typically involves providing connection details such as server address,
port, username, and password. Once connected, you can use SQL queries to extract
data from the database. Queries can range from simple retrieval of specific records to
complex joins, aggregations, and filtering operations. Python provides several libraries
for interacting with SQL databases, such as sqlite3, psycopg2, pymysql, and pyodbc.
These libraries allow you to establish connections, execute SQL queries, and retrieve
the query results into Python data structures for further processing.
import sqlite3
connection = sqlite3.connect('/content/ds_salaries.sqlite')
cursor = connection.cursor()
query = '''
SELECT name FROM sqlite_master
WHERE type='table';
'''
cursor.execute(query)
results = cursor.fetchall()
results
[('ds_salaries',)]
cursor.execute(query)
results = cursor.fetchall()
results
[(None,
'work_year',
'experience_level',
'employment_type',
'job_title',
'salary',
'salary_currency',
'salary_in_usd',
'employee_residence',
'remote_ratio',
'company_location',
'company_size'),
(0,
'2020',
'MI',
'FT',
'Data Scientist',
'70000',
'EUR',
'79833',
'DE',
'0',
'DE',
'L'),
26 ■ Data Mining with Python
...,
(606,
'2022',
'MI',
'FT',
'AI Scientist',
'200000',
'USD',
'200000',
'IN',
'100',
'US',
'L')]
df = pd.DataFrame(results)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 607 non-null float64
1 1 608 non-null object
2 2 608 non-null object
3 3 608 non-null object
4 4 608 non-null object
5 5 608 non-null object
6 6 608 non-null object
7 7 608 non-null object
8 8 608 non-null object
9 9 608 non-null object
10 10 608 non-null object
11 11 608 non-null object
dtypes: float64(1), object(11)
memory usage: 57.1+ KB
df.iloc[0]
0 NaN
1 work_year
2 experience_level
3 employment_type
4 job_title
5 salary
6 salary_currency
Data Collection ■ 27
7 salary_in_usd
8 employee_residence
9 remote_ratio
10 company_location
11 company_size
Name: 0, dtype: object
cols = list(df.iloc[0])
cols
[nan,
'work_year',
'experience_level',
'employment_type',
'job_title',
'salary',
'salary_currency',
'salary_in_usd',
'employee_residence',
'remote_ratio',
'company_location',
'company_size']
df.columns = cols
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nan 607 non-null float64
1 work_year 608 non-null object
2 experience_level 608 non-null object
3 employment_type 608 non-null object
4 job_title 608 non-null object
5 salary 608 non-null object
6 salary_currency 608 non-null object
7 salary_in_usd 608 non-null object
8 employee_residence 608 non-null object
9 remote_ratio 608 non-null object
10 company_location 608 non-null object
11 company_size 608 non-null object
dtypes: float64(1), object(11)
memory usage: 57.1+ KB
4 3.0 2020 MI FT
5 4.0 2020 SE FT
.. ... ... ... ...
603 602.0 2022 SE FT
604 603.0 2022 SE FT
605 604.0 2022 SE FT
606 605.0 2022 SE FT
607 606.0 2022 MI FT
cursor.close()
connection.close()
connection = sqlite3.connect('/content/shopping.sqlite')
cursor = connection.cursor()
query = '''
SELECT name FROM sqlite_master
WHERE type='table';
'''
cursor.execute(query)
results = cursor.fetchall()
results
[('customer_shopping_data',)]
cursor.execute(query)
results = cursor.fetchall()
results
[('I138884',
'C241288',
'Female',
28,
'Clothing',
5,
1500.4,
'Credit Card',
'5/8/2022',
'Kanyon'),
('I317333',
'C111565',
'Male',
21,
'Shoes',
3,
1800.51,
'Debit Card',
'12/12/2021',
'Forum Istanbul'),
('I127801',
'C266599',
'Male',
20,
30 ■ Data Mining with Python
'Clothing',
1,
300.08,
'Cash',
'9/11/2021',
'Metrocity')]
cursor.execute(query)
results = cursor.fetchall()
import pandas as pd
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16029 entries, 0 to 16028
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 invoice_no 16029 non-null object
1 customer_id 16029 non-null object
2 gender 16029 non-null object
3 age 16029 non-null int64
4 category 16029 non-null object
5 quantity 16029 non-null int64
Data Collection ■ 31
df.head()
df.to_csv('/content/shopping.csv')
• Data Quality and Reliability: API data quality and reliability depend on the data
provider. It’s important to verify the accuracy, completeness, and consistency
of the data obtained through APIs.
• API Changes and Deprecation: APIs may evolve over time, and changes to
endpoints, parameters, or authentication mechanisms can require updates in
your data collection code.
Examples of APIs are:
• Yahoo Finance API: The Yahoo Finance API provides access to financial market
data, including stock quotes, historical prices, company information, and more.
By interacting with the Yahoo Finance API, you can programmatically retrieve
financial data for analysis, investment strategies, or market monitoring.
• OpenWeatherMap API: The OpenWeatherMap API offers weather data for var-
ious locations worldwide. You can fetch weather conditions, forecasts, historical
weather data, and other meteorological information through their API.
• Twitter API: The Twitter API enables access to Twitter’s vast collection of
tweets and user data. You can use the API to retrieve tweets, monitor hashtags
or keywords, analyze sentiment, and gain insights from Twitter’s social media
data.
• Google Maps API: The Google Maps API provides access to location-based
services, including geocoding, distance calculations, routing, and map visualiza-
tion. It allows you to integrate maps and location data into your applications or
retrieve information related to places, addresses, or geographic features.
To collect data through APIs, you need to understand the API’s documentation,
authentication mechanisms, request formats (often in JSON or XML), and available
endpoints. Python provides libraries such as requests and urllib that facilitate making
HTTP requests to interact with APIs. You typically send HTTP requests with
the required parameters, handle the API responses, and process the returned data
according to your needs.
1.4.1.1 Installation
!pip install yfinance
!pip install yahoofinancials
Data Collection ■ 33
1.4.1.2 Analysis
The yfinance package can be imported into Python programs once it has been installed.
We must use the company’s ticker as an example in our argument.
A security is given a specific set of letters called a ticker or a stock symbol for trading
purposes. For instance:
For Amazon, it is “AMZN” For Facebook, it is “FB” For Google, it is “GOOGL” For
Microsoft, it is "MSFT"
import yfinance as yahooFinance
print(GoogleInfo.info)
The print statement produces a Python dictionary, which we can analyze and use to
get the specific financial data we’re looking for from Yahoo Finance. Let’s take a few
financial critical metrics as an example.
The info dictionary contains all firm information. As a result, we may extract the
desired elements from the dictionary by parsing it:
We can retrieve financial key metrics like Company Sector, Price Earnings Ratio, and
Company Beta from the above dictionary of items easily. Let us see the below code.
# display Company Sector
print("Company Sector : ", GoogleInfo.info['sector'])
There are a ton of more stuff in the information. By printing the informational keys,
we can view all of them:
34 ■ Data Mining with Python
zip : 94043
sector : Communication Services
fullTimeEmployees : 174014
longBusinessSummary : Alphabet Inc. ... in Mountain View, California.
city : Mountain View
...
logo_url : https://logo.clearbit.com/abc.xyz
trailingPegRatio : 1.3474
We can retrieve historical market prices too and display them. Additionally, we can
utilize it to get earlier market data.
We will use historical Google stock values over the past few years as our example. It
is a relatively easy assignment to complete, as demonstrated below:
# covering the past few years.
# max->maximum number of daily prices available
# for Google.
# Valid options are 1d, 5d, 1mo, 3mo, 6mo, 1y, 2y,
# 5y, 10y and ytd.
print(GoogleInfo.history(period="max"))
start = datetime.datetime(2012,5,31)
end = datetime.datetime(2013,1,30)
print(GoogleInfo.history(start=start, end=end))
Stock Splits
Date
2012-05-31 0
2012-06-01 0
2012-06-04 0
2012-06-05 0
2012-06-06 0
... ...
2013-01-23 0
2013-01-24 0
2013-01-25 0
2013-01-28 0
2013-01-29 0
[*********************100%***********************] 2 of 2 completed
AMZN \
Open High Low Close Adj Close Volume
36 ■ Data Mining with Python
Date
2019-01-02 73.260002 77.667999 73.046501 76.956497 76.956497 159662000
2019-01-03 76.000504 76.900002 74.855499 75.014000 75.014000 139512000
2019-01-04 76.500000 79.699997 75.915497 78.769501 78.769501 183652000
2019-01-07 80.115501 81.727997 79.459503 81.475502 81.475502 159864000
2019-01-08 83.234497 83.830498 80.830498 82.829002 82.829002 177628000
... ... ... ... ... ... ...
2019-12-24 89.690498 89.778503 89.378998 89.460503 89.460503 17626000
2019-12-26 90.050499 93.523003 89.974998 93.438499 93.438499 120108000
2019-12-27 94.146004 95.070000 93.300499 93.489998 93.489998 123732000
2019-12-30 93.699997 94.199997 92.030998 92.344498 92.344498 73494000
2019-12-31 92.099998 92.663002 91.611504 92.391998 92.391998 50130000
GOOGL
Open High Low Close Adj Close Volume
Date
2019-01-02 51.360001 53.039501 51.264000 52.734001 52.734001 31868000
2019-01-03 52.533501 53.313000 51.118500 51.273499 51.273499 41960000
2019-01-04 52.127998 54.000000 51.842999 53.903500 53.903500 46022000
2019-01-07 54.048500 54.134998 53.132000 53.796001 53.796001 47446000
2019-01-08 54.299999 54.667500 53.417500 54.268501 54.268501 35414000
... ... ... ... ... ... ...
2019-12-24 67.510498 67.600502 67.208504 67.221497 67.221497 13468000
2019-12-26 67.327499 68.160004 67.275497 68.123497 68.123497 23662000
2019-12-27 68.199997 68.352501 67.650002 67.732002 67.732002 23212000
2019-12-30 67.840500 67.849998 66.891998 66.985497 66.985497 19994000
2019-12-31 66.789497 67.032997 66.606499 66.969498 66.969498 19514000
Data Integration
D ata Integration is the process of combining data from different sources into
a single, unified view. It involves the combination of data from different data
types, structures, and formats to form a single dataset that can be used for analysis
and reporting. This step is important because it allows for the analysis of data from
multiple sources, which can provide a more complete and accurate picture of the data
being analyzed.
There are several Python packages that are commonly used for data integration,
including:
• Pandas: A powerful library for data manipulation and analysis that provides
data structures such as DataFrame and Series, that allow you to combine, filter,
transform, and shape your data.
• NumPy: A powerful library for array computation that provides a high-
performance multidimensional array object and tools to work with these arrays.
Objective: Collect data from various files, an SQLite database, and webpages for a
client.
Steps to fulfill the request:
• Understand the requirements: Schedule a meeting with the client to gather de-
tailed requirements. Determine the specific files, SQLite database, and webpages
from which the client wants to collect data. Clarify the desired data format,
extraction criteria, and any specific data processing requirements.
• Data collection from files: Identify the file formats (e.g., CSV, HTML, TXT,
XLSX, JSON) and their locations. Utilize the appropriate Python libraries (e.g.,
Pandas) to read and extract data from each file format. Iterate through the
DOI: 10.1201/9781003462781-2 37
38 ■ Data Mining with Python
files, apply the relevant parsing techniques, and store the extracted data in a
unified format (e.g., DataFrame).
• Data collection from SQLite database: Obtain the SQLite database file and
connection details. Use a Python library (e.g., sqlite3) to establish a connection
to the SQLite database. Execute SQL queries to retrieve the desired data from
specific tables or views. Fetch the query results into a Python data structure
(e.g., DataFrame) for further processing or integration with other data.
• Data collection from webpages: Identify the target webpages and determine
the appropriate approach for data extraction. If the webpages are structured
or semi-structured, leverage Python libraries (e.g., BeautifulSoup) to parse
the HTML/XML content and extract the required data using tags or CSS
selectors. If the webpages are unstructured or require interaction, consider tools
like Selenium to automate browser interactions and extract data through web
scraping or API calls. Apply relevant data cleaning and transformation steps as
needed.
Remember to maintain clear communication with the client throughout the process,
seeking clarification when needed and delivering the final dataset according to their
specifications. Regularly document your progress and keep track of any challenges
faced or solutions implemented.
import numpy as np
import pandas as pd
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
Data Integration ■ 39
2.1.1.2 Concatenation
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pand
as.concat.html
df_1 = df[['longitude','latitude','median_income']].sample(n=5)
df_2 = df[['longitude','latitude','median_income']].sample(n=5)
df_3 = df[['longitude','latitude','median_income']].sample(n=5)
df_1
df_cat1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 362 to 1671
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 15 non-null float64
1 latitude 15 non-null float64
2 median_income 15 non-null float64
dtypes: float64(3)
memory usage: 480.0 bytes
df_cat2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 362 to 1671
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 5 non-null float64
1 latitude 5 non-null float64
2 median_income 5 non-null float64
3 longitude 5 non-null float64
4 latitude 5 non-null float64
5 median_income 5 non-null float64
6 longitude 5 non-null float64
7 latitude 5 non-null float64
8 median_income 5 non-null float64
dtypes: float64(9)
memory usage: 1.2 KB
df_1 = df[['longitude']][:5]
df_2 = df[['latitude']][:5]
df_3 = df[['median_income']][:5]
df_1, df_2, df_3
( longitude
0 -122.05
1 -118.30
2 -117.81
3 -118.36
4 -119.67,
latitude
0 37.37
1 34.26
2 33.78
3 33.82
4 36.33,
42 ■ Data Mining with Python
median_income
0 6.6085
1 3.5990
2 5.7934
3 6.1359
4 2.9375)
2.1.1.3 Merging
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.
merge.html
df_1=df[['longitude','median_income']][0:5]
df_1
longitude median_income
0 -122.05 6.6085
1 -118.30 3.5990
2 -117.81 5.7934
3 -118.36 6.1359
4 -119.67 2.9375
df_2=df[['longitude','median_house_value']][0:5]
df_2
longitude median_house_value
0 -122.05 344700.0
1 -118.30 176500.0
2 -117.81 270500.0
3 -118.36 330000.0
4 -119.67 81700.0
pd.merge(df_1,df_2,on=['longitude'],how='inner')
df_3=df[['longitude','population',]][2:7]
df_3
longitude population
2 -117.81 1484.0
3 -118.36 49.0
4 -119.67 850.0
5 -119.56 663.0
6 -121.43 604.0
pd.merge(df_1,df_3,on='longitude',how='inner')
pd.merge(df_1,df_3,on='longitude',how='outer').drop_duplicates()
##Joining
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.
DataFrame.join.html
df_1=df[['longitude']][0:5]
df_1
longitude
0 -122.05
1 -118.30
2 -117.81
3 -118.36
4 -119.67
df_2=df[['latitude']][2:7]
df_2
latitude
2 33.78
3 33.82
4 36.33
44 ■ Data Mining with Python
5 36.51
6 38.63
df_1.join(df_2,how='left')
longitude latitude
0 -122.05 NaN
1 -118.30 NaN
2 -117.81 33.78
3 -118.36 33.82
4 -119.67 36.33
df_1.join(df_2,how='right')
longitude latitude
2 -117.81 33.78
3 -118.36 33.82
4 -119.67 36.33
5 NaN 36.51
6 NaN 38.63
df_1.join(df_2,how='inner')
longitude latitude
2 -117.81 33.78
3 -118.36 33.82
4 -119.67 36.33
df_1.join(df_2,how='outer')
longitude latitude
0 -122.05 NaN
1 -118.30 NaN
2 -117.81 33.78
3 -118.36 33.82
4 -119.67 36.33
5 NaN 36.51
6 NaN 38.63
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
company_location company_size
0 DE L
1 US L
2 RU M
3 RU L
4 US S
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 work_year 245 non-null object
1 experience_level 245 non-null object
2 employment_type 245 non-null object
3 job_title 245 non-null object
4 salary 245 non-null int64
5 salary_currency 245 non-null object
6 salary_in_usd 245 non-null int64
7 employee_residence 245 non-null object
8 remote_ratio 245 non-null int64
9 company_location 245 non-null object
10 company_size 245 non-null object
dtypes: int64(3), object(8)
memory usage: 21.2+ KB
2.1.2.2 Concatenation
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pand
as.concat.html
df_1 = df[['company_location','job_title',
'experience_level','salary_in_usd']].sample(n=5)
df_2 = df[['company_location','job_title',
'experience_level','salary_in_usd']].sample(n=5)
df_3 = df[['company_location','job_title',
'experience_level','salary_in_usd']].sample(n=5)
46 ■ Data Mining with Python
df_1
df_cat1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 16 to 84
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_location 15 non-null object
1 job_title 15 non-null object
2 experience_level 15 non-null object
3 salary_in_usd 15 non-null int64
dtypes: int64(1), object(3)
memory usage: 600.0+ bytes
salary_in_usd
16 NaN
125 NaN
25 NaN
22 NaN
41 NaN
216 NaN
73 NaN
137 NaN
28 NaN
45 NaN
92 115000.0
70 105000.0
242 105000.0
130 71968.0
84 72625.0
df_cat2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 16 to 84
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company_location 5 non-null object
1 job_title 5 non-null object
2 experience_level 5 non-null object
3 salary_in_usd 5 non-null float64
4 company_location 5 non-null object
5 job_title 5 non-null object
6 experience_level 5 non-null object
7 salary_in_usd 5 non-null float64
8 company_location 5 non-null object
9 job_title 5 non-null object
10 experience_level 5 non-null object
11 salary_in_usd 5 non-null float64
dtypes: float64(3), object(9)
memory usage: 1.5+ KB
Data Integration ■ 49
2.1.2.3 Merging
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.
merge.html
df_1=df[['company_location','experience_level','salary_in_usd']][0:5]
df_1
df_2=df[['company_location','job_title','salary_in_usd']][0:5]
df_2
pd.merge(df_1,df_2,on='company_location',how='inner')
job_title salary_in_usd_y
0 Data Science Consultant 64369
1 Data Scientist 68428
2 Machine Learning Engineer 125000
3 Data Scientist 68428
4 Machine Learning Engineer 125000
5 Head of Data Science 85000
6 Head of Data 230000
7 Head of Data Science 85000
8 Head of Data 230000
50 ■ Data Mining with Python
pd.merge(df_1,df_2,on='company_location',how='inner').drop_duplicates()
job_title salary_in_usd_y
0 Data Science Consultant 64369
1 Data Scientist 68428
2 Machine Learning Engineer 125000
3 Data Scientist 68428
4 Machine Learning Engineer 125000
5 Head of Data Science 85000
6 Head of Data 230000
7 Head of Data Science 85000
8 Head of Data 230000
df_3=df[['company_location','job_title','experience_level',]][2:6]
df_3
pd.merge(df_1,df_3,on='company_location',how='inner').drop_duplicates()
job_title experience_level_y
0 Machine Learning Engineer EN
1 Data Analytics Manager SE
2 Machine Learning Engineer EN
3 Data Analytics Manager SE
4 Head of Data Science EX
5 Head of Data EX
Data Integration ■ 51
pd.merge(df_1,df_3,on='company_location',how='outer').drop_duplicates()
job_title experience_level_y
0 NaN NaN
1 Machine Learning Engineer EN
2 Data Analytics Manager SE
3 Machine Learning Engineer EN
4 Data Analytics Manager SE
5 Head of Data Science EX
6 Head of Data EX
7 Head of Data Science EX
8 Head of Data EX
##Joining
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.
DataFrame.join.html
df_1=df[['experience_level']][0:5]
df_1
experience_level
0 EN
1 SE
2 EX
3 EX
4 EN
df_2=df[['job_title']][2:7]
df_2
job_title
2 Head of Data Science
3 Head of Data
4 Machine Learning Engineer
5 Data Analytics Manager
6 Research Scientist
52 ■ Data Mining with Python
df_1.join(df_2,how='left').drop_duplicates()
experience_level job_title
0 EN NaN
1 SE NaN
2 EX Head of Data Science
3 EX Head of Data
4 EN Machine Learning Engineer
df_1.join(df_2,how='right').drop_duplicates()
experience_level job_title
2 EX Head of Data Science
3 EX Head of Data
4 EN Machine Learning Engineer
5 NaN Data Analytics Manager
6 NaN Research Scientist
df_1.join(df_2,how='inner').drop_duplicates()
experience_level job_title
2 EX Head of Data Science
3 EX Head of Data
4 EN Machine Learning Engineer
df_1.join(df_2,how='outer').drop_duplicates()
experience_level job_title
0 EN NaN
1 SE NaN
2 EX Head of Data Science
3 EX Head of Data
4 EN Machine Learning Engineer
5 NaN Data Analytics Manager
6 NaN Research Scientist
CHAPTER 3
Data Statistics
We begin with a thorough examination of data types. We categorize data into two
distinct groups: non-numerical and numerical. Nonnumerical data encompasses qual-
itative information, such as categories or labels, while numerical data consists of
quantitative values. Understanding the characteristics and significance of these data
types is crucial for effective data analysis.
Central tendency measures are fundamental to statistical analysis. In this section, we
delve into the heart of data summarization by introducing key measures, including
the mean, median, and mode. These measures provide insights into the central or
typical value within a dataset and are invaluable tools for data interpretation.
DOI: 10.1201/9781003462781-3 53
54 ■ Data Mining with Python
Our exploration continues with a focus on dispersion and location metrics. Dispersion
measures, such as standard deviation and variance, quantify the spread or variability
of data points. Location metrics, on the other hand, help pinpoint central positions
within a dataset. We explore how these metrics contribute to a deeper understanding
of data patterns and variability.
The Interquartile Range (IQR) is a powerful tool for understanding data variability.
We not only explain how to calculate the IQR but also provide practical guidance
on its interpretation. This measure is particularly useful for identifying outliers and
assessing data distribution.
By the end of this chapter, you will have a solid foundation in these key concepts,
making you better equipped to navigate the complexities of data analysis. These
concepts serve as the building blocks for more advanced topics in the field and will
empower you to extract meaningful insights from your datasets. As we progress
through this chapter, remember that our aim is to provide you with both theoretical
understanding and practical applications of these concepts, ensuring that you can
confidently apply them to real-world data scenarios.
import pandas as pd
import numpy as np
Artist Track \
0 Gorillaz Feel Good Inc.
1 Gorillaz Rhinestone Eyes
2 Gorillaz New Gold (feat. Tame Impala and Bootie Brown)
3 Gorillaz On Melancholy Hill
4 Gorillaz Clint Eastwood
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20718 entries, 0 to 20717
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Artist 20718 non-null object
1 Track 20718 non-null object
2 Album 20718 non-null object
3 Album_type 20718 non-null object
4 Views 20248 non-null float64
5 Likes 20177 non-null float64
6 Comments 20149 non-null float64
7 Licensed 20248 non-null object
8 official_video 20248 non-null object
9 Stream 20142 non-null float64
dtypes: float64(4), object(6)
memory usage: 1.6+ MB
df.describe()
Gorillaz 10
Die drei !!! 10
Hollywood Undead 10
Empire of the Sun 10
White Noise for Babies 10
..
NewJeans 6
Alfonso Herrera 6
Jimin 3
Stars Music Chile 1
Bootie Brown 1
Name: Artist, Length: 2079, dtype: int64
56 ■ Data Mining with Python
df['Artist'].unique()
array(['Gorillaz', 'Red Hot Chili Peppers', '50 Cent', ..., 'LE SSERAFIM',
'ThxSoMch', 'SICK LEGEND'], dtype=object)
df['Artist'].nunique()
2079
Artist 2079
Track 17841
Album 11937
Album_type 3
Licensed 2
official_video 2
dtype: int64
Album_type
album 14926
single 5004
compilation 788
Licensed
True 14140
False 6108
official_video
True 15723
False 4525
Data Statistics ■ 57
col: Views
min: 0.0 max: 8079649362.0 median: 14501095.0 mode: 6639.0
midrange: 4039824681.0
def getCentralTendency(col):
min = df[col].min()
max = df[col].max()
median = df[col].median()
mode = df[col].mode()[0]
midrange = (max - min)/2
print('col:',col,
'\n\tmin:', min,
'max:',max,
'median:', median,
'mode:', mode,
'midrange:', midrange)
col: Views
min: 0.0 max: 8079649362.0
median: 14501095.0 mode: 6639.0 midrange: 4039824681.0
col: Likes
min: 0.0 max: 50788652.0
median: 124481.0 mode: 0.0 midrange: 25394326.0
col: Comments
min: 0.0 max: 16083138.0
median: 3277.0 mode: 0.0 midrange: 8041569.0
col: Stream
min: 6574.0 max: 3386520288.0
median: 49682981.5 mode: 169769959.0 midrange: 1693256857.0
58 ■ Data Mining with Python
Dispersion
range, quantiles, var, std
col = 'Views'
range = df[col].max() - df[col].min()
quantiles = df[col].quantile([0.25, 0.5, 0.75])
IQR = quantiles[0.75] - quantiles[0.25]
var = df[col].var()
std = df[col].std()
print('col:',col,
'\n\trange:', range,
'Q1:',quantiles[0.25],
'Q2:', quantiles[0.5],
'Q3:', quantiles[0.75],
'IQR:', IQR,
'var:', var,
'std:', std)
col: Views
range: 8079649362.0 Q1: 1826001.5 Q2: 14501095.0 Q3: 70399749.0
IQR: 68573747.5 var: 7.542950360937822e+16 std: 274644322.0046215
def getDispersion(col):
range = df[col].max() - df[col].min()
quantiles = df[col].quantile([0.25, 0.5, 0.75])
IQR = quantiles[0.75] - quantiles[0.25]
var = df[col].var()
std = df[col].std()
print('col:',col,
'\n\trange:', range,
'Q1:',quantiles[0.25],
'Q2:', quantiles[0.5],
'Q3:', quantiles[0.75],
'IQR:', IQR,
'var:', var,
'std:', std)
numericalcols = ['Views', 'Likes', 'Comments', 'Stream']
col: Views
range: 8079649362.0 Q1: 1826001.5 Q2: 14501095.0 Q3: 70399749.0
IQR: 68573747.5 var: 7.542950360937822e+16 std: 274644322.0046215
col: Likes
range: 50788652.0 Q1: 21581.0 Q2: 124481.0 Q3: 522148.0
IQR: 500567.0 var: 3201681265274.244 std: 1789324.2482217257
col: Comments
range: 16083138.0 Q1: 509.0 Q2: 3277.0 Q3: 14360.0
Data Statistics ■ 59
Correlation
df[numericalcols].corr()
3.1.2.1 Setup
import pandas as pd
import numpy as np
company_location company_size
0 DE L
1 US L
2 RU M
3 RU L
4 US S
60 ■ Data Mining with Python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 work_year 245 non-null object
1 experience_level 245 non-null object
2 employment_type 245 non-null object
3 job_title 245 non-null object
4 salary 245 non-null int64
5 salary_currency 245 non-null object
6 salary_in_usd 245 non-null int64
7 employee_residence 245 non-null object
8 remote_ratio 245 non-null int64
9 company_location 245 non-null object
10 company_size 245 non-null object
dtypes: int64(3), object(8)
memory usage: 21.2+ KB
df.describe()
2021e 179
2020 66
Name: work_year, dtype: int64
df['experience_level'].value_counts()
MI 103
SE 77
EN 54
EX 11
Name: experience_level, dtype: int64
Data Statistics ■ 61
FT 231
PT 7
CT 4
FL 3
Name: employment_type, dtype: int64
df['company_location'].value_counts()
US 108
DE 19
IN 17
GB 16
FR 11
CA 11
...
CO 1
KE 1
HU 1
SG 1
MT 1
Name: company_location, dtype: int64
df['company_location'].unique()
array(['DE', 'US', 'RU', 'FR', 'AT', 'CA', 'UA', 'NG', 'IN', 'ES', 'PL',
'GB', 'PT', 'DK', 'SG', 'MX', 'TR', 'NL', 'AE', 'JP', 'CN', 'HU',
'KE', 'CO', 'NZ', 'IR', 'CL', 'PK', 'BE', 'GR', 'SI', 'BR', 'CH',
'IT', 'MD', 'LU', 'VN', 'AS', 'HR', 'IL', 'MT'], dtype=object)
df['company_location'].nunique()
41
nonnumericalcols = ['work_year',
'experience_level',
'employment_type',
'job_title',
'salary_currency',
'employee_residence',
'company_location',
'company_size']
df[nonnumericalcols].nunique()
work_year 2
experience_level 4
employment_type 4
job_title 43
salary_currency 15
employee_residence 45
company_location 41
62 ■ Data Mining with Python
company_size 3
dtype: int64
employment_type
FT 231
PT 7
CT 4
FL 3
salary_currency
USD 126
EUR 57
INR 21
GBP 13
CAD 10
TRY 3
PLN 2
HUF 2
SGD 2
MXN 2
DKK 2
BRL 2
CLP 1
JPY 1
CNY 1
company_size
L 132
S 58
M 55
col = 'salary_in_usd'
min = df[col].min()
max = df[col].max()
median = df[col].median()
mode = df[col].mode()[0]
midrange = (max - min)/2
print('col:',col,
'\n\tmin:', min,
'max:',max,
'median:', median,
'mode:', mode,
'midrange:', midrange)
col: salary_in_usd
min: 2876 max: 600000 median: 81000.0 mode: 150000 midrange: 298562.0
def getCentralTendency(col):
min = df[col].min()
max = df[col].max()
median = df[col].median()
mode = df[col].mode()[0]
midrange = (max - min)/2
print('col:',col,
'\n\tmin:', min,
'max:',max,
'median:', median,
'mode:', mode,
'midrange:', midrange)
col: salary
min: 4000 max: 30400000 median: 103000.0 mode: 80000 midrange: 15198000.0
col: salary_in_usd
min: 2876 max: 600000 median: 81000.0 mode: 150000 midrange: 298562.0
col: remote_ratio
min: 0 max: 100 median: 100.0 mode: 100 midrange: 50.0
Dispersion
range, quantiles, var, std
col = 'salary'
range = df[col].max() - df[col].min()
quantiles = df[col].quantile([0.25, 0.5, 0.75])
IQR = quantiles[0.75] - quantiles[0.25]
64 ■ Data Mining with Python
var = df[col].var()
std = df[col].std()
print('col:',col,
'\n\trange:', range,
'Q1:',quantiles[0.25],
'Q2:', quantiles[0.5],
'Q3:', quantiles[0.75],
'IQR:', IQR,
'var:', var,
'std:', std)
\index{IQR}
col: salary
range: 30396000 Q1 60000.0 Q2: 103000.0 Q3: 174000.0
IQR: 114000.0 var: 5181223548855.596 std: 2276230.117728784
def getDispersion(col):
range = df[col].max() - df[col].min()
quantiles = df[col].quantile([0.25, 0.5, 0.75])
IQR = quantiles[0.75] - quantiles[0.25]
var = df[col].var()
std = df[col].std()
print('col:',col,
'\n\trange:', range,
'Q1:',quantiles[0.25],
'Q2:', quantiles[0.5],
'Q3:', quantiles[0.75],
'IQR:', IQR,
'var:', var,
'std:', std)
nnumericalcols = ['salary', 'salary_in_usd', 'remote_ratio']
col: salary
range: 30396000 Q1: 60000.0 Q2: 103000.0 Q3: 174000.0
IQR: 114000.0 var: 5181223548855.596 std: 2276230.117728784
col: salary_in_usd
range: 597124 Q1: 45896.0 Q2: 81000.0 Q3: 130000.0
IQR: 84104.0 var: 7053199205.446571 std: 83983.32694914255
col: remote_ratio
range: 100 Q1: 50.0 Q2: 100.0 Q3: 100.0
IQR: 50.0 var: 1413.265306122449 std: 37.59342104840219
Data Statistics ■ 65
Correlation
df[numericalcols].corr()
Data Visualization
There are several Python packages that are commonly used for data visualization,
including:
• Pandas: It integrates with other libraries such as Matplotlib and Seaborn
that allow to generate various types of plots and visualizations, that can help
understand the data and, identify patterns and trends.
• Matplotlib: It is a 2D plotting library that provides a wide range of tools for
creating static, animated, and interactive visualizations. It is widely used as the
foundation for other libraries.
• Seaborn: It is a library built on top of Matplotlib that provides a higher-level
interface for creating more attractive and informative statistical graphics. It is
particularly useful for data visualization in statistics and data science.
• Plotly: It is a library for creating interactive and web-based visualizations and
provides a wide range of tools for creating plots, maps, and dashboards. It is
particularly useful for creating interactive visualizations that can be embedded
in web pages or apps.
• PyViz: It is a library that is composed of a set of libraries such as Holoviews,
Geoviews, Datashader and more, for creating visualizations for complex data
and large datasets.
66 DOI: 10.1201/9781003462781-4
Data Visualization ■ 67
Data visualization is a critical aspect of data analysis, and the Pandas library embraces
this need by providing built-in functionalities for it. In this section, we focus on
harnessing these built-in functionalities for data visualization. Whether you are new
to data visualization or looking for a quick and convenient way to explore your data,
Pandas provides a powerful toolset.
4.1.1.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/Economy_of_US.csv')
df
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 48 non-null int64
1 GDP_PPP 48 non-null float64
2 GDP_PerCapita_PPP 48 non-null float64
3 GDP_Nominal 48 non-null float64
4 GDP_PerCapita_Nominal 48 non-null float64
5 GDP_Growth 48 non-null float64
6 Inflation 48 non-null float64
7 Unemployment 48 non-null float64
8 Inflation_Change 47 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 3.5+ KB
df.describe()
<Axes: >
<Axes: xlabel='Year'>
70 ■ Data Mining with Python
<Axes: xlabel='Year'>
<Axes: xlabel='Year'>
Data Visualization ■ 71
<Axes: xlabel='Year'>
<Axes: ylabel='Year'>
72 ■ Data Mining with Python
4.1.1.6 Histograms
df['Inflation'].plot(kind = 'hist')
<Axes: ylabel='Frequency'>
<Axes: ylabel='Frequency'>
Data Visualization ■ 73
df['Unemployment'].plot(kind = 'hist')
<Axes: ylabel='Frequency'>
df['Unemployment'].plot(kind = 'kde')
<Axes: ylabel='Density'>
74 ■ Data Mining with Python
<Axes: >
df['Unemployment'].plot(kind = 'box')
<Axes: >
Data Visualization ■ 75
Decrease 23
Increase 21
No change 3
Name: Inflation_Change, dtype: int64
df['Inflation_Change'].value_counts().plot(kind = 'pie')
<Axes: ylabel='Inflation_Change'>
4.1.1.10 Documentation
• You can find more details in the documentation here: https://pandas.pydata.or
g/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
• Here is another useful reference: https://pandas.pydata.org/docs/user_guide/
visualization.html
While the Pandas package provides certain support for basic data visualization, users
may need a more powerful tool to customize their plots. In this section, we delve
into Matplotlib, a versatile Python library that grants you complete control over
your visualizations. Whether you’re aiming for intricate, tailor-made plots or need to
visualize data in a highly specific way, Matplotlib offers the tools and flexibility to
bring your vision to life.
Our exploration begins with an introduction to Matplotlib and its capabilities. We’ll
guide you through the basics of creating plots, charts, and figures, emphasizing the
library’s flexibility in terms of customization. Matplotlib is renowned for its customiza-
tion options. We delve deep into the art of fine-tuning your visualizations. From
adjusting colors, markers, and line styles to controlling axis scales and annotations,
you’ll have the tools to craft visualizations that precisely convey your insights. Multi-
panel figures and subplots are essential when visualizing complex data. We explore
Data Visualization ■ 77
how Matplotlib allows you to create grids of subplots, enabling you to present multiple
views of your data in a single, coherent figure.
4.2.1.1 Setup
df = pd.read_csv('/content/Economy_of_US.csv')
df
plt.plot(df['Year'], df['GDP_Growth'])
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
78 ■ Data Mining with Python
Change markers
plt.plot(df['Year'], df['GDP_Growth'], 'o')
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
markers = ['o', '*', '.', ',', 'x', 'X', '+', 'P', 's', 'D', 'd', 'p',
'H', 'h', 'v', 'ˆ', '<', '>', '1', '2', '3', '4', '|', '_']
for m in markers:
print(m)
plt.plot(df['Year'], df['GDP_Growth'], m)
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
plt.show()
<
>
--
-.
Change color
color = ['r','g','b','c','m','y','k','w']
for c in color:
print(c)
c = 'o:' + c
plt.plot(df['Year'], df['GDP_Growth'], c)
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
plt.show()
plt.scatter(df['Year'], df['GDP_Growth'])
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
Colorbar
plt.scatter(df['Year'], df['GDP_Growth'], c=df['Inflation'], cmap='hot')
plt.colorbar()
plt.xlabel('Year')
98 ■ Data Mining with Python
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
Size
plt.scatter(df['Year'], df['GDP_Growth'], s= df['Unemployment']*1000)
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
plt.bar(df['Year'], df['GDP_Growth'])
plt.xlabel('Year')
plt.ylabel('GDP GRowth')
plt.title('Economy of US')
plt.hist(df['GDP_Growth'])
plt.xlabel('GDP Growth')
plt.ylabel('Counts')
plt.title('Economy of US')
Decrease 23
Increase 21
No change 3
Name: Inflation_Change, dtype: int64
plt.pie(df['Inflation_Change'].value_counts(), labels
= ['Decrease', 'Increase', 'No change'])
plt.legend()
plt.title('Economy of US')
plt.pie(df['Inflation_Change'].value_counts(), labels =
['Decrease', 'Increase', 'No change'], explode = [0.0, 0.2, 0])
plt.legend()
plt.title('Economy of US')
4.2.1.7 Multi-Plots
plt.plot(df['Year'], df['GDP_Growth'])
plt.plot(df['Year'], df['Inflation'])
plt.plot(df['Year'], df['Unemployment'])
plt.xlabel('Year')
plt.title('Economy of US')
Add legend
plt.plot(df['Year'], df['GDP_Growth'], label = 'GDP_Growth')
plt.plot(df['Year'], df['Inflation'], label = 'Inflation' )
plt.plot(df['Year'], df['Unemployment'], label = 'Unemployment')
plt.xlabel('Year')
plt.ylabel('Economy')
plt.grid()
plt.legend()
<matplotlib.legend.Legend at 0x7f27844e3e20>
Change arrangement
plt.subplot(3, 1, 1)
plt.plot(df['Year'], df['GDP_Growth'], label = 'GDP_Growth')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Economy')
plt.subplot(3, 1, 2)
plt.plot(df['Year'], df['Inflation'], label = 'Inflation' )
plt.legend()
plt.xlabel('Year')
plt.ylabel('Economy')
plt.subplot(3, 1, 3)
plt.plot(df['Year'], df['Unemployment'], label = 'Unemployment')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Economy')
plt.grid()
plt.subplot(1, 3, 1)
plt.plot(df['Year'], df['GDP_Growth'], label = 'GDP_Growth')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Economy')
plt.subplot(1, 3, 2)
plt.plot(df['Year'], df['Inflation'], label = 'Inflation' )
plt.legend()
plt.xlabel('Year')
plt.subplot(1, 3, 3)
plt.plot(df['Year'], df['Unemployment'], label = 'Unemployment')
plt.legend()
plt.xlabel('Year')
plt.grid()
106 ■ Data Mining with Python
4.3.1.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset('tips')
Data Visualization ■ 107
<seaborn.axisgrid.FacetGrid at 0x7f0e5b9e6280>
<seaborn.axisgrid.FacetGrid at 0x7f0e4d8e1fd0>
<seaborn.axisgrid.FacetGrid at 0x7f0e4b62f370>
<seaborn.axisgrid.FacetGrid at 0x7f0e4b54d310>
<seaborn.axisgrid.FacetGrid at 0x7f0e5b9e6df0>
Data Visualization ■ 109
<seaborn.axisgrid.FacetGrid at 0x7f0e5b970970>
<seaborn.axisgrid.FacetGrid at 0x7f0e4b192c40>
110 ■ Data Mining with Python
Figure 4.76 A Default Relational Plot with Size Differentiation and Different Dot-Sizes
<seaborn.axisgrid.FacetGrid at 0x7f0e4aff1f10>
<seaborn.axisgrid.FacetGrid at 0x7f0e4ad0d370>
Data Visualization ■ 111
Figure 4.78 A Default Relational Plot with Large Size Differentiation and Transparency
<seaborn.axisgrid.FacetGrid at 0x7f0e4ad26310>
<seaborn.axisgrid.FacetGrid at 0x7f0e4aff1ee0>
112 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7f0e48eb3400>
<seaborn.axisgrid.FacetGrid at 0x7f0e48d7e100>
Data Visualization ■ 113
<seaborn.axisgrid.FacetGrid at 0x7f0e48ccdc40>
<seaborn.axisgrid.FacetGrid at 0x7f0e48b70a30>
114 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7f0e48a0c760>
<seaborn.axisgrid.FacetGrid at 0x7f0e489e8cd0>
Data Visualization ■ 115
<seaborn.axisgrid.FacetGrid at 0x7f0e4b3ecfd0>
<seaborn.axisgrid.FacetGrid at 0x7f0e482d6550>
116 ■ Data Mining with Python
Figure 4.88 A KDE Distribution Plot with Gender Differentiation and Stacking
<seaborn.axisgrid.FacetGrid at 0x7f0e489dee50>
Figure 4.89 A KDE Distribution Plot with Gender Differentiation, Stacking in Multi-
columns
<seaborn.axisgrid.FacetGrid at 0x7f0e480b4fd0>
Data Visualization ■ 117
<seaborn.axisgrid.FacetGrid at 0x7f0e480ac9d0>
Figure 4.91 A KDE Distribution Plot with Two Attributes and Gender Differentiation
<seaborn.axisgrid.FacetGrid at 0x7f0e43d65640>
118 ■ Data Mining with Python
Figure 4.92 A KDE Distribution Plot with Two Attributes and Rug
<seaborn.axisgrid.FacetGrid at 0x7f0e43c4c610>
<seaborn.axisgrid.FacetGrid at 0x7f0e417a4b50>
Data Visualization ■ 119
<seaborn.axisgrid.FacetGrid at 0x7f0e4837b0d0>
<seaborn.axisgrid.FacetGrid at 0x7f0e4159bb50>
120 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7f0e415ed070>
<seaborn.axisgrid.FacetGrid at 0x7f0e414c6040>
Data Visualization ■ 121
<seaborn.axisgrid.FacetGrid at 0x7f0e413e08e0>
<seaborn.axisgrid.FacetGrid at 0x7f0e4b585eb0>
122 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7f0e412877c0>
<seaborn.axisgrid.FacetGrid at 0x7f0e412832e0>
Data Visualization ■ 123
<seaborn.axisgrid.FacetGrid at 0x7f0e410072b0>
124 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7f0e40f7f0a0>
<seaborn.axisgrid.JointGrid at 0x7f0e40f8ab50>
Data Visualization ■ 125
<seaborn.axisgrid.JointGrid at 0x7f0e4b641130>
<seaborn.axisgrid.PairGrid at 0x7f0e40bd8eb0>
Data Visualization ■ 129
<seaborn.axisgrid.PairGrid at 0x7f0e40766490>
130 ■ Data Mining with Python
Data Preprocessing
There are several Python packages that are commonly used for data preprocessing,
including:
• Pandas: It is a library for working with data in a tabular format and provides
a wide range of tools for reading, writing, and manipulating data, such as
DataFrame and series, as well as handling missing values.
• NumPy: It is a library for working with arrays and matrices of numerical data. It
provides a wide range of mathematical and statistical functions and is commonly
used as the foundation for other libraries.
• Scikit-learn: It is a library for machine learning in Python and provides a wide
range of tools for preprocessing data, such as feature scaling, normalization, and
one-hot encoding.
• NLTK: It is a library for natural language processing and provides a wide range of
tools for text preprocessing, such as tokenization, stemming, and lemmatization.
• SciPy: It is a library for scientific computing in Python and provides a wide
range of tools for data preprocessing, such as interpolation and smoothing.
Missing values are a common challenge that must be addressed to ensure the accuracy
and reliability of your results. This section is dedicated to understanding the impor-
tance of handling missing values and equipping you with the knowledge to effectively
manage them, whether by dropping them or filling them with appropriate values. We
explore several approaches using Pandas to effectively manage missing data:
import numpy as np
import pandas as pd
df = pd.read_csv('/content/Economy_of_US_na.csv')
df
df.isnull()
df2 = df.dropna()
df2
df2 = df.dropna(axis=1)
df2
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
df2 = df.dropna(thresh=3)
df2
df2 = df.dropna(thresh=2)
df2
df2 = df.dropna(thresh=1)
df2
Year GDP_Nominal
0 1980.0 2857.3
1 1981.0 3207.0
2 1982.0 3343.8
3 1983.0 NaN
4 1984.0 4037.7
5 1985.0 4339.0
6 1986.0 NaN
7 1987.0 4855.3
8 1988.0 5236.4
9 1989.0 NaN
10 1990.0 5963.1
11 NaN NaN
12 1992.0 6520.3
Year
0 1980.0
1 1981.0
2 1982.0
3 1983.0
4 1984.0
5 1985.0
6 1986.0
7 1987.0
8 1988.0
9 1989.0
10 1990.0
11 NaN
12 1992.0
136 ■ Data Mining with Python
df2 = df.fillna('NA')
df2
df['Year_filled'] = df['Year'].fillna('YEAR')
df
GDP_Nominal GDP_filled_ffill
0 2857.3 2857.3
1 3207.0 3207.0
2 3343.8 3343.8
3 NaN 3343.8
4 4037.7 4037.7
5 4339.0 4339.0
6 NaN 4339.0
7 4855.3 4855.3
Data Preprocessing ■ 137
8 5236.4 5236.4
9 NaN 5236.4
10 5963.1 5963.1
11 NaN 5963.1
12 6520.3 6520.3
GDP_Nominal GDP_filled_bfill
0 2857.3 2857.3
1 3207.0 3207.0
2 3343.8 3343.8
3 NaN 4037.7
4 4037.7 4037.7
5 4339.0 4339.0
6 NaN 4855.3
7 4855.3 4855.3
8 5236.4 5236.4
9 NaN 5963.1
10 5963.1 5963.1
11 NaN 6520.3
12 6520.3 6520.3
df['GDP_Nominal_filled_mean'] = df['GDP_Nominal'].fillna(df['GDP_Nominal']
.mean())
df[['GDP_Nominal', 'GDP_Nominal_filled_mean']]
GDP_Nominal GDP_Nominal_filled_mean
0 2857.3 2857.300000
1 3207.0 3207.000000
2 3343.8 3343.800000
3 NaN 4484.433333
4 4037.7 4037.700000
5 4339.0 4339.000000
6 NaN 4484.433333
7 4855.3 4855.300000
8 5236.4 5236.400000
9 NaN 4484.433333
10 5963.1 5963.100000
11 NaN 4484.433333
12 6520.3 6520.300000
138 ■ Data Mining with Python
df['GDP_Nominal_filled_mode'] =
df['GDP_Nominal'].fillna(df['GDP_Nominal'].mode()[0])
df[['GDP_Nominal', 'GDP_Nominal_filled_mode']]
GDP_Nominal GDP_Nominal_filled_mode
0 2857.3 2857.3
1 3207.0 3207.0
2 3343.8 3343.8
3 NaN 2857.3
4 4037.7 4037.7
5 4339.0 4339.0
6 NaN 2857.3
7 4855.3 4855.3
8 5236.4 5236.4
9 NaN 2857.3
10 5963.1 5963.1
11 NaN 2857.3
12 6520.3 6520.3
5.1.1.9 Summary
df['GDP_Growth_fill_NA'] =
df['GDP_Growth'].fillna('NA')
df['GDP_Growth_fill_0'] = df['GDP_Growth'].fillna(0)
df['GDP_Growth_fill_ffill'] = df['GDP_Growth'].fillna(method = 'ffill')
df['GDP_Growth_fill_bfill'] =
df['GDP_Growth'].fillna(method = 'bfill')
df['GDP_Growth_fill_mean'] =
df['GDP_Growth'].fillna(df['GDP_Growth'].mean())
df['GDP_Growth_fill_mode'] =
df['GDP_Growth'].fillna(df['GDP_Growth'].mode()[0])
df[['GDP_Growth', 'GDP_Growth_fill_NA', 'GDP_Growth_fill_0',
'GDP_Growth_fill_ffill','GDP_Growth_fill_bfill',
'GDP_Growth_fill_mean', 'GDP_Growth_fill_mode']]
In the world of data analysis, outliers are data points that deviate significantly
from the typical distribution of a dataset. Detecting outliers is a crucial step in
data preprocessing and analysis, as these unusual data points can distort statistical
measures and lead to inaccurate insights. In this section, we explore two fundamental
concepts for identifying outliers.
We begin by introducing the Interquartile Range (IQR) as a robust measure of data
spread. IQR analysis provides an effective method for identifying outliers by focusing
on the middle 50% of the data. You will learn how to calculate the IQR and define
bounds to identify potential outliers that fall outside this range. Practical exercises will
guide you in applying IQR analysis to your datasets, ensuring accurate identification
of outliers.
We follow by introducing a statistical understanding of data distribution; this concept
allows you to identify outliers by examining how data points deviate from expected
distribution patterns. You will explore various visualization techniques and statistical
tests to detect outliers. These methods include visualizing data distributions, applying
statistical tests like the Z-score, and interpreting statistical measures such as skewness
and kurtosis.
More sophisticated outlier detection methods will be covered in Chapter 10.
5.2.1.1 Setup
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/Nov2Temp.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 high 118 non-null int64
1 low 118 non-null int64
dtypes: int64(2)
memory usage: 2.0 KB
df.describe()
high low
count 118.000000 118.000000
mean 56.830508 29.262712
std 17.205796 12.877084
min 15.000000 -33.000000
25% 48.250000 24.000000
50% 57.500000 31.000000
75% 66.750000 36.750000
max 127.000000 54.000000
df.shape
(118, 2)
df['low'].hist()
<Axes: >
Data Preprocessing ■ 141
plt.boxplot(df['low'])
print(low_low_limit)
4.875
print(low_high_limit)
55.875
high low
41 41 -2
79 18 -1
109 48 -11
110 43 -21
111 64 -33
Empty DataFrame
Columns: [high, low]
Index: []
<Axes: >
Data Preprocessing ■ 143
plt.boxplot(df['low'])
5.2.1.6 Practice
Let’s do the same thing for df[’high’].
144 ■ Data Mining with Python
5.2.2.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/Nov2Temp.csv')
df
high low
0 58 25
1 26 11
2 53 24
3 60 37
4 67 42
.. ... ...
113 119 33
114 127 27
115 18 38
116 15 51
117 30 49
high low
109 48 -11
110 43 -21
111 64 -33
df['low'].plot(kind='box')
<Axes: >
Data Preprocessing ■ 145
df['low'].plot(kind = 'box')
<Axes: >
5.2.2.4 Practice
Play with df[’high’].
146 ■ Data Mining with Python
In the field of data analysis, dealing with large and complex datasets is a common
challenge. Data reduction techniques offer practical solutions to handle such datasets
effectively. This section introduces two fundamental concepts for data reduction:
Dimension elimination and data sampling.
Dimensionality reduction is a crucial technique for simplifying complex datasets by
reducing the number of features or variables while retaining essential information.
This concept aims to improve computational efficiency, reduce noise, and enhance
the interpretability of data. Here we will learn basic dimension elimination. We will
learn advanced techniques such as Principal Component Analysis (PCA) and Feature
Selection in Chapter 8.
Data sampling involves the selection of a subset of data points from a larger dataset.
This approach is valuable for reducing the overall dataset size while retaining its
statistical characteristics and patterns. Data sampling is particularly useful when
working with extensive datasets, as it can significantly improve analysis efficiency.
5.3.1.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
Data Preprocessing ■ 147
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 housing_median_age 17000 non-null float64
1 total_rooms 17000 non-null float64
2 total_bedrooms 17000 non-null float64
3 population 17000 non-null float64
4 households 17000 non-null float64
5 median_income 17000 non-null float64
6 median_house_value 17000 non-null float64
dtypes: float64(7)
memory usage: 929.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 housing_median_age 17000 non-null float64
1 total_rooms 17000 non-null float64
2 total_bedrooms 17000 non-null float64
3 population 17000 non-null float64
4 households 17000 non-null float64
5 median_income 17000 non-null float64
6 median_house_value 17000 non-null float64
dtypes: float64(7)
memory usage: 929.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 4 columns):
148 ■ Data Mining with Python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 housing_median_age 17000 non-null float64
1 total_rooms 17000 non-null float64
2 total_bedrooms 17000 non-null float64
3 population 17000 non-null float64
4 households 17000 non-null float64
dtypes: float64(5)
memory usage: 664.2 KB
5.3.2.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
Data Preprocessing ■ 149
##Sampling by numbers
df.sample(n=5)
df.sample(frac=0.001)
df.loc[:10].sample(frac=0.9)
df.loc[:10].sample(15,replace=True)
In the realm of data analysis, preparing and transforming data to ensure it is suitable
for analysis is a critical step. This section introduces two fundamental concepts for
data preprocessing: Data scaling and data discretization.
Data scaling is the process of transforming data into a consistent range to ensure
that no single feature disproportionately influences an analysis. This concept is vital
for algorithms that rely on distance calculations or gradient descent, as well as for
visualizing data with varying scales. Within this section, you will explore various data
scaling methods, including Min-Max Scaling (Normalization), Z-Score Standardization,
and Robust Scaling. These methods allow you to rescale data to specific ranges or
standardize it to a mean of zero and a standard deviation of one, making it more
amenable to analysis.
Data discretization involves the transformation of continuous data into discrete
intervals or categories. This technique is beneficial for simplifying complex data,
reducing noise, and making data more interpretable. Discretization can be based
on statistical measures like quartiles or domain knowledge. In this section, you will
explore techniques for data discretization, including Equal Width Binning and Equal
Frequency Binning. These methods enable you to partition continuous data into
Data Preprocessing ■ 151
predefined intervals, allowing you to study data patterns and relationships more
effectively.
5.4.1.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.head()
df.describe()
median_house_value
count 17000.000000
mean 207300.912353
std 115983.764387
min 14999.000000
25% 119400.000000
50% 180400.000000
75% 265000.000000
max 500001.000000
0 notpopular
1 notpopular
2 notpopular
3 notpopular
4 notpopular
...
16995 not popular
16996 not popular
16997 not popular
16998 not popular
16999 not popular
Name: popular, Length: 17000, dtype: object
df['popular'].value_counts()
0 HH
1 HH
2 LL
Data Preprocessing ■ 153
3 0
4 0
..
16995 HL
16996 0
16997 0
16998 0
16999 0
Name: rooms, Length: 17000, dtype: object
df['rooms'].value_counts()
0 7970
LL 3424
HH 3394
HL 1110
LH 1102
Name: rooms, dtype: int64
df['house_value_category'] = df['median_house_value'].apply(house_value)
df['house_value_category']
0 Low
1 Low
2 Low
3 Low
4 Low
...
16995 Low
16996 Low
16997 Low
16998 Low
16999 Low
Name: house_value_category, Length: 17000, dtype: object
df['house_value_category'].value_counts()
Medium 8510
High 4247
Low 4243
Name: house_value_category, dtype: int64
154 ■ Data Mining with Python
5.4.2.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
6 households 17000 non-null float64
7 median_income 17000 non-null float64
8 median_house_value 17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB
df['population'].describe()
count 17000.000000
mean 1429.573941
std 1147.852959
min 3.000000
25% 790.000000
50% 1167.000000
75% 1721.000000
max 35682.000000
Name: population, dtype: float64
where vi is the current value, min and max are the current min and max, and max
′
and min are the new boundary. vi is the min_max scaled value.
′ ′
Data Preprocessing ■ 155
Normally, we use a special case of [0, 1] as the new scale; in this case, the formular
can be as simple as:
vi =
′ vi −min
max−min
0 0.028364
1 0.031559
2 0.009249
3 0.014350
4 0.017405
...
16995 0.025337
16996 0.033381
16997 0.034782
16998 0.036296
16999 0.022506
Name: population_MinMax, Length: 17000, dtype: float64
df['population_MinMax'].describe()
count 17000.000000
mean 0.039984
std 0.032172
min 0.000000
25% 0.022058
50% 0.032624
75% 0.048152
max 1.000000
Name: population_MinMax, dtype: float64
where vi is the current value, mean and std are current mean and standard deviation,
and vi is the Z-score scaled value.
′
0 -0.361173
1 -0.261858
156 ■ Data Mining with Python
2 -0.955326
3 -0.796769
4 -0.701809
...
16995 -0.455262
16996 -0.205230
16997 -0.161670
16998 -0.114626
16999 -0.543252
Name: population_Z, Length: 17000, dtype: float64
df['population_Z'] .describe()
count 1.700000e+04
mean 6.687461e-17
std 1.000000e+00
min -1.242819e+00
25% -5.571915e-01
50% -2.287522e-01
75% 2.538880e-01
max 2.984043e+01
Name: population_Z, dtype: float64
35682.0
df['population_decimal'] = df['population']/100000
df['population_decimal']
0 0.01015
1 0.01129
2 0.00333
3 0.00515
4 0.00624
...
16995 0.00907
Data Preprocessing ■ 157
16996 0.01194
16997 0.01244
16998 0.01298
16999 0.00806
Name: population_decimal, Length: 17000, dtype: float64
df['population_decimal'].describe()
count 17000.000000
mean 0.014296
std 0.011479
min 0.000030
25% 0.007900
50% 0.011670
75% 0.017210
max 0.356820
Name: population_decimal, dtype: float64
df[['population','population_MinMax',
'population_Z', 'population_decimal']].describe()
In the field of data management and analytics, a Data Warehouse serves as a central
repository for storing, organizing, and retrieving large volumes of data from various
sources. This section introduces the concept of a Data Warehouse, including the
essential components of Data Cubes and their dimensions, as well as the versatile
built-in PivotTable tool within Pandas.
A Data Warehouse is a dedicated storage system designed to consolidate data from
multiple sources, making it accessible for analytical purposes. This centralized repos-
itory ensures data consistency and provides a platform for efficient querying and
reporting. As the bricks of a Data Warehouse, Data Cubes are multidimensional data
structures that allow you to store and analyze data in a way that provides different
perspectives or dimensions. Each dimension represents a characteristic or attribute of
the data, creating a comprehensive view for analysis. Within this section, you will
explore the concept of Data Cubes and their dimensions. You’ll learn how to structure
data into cubes to facilitate multidimensional analysis and gain insights from complex
datasets.
158 ■ Data Mining with Python
Pandas includes a built-in PivotTable tool. This tool enables you to summarize,
analyze, and present data in a dynamic and interactive format, all within the Pandas
framework. You will discover how to use Pandas to create PivotTables, arrange data
fields, and apply filters and calculations to gain valuable insights from your data. This
practical skill is invaluable for data analysts and professionals who need to present
data in a meaningful and customizable manner.
import pandas as pd
import atoti as tt
df = pd.read_csv('/content/Spotify_Youtube_Sample.csv')
df.head()
Artist Track \
0 Gorillaz Feel Good Inc.
1 Gorillaz Rhinestone Eyes
2 Gorillaz New Gold (feat. Tame Impala and Bootie Brown)
3 Gorillaz On Melancholy Hill
4 Gorillaz Clint Eastwood
views = session.read_csv(
'/content/Spotify_Youtube_Sample.csv',
keys = ['Artist', 'Track', 'Album'],
)
views.head()
Data Preprocessing ■ 159
Album_type \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa single
Avemaría Avemaría single
Omar Apollo Invincible (feat. Daniel Caesar) Ivory album
Useless Apolonio album
Endlessly Ivory (Marfil) album
Views \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa 115121545.0
Avemaría Avemaría 10838443.0
Omar Apollo Invincible (feat. Daniel Caesar) Ivory 1967236.0
Useless Apolonio 469551.0
Endlessly Ivory (Marfil) 210243.0
Likes \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa 761203.0
Avemaría Avemaría 96423.0
Omar Apollo Invincible (feat. Daniel Caesar) Ivory 38113.0
Useless Apolonio 13611.0
Endlessly Ivory (Marfil) 4704.0
Comments \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa 17238.0
Avemaría Avemaría 6616.0
Omar Apollo Invincible (feat. Daniel Caesar) Ivory 764.0
Useless Apolonio 405.0
Endlessly Ivory (Marfil) 123.0
Licensed \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa True
Avemaría Avemaría False
Omar Apollo Invincible (feat. Daniel Caesar) Ivory True
Useless Apolonio True
Endlessly Ivory (Marfil) True
official_video \
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa True
Avemaría Avemaría True
Omar Apollo Invincible (feat. Daniel Caesar) Ivory True
Useless Apolonio True
Endlessly Ivory (Marfil) True
Stream
Artist Track Album
Ryan Castro Wasa Wasa Wasa Wasa 96300795.0
Avemaría Avemaría 9327917.0
Omar Apollo Invincible (feat. Daniel Caesar) Ivory 29596755.0
Useless Apolonio 25646394.0
Endlessly Ivory (Marfil) 10150327.0
160 ■ Data Mining with Python
m= cube.measures
m
l = cube.levels
l
Views.SUM
0 1,902,053,002,307.00
1D
cube.query(m['Views.SUM'], levels = [l['Artist']])
Views.SUM
Artist
$NOT 110,784,903.00
$uicideboy$ 334,135,108.00
(G)I-DLE 1,754,953,941.00
*NSYNC 1,027,832,862.00
070 Shake 96,099,359.00
... ...
will.i.am 2,831,320,166.00
Ángela Aguilar 1,385,295,291.00
Ñejo 626,680,824.00
Ñengo Flow 812,726,315.00
Øneheart 34,623,310.00
Views.SUM
Album
!Volare! The Very Best of the Gipsy Kings 5,760,198.00
"Awaken, My Love!" 694,453,372.00
"Heroes" (2017 Remaster) 29,328,667.00
"Let Go" Dj Pack 56.00
"Let's Rock" 14,005,512.00
... ...
2D
Data Preprocessing ■ 161
Views.SUM
Album Artist
!Volare! The Very Best of the Gipsy Kings Gipsy Kings 5,760,198.00
"Awaken, My Love!" Childish Gambino 694,453,372.00
"Heroes" (2017 Remaster) David Bowie 29,328,667.00
"Let Go" Dj Pack Dina Rae 56.00
"Let's Rock" The Black Keys 14,005,512.00
... ...
2D with slicing
cube.query(m['Views.SUM'], levels = [l['Album'], l['Track']],
filter=l['Artist'] == 'The Beatles')
cube.query(m['Views.SUM'], levels = [l['Album'], l['Track']],
filter=l['Artist'] == 'Michael Jackson')
Views.SUM
Album Track
"Miguel" Te Amaré 30,083,671.00
...
#1s ... and then some Brand New Man 33,246.00
$outh $ide $uicide Cold Turkey 214,405.00
Muddy Blunts 31,879.00
... ...
3D
cube.query(m['Views.SUM']
, levels = [l['Album'], l['Track'], l['official_video']])
3D with slicing
cube.query(m['Views.SUM']
, levels = [l['Album'], l['Track'], l['official_video']]
,filter=l['Artist'] == 'Michael Jackson')
Exercise You can play with other interests in measure and try different dimension
of the cube.
162 ■ Data Mining with Python
5.5.2.1 Setup
import numpy as np
import pandas as pd
df = pd.read_csv('/content/Spotify_Youtube_Sample.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20718 entries, 0 to 20717
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Artist 20718 non-null object
1 Track 20718 non-null object
2 Album 20718 non-null object
3 Album_type 20718 non-null object
4 Views 20248 non-null float64
5 Likes 20177 non-null float64
6 Comments 20149 non-null float64
7 Licensed 20248 non-null object
8 official_video 20248 non-null object
9 Stream 20142 non-null float64
dtypes: float64(4), object(6)
memory usage: 1.6+ MB
df['Artist'].unique()[:100]
Comments Likes \
Artist Album
$NOT - TRAGEDY + 3404.0 165966.666667
Beautiful Havoc 13900.5 371387.500000
EAT YOUR HEART OUT 735.0 19033.000000
Ethereal 8183.0 388334.000000
Fast & Furious: Drift Tape (Phonk Vol 1) 32.0 1725.000000
... ... ...
Øneheart snowfall (Slowed + Reverb) 11423.0 561165.000000
snowfall (Sped Up) 1361.0 66128.000000
this feeling 516.0 32838.000000
watching the stars 216.0 13429.000000
watching the stars (Remixes) 16.0 2145.000000
Views
Artist Album
$NOT - TRAGEDY + 9158825.0
Beautiful Havoc 14683693.5
EAT YOUR HEART OUT 681136.0
Ethereal 10114989.0
Fast & Furious: Drift Tape (Phonk Vol 1) 76559.0
... ...
Øneheart snowfall (Slowed + Reverb) 15361992.0
snowfall (Sped Up) 1707355.0
this feeling 856049.0
watching the stars 323775.0
watching the stars (Remixes) 139020.0
df_pivot.loc['Michael Jackson']
Comments \
Artist Album
Michael Jackson Bad (Remastered) 60358.50
Dangerous 127080.00
HIStory - PAST, PRESENT AND FUTURE - BOOK I 335112.00
Off the Wall 88325.00
Thriller 383891.50
XSCAPE 27103.50
164 ■ Data Mining with Python
Likes \
Artist Album
Michael Jackson Bad (Remastered) 1594227.0
Dangerous 3085718.0
HIStory - PAST, PRESENT AND FUTURE - BOOK I 8312571.0
Off the Wall 2262080.0
Thriller 8089856.5
XSCAPE 1064408.0
The Beatles 1 (Remastered) 368291.5
Abbey Road (Remastered) 728453.0
Help! (Remastered) 475089.0
Let It Be (Remastered) 1075941.0
Please Please Me (Remastered) 490790.0
Rubber Soul (Remastered) 463315.0
The Beatles (Remastered) 365297.0
Beyoncé 4 2623723.5
BEYONCÉ [Platinum Edition] 2906283.0
Dangerously In Love 3218858.0
I AM...SASHA FIERCE 6931695.0
Perfect Duet (Ed Sheeran & Beyoncé) 1998224.0
RENAISSANCE 112728.5
Views
Artist Album
Michael Jackson Bad (Remastered) 2.604891e+08
Dangerous 5.040573e+08
HIStory - PAST, PRESENT AND FUTURE - BOOK I 9.786800e+08
Off the Wall 3.254739e+08
Thriller 1.103212e+09
XSCAPE 1.719371e+08
The Beatles 1 (Remastered) 4.284411e+07
Abbey Road (Remastered) 8.114310e+07
Help! (Remastered) 4.285668e+07
Let It Be (Remastered) 1.319251e+08
Please Please Me (Remastered) 5.322836e+07
Rubber Soul (Remastered) 6.322312e+07
The Beatles (Remastered) 3.385295e+07
Beyoncé 4 5.347308e+08
BEYONCÉ [Platinum Edition] 6.870312e+08
Dangerously In Love 6.690844e+08
I AM...SASHA FIERCE 1.357274e+09
Data Preprocessing ■ 165
Comments \
Artist Album
Daddy Yankee VIDA 4252791.0
Charlie Puth See You Again (feat. Charlie Puth) 2127346.0
Wiz Khalifa See You Again (feat. Charlie Puth) 2127345.0
Mark Ronson Uptown Special 598916.0
... ...
Camila Resistiré 1.0
Peter Groeger Der Kaiser von Dallas (Die einzige Wahrheit übe... 0.0
Deep Purple Machine Head (2016 Version) 0.0
Christian Rode Auf dem hohen Küstensande (Von Meer und Strand ... 0.0
Maroon 5 Hands All Over 0.0
Likes \
Artist Album
Daddy Yankee VIDA 50788626.0
Charlie Puth See You Again (feat. Charlie Puth) 40147674.0
Wiz Khalifa See You Again (feat. Charlie Puth) 40147618.0
Mark Ronson Uptown Special 20067879.0
... ...
Camila Resistiré 9.0
Peter Groeger Der Kaiser von Dallas (Die einzige Wahrheit übe... 0.0
Deep Purple Machine Head (2016 Version) 1.0
Christian Rode Auf dem hohen Küstensande (Von Meer und Strand ... 0.0
Maroon 5 Hands All Over 0.0
Views
Artist Album
Daddy Yankee VIDA 8.079647e+09
Charlie Puth See You Again (feat. Charlie Puth) 5.773798e+09
Wiz Khalifa See You Again (feat. Charlie Puth) 5.773797e+09
Mark Ronson Uptown Special 4.821016e+09
... ...
Camila Resistiré 4.900000e+01
Peter Groeger Der Kaiser von Dallas (Die einzige Wahrheit übe... 3.688889e+01
Deep Purple Machine Head (2016 Version) 3.100000e+01
Christian Rode Auf dem hohen Küstensande (Von Meer und Strand ... 2.800000e+01
Maroon 5 Hands All Over 2.600000e+01
Comments Likes \
Artist Album
Øneheart snowfall 11423.0 561165.000000
snowfall (Slowed + Reverb) 11423.0 561165.000000
166 ■ Data Mining with Python
Views
Artist Album
Øneheart snowfall 15361992.0
snowfall (Slowed + Reverb) 15361992.0
snowfall (Sped Up) 1707355.0
this feeling 856049.0
apathy 597163.0
... ...
$NOT Ethereal 10114989.0
- TRAGEDY + 9158825.0
SIMPLE 1967700.0
EAT YOUR HEART OUT 681136.0
Fast & Furious: Drift Tape (Phonk Vol 1) 76559.0
Comments Likes \
Artist Album
Coldplay Memories...Do Not Open 270444.0 10282499.0
A Head Full of Dreams 377666.0 13515772.0
Mylo Xyloto 343020.0 8497224.0
A Rush of Blood to the Head 124357.0 5532787.0
Viva La Vida or Death and All His Friends 261790.0 4370461.0
Ghost Stories 79974.0 3741300.0
X&Y 114460.0 2962029.0
Parachutes 59966.5 2694138.0
Data Preprocessing ■ 167
Views
Artist Album
Coldplay Memories...Do Not Open 2.118019e+09
A Head Full of Dreams 1.828242e+09
Mylo Xyloto 1.665814e+09
A Rush of Blood to the Head 1.082588e+09
Viva La Vida or Death and All His Friends 7.895815e+08
Ghost Stories 7.864046e+08
X&Y 5.662392e+08
Parachutes 4.528667e+08
Music Of The Spheres 2.546560e+08
The Beatles Let It Be (Remastered) 1.319251e+08
Abbey Road (Remastered) 8.114310e+07
Rubber Soul (Remastered) 6.322312e+07
Please Please Me (Remastered) 5.322836e+07
Help! (Remastered) 4.285668e+07
1 (Remastered) 4.284411e+07
The Beatles (Remastered) 3.385295e+07
II
Data Analysis
169
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
CHAPTER 6
Classification
There are many different classification methods we use with Scikit-learn, but some of
the most common include:
• Logistic Regression: A linear model that is often used for binary classification,
where the goal is to predict one of two possible classes.
• Decision Trees: A tree-based model that uses a series of if-then rules to make
predictions.
• Random Forest: An ensemble method that combines many decision trees to
improve the accuracy of predictions.
• Naive Bayes: A probabilistic model that makes predictions based on the proba-
bility of each class given the input features.
• Support Vector Machines (SVMs): A linear model that finds the best boundary
between classes by maximizing the margin between them.
• K-Nearest Neighbors: A simple method that uses the k closest labeled examples
to the input in question to make a prediction.
• Gradient Boosting: An ensemble method that combines many weak models to
improve the accuracy of predictions.
In the field of machine learning and pattern recognition, Nearest Neighbor Clas-
sifiers are fundamental algorithms that leverage the proximity of data points to
make predictions or classifications. This section introduces two essential Nearest
Neighbor Classifiers, K-Nearest Neighbors (KNN) and Radius Neighbors (RNN), and
demonstrates their practical implementation using the Scikit-learn package. K-Nearest
Neighbors (KNN) is a supervised machine learning algorithm used for classification
and regression tasks. It operates on the principle that data points with similar features
tend to belong to the same class or category. Throughout this section, you will explore
the KNN algorithm’s core concepts and practical implementation using Scikit-learn.
Radius Neighbors (RNN) is an extension of the KNN algorithm that focuses on data
points within a specific radius or distance from a query point. This approach is useful
when you want to identify data points that are similar to a given reference point.
Within this section, you will delve into the RNN algorithm’s fundamental concepts
and practical implementation using Scikit-learn
6.1.1.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = df[df['Species'] != 0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
3 Petal width 100 non-null float64
4 Species 100 non-null int64
dtypes: float64(4), int64(1)
memory usage: 4.7 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7c8ff7e55ed0>
Figure 6.1 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:2]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50)
174 ■ Data Mining with Python
X_train[:5]
Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[:5]
array([[-0.83327391, -0.12378458],
[ 0.1454035 , 1.33250465],
[-1.95176237, -1.58007382],
[-0.27402967, -0.8519292 ],
[-1.5323292 , -0.48785689]])
k = 1
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=1)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[13 9]
[14 14]]
Classification Report:
precision recall f1-score support
accuracy 0.54 50
macro avg 0.55 0.55 0.54 50
weighted avg 0.55 0.54 0.54 50
Accuracy: 0.54
6.1.1.3 Best k
def knn_tuning(k):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
knn_tuning(1)
0.54
knn_tuning(5)
0.6
knn_results['K']
0 1
1 6
2 11
3 16
4 21
5 26
6 31
7 36
8 41
9 46
Name: K, dtype: int64
knn_results['Accuracy'] = knn_results['K'].apply(knn_tuning)
knn_results['Accuracy']
176 ■ Data Mining with Python
0 0.54
1 0.54
2 0.66
3 0.66
4 0.64
5 0.58
6 0.62
7 0.58
8 0.60
9 0.44
Name: Accuracy, dtype: float64
knn_results
K Accuracy
0 1 0.54
1 6 0.54
2 11 0.66
3 16 0.66
4 21 0.64
5 26 0.58
6 31 0.62
7 36 0.58
8 41 0.60
9 46 0.44
def knn_tuning_uniform(k):
classifier = KNeighborsClassifier(n_neighbors = k, weights= 'uniform')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
def knn_tuning_distance(k):
classifier = KNeighborsClassifier(n_neighbors = k, weights= 'distance')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
knn_results['Uniform'] = knn_results['K'].apply(knn_tuning_uniform)
knn_results['Distance'] = knn_results['K'].apply(knn_tuning_distance)
knn_results
6.1.2.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
178 ■ Data Mining with Python
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7e6a2a8a3430>
Figure 6.2 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Classification ■ 179
X_train[:5]
k = 1
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=1)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[ 8 0 0]
[ 0 8 2]
[ 0 0 12]]
Classification Report:
precision recall f1-score support
accuracy 0.93 30
macro avg 0.95 0.93 0.94 30
weighted avg 0.94 0.93 0.93 30
Accuracy: 0.9333333333333333
180 ■ Data Mining with Python
6.1.2.3 Best k
def knn_tuning(k):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
knn_tuning(1)
0.9333333333333333
knn_tuning(5)
0.9333333333333333
knn_results['K']
0 1
1 6
2 11
3 16
4 21
5 26
6 31
7 36
8 41
9 46
10 51
11 56
12 61
13 66
14 71
15 76
16 81
17 86
18 91
19 96
20 101
21 106
22 111
23 116
Name: K, dtype: int64
knn_results['Accuracy'] = knn_results['K'].apply(knn_tuning)
knn_results['Accuracy']
Classification ■ 181
0 0.933333
1 1.000000
2 1.000000
3 0.966667
4 0.966667
5 0.966667
6 0.966667
7 0.933333
8 0.933333
9 0.933333
10 0.933333
11 0.933333
12 0.833333
13 0.833333
14 0.866667
15 0.666667
16 0.600000
17 0.633333
18 0.633333
19 0.600000
20 0.600000
21 0.600000
22 0.600000
23 0.566667
Name: Accuracy, dtype: float64
knn_results
K Accuracy
0 1 0.933333
1 6 1.000000
2 11 1.000000
3 16 0.966667
4 21 0.966667
5 26 0.966667
6 31 0.966667
7 36 0.933333
8 41 0.933333
9 46 0.933333
10 51 0.933333
11 56 0.933333
12 61 0.833333
13 66 0.833333
14 71 0.866667
15 76 0.666667
16 81 0.600000
17 86 0.633333
18 91 0.633333
19 96 0.600000
20 101 0.600000
21 106 0.600000
22 111 0.600000
23 116 0.566667
182 ■ Data Mining with Python
def knn_tuning_uniform(k):
classifier = KNeighborsClassifier(n_neighbors = k, weights= 'uniform')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
def knn_tuning_distance(k):
classifier = KNeighborsClassifier(n_neighbors = k, weights= 'distance')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
knn_results['Uniform'] = knn_results['K'].apply(knn_tuning_uniform)
knn_results['Distance'] = knn_results['K'].apply(knn_tuning_distance)
knn_results
6.1.3.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = df[df['Species'] != 0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
3 Petal width 100 non-null float64
4 Species 100 non-null int64
dtypes: float64(4), int64(1)
memory usage: 4.7 KB
<seaborn.axisgrid.FacetGrid at 0x7b63ed91fb20>
184 ■ Data Mining with Python
Figure 6.3 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
X = df[df.columns[:2]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[:5]
array([[-0.85622399, -1.19846152],
[-1.1718826 , -1.83551748],
[-0.54056538, 1.03123433],
[-1.0140533 , -0.56140556],
[-0.22490676, -0.24287758]])
Classification ■ 185
r = 1
classifier = RadiusNeighborsClassifier(radius = r)
classifier.fit(X_train, y_train)
RadiusNeighborsClassifier(radius=1)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[4 4]
[4 8]]
Classification Report:
precision recall f1-score support
accuracy 0.60 20
macro avg 0.58 0.58 0.58 20
weighted avg 0.60 0.60 0.60 20
Accuracy: 0.6
6.1.3.3 Best r
def rnn_tuning(r):
classifier = RadiusNeighborsClassifier(radius = r)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
rnn_tuning(1)
186 ■ Data Mining with Python
0.6
rnn_tuning(5)
0.45
rnn_results['R']
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
5 3.5
6 4.0
7 4.5
8 5.0
9 5.5
10 6.0
11 6.5
12 7.0
13 7.5
14 8.0
15 8.5
16 9.0
17 9.5
Name: R, dtype: float64
rnn_results['Accuracy'] = rnn_results['R'].apply(rnn_tuning)
rnn_results['Accuracy']
0 0.60
1 0.55
2 0.55
3 0.60
4 0.55
5 0.50
6 0.45
7 0.45
8 0.45
9 0.45
10 0.40
11 0.40
12 0.40
13 0.40
14 0.40
15 0.40
16 0.40
17 0.40
Name: Accuracy, dtype: float64
Classification ■ 187
rnn_results
R Accuracy
0 1.0 0.60
1 1.5 0.55
2 2.0 0.55
3 2.5 0.60
4 3.0 0.55
5 3.5 0.50
6 4.0 0.45
7 4.5 0.45
8 5.0 0.45
9 5.5 0.45
10 6.0 0.40
11 6.5 0.40
12 7.0 0.40
13 7.5 0.40
14 8.0 0.40
15 8.5 0.40
16 9.0 0.40
17 9.5 0.40
def rnn_tuning_uniform(r):
classifier = RadiusNeighborsClassifier(radius = r, weights= 'uniform')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
def rnn_tuning_distance(k):
classifier = RadiusNeighborsClassifier(radius = k, weights= 'distance')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
rnn_results['Uniform'] = rnn_results['R'].apply(rnn_tuning_uniform)
rnn_results['Distance'] = rnn_results['R'].apply(rnn_tuning_distance)
rnn_results
6.1.4.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
Classification ■ 189
<seaborn.axisgrid.FacetGrid at 0x79783b73e890>
Figure 6.4 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train[:5]
190 ■ Data Mining with Python
r = 1
classifier = RadiusNeighborsClassifier(radius = r)
classifier.fit(X_train, y_train)
RadiusNeighborsClassifier(radius=1)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[ 8 0 0]
[ 0 12 0]
[ 0 0 10]]
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Accuracy: 1.0
Classification ■ 191
6.1.4.3 Best r
def rnn_tuning(r):
classifier = RadiusNeighborsClassifier(radius = r)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
rnn_tuning(1)
1.0
rnn_tuning(5)
0.4
rnn_results['R']
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
5 3.5
6 4.0
7 4.5
8 5.0
9 5.5
10 6.0
11 6.5
12 7.0
13 7.5
14 8.0
15 8.5
16 9.0
17 9.5
Name: R, dtype: float64
rnn_results['Accuracy'] = rnn_results['R'].apply(rnn_tuning)
rnn_results['Accuracy']
0 1.000000
1 0.933333
2 0.866667
3 0.900000
4 0.866667
5 0.700000
192 ■ Data Mining with Python
6 0.600000
7 0.500000
8 0.400000
9 0.366667
10 0.300000
11 0.266667
12 0.266667
13 0.266667
14 0.266667
15 0.266667
16 0.266667
17 0.266667
Name: Accuracy, dtype: float64
rnn_results
R Accuracy
0 1.0 1.000000
1 1.5 0.933333
2 2.0 0.866667
3 2.5 0.900000
4 3.0 0.866667
5 3.5 0.700000
6 4.0 0.600000
7 4.5 0.500000
8 5.0 0.400000
9 5.5 0.366667
10 6.0 0.300000
11 6.5 0.266667
12 7.0 0.266667
13 7.5 0.266667
14 8.0 0.266667
15 8.5 0.266667
16 9.0 0.266667
17 9.5 0.266667
def rnn_tuning_uniform(r):
classifier = RadiusNeighborsClassifier(radius = r, weights= 'uniform')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
def rnn_tuning_distance(k):
classifier = RadiusNeighborsClassifier(radius = k, weights= 'distance')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
Classification ■ 193
rnn_results['Uniform'] = rnn_results['R'].apply(rnn_tuning_uniform)
rnn_results['Distance'] = rnn_results['R'].apply(rnn_tuning_distance)
rnn_results
6.1.5 Case Study – Breast Cancer Classification Using Nearest Neighbor Classifiers
We will create a tutorial for the Nearest Neighbor algorithm, including K-Nearest
Neighbors (KNN) and Radius Neighbors (RNN), using the Breast Cancer dataset. We
will demonstrate how the choices of k and radius affect the classification results and
compare the performance of different models. To aid understanding, we will visualize
the prediction results.
6.1.5.1 Setup
Import necessary libraries and load the Breast Cancer dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors
import KNeighborsClassifier, RadiusNeighborsClassifier
from sklearn.metrics
import accuracy_score, classification_report, confusion_matrix
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
# Train KNN models with different k values and store the results
knn_results = {}
for k in k_values:
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
knn_results[k] = {
'model': knn_model,
'y_pred': y_pred_knn,
'accuracy': accuracy_score(y_test, y_pred_knn)
}
# Train RNN models with different radius values and store the results
rnn_results = {}
for radius in radius_values:
rnn_model = RadiusNeighborsClassifier(radius=radius)
rnn_model.fit(X_train, y_train)
y_pred_rnn = rnn_model.predict(X_test)
rnn_results[radius] = {
'model': rnn_model,
'y_pred': y_pred_rnn,
'accuracy': accuracy_score(y_test, y_pred_rnn)
}
KNN Accuracy:
K = 1: 0.93
K = 5: 0.96
K = 11: 0.98
K = 15: 0.96
K = 21: 0.96
RNN Accuracy:
Radius = 350: 0.94
Radius = 400: 0.94
Radius = 450: 0.94
Radius = 500: 0.91
Radius = 550: 0.90
Radius = 600: 0.90
plt.figure(figsize=(8, 4))
plt.plot(k_values, k_accuracies, marker='o')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Accuracy of KNN models')
plt.grid(True)
plt.show()
plt.figure(figsize=(8, 4))
plt.plot(radius_values, radius_accuracies, marker='o')
plt.xlabel('Radius Value')
plt.ylabel('Accuracy')
plt.title('Accuracy of RNN models')
plt.grid(True)
plt.show()
196 ■ Data Mining with Python
6.1.5.7 K and R
Feel free to experiment with different values of k and radius to observe how they affect
the accuracy of the models.
Decision Trees are powerful machine learning algorithms that are widely used for
classification tasks due to their interpretability and simplicity. This section intro-
duces Decision Tree Classifiers using the Scikit-learn package, covering classification,
visualization, and model tuning aspects.
Decision Tree Classifiers are versatile algorithms used for both classification and
regression tasks. They operate by recursively splitting the dataset based on the
Classification ■ 197
6.2.1.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = df[df['Species'] !=0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
198 ■ Data Mining with Python
<seaborn.axisgrid.FacetGrid at 0x7efc06708a90>
Figure 6.7 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train
DecisionTreeClassifier()
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
[[10 0]
[ 1 9]]
Accuracy: 0.95
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
Classification ■ 201
[[10 0]
[ 1 9]]
Accuracy: 0.95
|--- feature_3 <= 1.75
| |--- feature_2 <= 4.95
| | |--- feature_3 <= 1.65
| | | |--- class: 1
| | |--- feature_3 > 1.65
| | | |--- class: 2
| |--- feature_2 > 4.95
| | |--- feature_3 <= 1.55
| | | |--- class: 2
| | |--- feature_3 > 1.55
| | | |--- class: 1
|--- feature_3 > 1.75
| |--- feature_2 <= 4.85
| | |--- feature_1 <= 3.10
| | | |--- class: 2
| | |--- feature_1 > 3.10
| | | |--- class: 1
| |--- feature_2 > 4.85
| | |--- class: 2
classifier = DecisionTreeClassifier(max_depth=1)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
202 ■ Data Mining with Python
fig = plt.figure()
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
[[10 0]
[ 1 9]]
Accuracy: 0.95
|--- feature_3 <= 1.75
| |--- class: 1
|--- feature_3 > 1.75
| |--- class: 2
classifier = DecisionTreeClassifier(max_depth=2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
Classification ■ 203
[[10 0]
[ 0 10]]
Accuracy: 1.0
|--- feature_3 <= 1.75
| |--- feature_2 <= 4.95
| | |--- class: 1
| |--- feature_2 > 4.95
| | |--- class: 2
|--- feature_3 > 1.75
| |--- feature_2 <= 4.85
| | |--- class: 2
| |--- feature_2 > 4.85
| | |--- class: 2
classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
[[10 0]
[ 1 9]]
Accuracy: 0.95
|--- feature_3 <= 1.75
| |--- feature_2 <= 4.95
| | |--- feature_3 <= 1.65
| | | |--- class: 1
| | |--- feature_3 > 1.65
204 ■ Data Mining with Python
| | | |--- class: 2
| |--- feature_2 > 4.95
| | |--- feature_3 <= 1.55
| | | |--- class: 2
| | |--- feature_3 > 1.55
| | | |--- class: 1
|--- feature_3 > 1.75
| |--- feature_2 <= 4.85
| | |--- feature_1 <= 3.10
| | | |--- class: 2
| | |--- feature_1 > 3.10
| | | |--- class: 1
| |--- feature_2 > 4.85
| | |--- class: 2
def tree_depth_tuning(d):
classifier = DecisionTreeClassifier(max_depth=d)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
tree_results['Accuracy'] = tree_results['D'].apply(tree_depth_tuning)
tree_results
D Accuracy
0 1 0.95
1 2 1.00
2 3 0.95
3 4 0.95
4 5 0.95
5 6 0.95
6 7 0.95
7 8 0.95
8 9 0.95
6.2.2.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
Classification ■ 205
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
<seaborn.axisgrid.FacetGrid at 0x7f10608b6b20>
206 ■ Data Mining with Python
Figure 6.12 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train
DecisionTreeClassifier()
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
[[12 0 0]
[ 0 9 1]
[ 0 1 7]]
Accuracy: 0.9333333333333333
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
Classification ■ 209
[[12 0 0]
[ 0 9 1]
[ 0 1 7]]
Accuracy: 0.9333333333333333
|--- feature_2 <= 2.45
| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.75
| | |--- feature_2 <= 4.95
| | | |--- feature_0 <= 4.95
| | | | |--- class: 2
| | | |--- feature_0 > 4.95
| | | | |--- class: 1
| | |--- feature_2 > 4.95
| | | |--- feature_1 <= 2.65
| | | | |--- class: 2
| | | |--- feature_1 > 2.65
| | | | |--- feature_2 <= 5.45
| | | | | |--- class: 1
| | | | |--- feature_2 > 5.45
| | | | | |--- class: 2
| |--- feature_3 > 1.75
| | |--- feature_2 <= 4.85
| | | |--- feature_0 <= 5.95
| | | | |--- class: 1
| | | |--- feature_0 > 5.95
| | | | |--- class: 2
| | |--- feature_2 > 4.85
| | | |--- class: 2
classifier = DecisionTreeClassifier(max_depth=1)
classifier.fit(X_train, y_train)
210 ■ Data Mining with Python
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
fig = plt.figure()
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
[[12 0 0]
[ 0 0 10]
[ 0 0 8]]
Accuracy: 0.6666666666666666
|--- feature_3 <= 0.80
| |--- class: 0
|--- feature_3 > 0.80
| |--- class: 2
classifier = DecisionTreeClassifier(max_depth=2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
Classification ■ 211
print(text_representation)
fig = plt.figure(figsize=(10,8))
_ = tree.plot_tree(classifier,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
[[12 0 0]
[ 0 10 0]
[ 0 1 7]]
Accuracy: 0.9666666666666667
|--- feature_2 <= 2.45
| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.75
| | |--- class: 1
| |--- feature_3 > 1.75
| | |--- class: 2
classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test,y_pred))
text_representation = tree.export_text(classifier)
print(text_representation)
212 ■ Data Mining with Python
[[12 0 0]
[ 0 10 0]
[ 0 1 7]]
Accuracy: 0.9666666666666667
|--- feature_2 <= 2.45
| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.75
| | |--- feature_2 <= 5.35
| | | |--- class: 1
| | |--- feature_2 > 5.35
| | | |--- class: 2
| |--- feature_3 > 1.75
| | |--- feature_2 <= 4.85
| | | |--- class: 2
| | |--- feature_2 > 4.85
| | | |--- class: 2
def tree_depth_tuning(d):
classifier = DecisionTreeClassifier(max_depth=d)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
tree_results['Accuracy'] = tree_results['D'].apply(tree_depth_tuning)
tree_results
D Accuracy
0 1 0.666667
1 2 0.966667
2 3 0.966667
3 4 0.933333
4 5 0.933333
5 6 0.933333
6 7 0.933333
7 8 0.966667
8 9 0.933333
6.2.3.1 Setup
Import necessary libraries and load the Breast Cancer dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
6.2.3.3 Create and Train Decision Tree Models with Different Splitting Criteria and Max
Depth
6.2.3.6 Conclusion
This tutorial covers the Decision Tree algorithm using the Breast Cancer dataset.
It demonstrates how different splitting criteria (Information Gain and Gini Index)
and tree pruning (max depth) affect the classification results. The tutorial prints the
accuracy of the models with different hyperparameters and visualizes the accuracy vs
max depth for each splitting criterion for comparison.
Feel free to adjust the max_depth_values and add other hyperparameters to explore
their effects on the decision tree’s performance.
Support Vector Machines (SVMs) are powerful and versatile machine learning al-
gorithms used for classification and regression tasks. This section introduces SVM
Classifiers using the Scikit-learn package, covering their theory, implementation, and
practical applications.
Support Vector Machine Classifiers (SVMs) are supervised learning algorithms that
excel in both linear and non-linear classification tasks. They work by finding the
optimal hyperplane that best separates data into distinct classes. Scikit-learn provides
a robust library for implementing SVM Classifiers with ease. You will explore practi-
cal implementation steps, including using the SVC (Support Vector Classification)
class in Scikit-learn to create SVM Classifier models and understanding the role of
hyperparameters such as the kernel type, regularization parameter (C), and gamma
in SVM model performance.
6.3.1.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = df[df['Species'] !=0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
3 Petal width 100 non-null float64
4 Species 100 non-null int64
dtypes: float64(4), int64(1)
memory usage: 4.7 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7f8358fbf880>
Classification ■ 217
Figure 6.18 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:2]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
model = svm.SVC(kernel='linear')
classifier = model.fit(X_train, y_train)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test,y_pred))
218 ■ Data Mining with Python
[[11 1]
[ 1 7]]
precision recall f1-score support
accuracy 0.90 20
macro avg 0.90 0.90 0.90 20
weighted avg 0.90 0.90 0.90 20
0.9
6.3.2.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
Classification ■ 219
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7aaccd3501f0>
Figure 6.19 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
model = svm.SVC(kernel='linear')
classifier = model.fit(X_train, y_train)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test,y_pred))
[[12 0 0]
[ 0 8 1]
[ 0 0 9]]
0.9666666666666667
6.3.3.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2
, random_state=42)
Classification ■ 221
6.3.3.2 Create and Train SVM Models with Different Regularization Parameters (C)
plt.figure(figsize=(8, 4))
plt.plot(C_values, accuracies, marker='o')
plt.xscale('log')
plt.xlabel('Regularization Parameter (C)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Regularization Parameter (C) for SVM')
plt.grid(True)
plt.show()
222 ■ Data Mining with Python
6.3.3.5 Conclusion
This tutorial covers the SVM algorithm using the Breast Cancer dataset. It demon-
strates the difference between soft and hard margin SVM and shows how the regular-
ization parameter (C) affects the classification results. The tutorial prints the accuracy
of the models with different values of C and visualizes the accuracy vs regularization
parameter (C) for comparison.
Feel free to adjust the C_values list and try other kernel types (e.g., ’rbf’, ’poly’) to
explore their effects on SVM’s performance.
Naive Bayes classifiers are probabilistic machine learning algorithms commonly used
for classification tasks, particularly in natural language processing and text analysis.
This section introduces Naive Bayes classifiers using the Scikit-learn package, covering
their theory, implementation, and practical applications.
Naive Bayes classifiers are based on Bayes’ theorem and assume that features are
conditionally independent, hence the term “naive”. They are known for their simplicity,
efficiency, and effectiveness in various classification tasks. Scikit-learn provides a
user-friendly environment for implementing Naive Bayes classifiers. You will explore
practical implementation steps, including using the MultinomialNB, GaussianNB,
and BernoulliNB classes in Scikit-learn for different types of Naive Bayes models
and understanding the Laplace smoothing technique to handle unseen features and
improve model performance.
6.4.1.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = df[df['Species'] != 0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
3 Petal width 100 non-null float64
4 Species 100 non-null int64
dtypes: float64(4), int64(1)
memory usage: 4.7 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x79b596653b80>
224 ■ Data Mining with Python
Figure 6.21 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:2]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Confusion Matrix:
[[9 2]
[1 8]]
Classification Report:
precision recall f1-score support
accuracy 0.85 20
macro avg 0.85 0.85 0.85 20
weighted avg 0.86 0.85 0.85 20
Accuracy: 0.85
6.4.2.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7e90b4584700>
Figure 6.22 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
Classification ■ 227
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[ 7 0 0]
[ 0 8 2]
[ 0 0 13]]
Classification Report:
precision recall f1-score support
accuracy 0.93 30
macro avg 0.96 0.93 0.94 30
weighted avg 0.94 0.93 0.93 30
Accuracy: 0.9333333333333333
6.4.3.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2
, random_state=42)
6.4.3.4 Conclusion
This tutorial covers the Naive Bayes algorithm using the Breast Cancer dataset. It
demonstrates how to create and train the Naive Bayes model for classification and
prints the accuracy of the model on the test set.
Naive Bayes is a simple yet powerful algorithm for classification tasks, especially when
dealing with text or categorical data.
Classification ■ 229
Logistic Regression is a widely used statistical and machine learning technique for
binary classification tasks. This section introduces Logistic Regression classifiers
using the Scikit-learn package, covering their theory, implementation, and practical
applications.
Logistic Regression is a fundamental classification algorithm that models the prob-
ability of a binary outcome based on one or more predictor variables. Despite its
name, it is used for classification rather than regression tasks. Scikit-learn offers a
convenient environment for implementing Logistic Regression classifiers. You will
explore practical implementation steps, including using the LogisticRegression class
in Scikit-learn to create Logistic Regression models and training Logistic Regression
models on labeled datasets and making binary classification predictions.
6.5.1.1 Setup
Environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = df[df['Species'] != 0]
df.info()
230 ■ Data Mining with Python
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 100 non-null float64
1 Sepal width 100 non-null float64
2 Petal length 100 non-null float64
3 Petal width 100 non-null float64
4 Species 100 non-null int64
dtypes: float64(4), int64(1)
memory usage: 4.7 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x77fc4c650100>
Figure 6.23 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:2]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
classifier = LogisticRegression().fit(X, y)
classifier.fit(X_train, y_train)
LogisticRegression()
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
Confusion Matrix:
[[6 1]
[4 9]]
Classification Report:
precision recall f1-score support
accuracy 0.75 20
macro avg 0.75 0.77 0.74 20
weighted avg 0.80 0.75 0.76 20
Accuracy: 0.75
6.5.2.1 Setup
Environment
232 ■ Data Mining with Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
A simple visualization
sns.relplot(data = df, x = 'Sepal length', y = 'Sepal width'
, hue = 'Species')
<seaborn.axisgrid.FacetGrid at 0x7e3b5deaa890>
Classification ■ 233
Figure 6.24 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split
from sklearn.model_selection import train_test_split
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train[:5]
classifier = LogisticRegression().fit(X, y)
classifier.fit(X_train, y_train)
LogisticRegression()
y_pred = classifier.predict(X_test)
from sklearn.metrics
import classification_report, confusion_matrix, accuracy_score
print(result)
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)
accuracy = accuracy_score(y_test,y_pred)
print("Accuracy:",accuracy)
Confusion Matrix:
[[11 0 0]
[ 0 9 1]
[ 0 1 8]]
Classification Report:
precision recall f1-score support
accuracy 0.93 30
macro avg 0.93 0.93 0.93 30
weighted avg 0.93 0.93 0.93 30
Accuracy: 0.9333333333333333
6.5.3.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
Classification ■ 235
6.5.3.2 Create and Train the Logistic Regression Model with Different Regularization
Parameters
# Create a list of regularization parameter values
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
plt.figure(figsize=(8, 4))
plt.plot(C_values, accuracies, marker='o')
plt.xscale('log')
plt.xlabel('Regularization Parameter (C)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Regularization Parameter (C)')
plt.grid(True)
plt.show()
236 ■ Data Mining with Python
6.5.3.5 Conclusion
This tutorial covers the Logistic Regression algorithm using the Breast Cancer dataset.
It demonstrates how to create and train the Logistic Regression model for classification
and prints the accuracy of the model with different values of the regularization
parameter (C). The tutorial also visualizes the accuracy vs regularization parameter
(C) for comparison.
Feel free to experiment with different values of C to observe how it affects the model’s
performance.
In this section, we will conduct a comprehensive case study to explore and compare
the performance of various classification methods we have introduced using a single
dataset. This hands-on approach will provide you with a practical understanding of
how different classifiers behave and perform in real-world scenarios.
The case study aims to demonstrate the strengths and weaknesses of different clas-
sification methods, allowing you to make informed choices when selecting the most
appropriate algorithm for a specific task. You will work with a dataset that is suitable
for classification and apply all classifiers we have covered. Based on the case study
results, you will gain insights into which classifier(s) perform best for the given dataset
and classification task. You will also learn how to choose the most suitable classifier
based on the specific requirements and characteristics of a problem.
and summarize their performance in terms of accuracy. At the end, we will visualize
the results for comparison.
The Wine Dataset is a popular dataset for classification tasks, where the target class
represents the origin of different wines. It contains 13 features that describe various
properties of the wines.
6.6.1.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, random_state=42)
print("\nDecision Trees:")
for max_depth, accuracy in dt_results.items():
print(f"Max Depth = {max_depth}, Accuracy: {accuracy:.2f}")
print("\nSVM:")
for C, accuracy in svm_results.items():
print(f"C = {C}, Accuracy: {accuracy:.2f}")
print("\nNaive Bayes:")
print(f"Accuracy: {nb_accuracy:.2f}")
print("\nLogistic Regression:")
for C, accuracy in logreg_results.items():
print(f"C = {C}, Accuracy: {accuracy:.2f}")
Decision Trees:
Max Depth = 1, Accuracy: 0.67
Max Depth = 3, Accuracy: 0.94
Max Depth = 5, Accuracy: 0.94
Max Depth = 7, Accuracy: 0.94
Max Depth = 9, Accuracy: 0.94
Max Depth = 11, Accuracy: 0.94
Max Depth = 13, Accuracy: 0.94
Max Depth = 15, Accuracy: 0.94
Max Depth = 17, Accuracy: 0.94
240 ■ Data Mining with Python
SVM:
C = 0.001, Accuracy: 0.39
C = 0.01, Accuracy: 0.39
C = 0.1, Accuracy: 0.78
C = 1, Accuracy: 0.81
C = 10, Accuracy: 0.78
C = 100, Accuracy: 0.83
C = 200, Accuracy: 0.83
Naive Bayes:
Accuracy: 1.00
Logistic Regression:
C = 0.001, Accuracy: 0.89
C = 0.01, Accuracy: 1.00
C = 0.1, Accuracy: 1.00
C = 1, Accuracy: 0.97
C = 10, Accuracy: 0.94
C = 100, Accuracy: 0.94
C = 200, Accuracy: 0.94
plt.subplot(2, 2, 2)
plt.plot(list(dt_results.keys()), list(dt_results.values()), marker='o')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Accuracy vs Max Depth for Decision Trees')
plt.subplot(2, 2, 3)
plt.plot(C_values, list(svm_results.values()), marker='o')
plt.xscale('log')
plt.xlabel('C Value')
plt.ylabel('Accuracy')
plt.title('Accuracy vs C for SVM')
plt.subplot(2, 2, 4)
plt.plot(C_values, list(logreg_results.values()), marker='o')
plt.xscale('log')
plt.xlabel('C Value')
plt.ylabel('Accuracy')
plt.title('Accuracy vs C for Logistic Regression')
Classification ■ 241
plt.tight_layout()
plt.show()
6.6.1.9 Conclusion
This tutorial covers various classification methods (KNN, Decision Trees, SVM, Naive
Bayes, and Logistic Regression) using the Wine Dataset. It demonstrates how to apply
each method with different hyperparameters, summarizes their performance in terms
of accuracy, and visualizes the accuracy for comparison.
Feel free to experiment with other classification algorithms, hyperparameters, or
additional datasets to further explore different classification techniques.
CHAPTER 7
Regression
There are many different regression methods we use with Scikit-learn, but some of
the most common include:
• Linear Regression: A simple model that finds the best linear relationship between
the input features and the target value.
• Polynomial Regression: A non-linear extension of Linear Regression that uses
polynomial functions of the input features to fit the data.
• Ridge Regression: A Linear Regression model that includes a regularization
term to prevent overfitting.
• Lasso Regression: A Linear Regression model that includes a regularization
term to shrink the coefficient of less important features to zero.
• Decision Tree Regression: A tree-based model that uses a series of if-then rules
to make predictions.
• Random Forest Regression: An ensemble method that combines many decision
trees to improve the accuracy of predictions.
• Gradient Boosting Regression: An ensemble method that combines many weak
models to improve the accuracy of predictions.
between two variables: one independent variable (predictor) and one dependent variable
(outcome). Simple Regression is a powerful method for modeling linear relationships
between variables. It is commonly used for making predictions and understanding
how changes in one variable affect another. Scikit-learn provides a user-friendly
environment for implementing Simple Regression models. You will explore practical
implementation steps, including data preparation and exploration, which involve
cleaning and visualizing the dataset to identify trends and relationships, using the
LinearRegression class in Scikit-learn to fit a Linear Regression model to the data,
and assessing the goodness of fit and model performance using metrics like R-squared
and residual analysis.
7.1.1.1 Setup
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.head()
<seaborn.axisgrid.FacetGrid at 0x7f840ad533d0>
244 ■ Data Mining with Python
Prepare independent
x = np.array(df['total_rooms']).reshape(-1,1)
Prepare dependent
y = np.array(df['total_bedrooms']).reshape(-1, 1)
reg = LinearRegression()
model = reg.fit(x_train, y_train)
model.coef_, model.intercept_
(array([[0.18070511]]), array([61.86755768]))
y_pred = model.predict(x_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
Regression ■ 245
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 101.35888011777651
MSE: 23752.731604495282
RMSE: 154.1192123146731
R^2 0.8637647401970822
<matplotlib.collections.PathCollection at 0x7f840540fbe0>
<seaborn.axisgrid.FacetGrid at 0x7f84053cedc0>
246 ■ Data Mining with Python
reg = LinearRegression()
model = reg.fit(x_train, y_train)
y_pred = model.predict(x_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 61776.93574574111
MSE: 6756173988.71918
RMSE: 82195.94873665721
R^2 0.4940654658976973
(array([[41854.98097901]]), array([44955.95374405]))
plt.scatter(x_test, y_test)
plt.scatter(x_test, y_pred)
<matplotlib.collections.PathCollection at 0x7f8403b29f40>
<seaborn.axisgrid.FacetGrid at 0x7f8405370a00>
248 ■ Data Mining with Python
reg = LinearRegression()
model = reg.fit(x_train, y_train)
model.coef_, model.intercept_
(array([[2.74182763]]), array([55.92258466]))
y_pred = model.predict(x_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
Regression ■ 249
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 270.2469967472556
MSE: 184962.76253804995
RMSE: 430.07297350339275
R^2 0.8435552238188913
<matplotlib.collections.PathCollection at 0x7f84039d3c10>
7.1.2.1 Setup
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
250 ■ Data Mining with Python
Simple visualization
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x79d0267957e0>
reg = LinearRegression()
model = reg.fit(x_train, y_train)
model.coef_, model.intercept_
(array([[-0.84541517]]), array([2.72343055]))
Regression ■ 251
y_pred = model.predict(x_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 2.7050964291043655
MSE: 8.22603426094628
RMSE: 2.868106389405086
R^2 0.0913959729699606
<matplotlib.collections.PathCollection at 0x79d0266ba2f0>
poly = PolynomialFeatures(degree=2)
poly_train = poly.fit_transform(x_train.reshape(-1, 1))
poly_test = poly.fit_transform(x_test.reshape(-1, 1))
poly_train
poly_reg_model = LinearRegression()
poly_reg_model.fit(poly_train, y_train)
LinearRegression()
y_pred = poly_reg_model.predict(poly_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 0.39897424716689717
MSE: 0.22466684994348615
RMSE: 0.4739903479433797
R^2 0.9596978093663853
<matplotlib.collections.PathCollection at 0x79d026554760>
Regression ■ 253
Much better! Isn’t it? Is the degree higher, the result better? How about we do a
degree as 51?
poly = PolynomialFeatures(degree=51)
poly_train = poly.fit_transform(x_train.reshape(-1, 1))
poly_test = poly.fit_transform(x_test.reshape(-1, 1))
poly_reg_model = LinearRegression()
poly_reg_model.fit(poly_train, y_train)
y_pred = poly_reg_model.predict(poly_train)
mae = mean_absolute_error(y_true=y_train,y_pred=y_pred)
mse = mean_squared_error(y_true=y_train,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_train,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)
print('''Training:
MAE: {}
MSE: {}
RMSE: {}
Rˆ2 {}'''.format(mae, mse, rmse, r2))
y_pred = poly_reg_model.predict(poly_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('''Testing:
MAE: {}
MSE: {}
RMSE: {}
Rˆ2 {}'''.format(mae, mse, rmse, r2))
254 ■ Data Mining with Python
Training:
MAE: 0.9300192606660586
MSE: 1.7181920444211911
RMSE: 1.3107982470316288
R^2 0.8398154972074873
Testing:
MAE: 1.62949110030523
MSE: 4.890215068197883
RMSE: 2.2113830668154
R^2 0.12276163587346123
plt.scatter(x, y)
plt.scatter(x_test, y_pred)
<matplotlib.collections.PathCollection at 0x79d028a3d270>
<matplotlib.collections.PathCollection at 0x79d026554760>
7.2.1.1 Setup
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('/content/sample_data/california_housing_train.csv')
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
6 households 17000 non-null float64
7 median_income 17000 non-null float64
8 median_house_value 17000 non-null float64
dtypes: float64(9)
256 ■ Data Mining with Python
7.2.1.2 Try Dependent as Median House Value, and two Independent Variables
Prepare independent variable
X = np.array(df[['total_rooms', 'median_income']]).reshape(-1,2)
reg = LinearRegression()
model = reg.fit(X_train, y_train)
model.coef_, model.intercept_
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 82.63220433916297
MSE: 18266.31046933536
RMSE: 135.1529151344334
R^2 0.8964203264448544
Regression ■ 257
reg = LinearRegression()
model = reg.fit(X_train, y_train)
model.coef_, model.intercept_
258 ■ Data Mining with Python
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
mse = mean_squared_error(y_true=y_test,y_pred=y_pred)
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
MAE: 37.49868335804146
MSE: 4751.900928762269
RMSE: 68.9340331676761
R^2 0.9751011683609353
7.3 REGULARIZATION
7.3.1.1 Setup
You can either use a real dataset or generate a dummy dataset for this tutorial. For
simplicity, let’s create a dummy dataset using NumPy.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error
plt.scatter(X, y)
<matplotlib.collections.PathCollection at 0x79a048d0e680>
260 ■ Data Mining with Python
print(f"Linear Regression:")
print(f"Training R-squared: {linear_r2_train:.4f}
, Training MSE: {linear_mse_train:.4f}")
print(f"Testing R-squared: {linear_r2_test:.4f}
, Testing MSE: {linear_mse_test:.4f}")
Linear Regression:
Training R-squared: 0.0008, Training MSE: 3.9027
Testing R-squared: -0.0453, Testing MSE: 3.5428
Regression ■ 261
# Split the high-degree polynomial features into training and testing sets
X_poly_high_degree_train, X_poly_high_degree_test, y_train, y_test
= train_test_split(X_poly_high_degree, y, test_size=0.2
262 ■ Data Mining with Python
, random_state=0)
# Create and fit the Ridge, Lasso, and ElasticNet regression models
ridge_model = Ridge(alpha=alpha_ridge)
ridge_model.fit(X_poly_high_degree_train, y_train)
lasso_model = Lasso(alpha=alpha_lasso)
lasso_model.fit(X_poly_high_degree_train, y_train)
Regression ■ 263
elasticnet_model = ElasticNet(alpha=alpha_elasticnet
, l1_ratio=l1_ratio_elasticnet)
elasticnet_model.fit(X_poly_high_degree_train, y_train)
# Make predictions on training and testing data for all regularized models
y_train_pred_ridge = ridge_model.predict(X_poly_high_degree_train)
y_test_pred_ridge = ridge_model.predict(X_poly_high_degree_test)
y_train_pred_lasso = lasso_model.predict(X_poly_high_degree_train)
y_test_pred_lasso = lasso_model.predict(X_poly_high_degree_test)
y_train_pred_elasticnet = elasticnet_model.predict(X_poly_high_degree_train)
y_test_pred_elasticnet = elasticnet_model.predict(X_poly_high_degree_test)
print("\nRegularization:")
print(f"Ridge Regression - Training R-squared: {ridge_r2_train:.4f}
, Testing R-squared: {ridge_r2_test:.4f}")
print(f"Lasso Regression - Training R-squared: {lasso_r2_train:.4f}
, Testing R-squared: {lasso_r2_test:.4f}")
print(f"ElasticNet Regression - Training R-squared: {elasticnet_r2_
train:.4f}
, Testing R-squared: {elasticnet_r2_test:.4f}")
Regularization:
Ridge Regression - Training R-squared: 0.9627, Testing R-squared: 0.9151
Lasso Regression - Training R-squared: 0.9554, Testing R-squared: 0.9225
ElasticNet Regression - Training R-squared: 0.9552, Testing R-squared: 0.9228
This tutorial demonstrates the use of different regularization techniques (Ridge, Lasso,
Elastic Net) for regression analysis on the California Housing Prices dataset. Users
will be able to understand how regularization helps in controlling overfitting and
improving the generalization of Linear Regression models. They can further explore
other real-world datasets and apply different regularization strategies to improve the
performance of regression models effectively.
7.3.2.1 Setup
We’ll start by importing the necessary libraries for data manipulation, visualization,
and regression analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
# Create and fit the polynomial regression models with different degrees
polynomial_models = []
polynomial_r2_train_scores = []
polynomial_r2_test_scores = []
model = LinearRegression()
model.fit(X_train_poly, y_train)
polynomial_models.append(model)
polynomial_r2_train_scores.append(polynomial_r2_train)
polynomial_r2_test_scores.append(polynomial_r2_test)
print(f"\nPolynomial Regression:")
print(f"Best Degree: {best_degree}")
print(f"Training R-squared Scores: {polynomial_r2_train_scores}")
print(f"Testing R-squared Scores: {polynomial_r2_test_scores}")
Polynomial Regression:
Best Degree: 2
Training R-squared Scores:
[0.6852681982344955, 0.7441415681335484, 0.7893228446487628]
Testing R-squared Scores:
[0.6456819729261878, -18.38870805843526, -11476.104183339065]
poly_features = PolynomialFeatures(degree=2)
X_train_poly = poly_features.fit_transform(X_train_scaled)
X_test_poly = poly_features.transform(X_test_scaled)
# Create and fit the Ridge regression models with different alpha values
ridge_models = []
ridge_r2_train_scores = []
ridge_r2_test_scores = []
ridge_r2_train_scores.append(ridge_r2_train)
ridge_r2_test_scores.append(ridge_r2_test)
print(f"\nRidge Regression:")
print(f"Best Alpha: {best_alpha_ridge:.4f}")
Regression ■ 267
Ridge Regression:
Best Alpha: 50.0000
Training R-squared Scores:
[0.6852681982309979, 0.6852681978848241, 0.6852681633541837,
0.685264794671512, 0.6849940748677977, 0.6835523857209759,
0.6816443257072609]
Testing R-squared Scores:
[0.645683225805578, 0.6456944994375847, 0.6458070098962285,
0.6469096540341595, 0.6558501677208112, 0.6655692803642396,
0.6672535561034868]
lasso_r2_train_scores.append(lasso_r2_train)
lasso_r2_test_scores.append(lasso_r2_test)
print(f"\nLasso Regression:")
print(f"Best Alpha: {best_alpha_lasso:.4f}")
print(f"Training R-squared Scores: {lasso_r2_train_scores}")
print(f"Testing R-squared Scores: {lasso_r2_test_scores}")
Lasso Regression:
Best Alpha: 0.0010
Training R-squared Scores:
268 ■ Data Mining with Python
# Create and fit the models with different alpha and l1_ratio values
elasticnet_models = []
elasticnet_r2_train_scores = []
elasticnet_r2_test_scores = []
elasticnet_r2_train_scores.append(elasticnet_r2_train)
elasticnet_r2_test_scores.append(elasticnet_r2_test)
# Find the best alpha and l1_ratio based on the testing R-squared score
best_alpha_elasticnet, best_l1_ratio_elasticnet =
alphas_elasticnet[np.argmax(elasticnet_r2_test_scores)//len(l1_ratios)],
l1_ratios[np.argmax(elasticnet_r2_test_scores) % len(l1_ratios)]
print(f"\nElasticNet Regression:")
print(f"Best Alpha: {best_alpha_elasticnet:.4f}
, Best l1_ratio: {best_l1_ratio_elasticnet:.1f}")
print(f"Training R-squared Scores: {elasticnet_r2_train_scores}")
print(f"Testing R-squared Scores: {elasticnet_r2_test_scores}")
ElasticNet Regression:
Best Alpha: 0.0010, Best l1_ratio: 0.9
Regression ■ 269
7.3.2.8 Visualization
You can visualize the R-squared scores for different regularization techniques.
# Plotting R-squared scores for different regularization techniques
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(alphas, ridge_r2_test_scores, label="Ridge Regression")
plt.plot(alphas, lasso_r2_test_scores, label="Lasso Regression")
plt.plot(elasticnet_r2_test_scores, label="ElasticNet Regression")
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Testing R-squared")
plt.title("R-squared Scores for Different Regularization Techniques")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(degrees, polynomial_r2_test_scores, marker='o')
plt.xlabel("Polynomial Degree")
plt.ylabel("Testing R-squared")
plt.title("R-squared Scores for Different Polynomial Degrees")
plt.xticks(degrees)
plt.tight_layout()
plt.show()
270 ■ Data Mining with Python
7.4 CROSS-VALIDATION
7.4.1.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Regression ■ 271
<matplotlib.collections.PathCollection at 0x7c68f21d7e50>
7.4.1.3 Cross-Validation
Now, we’ll perform cross-validation using different techniques and compare the results.
# Create a list to store the MSEs for different techniques
mse_scores = []
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_scores.append(np.mean(mse_scores_cv))
7.4.1.4 Visualization
We can also visualize the mean squared errors for different cross-validation techniques
using a bar plot.
# Plotting mean squared errors for different cross-validation techniques
plt.figure(figsize=(8, 6))
plt.bar(cv_methods, mse_scores)
plt.xlabel("Cross-Validation Method")
plt.ylabel("Mean Squared Error")
plt.title("Mean Squared Error for Different Cross-Validation Techniques")
plt.show()
Regression ■ 273
7.4.2.1 Setup
We’ll start by importing the necessary libraries for data manipulation, cross-validation,
regression, and dataset loading.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold, LeaveOneOut
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Longitude target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
7.4.2.3 Cross-Validation
Now, we’ll perform cross-validation using different techniques and compare the results.
# Create a list to store the MSEs for different techniques
mse_scores = []
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse_scores.append(np.mean(mse_scores_cv))
7.4.2.4 Visualization
We can visualize the mean squared errors for different cross-validation techniques
using a bar plot.
# Plotting mean squared errors for different cross-validation techniques
plt.figure(figsize=(8, 6))
plt.bar(cv_methods, mse_scores)
plt.xlabel("Cross-Validation Method")
plt.ylabel("Mean Squared Error")
plt.title("Mean Squared Error for Different Cross-Validation Techniques")
plt.show()
Ensemble methods are powerful techniques in machine learning that combine the
predictions of multiple models to improve overall predictive performance. Ensemble
methods leverage the wisdom of crowds by combining multiple models to make
predictions that are often more accurate and robust than those of individual models.
276 ■ Data Mining with Python
7.5.1.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = pd.DataFrame(
{'Sepal length': iris.data[:,0],
'Sepal width': iris.data[:,1],
'Petal length':iris.data[:,2],
'Petal width':iris.data[:,3],
'Species':iris.target})
df.head()
df = df[df['Species'] !=0]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 50 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
Regression ■ 277
<seaborn.axisgrid.FacetGrid at 0x7fe5f92a1d30>
Figure 7.18 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split training and testing datasets are split with test_size as ratio. Here
we use 80% for training and 20% for testing
from sklearn.model_selection import train_test_split
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train
RandomForestClassifier(n_estimators=10)
[[10 0]
[ 1 9]]
Accuracy: 0.95
7.5.2.1 Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
df = pd.DataFrame(
{'Sepal length': iris.data[:,0],
'Sepal width': iris.data[:,1],
'Petal length':iris.data[:,2],
Regression ■ 279
'Petal width':iris.data[:,3],
'Species':iris.target})
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal length 150 non-null float64
1 Sepal width 150 non-null float64
2 Petal length 150 non-null float64
3 Petal width 150 non-null float64
4 Species 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
<seaborn.axisgrid.FacetGrid at 0x7fe8817e8cd0>
Figure 7.19 A Scatter Plot of Sepal Length VS Sepal Width with Species Differentiation
Train-test split training and testing datasets are split with test_size as ratio. Here
we use 80% for training and 20% for testing
280 ■ Data Mining with Python
X = df[df.columns[:4]]
y = df[df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
X_train
RandomForestClassifier(n_estimators=10)
[[11 0 0]
[ 0 6 0]
[ 0 1 12]]
Accuracy: 0.9666666666666667
We will explore three popular ensemble techniques: Bagging, Boosting, and Stacking.
For this tutorial, we’ll use the Gradient Boosting Regressor, Random Forest Regressor,
and a Simple Linear Regression as base models.
This comprehensive tutorial covers various regression techniques, including Linear
Regression, Polynomial Regression, Polynomial Regression with regularization (Ridge,
Lasso, Elastic Net), Multivariable Regression, and Random Forest Regression using
ensemble methods. Users will be able to understand the strengths and weaknesses
of each method and how to select appropriate models for different regression tasks.
They can further explore other real-world datasets and apply these regression models
to make accurate predictions.
# Create and fit the polynomial regression models with different degrees
polynomial_models = []
polynomial_r2_train_scores = []
polynomial_r2_test_scores = []
model = LinearRegression()
model.fit(X_train_poly, y_train)
polynomial_models.append(model)
polynomial_r2_train_scores.append(polynomial_r2_train)
polynomial_r2_test_scores.append(polynomial_r2_test)
print(f"\nPolynomial Regression:")
print(f"Best Degree: {best_degree}")
print(f"Training R-squared Scores: {polynomial_r2_train_scores}")
print(f"Testing R-squared Scores: {polynomial_r2_test_scores}")
Polynomial Regression:
Best Degree: 2
Training R-squared Scores:
[0.6852681982344955, 0.7441415681335484, 0.7893228446487628]
Testing R-squared Scores:
[0.6456819729261878, -18.38870805843526, -11476.104183339065]
# Create and fit the regression models with different alpha values
ridge_models = []
ridge_r2_train_scores = []
ridge_r2_test_scores = []
lasso_models = []
lasso_r2_train_scores = []
lasso_r2_test_scores = []
elasticnet_models = []
elasticnet_r2_train_scores = []
elasticnet_r2_test_scores = []
ridge_models.append(ridge_model)
ridge_r2_train_scores.append(ridge_r2_train)
ridge_r2_test_scores.append(ridge_r2_test)
lasso_model = Lasso(alpha=alpha)
lasso_model.fit(X_train_poly, y_train)
lasso_models.append(lasso_model)
lasso_r2_train_scores.append(lasso_r2_train)
lasso_r2_test_scores.append(lasso_r2_test)
elasticnet_r2_train_scores.append(elasticnet_r2_train)
elasticnet_r2_test_scores.append(elasticnet_r2_test)
# Find the best alpha and l1_ratio based on the testing R-squared score
best_alpha_ridge = alphas[np.argmax(ridge_r2_test_scores)]
best_alpha_lasso = alphas[np.argmax(lasso_r2_test_scores)]
best_alpha_elasticnet = alphas[np.argmax(elasticnet_r2_test_scores)]
best_l1_ratio_elasticnet = [0.2, 0.5, 0.7, 0.9]
[np.argmax(elasticnet_r2_test_scores) % 4]
Regression ■ 285
multi_model.fit(X_train_scaled, y_train)
print(f"\nMultivariable Regression:")
print(f"Training R-squared: {multi_r2_train:.4f}
, Training MSE: {multi_mse_train:.4f}")
print(f"Testing R-squared: {multi_r2_test:.4f}
, Testing MSE: {multi_mse_test:.4f}")
Multivariable Regression:
Training R-squared: 0.6126, Training MSE: 0.5179
Testing R-squared: 0.5758, Testing MSE: 0.5559
7.5.3.8 Visualization
You can visualize the R-squared scores for different models.
# Plotting R-squared scores for different models
plt.figure(figsize=(12, 6))
train_scores = [linear_r2_train
, polynomial_r2_train_scores[np.argmax(polynomial_r2_test_scores)],
max(ridge_r2_train_scores), multi_r2_train, rf_r2_train]
test_scores = [linear_r2_test
, polynomial_r2_test_scores[np.argmax(polynomial_r2_test_scores)],
max(ridge_r2_test_scores), multi_r2_test, rf_r2_test]
x = np.arange(len(models))
width = 0.35
In this section, we will conduct a comprehensive case study to explore and compare
the performance of various regression methods we have introduced using a single
dataset. This hands-on approach will provide you with a practical understanding of
how different regression techniques perform in real-world scenarios.
The case study aims to demonstrate the strengths and weaknesses of different regression
methods, allowing you to make informed choices when selecting the most appropriate
technique for a specific regression task. You will work with a dataset that is suitable
for regression and apply the regression methods we have covered. Based on the case
study results, you will gain insights into which regression method(s) perform best for
the given dataset and regression task. You will also learn how to choose the most
suitable regression technique based on specific requirements and characteristics of a
problem.
, columns=data.feature_names + ['target'])
...
target
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 442 non-null float64
1 sex 442 non-null float64
2 bmi 442 non-null float64
3 bp 442 non-null float64
4 s1 442 non-null float64
5 s2 442 non-null float64
6 s3 442 non-null float64
7 s4 442 non-null float64
8 s5 442 non-null float64
9 s6 442 non-null float64
10 target 442 non-null float64
dtypes: float64(11)
memory usage: 38.1 KB
None
290 ■ Data Mining with Python
# Create and fit the polynomial regression models with different degrees
polynomial_models = []
polynomial_r2_train_scores = []
polynomial_r2_test_scores = []
model = LinearRegression()
model.fit(X_train_poly, y_train)
polynomial_models.append(model)
polynomial_r2_train_scores.append(polynomial_r2_train)
polynomial_r2_test_scores.append(polynomial_r2_test)
# Create and fit the regression models with different alpha values
ridge_models = []
ridge_r2_train_scores = []
ridge_r2_test_scores = []
lasso_models = []
lasso_r2_train_scores = []
lasso_r2_test_scores = []
elasticnet_models = []
elasticnet_r2_train_scores = []
elasticnet_r2_test_scores = []
y_train_pred = ridge_model.predict(X_train_poly)
y_test_pred = ridge_model.predict(X_test_poly)
ridge_r2_train_scores.append(ridge_r2_train)
ridge_r2_test_scores.append(ridge_r2_test)
lasso_model = Lasso(alpha=alpha)
lasso_model.fit(X_train_poly, y_train)
lasso_models.append(lasso_model)
lasso_r2_train_scores.append(lasso_r2_train)
lasso_r2_test_scores.append(lasso_r2_test)
elasticnet_r2_train_scores.append(elasticnet_r2_train)
elasticnet_r2_test_scores.append(elasticnet_r2_test)
# Find the best alpha values based on the testing R-squared scores
best_alpha_ridge = alphas[np.argmax(ridge_r2_test_scores)]
best_alpha_lasso = alphas[np.argmax(lasso_r2_test_scores)]
best_alpha_elasticnet = alphas[np.argmax(elasticnet_r2_test_scores)]
best_l1_ratio_elasticnet = [0.2, 0.5, 0.8]
[np.argmax(elasticnet_r2_test_scores) // len(alphas)]
print(f"\nMultivariable Regression:")
print(f"Training R-squared: {multi_r2_train:.4f}
, Training MSE: {multi_mse_train:.4f}")
print(f"Testing R-squared: {multi_r2_test:.4f}
, Testing MSE: {multi_mse_test:.4f}")
Multivariable Regression:
Training R-squared: 0.5279, Training MSE: 2868.5497
Testing R-squared: 0.4526, Testing MSE: 2900.1936
7.6.1.6 Cross-Validation
We will perform cross-validation using 3-folds, 5-folds, and 10-folds to assess the
performance of different regression models.
# Define a function to perform cross-validation
def perform_cross_validation(model, X, y, cv):
cv_scores = cross_val_score(model, X, y, scoring='r2', cv=cv)
mean_cv_score = np.mean(cv_scores)
return mean_cv_score
rf_model.fit(X_train, y_train)
, lasso_r2_train_scores[np.argmax(lasso_r2_test_scores)],
, elasticnet_r2_train_scores[np.argmax(elasticnet_r2_test_scores)]
, multi_r2_train, rf_r2_train, gb_r2_train, stack_r2_train]
test_r2_scores = [slr_r2_test
, polynomial_r2_test_scores[np.argmax(polynomial_r2_test_scores)]
, ridge_r2_test_scores[np.argmax(ridge_r2_test_scores)]
, lasso_r2_test_scores[np.argmax(lasso_r2_test_scores)]
, elasticnet_r2_test_scores[np.argmax(elasticnet_r2_test_scores)]
, multi_r2_test, rf_r2_test, gb_r2_test, stack_r2_test]
Clustering
There are many different clustering methods we use with Scikit-learn, but some of
the most common include:
• K-Means: A method that partitions a dataset into k clusters, where each cluster
is defined by the mean of the points assigned to that cluster.
• Hierarchical Clustering: A method that builds a hierarchy of clusters, where
each cluster is split into smaller clusters or merged with other clusters.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): A
density-based clustering method that groups together points that are close to
each other, while marking as outliers points that are isolated.
• Gaussian Mixture Model: A probabilistic model that represents a dataset as a
mixture of Gaussian distributions.
8.1.1 Tutorial
We learned both K-Means and K-Medoids as basic clustering methods. Let’s play
with them and observe the differences.
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
import matplotlib.pyplot as plt
import seaborn as sns
Create three clusters of data points. Each cluster has 1,000 data points, and each
data point has two dimensions. Data points have normal distribution with specified
mean and std.
c1 = np.random.normal(5, 3, (100, 2))
c2 = np.random.normal(15, 5, (100, 2))
c3 = np.random.normal(-5, 2, (100, 2))
d = np.concatenate((c1, c2, c3), axis = 0)
d.shape
(300, 2)
<matplotlib.axes._subplots.AxesSubplot at 0x7f547096a2d0>
300 ■ Data Mining with Python
<matplotlib.axes._subplots.AxesSubplot at 0x7f546ef3fa10>
<matplotlib.axes._subplots.AxesSubplot at 0x7f54679ccc50>
Conclusion: K-Means and K-Medoids did equally well for this dataset.
Create an outlier. The outlier is far away from all other data points.
outlier = np.array([[100, 100]])
<matplotlib.axes._subplots.AxesSubplot at 0x7f54678f9a90>
302 ■ Data Mining with Python
<matplotlib.axes._subplots.AxesSubplot at 0x7f54678f6690>
<matplotlib.axes._subplots.AxesSubplot at 0x7f5467861390>
<seaborn.axisgrid.FacetGrid at 0x7f5467792910>
304 ■ Data Mining with Python
<matplotlib.axes._subplots.AxesSubplot at 0x7f5467754890>
<matplotlib.axes._subplots.AxesSubplot at 0x7f5467685b90>
<seaborn.axisgrid.FacetGrid at 0x7f54676c6250>
306 ■ Data Mining with Python
<matplotlib.axes._subplots.AxesSubplot at 0x7f546786dbd0>
<matplotlib.axes._subplots.AxesSubplot at 0x7f5467634bd0>
<seaborn.axisgrid.FacetGrid at 0x7f54673ce1d0>
308 ■ Data Mining with Python
<matplotlib.axes._subplots.AxesSubplot at 0x7f54673b0890>
<matplotlib.axes._subplots.AxesSubplot at 0x7f54672d65d0>
Conclusion: K-Medoids did well even with more outliers. K-Means failed.
8.1.2.1 Setup
Dataset loading and exploration
!pip install scikit-learn-extra
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score, adjusted_rand_score
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
target
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000
std 0.828066 0.435866 1.765298
min 4.300000 2.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.350000
75% 6.400000 3.300000 5.100000
max 7.900000 4.400000 6.900000
None
print(f"K-means Clustering:")
print(f"Silhouette Score: {kmeans_silhouette_score:.4f}")
print(f"Adjusted Rand Index Score: {kmeans_ari_score:.4f}")
K-means Clustering:
Silhouette Score: 0.4599
Adjusted Rand Index Score: 0.6201
print(f"\nK-medoids Clustering:")
print(f"Silhouette Score: {kmedoids_silhouette_score:.4f}")
print(f"Adjusted Rand Index Score: {kmedoids_ari_score:.4f}")
K-medoids Clustering:
Silhouette Score: 0.4590
Adjusted Rand Index Score: 0.6312
plt.subplot(1, 2, 2)
plt.scatter(X_scaled[:, 0]
, X_scaled[:, 1], c=kmedoids_labels, cmap='viridis')
plt.scatter(kmedoids.cluster_centers_[:, 0]
, kmedoids.cluster_centers_[:, 1]
, marker='X', s=200, c='red', label='Medoids')
plt.title('K-medoids Clustering')
plt.xlabel('Sepal Length (Scaled)')
plt.ylabel('Sepal Width (Scaled)')
plt.legend()
plt.tight_layout()
plt.show()
Clustering ■ 313
8.2.1 Tutorial
Let’s create a dummy dataset and demonstrate agglomerative hierarchical clustering
using different numbers of clusters. We will observe how the number of clusters affects
the clustering results and the dendrogram visualization.
plt.scatter(df['Feature1'], df['Feature2'])
<matplotlib.collections.PathCollection at 0x7c20b88d3dc0>
Clustering ■ 315
8.2.2.1 Setup
Dataset loading and exploration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
target
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000
std 0.828066 0.435866 1.765298
min 4.300000 2.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.350000
75% 6.400000 3.300000 5.100000
max 7.900000 4.400000 6.900000
8.3.1 Tutorial
8.3.1.1 Setup
import numpy as np
X = StandardScaler().fit_transform(X)
8.3.1.2 DBSCAN
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
# Plot result
import matplotlib.pyplot as plt
class_member_mask = labels == k
xy[:, 0],
xy[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=14,
)
8.3.2.1 Setup
Before applying DBSCAN, we need to preprocess the data to standardize the features.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, adjusted_rand_score
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
target
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
sepal length (cm) sepal width (cm) petal length (cm) \
count 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000
std 0.828066 0.435866 1.765298
min 4.300000 2.000000 1.000000
322 ■ Data Mining with Python
8.3.2.2 DBSCAN
DBSCAN clustering with different eps and min_samples
# Define a range of `eps` and `min_samples` values to try
eps_values = [0.2, 0.3, 0.4, 0.5]
min_samples_values = [2, 3, 4]
plt.show()
8.4.1 Tutorial
8.4.1.1 Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
8.4.1.2 STING
from sklearn_extra.cluster import KMedoids
print(f"STING Clustering:")
print(f"Cluster Assignments: {sting_labels}")
STING Clustering:
Cluster Assignments: [2 0 1 3 0 ... 2 3 0 0 1]
plt.tight_layout()
plt.show()
326 ■ Data Mining with Python
8.4.1.3 CLIQUE
!pip install pyclustering
# allocated clusters
clusters = clique_instance.get_clusters()
# points that are considered as outliers (in this example should be empty)
noise = clique_instance.get_noise()
# CLIQUE blocks that forms grid
cells = clique_instance.get_cells()
Amount of clusters: 4
8.4.2.1 Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
print(f"STING Clustering:")
print(f"Adjusted Rand Index Score: {ari_score_sting:.4f}")
STING Clustering:
Adjusted Rand Index Score: 0.2816
plt.tight_layout()
plt.show()
print(f"OPTICS Clustering:")
print(f"Adjusted Rand Index Score: {ari_score_optics:.4f}")
OPTICS Clustering:
Adjusted Rand Index Score: 1.0000
print(f"DBSCAN Clustering:")
print(f"Adjusted Rand Index Score: {ari_score_dbscan:.4f}")
DBSCAN Clustering:
Adjusted Rand Index Score: 1.0000
Clustering ■ 331
8.5.1 Tutorial
8.5.1.1 Round 1: Using Digits Dataset
##Load dataset and preprocess it
from sklearn.datasets import load_digits
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
dataset = load_digits()
dataset.keys()
dataset.data.shape
(1797, 64)
dataset.data[0]
array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
dataset.data[0].reshape(8,8)
plt.gray()
plt.matshow(dataset.data[0].reshape(8,8))
<matplotlib.image.AxesImage at 0x7fa60ce2d2d0>
for i in range(10):
plt.matshow(dataset.data[i].reshape(8,8))
dataset.target
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
df.describe()
X = df
y = dataset.target
Clustering ■ 339
scaler = StandardScaler()
X = scaler.fit_transform(X)
X
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
model.score(X_test, y_test)
0.9722222222222222
Conclusion: With 100% information (64 features), we can achieve 97.2% accuracy.
X.shape
340 ■ Data Mining with Python
(1797, 64)
pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape
(1797, 40)
pca.explained_variance_ratio_
pca.n_components_
40
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.9638888888888889
Clustering ■ 341
X.shape
(1797, 64)
pca = PCA(0.80)
X_pca = pca.fit_transform(X)
X_pca.shape
(1797, 21)
pca.explained_variance_ratio_
pca.n_components_
21
X_pca
-1.13345406, -0.52591658],
...,
[ 1.02259599, -0.14791087, 2.46997365, ..., -1.61210006,
0.18230257, 0.16666651],
[ 1.07605522, -0.38090625, -2.45548693, ..., -1.76918064,
0.77471846, -0.13566828],
[-1.25770233, -2.22759088, 0.28362789, ..., -2.43897852,
-1.13276155, -1.11458695]])
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.9472222222222222
X.shape
(1797, 64)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_pca.shape
(1797, 2)
pca.explained_variance_ratio_
Clustering ■ 343
array([0.12033916, 0.09561054])
pca.n_components_
2
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.5666666666666667
X.shape
344 ■ Data Mining with Python
(1797, 64)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
X_pca.shape
(1797, 10)
pca.explained_variance_ratio_
pca.n_components_
10
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.8805555555555555
dataset = load_wine()
dataset.keys()
dataset.data.shape
(178, 13)
dataset.data[0]
dataset.target
array([0, 0, 0, 0, 0, ... 2, 2, 2,
2, 2])
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
df.describe()
X = df
y = dataset.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
X
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
model.score(X_test, y_test)
0.9722222222222222
Conclusion: With 100% information (13 features), we can achieve 97.2% accuracy.
X.shape
(178, 13)
pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape
(178, 10)
pca.explained_variance_ratio_
Clustering ■ 347
pca.n_components_
10
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.9722222222222222
Conclusion: With 95% information (ten features), we achieved 97.2% accuracy (equally
good as using 13 features).
-1.40069891, 0.29649784],
[ 1.39508604, 1.58316512, 1.36520822, ..., -1.52437837,
-1.42894777, -0.59516041]])
X.shape
(178, 13)
pca = PCA(0.80)
X_pca = pca.fit_transform(X)
X_pca.shape
(178, 5)
pca.explained_variance_ratio_
pca.n_components_
5
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.9722222222222222
X.shape
(178, 13)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_pca.shape
(178, 2)
pca.explained_variance_ratio_
array([0.36198848, 0.1920749 ])
pca.n_components_
2
X_pca
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.9722222222222222
X.shape
(178, 13)
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X)
X_pca.shape
(178, 1)
pca.explained_variance_ratio_
array([0.36198848])
pca.n_components_
1
X_pca
array([[ 3.31675081],
[ 2.20946492],
Clustering ■ 351
[ 2.51674015],
...
[-2.38701709],
[-3.20875816]])
model = LogisticRegression(max_iter=1000)
model.fit(X_train_pca, y_train)
model.score(X_test_pca, y_test)
0.8888888888888888
In this section, we will conduct a comprehensive case study to explore and compare
the performance of various clustering methods we have introduced using a single
dataset. This hands-on approach will provide you with practical insights into how
different clustering techniques perform in real-world scenarios.
The case study aims to evaluate and compare the performance of different clustering
methods, allowing you to make informed choices when selecting the most appropriate
technique for a specific clustering task. You will work with a dataset suitable for
clustering and apply the clustering methods we have covered. Based on the case study
results, you will gain insights into which clustering method(s) perform best for the
given dataset and clustering task. You will also learn how to choose the most suitable
clustering technique based on specific requirements and characteristics of a problem.
8.6.1.1 Setup
Load and explore the “penguins” dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
penguins = sns.load_dataset('penguins')
body_mass_g sex
0 3750.0 Male
1 3800.0 Female
2 3250.0 Female
3 NaN NaN
4 3450.0 Female
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
count 342.000000 342.000000 342.000000 342.000000
mean 43.921930 17.151170 200.915205 4201.754386
std 5.459584 1.974793 14.061714 801.954536
min 32.100000 13.100000 172.000000 2700.000000
25% 39.225000 15.600000 190.000000 3550.000000
50% 44.450000 17.300000 197.000000 4050.000000
75% 48.500000 18.700000 213.000000 4750.000000
Clustering ■ 353
Frequent Patterns
There are many different frequent-pattern methods we use with Scikit-learn, but some
of the most common include:
• Apriori: An algorithm that generates frequent itemsets by iteratively removing
itemsets that do not meet a minimum support threshold.
• FP-Growth: An algorithm that generates frequent itemsets by building a compact
data structure called a frequent-pattern tree, which allows for efficient generation
of frequent itemsets.
Frequent itemset mining and association rule analysis are fundamental techniques in
Data Mining used to discover associations and patterns within transactional datasets.
This section introduces frequent itemsets and association rules using the mlxtend
package, a versatile library for association analysis.
Frequent itemset mining identifies sets of items that frequently co-occur in transactions,
while association rule analysis uncovers meaningful relationships and dependencies
among items. These techniques provide valuable insights into customer behavior,
product recommendations, and more. The mlxtend package offers a user-friendly
environment to perform frequent itemset mining and association rule analysis efficiently.
You will explore practical implementation steps, including using the apriori function in
mlxtend to discover frequent itemsets based on specified support thresholds, extracting
association rules from frequent itemsets using the association_rules function, and
9.1.1.1 Setup
Before proceeding, make sure you have the mlxtend library installed, which provides
an efficient implementation of the Apriori algorithm.
!pip install mlxtend
print(frequent_itemsets)
support itemsets
0 0.857143 (Bread)
1 0.428571 (Butter)
2 0.714286 (Eggs)
3 0.714286 (Milk)
4 0.428571 (Butter, Bread)
5 0.571429 (Eggs, Bread)
6 0.571429 (Milk, Bread)
7 0.428571 (Eggs, Milk)
9.1.2.1 Setup
Install required packages: Make sure you have the mlxtend library installed, as it
provides an efficient implementation of the Apriori algorithm.
!pip install mlxtend
print("Frequent Itemsets:")
print(frequent_itemsets)
Frequent Itemsets:
support itemsets
0 0.857143 (Bread)
1 0.428571 (Butter)
2 0.714286 (Eggs)
3 0.714286 (Milk)
4 0.428571 (Butter, Bread)
5 0.571429 (Eggs, Bread)
6 0.571429 (Milk, Bread)
7 0.428571 (Eggs, Milk)
print("\nAssociation Rules:")
print(association_rules_df[['antecedents', 'consequents',
'support', 'confidence','lift']])
Association Rules:
antecedents consequents support confidence lift
0 (Butter) (Bread) 0.428571 1.000000 1.166667
1 (Eggs) (Bread) 0.571429 0.800000 0.933333
2 (Bread) (Eggs) 0.571429 0.666667 0.933333
3 (Milk) (Bread) 0.571429 0.800000 0.933333
4 (Bread) (Milk) 0.571429 0.666667 0.933333
5 (Eggs) (Milk) 0.428571 0.600000 0.840000
6 (Milk) (Eggs) 0.428571 0.600000 0.840000
In this tutorial, we demonstrated how to perform association rule mining using the
Apriori algorithm in Python. The Apriori algorithm is a powerful technique for finding
frequent itemsets and generating association rules, and it is widely used for market
basket analysis and recommendation systems.
Feel free to experiment with different dataset examples and adjust the support and
confidence thresholds to discover more or less frequent itemsets and association rules
based on your specific use case.
Apriori and FP-Growth are classic algorithms used in frequent itemset mining and
association rule analysis. This section introduces these two influential algorithms using
the mlxtend package.
Apriori and FP-Growth are prominent algorithms for discovering frequent itemsets
and generating association rules from transactional data. They play a crucial role
in understanding customer behavior, product recommendations, and market basket
analysis. The mlxtend package provides user-friendly tools to implement Apriori and
FP-Growth efficiently. You will explore practical implementation steps, including using
the apriori and fpgrowth functions in mlxtend to discover frequent itemsets based
on specified support thresholds, extracting association rules from frequent itemsets
using the association_rules function, and comparing the efficiency and performance
of Apriori and FP-Growth.
9.2.1.1 Setup
Make sure you have the mlxtend library installed, as it provides an efficient imple-
mentation of the Apriori algorithm.
You can install the mlxtend library using pip:
!pip install mlxtend
Item1 Item10 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9
0 True True True True False False True True False True
1 True True True True False False True True True True
2 False False False True False False False True False False
3 True True True True True True True True True True
4 True True True True True True True False True True
.. ... ... ... ... ... ... ... ... ... ...
995 True False True True False False False False False True
996 False False False False False False False True False False
997 True False False True True True False True False True
998 False True False True False False False True False False
999 False False False False True False False True True True
, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)
Frequent Itemsets:
support itemsets
0 0.530 (Item1)
1 0.546 (Item10)
2 0.543 (Item2)
3 0.545 (Item3)
4 0.527 (Item4)
5 0.540 (Item5)
6 0.533 (Item6)
7 0.532 (Item7)
8 0.533 (Item8)
9 0.545 (Item9)
10 0.358 (Item1, Item5)
11 0.358 (Item2, Item10)
12 0.359 (Item3, Item10)
13 0.351 (Item4, Item10)
14 0.353 (Item5, Item10)
15 0.364 (Item9, Item10)
16 0.351 (Item2, Item4)
17 0.357 (Item2, Item6)
18 0.352 (Item8, Item2)
19 0.362 (Item2, Item9)
20 0.354 (Item3, Item4)
21 0.362 (Item3, Item5)
22 0.357 (Item3, Item6)
23 0.360 (Item3, Item7)
24 0.362 (Item3, Item8)
25 0.356 (Item3, Item9)
26 0.358 (Item4, Item5)
27 0.350 (Item8, Item4)
28 0.352 (Item4, Item9)
29 0.352 (Item5, Item6)
30 0.352 (Item8, Item5)
31 0.356 (Item5, Item9)
32 0.354 (Item8, Item6)
33 0.355 (Item8, Item9)
print("\nAssociation Rules:")
print(association_rules_df)
Association Rules:
antecedents consequents antecedent support consequent support support \
0 (Item1) (Item5) 0.530 0.540 0.358
1 (Item4) (Item3) 0.527 0.545 0.354
2 (Item5) (Item3) 0.540 0.545 0.362
3 (Item7) (Item3) 0.532 0.545 0.360
4 (Item8) (Item3) 0.533 0.545 0.362
5 (Item4) (Item5) 0.527 0.540 0.358
9.2.2.1 Setup
Import necessary libraries.
!pip install mlxtend
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth, association_rules
print(frequent_itemsets)
support itemsets
0 0.857143 (Bread)
1 0.714286 (Milk)
2 0.714286 (Eggs)
3 0.428571 (Butter)
4 0.571429 (Milk, Bread)
5 0.571429 (Eggs, Bread)
6 0.428571 (Eggs, Milk)
7 0.428571 (Butter, Bread)
print("\nAssociation Rules:")
print(association_rules_df[['antecedents', 'consequents',
'support', 'confidence','lift']])
Association Rules:
antecedents consequents support confidence lift
0 (Milk) (Bread) 0.571429 0.800000 0.933333
1 (Bread) (Milk) 0.571429 0.666667 0.933333
2 (Eggs) (Bread) 0.571429 0.800000 0.933333
3 (Bread) (Eggs) 0.571429 0.666667 0.933333
4 (Eggs) (Milk) 0.428571 0.600000 0.840000
5 (Milk) (Eggs) 0.428571 0.600000 0.840000
366 ■ Data Mining with Python
9.2.3.1 Setup
Make sure you have the mlxtend library installed, as it provides an efficient imple-
mentation of the Apriori algorithm.
You can install the mlxtend library using pip:
!pip install mlxtend
<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 0 to 998
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Apple 999 non-null bool
1 Bread 999 non-null bool
2 Butter 999 non-null bool
3 Cheese 999 non-null bool
4 Corn 999 non-null bool
5 Dill 999 non-null bool
6 Eggs 999 non-null bool
7 Ice cream 999 non-null bool
8 Kidney Beans 999 non-null bool
9 Milk 999 non-null bool
10 Nutmeg 999 non-null bool
11 Onion 999 non-null bool
12 Sugar 999 non-null bool
13 Unicorn 999 non-null bool
14 Yogurt 999 non-null bool
15 chocolate 999 non-null bool
dtypes: bool(16)
Frequent Patterns ■ 367
Apple Bread Butter Cheese Corn Dill Eggs Ice cream Kidney Beans \
0 False True False False True True False True False
1 False False False False False False False False False
2 True False True False False True False True False
3 False False True True False True False False False
4 True True False False False False False False False
5 True True True True False True False True False
6 False False True False False False True True True
7 True False False True False False True False False
8 True False False False True True True True False
9 True False False False False True True True False
print("Frequent Itemsets:")
print(frequent_itemsets)
Frequent Itemsets:
support itemsets
0 0.383383 (Apple)
1 0.384384 (Bread)
2 0.420420 (Butter)
3 0.404404 (Cheese)
4 0.407407 (Corn)
5 0.398398 (Dill)
6 0.384384 (Eggs)
368 ■ Data Mining with Python
print("\nAssociation Rules:")
print(association_rules_df)
Association Rules:
antecedents consequents antecedent support consequent support support \
0 (Ice cream) (Butter) 0.410410 0.420420 0.207207
1 (Milk) (chocolate) 0.405405 0.421421 0.211211
2 (chocolate) (Milk) 0.421421 0.405405 0.211211
In this case study, we demonstrated how to perform association rule mining using the
Apriori algorithm with the “Groceries” dataset. The Apriori algorithm is a powerful
technique for finding frequent itemsets and generating association rules, and it is
widely used for market basket analysis and recommendation systems.
Feel free to experiment with different datasets and adjust the support and confidence
thresholds to discover more or less frequent itemsets and association rules based on
your specific use case.
CHAPTER 10
Outlier Detection
Outlier detection is a critical task in data analysis, aiming to identify data points that
deviate significantly from the norm. This section introduces various outlier detection
methods using the Scikit-learn package.
Outliers are data points that exhibit unusual behavior compared to the majority of
data in a dataset. Detecting outliers is essential in various domains, including fraud
detection, quality control, and anomaly detection. Scikit-learn offers a wide range of
tools to implement outlier detection methods efficiently. You will explore practical
implementation steps, including using Scikit-learn’s modules to apply various outlier
detection techniques, such as Z-score, IQR, One-Class SVM, Isolation Forest, DBSCAN,
and LOF, customizing parameters and thresholds for each method to adapt to specific
datasets, and visualizing outlier detection results to understand data anomalies.
10.1.1 Tutorial
10.1.1.1 Dataset Creation
In this code, we generate 500 normal data points following a normal distribution with
mean [5, 10] and standard deviation [1, 2]. Then, we introduce 50 outliers by adding
noise to the data with mean [20, 30] and standard deviation [5, 8]. The dataset is
then combined, and a binary target variable (Outlier) is assigned to indicate whether
a data point is an outlier (1) or not (0).
The scatter plot visualizes the dataset, where outliers are shown in a different color.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
10.1.1.2 Z-score
In this code, we calculate the Z-scores for each data point with respect to both
Feature1 and Feature2. We set a threshold value (in this case, 3) to determine outliers.
Data points with a Z-score greater than the threshold in either feature are considered
outliers.
The scatter plot visualizes the dataset, where normal data points are shown in blue,
and the outliers detected by the Z-score method are shown in red.
from scipy import stats
10.1.1.3 IQR
Let’s use the same dummy dataset and apply the Interquartile Range (IQR) method
for outlier detection. The IQR method is based on the range between the first quartile
(Q1) and the third quartile (Q3) of the data. Data points outside a specified range
(usually defined as Q1 - 1.5 * IQR and Q3 + 1.5 * IQR) are considered outliers.
In this code, we calculate the first quartile (Q1), third quartile (Q3), and IQR for
each feature (Feature1 and Feature2). We then define the outlier range as Q1 - 1.5 *
IQR and Q3 + 1.5 * IQR for each feature. Data points falling outside this range in
either feature are considered outliers.
The scatter plot visualizes the dataset, where normal data points are shown in blue,
and the outliers detected by the IQR method are shown in red.
# Calculate Q1, Q3, and IQR for each feature
Q1 = df[['Feature1', 'Feature2']].quantile(0.25)
Q3 = df[['Feature1', 'Feature2']].quantile(0.75)
IQR = Q3 - Q1
10.1.1.6 DBSCAN
Let’s continue with the same dummy dataset and apply the Density-Based Spatial
Clustering of Applications with Noise (DBSCAN) algorithm for outlier detection.
DBSCAN is a density-based clustering algorithm that can be used for outlier detection
by identifying points that are not part of any dense cluster.
In this code, we fit the DBSCAN model to the combined feature array X, which
contains both Feature1 and Feature2. The eps parameter defines the maximum distance
between two samples for them to be considered as part of the same cluster. The
min_samples parameter specifies the minimum number of samples in a neighborhood
for a point to be considered as a core point.
The dbscan.labels_ attribute contains the cluster labels assigned by DBSCAN. Points
that are not part of any cluster are assigned the label -1, which indicates outliers.
The scatter plot visualizes the dataset, where normal data points are shown in blue,
and the outliers detected by the DBSCAN algorithm are shown in red.
from sklearn.cluster import DBSCAN
10.1.1.7 LOF
Let’s continue with the same dummy dataset and apply the Local Outlier Factor
(LOF) method for outlier detection. LOF is a density-based outlier detection method
that measures the local density deviation of a data point with respect to its neighbors.
Outliers are identified as data points with significantly lower local density compared
to their neighbors.
In this code, we fit the LOF model to the combined feature array X, which contains
both Feature1 and Feature2. The n_neighbors parameter defines the number of
neighbors considered for calculating the local density deviation of each data point.
The contamination parameter controls the expected proportion of outliers in the data.
The lof.fit_predict(X) method returns an array of predictions, where -1 indicates an
outlier and 1 indicates an inlier. We convert the predictions to Boolean values, where
True represents an outlier and False represents an inlier.
The scatter plot visualizes the dataset, where normal data points are shown in blue,
and the outliers detected by the LOF method are shown in red.
from sklearn.neighbors import LocalOutlierFactor
Time V1 V2 V3 V4 V5 V6 \
128634 78785.0 -0.659717 1.183753 0.483915 1.210817 -0.035700 0.188756
224924 144024.0 1.997011 0.110559 -1.608624 0.337948 0.447722 -0.561137
266887 162525.0 2.047366 0.081031 -1.782673 0.251559 0.558326 -0.433816
72073 54554.0 1.377808 -1.975197 0.013025 -1.915389 0.035267 4.520147
145549 87041.0 -1.690957 2.297320 0.088259 -1.348462 1.239065 -1.195810
[5 rows x 31 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 128634 to 133610
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
Outlier Detection ■ 381
V5 V6 V7 V8 V9 \
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 0.000447 -0.006421 0.002671 0.008939 0.010266
std 1.368033 1.322849 1.208984 1.112960 1.103543
min -23.669726 -13.591286 -24.419483 -37.353443 -4.679402
25% -0.694127 -0.762416 -0.549120 -0.208433 -0.643822
50% -0.062717 -0.259929 0.043512 0.016011 -0.039315
75% 0.604511 0.399681 0.558851 0.321598 0.621245
max 19.180525 15.568823 28.069822 16.635979 6.965981
Class
count 10000.00000
mean 0.00160
std 0.03997
min 0.00000
25% 0.00000
50% 0.00000
75% 0.00000
max 1.00000
[8 rows x 31 columns]
df['Class'].value_counts()
0 9984
1 16
Name: Class, dtype: int64
0 1348
1 13
Name: Class, dtype: int64
plt.bar(['Normal', 'Outlier']
, [len(df) - outliers_iqr.sum(), outliers_iqr.sum()])
plt.xlabel(f'Data Points {len(df) -
outliers_iqr.sum()}:{outliers_iqr.sum()}')
plt.ylabel('Count')
plt.title('Outliers Detected by IQR Method')
plt.show()
df[outliers_iqr]['Class'].value_counts()
0 4761
1 16
Name: Class, dtype: int64
0 998
1 1
Name: Class, dtype: int64
0 190
1 10
Name: Class, dtype: int64
outliers_dbscan.sum()}:{outliers_dbscan.sum()}')
plt.ylabel('Count')
plt.title('Outliers Detected by DBSCAN Method')
plt.show()
df[outliers_dbscan]['Class'].value_counts()
0 1472
1 8
Name: Class, dtype: int64
plt.ylabel('Count')
plt.title('Outliers Detected by Local Outlier Factor (LOF) Method')
plt.show()
df[outliers_lof]['Class'].value_counts()
0 198
1 2
Name: Class, dtype: int64
10.1.2.8 Conclusion
In this specific context, not every model works well!
That completes the tutorial for outlier detection using different methods on the “Credit
Card Fraud Detection” dataset. You can compare the performances of each method
and adjust the hyperparameters to fine-tune the outlier detection for your specific use
case.
Index
389
390 ■ Index
TXT, 3
web, 14, 15
XLSX, 3
Yahoo Finance, 32