Better Data Science - Make Synthetic Datasets With Python

Uploaded by

Derek Degbedzui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Better Data Science - Make Synthetic Datasets With Python

Uploaded by

Derek Degbedzui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Better Data Science | Make Synthetic Datasets

with Python
● Library imports
● rcParams is only here for plot stylings
In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

Make a synthetic dataset

● 1000 data points measured through 2 features

● Perfect (50:50) class distribution
● Binary target variable, every subset has a single cluster
● Make sure to use random_state=42 if you want reproducible results
In [2]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']
# 5 random rows
df.sample(5)
Visualization

● The plot() function visualizes a synthetic dataset:

In [3]:
def plot(df: pd.DataFrame, x1: str, x2: str, y: str, title: str = '', save: bool = False,
figname='figure.png'):
plt.figure(figsize=(14, 7))
plt.scatter(x=df[df[y] == 0][x1], y=df[df[y] == 0][x2], label='y = 0')
plt.scatter(x=df[df[y] == 1][x1], y=df[df[y] == 1][x2], label='y = 1')
plt.title(title, fontsize=20)
plt.legend()
if save:
plt.savefig(figname, dpi=300, bbox_inches='tight', pad_inches=0)
plt.show()
In [4]:
plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes')

Adding noise

● You can use the flip_y parameter to add noise

● From the docs:
○ The fraction of samples whose class is assigned randomly. Larger
values introduce noise in the labels and make the classification
task harder. Note that the default setting flip_y > 0 might lead to
less than n_classes in y in some cases.
In [5]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
flip_y=0.15,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Added noise')

Add class imbalance

● Perfect class distribution (50:50) is rarely the case

● You can use the weights parameter to play with the distribution
○ Assigning the value of 0.95 makes the y = 1 class take 5% of the
data
In [6]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.95],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 1)')

● You can do the opposite:

In [7]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.05],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 0)')

Make classification task easier/harder

● You can play around with the class_sep parameter to adjust class separation
● Higher the value, the more separated the classes are
In [8]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
class_sep=5,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Make classification easier')

Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
AIML%20Short%20Term%20Internship%20Session%209%20Summary-1719044709410
No ratings yet
AIML%20Short%20Term%20Internship%20Session%209%20Summary-1719044709410
14 pages
ML Assignment
No ratings yet
ML Assignment
34 pages
CS 611 Slides 4
No ratings yet
CS 611 Slides 4
25 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
30 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
Lab Report 4
No ratings yet
Lab Report 4
6 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
ml lab
No ratings yet
ml lab
14 pages
Maxbox - Starter67 Machine Learning
No ratings yet
Maxbox - Starter67 Machine Learning
7 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
Ml Solution
No ratings yet
Ml Solution
60 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
ml short
No ratings yet
ml short
2 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Week 8 DS Practical (1)
No ratings yet
Week 8 DS Practical (1)
13 pages
KRAI Practical
No ratings yet
KRAI Practical
14 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
Code shabab error 7
No ratings yet
Code shabab error 7
5 pages
Maxbox - Starter68 Machine Learning
No ratings yet
Maxbox - Starter68 Machine Learning
5 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
ml lab exam document
No ratings yet
ml lab exam document
14 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
Import Numpy As NP Import Pandas As PD
No ratings yet
Import Numpy As NP Import Pandas As PD
7 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
23BCE7199 ML Lab Assignment[1]
No ratings yet
23BCE7199 ML Lab Assignment[1]
15 pages
5) Randomforest - Ipynb - Colaboratory
No ratings yet
5) Randomforest - Ipynb - Colaboratory
12 pages
mini4
No ratings yet
mini4
9 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
SE_KMeansClustering
No ratings yet
SE_KMeansClustering
21 pages
Plot Centroids by Clustering Things
No ratings yet
Plot Centroids by Clustering Things
1 page
ML2 Practical List
No ratings yet
ML2 Practical List
80 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
HW5 Clustering (50 PTS) : Test Algorithms
No ratings yet
HW5 Clustering (50 PTS) : Test Algorithms
5 pages
Pandas: Reference Sheet
No ratings yet
Pandas: Reference Sheet
9 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
Project Data Mining (AMAN YADAV)
No ratings yet
Project Data Mining (AMAN YADAV)
12 pages
DM ML Practical
No ratings yet
DM ML Practical
13 pages
Introductory Notes: Matplotlib: Preliminaries
No ratings yet
Introductory Notes: Matplotlib: Preliminaries
11 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
ML Assignment 5
No ratings yet
ML Assignment 5
8 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
1608 06048 PDF
No ratings yet
1608 06048 PDF
7 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
week_3
No ratings yet
week_3
10 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Logistic Regression
No ratings yet
Logistic Regression
10 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Linear Regression For Absolute Beginners With Implementation in Python
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
17 pages
Simple Linear Regression: Math Behind
No ratings yet
Simple Linear Regression: Math Behind
6 pages
Better Data Science - Generate PDF Reports With Python
No ratings yet
Better Data Science - Generate PDF Reports With Python
5 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
5 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
Random Forest: The Algorithm in A Nutshell
No ratings yet
Random Forest: The Algorithm in A Nutshell
10 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
13 pages
NFGP Unit I Paavai
No ratings yet
NFGP Unit I Paavai
111 pages
Preboard - Class 12, Maths Indu 22-23, A
No ratings yet
Preboard - Class 12, Maths Indu 22-23, A
6 pages
After Class Quiz #1 - Sol - Updated
No ratings yet
After Class Quiz #1 - Sol - Updated
6 pages
Simultaneous Linear Equations Q
0% (1)
Simultaneous Linear Equations Q
4 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Unit 10
No ratings yet
Unit 10
14 pages
Numerical Differentiation and Integration
No ratings yet
Numerical Differentiation and Integration
15 pages
Ca U3m06 FM
No ratings yet
Ca U3m06 FM
2 pages
Lab Manual On Soft Computing (IT-802) : Ms. Neha Sexana
No ratings yet
Lab Manual On Soft Computing (IT-802) : Ms. Neha Sexana
29 pages
MAT2022 B Engineering Mathematics 2 Work Sheet 4
No ratings yet
MAT2022 B Engineering Mathematics 2 Work Sheet 4
2 pages
Digital Image Processing2
No ratings yet
Digital Image Processing2
85 pages
06 Regression With Simple Data Preparation
No ratings yet
06 Regression With Simple Data Preparation
2 pages
4.Advancements in Computer Science Fields
No ratings yet
4.Advancements in Computer Science Fields
2 pages
(eBook PDF) Introductory Econometrics for Finance 4th Edition 2024 Scribd Download
100% (2)
(eBook PDF) Introductory Econometrics for Finance 4th Edition 2024 Scribd Download
51 pages
Or TP Optimal Solution
No ratings yet
Or TP Optimal Solution
14 pages
Mulberry Leaf Disease Detection
No ratings yet
Mulberry Leaf Disease Detection
7 pages
Genetic Evolution of TIC-TAC-ToE
No ratings yet
Genetic Evolution of TIC-TAC-ToE
8 pages
Problem Set 3 Solutions
No ratings yet
Problem Set 3 Solutions
6 pages
Comparative Analysis of Time Series Forecasting Models To Predict Amount of Rainfall in Telangana
No ratings yet
Comparative Analysis of Time Series Forecasting Models To Predict Amount of Rainfall in Telangana
5 pages
Videos Parakram GATE 2024 Batch B Computer Science Weekend Hinglish
No ratings yet
Videos Parakram GATE 2024 Batch B Computer Science Weekend Hinglish
33 pages
ECO 104.7 - Assignment II
No ratings yet
ECO 104.7 - Assignment II
2 pages
Assignment No.1: Unit 1. Soft Computing Basics
No ratings yet
Assignment No.1: Unit 1. Soft Computing Basics
12 pages
A Modified Simulated Annealing Method For Flexible Job Shop Scheduling Problem
No ratings yet
A Modified Simulated Annealing Method For Flexible Job Shop Scheduling Problem
6 pages
Module 1 (1,2,3)
No ratings yet
Module 1 (1,2,3)
45 pages
Few-Shot Learning Tutorial - Medium
No ratings yet
Few-Shot Learning Tutorial - Medium
16 pages
Determinants and Singular Matrices PDF
No ratings yet
Determinants and Singular Matrices PDF
2 pages
Electrical Power and Energy Systems: Hamed Shakouri G., Hamid Reza Radmanesh
No ratings yet
Electrical Power and Energy Systems: Hamed Shakouri G., Hamid Reza Radmanesh
11 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Practice Examples Chapter-7 Filter Design
No ratings yet
Practice Examples Chapter-7 Filter Design
2 pages
The Indian Community School Kuwait First Mid Term Examination - 2022 - 2023 Class - Xi - Mathematics - Answer Key
No ratings yet
The Indian Community School Kuwait First Mid Term Examination - 2022 - 2023 Class - Xi - Mathematics - Answer Key
1 page