Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

UNIT 2 Notes - Data Science

The document discusses data collection and preprocessing, outlining methods for gathering primary and secondary data, including interviews, observations, surveys, and internet sources. It emphasizes the importance of data preprocessing in ensuring data quality through steps like data cleaning, integration, reduction, and transformation, which enhance the accuracy and usability of data for analysis. Additionally, it introduces exploratory data analysis (EDA) as a technique for understanding data patterns and relationships, detailing various analysis types and their applications.

Uploaded by

Ankita Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT 2 Notes - Data Science

The document discusses data collection and preprocessing, outlining methods for gathering primary and secondary data, including interviews, observations, surveys, and internet sources. It emphasizes the importance of data preprocessing in ensuring data quality through steps like data cleaning, integration, reduction, and transformation, which enhance the accuracy and usability of data for analysis. Additionally, it introduces exploratory data analysis (EDA) as a technique for understanding data patterns and relationships, detailing various analysis types and their applications.

Uploaded by

Ankita Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT -2

Data Collection and Data Pre - Processing

What is Data Collection?

Data collection is the process of collecting, measuring and analyzing


different types of information using a set of standard validated
techniques. The main objective of data collection is to gather
information-rich and reliable data, and analyze them to make critical
business decisions. Once the data is collected, it goes through a
rigorous process of data cleaning and data processing to make this data
truly useful for businesses. There are two main methods of data
collection in research based on the information that is required,
namely:

 Primary Data Collection


 Secondary Data Collection

Primary Data Collection Methods


Primary data refers to data collected from first-hand experience
directly from the main source. It refers to data that has never been
used in the past. The data gathered by primary data collection
methods are generally regarded as the best kind of data in
research.The methods of collecting primary data can be further
divided into quantitative data collection methods (deals with factors
that can be counted) and qualitative data collection methods (deals
with factors that are not necessarily numerical in nature).

1. Interviews
Interviews are a direct method of data collection. It is simply a
process in which the interviewer asks questions and the interviewee
responds to them. It provides a high degree of flexibility because
questions can be adjusted and changed anytime according to the
situation.

2. Observations
In this method, researchers observe a situation around them and
record the findings. It can be used to evaluate the behaviour of
different people in controlled (everyone knows they are being
observed) and uncontrolled (no one knows they are being observed)
situations. This method is highly effective because it is
straightforward and not directly dependent on other
participants. For example, a person looks at random people that
walk their pets on a busy street, and then uses this data to decide
whether or not to open a pet food store in that area.
3. Surveys and Questionnaires
Surveys and questionnaires provide a broad perspective from large
groups of people. They can be conducted face-to-face, mailed, or
even posted on the Internet to get respondents from anywhere in
the world. The answers can be yes or no, true or false, multiple
choice, and even open-ended questions. However, a drawback of
surveys and questionnaires is delayed response and the possibility
of ambiguous answers.

4. Focus Groups
A focus group is similar to an interview, but it is conducted with a
group of people who all have something in common. The data
collected is similar to in-person interviews, but they offer a better
understanding of why a certain group of people thinks in a particular
way. However, some drawbacks of this method are lack of privacy
and domination of the interview by one or two participants. Focus
groups can also be time-consuming and challenging, but they help
reveal some of the best information for complex situations.

Secondary Data Collection Methods


Secondary data refers to data that has already been collected by
someone else. It is much more inexpensive and easier to collect
than primary data. While primary data collection provides more
authentic and original data, there are numerous instances where
secondary data collection provides great value to organizations.
1. Internet
The use of the Internet has become one of the most popular
secondary data collection methods in recent times. There is a large
pool of free and paid research resources that can be easily accessed
on the Internet. While this method is a fast and easy way of data
collection, you should only source from authentic sites while
collecting information.
2. Government Archives
There is lots of data available from government archives that you
can make use of. The most important advantage is that the data in
government archives are authentic and verifiable. The challenge,
however, is that data is not always readily available due to a
number of factors. For example, criminal records can come under
classified information and are difficult for anyone to have access to
them.
3. Libraries
Most researchers donate several copies of their academic research
to libraries. You can collect important and authentic information
based on different research contexts. Libraries also serve as a
storehouse for business directories, annual reports and other similar
documents that help businesses in their research.
Data Pre - processing

Data Preprocessing can be defined as a process of converting raw


data into a format that is understandable and usable for further
analysis. It is an important step in the Data Preparation stage. It
ensures that the outcome of the analysis is accurate, complete,
and consistent.

Data Types

Data Type can be defined as labeling the values a feature can hold. The data type will
also determine what kinds of relational, mathematical, or logical operations can be
performed on it. A few of the most common data types include Integer, Floating,
Character, String, Boolean, Array, Date, Time, etc.

Data Summary

Data Summary can be defined as generating descriptive or summary statistics for the
features in a given dataset. For example, for a numeric column, it will
compute mean, max, min, std, etc. For a categorical variable, it will compute the
count of unique labels, labels with the highest frequency, etc.

Why is Data Preprocessing Important

Data Preprocessing is an important step in the Data Preparation


stage of a Data Science development lifecycle that will ensure
reliable, robust, and consistent results. The main objective of this
step is to ensure and check the quality of data before applying any
Machine Learning or Data Mining methods. Let’s review some of its
benefits -

 Accuracy - Data Preprocessing will ensure that input data is


accurate and reliable by ensuring there are no manual entry
errors, no duplicates, etc.
 Completeness - It ensures that missing values are handled, and
data is complete for further analysis.
 Consistent - Data Preprocessing ensures that input data is
consistent, i.e., the same data kept in different places should
match.
 Timeliness - Whether data is updated regularly and on a timely
basis or not.
 Trustable - Whether data is coming from trustworthy sources or
not.
 Interpretability - Raw data is generally unusable, and Data
Preprocessing converts raw data into an interpretable format.
Key Steps in Data Preprocessing

Data Cleaning

Data Cleaning uses methods to handle incorrect, incomplete,


inconsistent, or missing values. Some of the techniques for Data
Cleaning include -

1. Handling Missing Values

Input data can contain missing or NULL values, which must be


handled before applying any Machine Learning or Data Mining
techniques.

Missing values can be handled by many techniques, such as


removing rows/columns containing NULL values and imputing NULL
values using mean, mode, regression, etc.

2. De-noising

De-noising is a process of removing noise from the data. Noisy data


is meaningless data that is not interpretable or understandable by
machines or humans. It can occur due to data entry errors, faulty
data collection, etc.

De-noising can be performed by applying many techniques, such as


binning the features, using regression to smoothen the features to
reduce noise, clustering to detect the outliers, etc.

Data Integration

Data Integration can be defined as combining data from multiple


sources. A few of the issues to be considered during Data
Integration include the following -

Entity Identification Problem - It can be defined as identifying


objects/features from multiple databases that correspond to the
same entity. For example, in database A _customer_id,_ and in
database B _customer_number_ belong to the same entity.

Schema Integration - It is used to merge two or more database


schema/metadata into a single schema. It essentially takes two or
more schema as input and determines a mapping between them.
For example, entity type CUSTOMER in one schema may have
CLIENT in another schema.

Detecting and Resolving Data Value Concepts - The data can


be stored in various ways in different databases, and it needs to be
taken care of while integrating them into a single dataset. For
example, dates can be stored in various formats such
as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.

Data Reduction

Data Reduction is used to reduce the volume or size of the input


data. Its main objective is to reduce storage and analysis costs and
improve storage efficiency. A few of the popular techniques to
perform Data Reduction include -

Dimensionality Reduction - It is the process of reducing the


number of f eatures in the input dataset. It can be performed in
various ways, such as selecting features with the highest
importance, Principal Component Analysis (PCA), etc.

Numerosity Reduction - In this method, various techniques


can be applied to reduce the volume of data by choosing
alternative smaller representations of the data. For example, a
variable can be approximated by a regression model, and
instead of storing the entire variable, we can store the
regression model to approximate it.
Data Compression - In this method, data is compressed.
Data Compression can be lossless or lossy depending on
whether the information is lost or not during compression.

Data Transformation

Data Transformation is a process of converting data into a format


that helps in building efficient ML models and deriving better
insights. A few of the most common methods for Data
Transformation include -

Smoothing - Data Smoothing is used to remove noise in the


dataset, and it helps identify important features and detect
patterns. Therefore, it can help in predicting trends or future events.

Aggregation - Data Aggregation is the process of transforming


large volumes of data into an organized and summarized
format that is more understandable and comprehensive. For
example, a company may look at monthly sales data of a product
instead of raw sales data to understand its performance better and
forecast future sales.

Discretization - Data Discretization is a process of converting


numerical or continuous variables into a set of intervals/bins. This
makes data easier to analyze. For example, the age features can be
converted into various intervals such as (0-10, 11-20, ..) or
(child, young, …).

Normalization - Data Normalization is a process of converting a


numeric variable into a specified range such as [-1,1], [0,1], etc. A
few of the most common approaches to performing normalization
are Min-Max Normalization, Data Standardization or Data
Scaling, etc.

Applications of Data Preprocessing

Data Preprocessing is important in the early stages of a Machine


Learning and AI application development lifecycle. A few of the most
common usage or application include -

Improved Accuracy of ML Models - Various techniques used to


preprocess data, such as Data Cleaning, Transformation ensure that
data is complete, accurate, and understandable, resulting in
efficient and accurate ML models.

Reduced Costs - Data Reduction techniques can help companies


save storage and compute costs by reducing the volume of the data
Visualization - Preprocessed data is easily consumable and
understandable that can be further used to build d ashboards to gain valuable
insights.

Data Preprocessing is a process of converting raw datasets into a


format that is consumable, understandable, and usable for further
analysis. It is an important step in any Data Analysis project that will
ensure the input datasets's accuracy, consistency,
and completeness.

The key steps in this stage include - Data Cleaning, Data


Integration, Data Reduction, and Data Transformation.

It can help build accurate ML models, reduce analysis costs, and


build dashboards on raw data.

Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA) refers to the method of studying
and exploring record sets to apprehend their predominant traits,
discover patterns, locate outliers, and identify relationships
between variables. EDA is normally carried out as a preliminary
step before undertaking extra formal statistical analyses or
modeling.

The Foremost Goals of EDA

1. Data Cleaning: EDA involves examining the information for


errors, lacking values, and inconsistencies. It includes techniques
including records imputation, managing missing statistics, and
figuring out and getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to
recognize the important tendency, variability, and distribution of
variables. Measures like suggest, median, mode, preferred
deviation, range, and percentiles are usually used.

Data Visualization: EDA employs visual techniques to represent


the statistics graphically. Visualizations consisting of histograms,
box plots, scatter plots, line plots, heatmaps, and bar charts assist
in identifying styles, trends, and relationships within the facts.

4. Feature Engineering: EDA allows for the exploration of


various variables and their adjustments to create new functions or
derive meaningful insights. Feature engineering can contain
scaling, normalization, binning, encoding express variables, and
creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover
relationships and dependencies between variables. Techniques
such as correlation analysis, scatter plots, and pass-tabulations
offer insights into the power and direction of relationships between
variables.

6. Data Segmentation: EDA can contain dividing the information


into significant segments based totally on sure standards or traits.
This segmentation allows advantage insights into unique
subgroups inside the information and might cause extra focused
analysis.

7. Hypothesis Generation: EDA aids in generating hypotheses


or studies questions based totally on the preliminary exploration of
the data. It facilitates form the inspiration for in addition
evaluation and model building.

8. Data Quality Assessment: EDA permits for assessing the nice


and reliability of the information. It involves checking for records
integrity, consistency, and accuracy to make certain the
information is suitable for analysis.

Types of EDA

Depending on the number of columns we are analyzing we can


divide EDA into two types.

EDA, or Exploratory Data Analysis, refers back to the method of


analyzing and analyzing information units to uncover styles, pick
out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records
and the desires of the evaluation. Here are some not unusual kinds
of EDA:

1. Univariate Analysis: This sort of evaluation makes a


speciality of analyzing character variables inside the records set. It
involves summarizing and visualizing a unmarried variable at a
time to understand its distribution, relevant tendency, unfold, and
different applicable records. Techniques like histograms, field
plots, bar charts, and precis information are generally used in
univariate analysis.

2. Bivariate Analysis: Bivariate evaluation involves exploring the


connection between variables. It enables find associations,
correlations, and dependencies between pairs of variables. Scatter
plots, line plots, correlation matrices, and move-tabulation are
generally used strategies in bivariate analysis.

3. Multivariate Analysis: Multivariate analysis extends bivariate


evaluation to encompass greater than variables. It ambitions to
apprehend the complex interactions and dependencies among
more than one variables in a records set. Techniques inclusive of
heatmaps, parallel coordinates, aspect analysis, and primary
component analysis (PCA) are used for multivariate analysis.

4. Time Series Analysis: This type of analysis is mainly applied


to statistics sets that have a temporal component. Time collection
evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques like
line plots, autocorrelation analysis, transferring averages, and
ARIMA (AutoRegressive Integrated Moving Average) fashions are
generally utilized in time series analysis.

5. Missing Data Analysis: Missing information is a not unusual


issue in datasets, and it may impact the reliability and validity of
the evaluation. Missing statistics analysis includes figuring out
missing values, know-how the patterns of missingness, and using
suitable techniques to deal with missing data. Techniques along
with lacking facts styles, imputation strategies, and sensitivity
evaluation are employed in lacking facts evaluation.

6. Outlier Analysis: Outliers are statistics factors that drastically


deviate from the general sample of the facts. Outlier analysis
includes identifying and knowledge the presence of outliers, their
capability reasons, and their impact at the analysis. Techniques
along with box plots, scatter plots, z-rankings, and clustering
algorithms are used for outlier evaluation.

7. Data Visualization: Data visualization is a critical factor of


EDA that entails creating visible representations of the statistics to
facilitate understanding and exploration. Various visualization
techniques, inclusive of bar charts, histograms, scatter plots, line
plots, heatmaps, and interactive dashboards, are used to
represent exclusive kinds of statistics.

These are just a few examples of the types of EDA techniques that
can be employed at some stage in information evaluation. The
choice of strategies relies upon on the information traits, research
questions, and the insights sought from the analysis.

Exploratory Data Analysis (EDA) Using Python Libraries


For the simplicity of the article, we will use a single dataset. We
will use the employee data for this. It contains 8 columns namely –
First Name, Gender, Start Date, Last Login, Salary, Bonus%, Senior
Management, and Team. We can get the dataset
here Employees.csv

Let’s read the dataset using the Pandas read_csv() function and
print the 1st five rows. To print the first five rows we will use
the head() function.
 Python3

import pandas as pd
import numpy as np
# read datasdet using pandas
df = pd.read_csv('employees.csv')
df.head()

Output:

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

 Python3

df.shape

Output:

(1000, 8)

This means that this dataset has 1000 rows and 8 columns.

Let’s get a quick summary of the dataset using the


pandas describe() method. The describe() function applies basic
statistical computations on the dataset like extreme values, count
of data points standard deviation, etc. Any missing value or NaN
value is automatically skipped. describe() function gives a good
picture of the distribution of data.

Example:

 Python3

df.describe()

Output:

description of the dataframe


Note we can also get the description of categorical columns of the
dataset if we specify include =’all’ in the describe function.

Now, let’s also see the columns and their data types. For this, we
will use the info()method.

 Python3

# information about the dataset


df.info()

Output:

Information about the dataset

Changing Dtype from Object to Datetime

Start Date is an important column for employees. However, it is


not of much use if we can not handle it properly to handle this
type of data pandas provide a special function datetime() from
which we can change object type to DateTime format.

 Python3

# convert "Start Date" column to datetime data type


df['Start Date'] = pd.to_datetime(df['Start Date'])

We can see the number of unique elements in our dataset. This


will help us in deciding which type of encoding to choose for
converting categorical columns into numerical columns.

 Python3

df.nunique()

Output:

First Name 200


Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64

Till now we have got an idea about the dataset used. Now Let’s
see if our dataset contains any missing values or not.

Handling Missing Values

You all must be wondering why a dataset will contain any missing
values. It can occur when no information is provided for one or
more items or for a whole unit. For Example, Suppose different
users being surveyed may choose not to share their income, and
some users may choose not to share their address in this way
many datasets went missing. Missing Data is a very big problem in
real-life scenarios. Missing Data can also refer to as NA(Not
Available) values in pandas. There are several useful functions for
detecting, removing, and replacing null values in Pandas
DataFrame :

isnull()

notnull()

dropna()

fillna()

replace()

interpolate()

Now let’s check if there are any missing values in our dataset or
not.

Example:

 Python3

df.isnull().sum()

Output:

Null values in dataframe

We can see that every column has a different amount of missing


values. Like Gender has 145 missing values and salary has 0. Now
for handling these missing values there can be several cases like
dropping the rows containing NaN or replacing NaN with either
mean, median, mode, or some other value.

Now, let’s try to fill in the missing values of gender with the string
“No Gender”.
Example:

 Python3

df["Gender"].fillna("No Gender", inplace = True)

df.isnull().sum()

Output:

Null values in dataframe after filling Gender column

We can see that now there is no null value for the gender column.
Now, Let’s fill the senior management with the mode value.

Example:

 Python3

mode = df['Senior Management'].mode().values[0]


df['Senior Management']= df['Senior Management'].replace(np.nan,
mode)

df.isnull().sum()

Output:

Null values in dataframe after filling S senior management column

Now for the first name and team, we cannot fill the missing values
with arbitrary data, so, let’s drop all the rows containing these
missing values.

Example:

 Python3

df = df.dropna(axis = 0, how ='any')

print(df.isnull().sum())
df.shape

Output:
Null values in dataframe after dropping all null values

We can see that our dataset is now free of all the missing values
and after dropping the data the number of rows also reduced from
1000 to 899.

Note: For more information, refer to Working with Missing Data in


Pandas.

After removing the missing data let’s visualize our data.

Data Encoding

There are some models like Linear Regression which does not work
with categorical dataset in that case we should try to encode
categorical dataset into the numerical column. we can use
different methods for encoding like Label encoding or One-hot
encoding. pandas and sklearn provide different functions for
encoding in our case we will use the LabelEncoding function from
sklearn to encode the Gender column.

 Python3

from sklearn.preprocessing import LabelEncoder


# create an instance of LabelEncoder
le = LabelEncoder()

# fit and transform the "Senior Management"


# column with LabelEncoder
df['Gender'] = le.fit_transform\
(df['Gender'])

Noe

Data visualization
Data Visualization is the process of analyzing data in the form of
graphs or maps, making it a lot easier to understand the trends or
patterns in the data.

Let’s see some commonly used graphs –

Note: We will use Matplotlib and Seaborn library for


the data visualization. If you want to know about these
modules refer to the articles –

 Matplotlib Tutorial
 Python Seaborn Tutorial
Histogram

It can be used for both uni and bivariate analysis.

Example:

 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(x='Salary', data=df, )
plt.show()

Output:

Histogram plot of salary column

Boxplot

It can also be used for univariate and bivariate analyses.

Example:

 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot( x="Salary", y='Team', data=df, )


plt.show()

Output:

Boxplot of Salary and team column

Scatter Boxplot For Data Visualization


It can be used for bivariate analyses.

Example:

 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot( x="Salary", y='Team', data=df,


hue='Gender', size='Bonus %')

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:

Scatter plot of salary and Team column

For multivariate analysis, we can use pairplot()method of the


seaborn module. We can also use it for the multiple pairwise
bivariate distributions in a dataset.

Example:

 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='Gender', height=2)

Output:

Pairplot of columns of dataframe

Handling Outliers
An Outlier is a data item/object that deviates significantly from the
rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier
detection is referred to as outlier mining. There are many ways to
detect outliers, and the removal process of these outliers from the
dataframe is the same as removing a data item from the panda’s
dataframe.

Let’s consider the iris dataset and let’s plot the boxplot for the
SepalWidthCm column.

Example:

 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset


df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot of sample width column before outliers removal

In the above graph, the values above 4 and below 2 are acting as
outliers.

Removing Outliers

For removing the outlier, one must follow the same process of
removing an entry from the dataset using its exact position in the
dataset because in all the above methods of detecting the outliers
end result is the list of all those data items that satisfy the outlier
definition according to the method used.

Example: We will detect the outliers using IQR and then we will
remove them. We will also draw the boxplot to see if the outliers
are removed or not.

 Python3

# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns

# Load the dataset


df = pd.read_csv('Iris.csv')

# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1

print("Old Shape: ", df.shape)

# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers


df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot of sample width after outlier removal

Note: for more information, refer Detect and Remove the Outliers
using Python

These are some of the EDA we do during our data science project
however it depends upon your requirement and how much data
analysis we do.

You might also like