UNIT 2 Notes - Data Science
UNIT 2 Notes - Data Science
1. Interviews
Interviews are a direct method of data collection. It is simply a
process in which the interviewer asks questions and the interviewee
responds to them. It provides a high degree of flexibility because
questions can be adjusted and changed anytime according to the
situation.
2. Observations
In this method, researchers observe a situation around them and
record the findings. It can be used to evaluate the behaviour of
different people in controlled (everyone knows they are being
observed) and uncontrolled (no one knows they are being observed)
situations. This method is highly effective because it is
straightforward and not directly dependent on other
participants. For example, a person looks at random people that
walk their pets on a busy street, and then uses this data to decide
whether or not to open a pet food store in that area.
3. Surveys and Questionnaires
Surveys and questionnaires provide a broad perspective from large
groups of people. They can be conducted face-to-face, mailed, or
even posted on the Internet to get respondents from anywhere in
the world. The answers can be yes or no, true or false, multiple
choice, and even open-ended questions. However, a drawback of
surveys and questionnaires is delayed response and the possibility
of ambiguous answers.
4. Focus Groups
A focus group is similar to an interview, but it is conducted with a
group of people who all have something in common. The data
collected is similar to in-person interviews, but they offer a better
understanding of why a certain group of people thinks in a particular
way. However, some drawbacks of this method are lack of privacy
and domination of the interview by one or two participants. Focus
groups can also be time-consuming and challenging, but they help
reveal some of the best information for complex situations.
Data Types
Data Type can be defined as labeling the values a feature can hold. The data type will
also determine what kinds of relational, mathematical, or logical operations can be
performed on it. A few of the most common data types include Integer, Floating,
Character, String, Boolean, Array, Date, Time, etc.
Data Summary
Data Summary can be defined as generating descriptive or summary statistics for the
features in a given dataset. For example, for a numeric column, it will
compute mean, max, min, std, etc. For a categorical variable, it will compute the
count of unique labels, labels with the highest frequency, etc.
Data Cleaning
2. De-noising
Data Integration
Data Reduction
Data Transformation
Types of EDA
These are just a few examples of the types of EDA techniques that
can be employed at some stage in information evaluation. The
choice of strategies relies upon on the information traits, research
questions, and the insights sought from the analysis.
Let’s read the dataset using the Pandas read_csv() function and
print the 1st five rows. To print the first five rows we will use
the head() function.
Python3
import pandas as pd
import numpy as np
# read datasdet using pandas
df = pd.read_csv('employees.csv')
df.head()
Output:
Python3
df.shape
Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.
Example:
Python3
df.describe()
Output:
Now, let’s also see the columns and their data types. For this, we
will use the info()method.
Python3
Output:
Python3
Python3
df.nunique()
Output:
Till now we have got an idea about the dataset used. Now Let’s
see if our dataset contains any missing values or not.
You all must be wondering why a dataset will contain any missing
values. It can occur when no information is provided for one or
more items or for a whole unit. For Example, Suppose different
users being surveyed may choose not to share their income, and
some users may choose not to share their address in this way
many datasets went missing. Missing Data is a very big problem in
real-life scenarios. Missing Data can also refer to as NA(Not
Available) values in pandas. There are several useful functions for
detecting, removing, and replacing null values in Pandas
DataFrame :
isnull()
notnull()
dropna()
fillna()
replace()
interpolate()
Now let’s check if there are any missing values in our dataset or
not.
Example:
Python3
df.isnull().sum()
Output:
Now, let’s try to fill in the missing values of gender with the string
“No Gender”.
Example:
Python3
df.isnull().sum()
Output:
We can see that now there is no null value for the gender column.
Now, Let’s fill the senior management with the mode value.
Example:
Python3
df.isnull().sum()
Output:
Now for the first name and team, we cannot fill the missing values
with arbitrary data, so, let’s drop all the rows containing these
missing values.
Example:
Python3
print(df.isnull().sum())
df.shape
Output:
Null values in dataframe after dropping all null values
We can see that our dataset is now free of all the missing values
and after dropping the data the number of rows also reduced from
1000 to 899.
Data Encoding
There are some models like Linear Regression which does not work
with categorical dataset in that case we should try to encode
categorical dataset into the numerical column. we can use
different methods for encoding like Label encoding or One-hot
encoding. pandas and sklearn provide different functions for
encoding in our case we will use the LabelEncoding function from
sklearn to encode the Gender column.
Python3
Noe
Data visualization
Data Visualization is the process of analyzing data in the form of
graphs or maps, making it a lot easier to understand the trends or
patterns in the data.
Matplotlib Tutorial
Python Seaborn Tutorial
Histogram
Example:
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(x='Salary', data=df, )
plt.show()
Output:
Boxplot
Example:
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
Output:
Example:
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plt.show()
Output:
Example:
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
Output:
Handling Outliers
An Outlier is a data item/object that deviates significantly from the
rest of the (so-called normal)objects. They can be caused by
measurement or execution errors. The analysis for outlier
detection is referred to as outlier mining. There are many ways to
detect outliers, and the removal process of these outliers from the
dataframe is the same as removing a data item from the panda’s
dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the
SepalWidthCm column.
Example:
Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='SepalWidthCm', data=df)
Output:
In the above graph, the values above 4 and below 2 are acting as
outliers.
Removing Outliers
For removing the outlier, one must follow the same process of
removing an entry from the dataset using its exact position in the
dataset because in all the above methods of detecting the outliers
end result is the list of all those data items that satisfy the outlier
definition according to the method used.
Example: We will detect the outliers using IQR and then we will
remove them. We will also draw the boxplot to see if the outliers
are removed or not.
Python3
# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
sns.boxplot(x='SepalWidthCm', data=df)
Output:
Note: for more information, refer Detect and Remove the Outliers
using Python
These are some of the EDA we do during our data science project
however it depends upon your requirement and how much data
analysis we do.