Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

Notes Unit I

UNIT I EXPLORATORY DATA ANALYSIS (EDA) EDA Fundamentals-Definition and importance of EDA in data science. Understanding and Making Sense of Data, Software Tools for EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib, Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts used in EDA (histograms, box plots, scatter plots),Data Transformation Techniques, Grouping and Aggregation Lab Component: Loading datasets and Basic data exploration (head, tail, describe, info). Data
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Notes Unit I

UNIT I EXPLORATORY DATA ANALYSIS (EDA) EDA Fundamentals-Definition and importance of EDA in data science. Understanding and Making Sense of Data, Software Tools for EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib, Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts used in EDA (histograms, box plots, scatter plots),Data Transformation Techniques, Grouping and Aggregation Lab Component: Loading datasets and Basic data exploration (head, tail, describe, info). Data
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

UNIT I - EXPLORATORY DATA

ANALYSIS

EDA Fundamentals-Definition and importance of EDA in data


science. Understanding and Making Sense of Data, Software Tools for
EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib,
Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts
used in EDA (histograms, box plots, scatter plots), Data Transformation
Techniques, Grouping and Aggregation

1. EDA Fundamentals

EDA is a process of examining the available dataset to discover


patterns, spot anomalies, test hypotheses, and check assumptions
using statistical measures. In this chapter, we are going to discuss
the steps involved in performing top-notch exploratory data analysis
and get our hands dirty using some open source databases. As
mentioned here and in several studies, the primary aim of EDA is to
examine what data can tell us before actually going through formal
modeling or hypothesis formulation.

Data Science
• Data science is at the peak of its hype and the skills for
data scientists are changing. Now, data scientists are
not only required to build a performant model, but it is
essential for them to explain the results obtained and
use the result for business intelligence.

• Data science involves cross-disciplinary knowledge from


computer science,

data, statistics, and mathematics.

• There are several phases of data analysis,


including data requirements, data collection, data
processing, data cleaning, exploratory data analysis,
modeling and algorithms, and data product and
communication.
• These phases are similar to the CRoss-Industry Standard
Process for data

mining (CRISP) framework in data mining.

• The main takeaway here is the stages of EDA, as it


is an important aspect of data analysis and data
mining. Let's understand in brief what these stages
are:

i. Data requirements: There can be various sources of


data for an organization. It is important to comprehend
what type of data is required for the organization

to be collected, curated, and stored.

• For example, an application tracking the sleeping


pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep
data, heart rate from the patient, electro-dermal
activities, and user activities pattern.
• All of these data points are required to correctly
diagnose the mental state of the person. Hence,
these are mandatory requirements for the

application.

• In addition to this, it is required to categorize the data,


numerical or
categorical, and the format of storage and dissemination.

ii. Data collection: Data collected from several sources


must be stored in the correct format and transferred to the
right information technology personnel within a company.
As mentioned previously, data can be collected from
several objects on several events using different types of
sensors and storage tools.

iii. Data processing: Preprocessing involves the process of pre-


curating the

dataset before actual analysis.

• Common tasks involve correctly exporting the


dataset, placing them under the right tables,
structuring them, and exporting them in the correct
format.

iv. Data cleaning: Preprocessed data is still not ready for


detailed analysis. It must be correctly transformed for an
incompleteness check, duplicates check, error check, and
missing value check.
• These tasks are performed in the data cleaning stage,
which involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset,
understanding the overall data quality, removing
duplicate items, and filling in the missing values.

• However, how could we identify these anomalies


on any dataset? Finding such data issues requires
us to perform some analytical techniques.
• To understand briefly, data cleaning is dependent on the
types of data
under study.

• Hence, it is most essential for data scientists or EDA


experts to comprehend different types of datasets.
An example of data cleaning would be using outlier
detection methods for quantitative data cleaning.

v. EDA: Exploratory data analysis is the stage where we


actually start to understand the message contained in the
data. It should be noted that several types of data
transformation techniques might be required during the
process of exploration.

vi. Modeling and algorithm: From a data science


perspective, generalized models or mathematical formulas
can represent or exhibit relationships among different
variables, such as correlation or causation.

• These models or equations involve one or more variables


that depend on

other variables to cause an event.

• For example, when buying, say, pens, the total price of


pens(Total) = price

for one pen (UnitPrice) * the number of pens bought


(Quantity).

• Hence, our model would be Total = UnitPrice *


Quantity. Here, the total price is dependent on the
unit price.
• Hence, the total price is referred to as the dependent
variable and the unit

price is referred to as an independent variable.


• In general, a model always describes the relationship
between independent and dependent variables.
Inferential statistics deals with quantifying
relationships between particular variables.
• The Judd model for describing the relationship
between data, model, and error still holds true: Data
= Model + Error.

viii. Data Product:


Any computer software that uses data as inputs,
produces outputs, and provides feedback based on the
output to control the environment is referred to as a data
product.
• A data product is generally based on a model
developed during data analysis, for example, a
recommendation model that inputs user purchase
history and recommends a related item that the user
is highly likely to buy.

ix. Communication:

This stage deals with disseminating the results to end


stakeholders to use

the result for business intelligence.

• One of the most notable steps in this stage is data


visualization. Visualization deals with information
relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
2. Importance of EDA in Data Science

• Different fields of science, economics, engineering,


and marketing accumulate and store data primarily
in electronic databases. Appropriate and well-
established decisions should be made using the data
collected.
• It is practically impossible to make sense of datasets
containing more than a handful of data points
without the help of computer programs. To be
certain of the insights that the collected data
provides and to make further decisions, data mining
is performed where we go throughdistinctive
analysis processes.
• Exploratory data analysis is key, and usually the first
exercise in data mining. It allows us to visualize data
to understand it as well as to create hypotheses for
further analysis. The exploratory analysis centers
around creating a synopsis of data or insights for
the next steps in a data mining
project.

• EDA actually reveals ground truth about the content


without making any underlying assumptions. This
is the fact that data scientists use this process to
actually understand what type of modeling and
hypotheses can
be created.
• Key components of exploratory data analysis
include summarizing data, statistical analysis, and
visualization of data.
• Python provides expert tools for exploratory
analysis, with pandas for summarizing; scipy, along
with others, for statistical analysis; and matplotlib
and plotly for visualizations.
• After understanding the significance of EDA, let's
discover what are the most generic steps involved in
EDA

❖ Steps in EDA

• Having understood what EDA is, and its


significance, let's understand the various steps
involved in data analysis. Basically, it involves four
different steps. Let's go through each of them to get
a brief understanding of each step:
1. Problem definition: Before trying to extract useful
insight from the data, it is essential to define the
business problem to be solved.
• The problem definition works as the driving force
for a data analysis plan execution.
• The main tasks involved in problem definition
are defining the main objective of the analysis,
defining the main deliverables, outlining the main
roles and responsibilities, obtaining the current
status of the data, defining the timetable, and
performing cost/benefit analysis. Based on such a
problem definition, an execution plan can be
created.

2. Data preparation: This step involves methods for


preparing the dataset before actual analysis.

• In this step, we define the sources of data, define


data schemas and tables, understand the main
characteristics of the data, clean the dataset,

delete non-relevant datasets, transform the data, and


divide the data into required chunks for analysis.

3. Data analysis: This is one of the most crucial steps


that deals with descriptive statistics and analysis of the
data.
• The main tasks involve summarizing the data,
finding the hidden correlation and relationships
among the data, developing predictive models,
evaluating the models, and calculating the
accuracies.

• Some of the techniques used for data summarization


are summary tables, graphs, descriptive statistics,
inferential statistics, correlation statistics,
searching, grouping, and mathematical models.

4. Development and representation of the results: This


step involves presenting the dataset to the target audience in
the form of graphs, summary tables, maps, and diagrams.
• This is also an essential step as the result analyzed
from the dataset should be interpretable by the
business stakeholders, which is one of the major
goals of EDA.
• Most of the graphical analysis techniques include
scattering plots, character plots, histograms, box
plots, residual plots, mean plots, and others.
Understanding and making sense of Data

• It is crucial to identify the type of data under


analysis. We will learn about different types of
data that is encountered during analysis.
• Different disciplines store different kinds of data for
different purposes. For

example, medical researchers store patients' data,


universities store students' and teachers' data, and
real estate industries storehouse and

building datasets.

• A dataset contains many observations about a


particular object. For instance, a dataset about
patients in a hospital can contain many observations.
• A patient can be described by a patient identifier
(ID), name, address, weight, date of birth, address,
email, and gender. Each of these features that
describes a patient is a variable. Each observation
can have a specific value for each of these variables.
For example, a patient can have the following:

• These datasets are stored in hospitals and are presented


for analysis. Most of this data is stored in some sort of
database management system in tables/schema. An
example of a table for storing patient information is
shown here:

• To summarize the preceding table, there are four


observations (001, 002, 003, 004, 005). Each
observation describes variables (PatientID, name,

address, dob, email, gender, and weight).

• Most of the dataset broadly falls into two


groups—numerical data and categorical data.
1. Numerical data

• This data has a sense of measurement involved in it;


for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family
members. This data is often referred to as
quantitative data in statistics.

• The numerical dataset can be either discrete or continuous


types.

1. Discrete data

• This is data that is countable and its values can be


listed out. For example, if we flip a coin, the number
of heads in 200 coin flips can take values from 0 to
200 (finite) cases.

• A variable that represents a discrete dataset is referred to


as a discrete

variable. The discrete variable takes a fixed number of


distinct values.

• For example, the Country variable can have values


such as Nepal, India, Norway, and Japan. It is
fixed. The Rank variable of a student in a
classroom can take values from 1, 2, 3, 4, 5, and so
on.

2. Continuous data
• A variable that can have an infinite number of numerical
values within a

specific range is classified as continuous data.

• A variable describing continuous data is a


continuous variable. For example, what is the
temperature of your city today? Can we be
finite?

• Similarly, the weight variable is also a continuous variable.

3. Categorical data

➢ This type of data represents the characteristics of an


object; for example, gender, marital status, type of
address, or categories of the movies. This data is
often referred to as qualitative datasets in statistics.

➢ Some of the most common types of categorical data you can


find in data:

• Gender (Male, Female, Other, or Unknown)

• Marital Status (Divorced, Legally Separated,


Married, Never Married, Unmarried, Widowed, or
Unknown)
• Movie genres (Action, Adventure, Comedy, Crime,
Drama, Fantasy,

Historical, Horror, Mystery, Philosophical,


Political, Saga, Satire, Science Fiction, Social,
Thriller, Urban, or Western)

• Blood type (A, B, AB, or O)

• Types of drugs (Stimulants, Depressants, Hallucinogens,


Dissociatives,
Opioids, Inhalants, or Cannabis)

A variable describing categorical data is referred


to as a categorical variable. These types of
variables can have one of a limited number of
values.

There are different types of categorical variables:

• A binary categorical variable can take exactly


two values and is also referred to as a
dichotomous variable. For example, when we
create an experiment, the result is either success or
failure. Hence, results can be
understood as a binary categorical variable.

• Polytomous variables are categorical variables that


can take more than two possible values. For
example, marital status can have several values,
such as divorced, legally separated, married, never
married, unmarried, widowed and unknown. Since
marital status can take more than two possible
values, it is a polytomous variable.

• Most of the categorical dataset follows


either nominal or ordinal measurement
scales.

1. Measurement scales

There are four different types of measurement scales


described in statistics: nominal, ordinal, interval, and ratio.
These scales are used more in academic industries. Let's
understand each of them with some examples.
1. Nominal

These are practiced for labeling variables without any


quantitative value. The scales are generally referred to as
labels. And these scales are mutually exclusive and do not
carry any numerical importance. Let's see some examples:

• What is your gender?

• Male

• Female

• I prefer not to answer

• Other

Other examples include the following:

• The languages that are spoken in a particular country

• Biological species

• Parts of speech in grammar (noun, pronoun,


adjective, and so on)

• Taxonomic ranks in biology (Archea, Bacteria, and


Eukarya)

• Nominal scales are considered qualitative scales


and the measurements that are taken using
qualitative scales are considered qualitative data.

4.2.1.2 Ordinal

• The main difference in the ordinal and nominal scale is the


order. In ordinal

scales, the order of the values is a significant factor.

• Let's check an example of ordinal scale using the


Likert scale: WordPress is making content
managers' lives easier. How do you feel about this
statement? The following diagram shows the Likert
scale:

• As depicted in the preceding diagram, the answer to


the question of WordPress is making content
managers' lives easier is scaled down to five
different ordinal values, Strongly Agree, Agree,
Neutral, Disagree, and Strongly Disagree.
• Scales like these are referred to as the Likert scale.
Similarly, the following diagram shows more
examples of the Likert scale:

To make it easier, consider ordinal scales as an order of ranking (1st,


2nd, 3rd, 4th, and so on). The median item is allowed as the measure
of central tendency; however, the average is not permitted.
3. Interval

• In interval scales, both the order and exact differences


between the values

are significant.

• Interval scales are widely used in statistics, for


example, in the measure of central tendencies—
mean, median, mode, and standard deviations.
• Examples include location in Cartesian coordinates
and direction measured in degrees from magnetic
north. The mean, median, and mode are allowed on
interval data.
4. Ratio

• Ratio scales contain order, exact values, and absolute zero,


which makes it

possible to be used in descriptive and inferential statistics.

• These scales provide numerous possibilities for statistical


analysis.

• Mathematical operations, the measure of central


tendencies, and the
measure of dispersion and coefficient of variation
can also be computed from such scales.

• Examples include a measure of energy, mass,


length, duration, electrical energy, plan angle, and
volume. The following table gives a summary of
the data types and scale measures:
Software tools for EDA
There are several software tools available for Exploratory Data Analysis
(EDA) that help in analyzing and visualizing data. Here are some popular
ones:

1. Python Libraries

Pandas: Used for data manipulation and analysis, providing data


structures like DataFrames.

Matplotlib: A plotting library that provides static, animated, and


interactive visualizations.

Seaborn: Built on Matplotlib, it offers a high-level interface for drawing


attractive statistical graphics.

Plotly: An interactive graphing library that makes it easy to create plots,


including line charts, scatter plots, and histograms.

NumPy: Provides support for large, multi-dimensional arrays and


matrices, along with a collection of mathematical functions.

Scikit-learn: Offers tools for data preprocessing, as well as simple


visualizations like feature importance plots.

2. R Libraries

ggplot2: A powerful data visualization package for creating complex


plots.
dplyr: A grammar of data manipulation, providing a consistent set of
verbs for data manipulation.

tidyverse: A collection of R packages designed for data science,


including ggplot2, dplyr, tidyr, and others.

Shiny: Allows you to build interactive web apps straight from R.

3. Standalone Software

Tableau: A leading data visualization tool that allows users to create a


wide variety of charts and dashboards.

Power BI: A business analytics tool by Microsoft that provides


interactive visualizations and business intelligence capabilities.

Excel: A widely-used spreadsheet tool that offers basic to advanced data


analysis and visualization capabilities.

RapidMiner: An advanced data science platform that includes tools for


data prep, machine learning, and EDA.

KNIME: A data analytics, reporting, and integration platform that


supports data blending, EDA, and more.

Orange: A data visualization and analysis tool for both novices and
experts in data science.

4. Big Data Tools

Apache Spark: A big data processing framework that can be used for
large-scale EDA.

Hadoop: A framework that allows for distributed processing of large data


sets across clusters of computers.

5. Interactive Notebooks

Jupyter Notebook: An open-source web application that allows you to


create and share documents that contain live code, equations,
visualizations, and narrative text.
Google Colab: A free Jupyter notebook environment that runs in the
cloud and supports Python code execution.

These tools enable data scientists and analysts to explore data, identify
patterns, and extract meaningful insights, often through visualizations
and statistical summaries.

Introduction to the popular EDA tools


(pandas,Numpy,Matplotlib,seaborn and tableau)
1. Pandas

 Pandas is a Python library that makes it easy to work with data,


especially in tables (like spreadsheets).

Advantage :

 It helps you clean, organize, and analyze data quickly. You can
filter data, calculate statistics, and handle missing values with just
a few commands.

2. NumPy

 NumPy is a Python library that focuses on numerical operations,


especially with large datasets.

Advantage :

 It provides support for working with arrays (lists of numbers) and


matrices, making calculations faster and more efficient. It's the
foundation for many other data science tools.

3. Matplotlib

 Matplotlib is a plotting library for creating static, 2D graphs and


charts in Python.

Advantage:

It allows you to create a wide variety of visualizations, like line


charts, bar graphs, and scatter plots, to help you understand and
present your data.
4. Seaborn

 Seaborn is built on top of Matplotlib and provides a higher-level


interface for making attractive and informative statistical graphics.

Advantage:

 It simplifies the process of creating complex visualizations, like


heatmaps and violin plots, making it easier to explore
relationships between different parts of your data.

5. Tableau

 Tableau is a powerful data visualization tool that allows you to


create interactive and shareable dashboards.

Advantage:

 It is user-friendly and doesn't require coding. You can connect to


different data sources, drag and drop to create visuals, and explore
your data through interactive charts and graphs.

Visual Aids for EDA


Visual Aids For EDA

1. Line chart

• We will use the matplotlib library and the stock price data to
plot time series lines. First of all, let's understand the dataset.

• We have created a function using the faker Python library to


generate the dataset.

• It is the simplest possible dataset with just two columns. The


first column is Date and the second column is Price, indicating
the stock price on that date.
• Let's generate the dataset by calling the helper method. In
addition to this, we have saved the CSV file.
• We can optionally load the CSV file using the pandas
(read_csv) library and proceed with visualization. generateData
function is defined here:

Example Source code:

import datetime

import random

import pandas as pd

import radar

def generateData(n):

listdata = []

start = datetime.datetime(2019, 8, 1)

end = datetime.datetime(2019, 8, 30)

for _ in range(n):

# Generate a random date within the specified range

date = radar.random_datetime(start='2019-08-01', stop='2019-08-


30').strftime("%Y-%m-%d")

# Generate a random price between 900 and 1000

price = round(random.uniform(900, 1000), 4)

listdata.append([date, price])

# Create a DataFrame from the list of data

df = pd.DataFrame(listdata, columns=['Date', 'Price'])

# Convert the 'Date' column to datetime format

df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Group the data by date and take the mean of the prices

df = df.groupby(by='Date').mean()

return df
Explanation:

 Imports: The necessary libraries are imported at the beginning.

 generateData function:

o The function takes an integer n as input, which determines the number


of data points to generate.

o It initializes a list listdata to store the generated data.

o A random date is generated using radar.random_datetime, and a


random price is generated between 900 and 1000.

o The data is appended to the listdata list.

o The data is converted to a Pandas DataFrame.

o The 'Date' column is converted to a datetime object.

o The data is grouped by date, and the average price for each date is
computed.

 Output: The function returns a DataFrame with the mean price for each date.

The output of the preceding code is shown in the following screenshot:


Steps involved

Line chart:

Sample Source Code:

import matplotlib.pyplot as plt

# Sample data

x = [1, 2, 3, 4, 5]

y = [2, 3, 5, 7, 11]

# Create the plot

plt.plot(x, y, marker='o')

# Add title and labels

plt.title("Simple Line Graph")

plt.xlabel("X-axis Label")

plt.ylabel("Y-axis Label")

# Show the plot


plt.show()

EXPLANATION:

 matplotlib.pyplot: This library is used to create static, animated, and


interactive visualizations in Python.
 Data: The x and y lists represent the data points to be plotted on the
graph.
 plt.plot(): This function plots the line graph. The marker='o' argument
adds markers at each data point.
 plt.title(), plt.xlabel(), plt.ylabel(): These functions add a title and
labels to the x and y axes.
 plt.show(): This function displays the graph.
OUTPUT:

2. Bar charts

• This is one of the most common types of visualization that


almost everyone must have encountered. Bars can be drawn
horizontally or vertically to represent categorical variables.
• Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time.

• In most cases, bar charts are very convenient when the changes are
large.

• In order to learn about bar charts,


• We can use the calendar Python library to keep track of the
months of the year (1 to 12) corresponding to January to
December:
Example Source Code:

import numpy as np

import calendar

import matplotlib.pyplot as plt

import random

# Step 2: Set up the data

months = list(range(1, 13))

sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

# Step 3: Specify the layout of the figure and allocate space

figure, axis = plt.subplots()

# Step 4: Display the names of the months on the x-axis

plt.xticks(months, calendar.month_name[1:13], rotation=20)

# Step 5: Plot the graph

plot = axis.bar(months, sold_quantity)

# Step 6: Display data values on the head of the bars (optional)

for rectangle in plot:

height = rectangle.get_height()

axis.text(rectangle.get_x() + rectangle.get_width() / 2., 1.002 * height,

'%d' % int(height), ha='center', va='bottom')


# Step 7: Display the graph on the screen

plt.show()
EXPLANATION

1. Imports:
o numpy, calendar, and matplotlib.pyplot are imported along with
random.
2. Set up Data:
o months is a list of integers from 1 to 12.
o sold_quantity generates random integers between 100 and 200 for
each month.
3. Figure and Axis Layout:
o figure, axis = plt.subplots() creates a figure and an axis for plotting.
4. Customize X-axis:
o plt.xticks() sets custom tick labels on the x-axis to display month
names.
5. Plot the Graph:
o axis.bar() creates a bar chart using the data.
6. Annotate Bars (Optional):
o A loop iterates through the bars and places the height value on top of
each bar for clarity.
7. Display the Graph:
o plt.show() renders the graph.
Running this code will display a bar chart showing the number of items sold for
each month, with the month names on the x-axis and the sold quantity on the y-
axis. The numbers on top of the bars represent the exact quantities sold.

OUTPUT:
1.3 Scatter plot

Scatter plots are also called scatter graphs, scatter charts, scattergrams,
and
scatter diagrams. They use a Cartesian coordinates system to display
values of typically two variables for a set of data.

When should we use a scatter plot? Scatter plots can be


constructed in the following two situations:
• When one continuous variable is dependent on another
variable, which is under the control of the observer

• When both continuous variables are independent


There are two important concepts—independent variable and
dependent variable.

• In statistical modeling or mathematical modeling, the values of


dependent variables rely on the values of independent
variables.
• The dependent variable is the outcome variable being studied.

• The independent variables are also referred to as regressors.

• The scatter plots are used when we need to show the


relationship between two variables, and hence are sometimes
referred to as correlation plots.

Example Source :
import matplotlib.pyplot as plt
import pandas as pd
# Example data creation
data = {
'age': [0, 6, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156,
168, 180],
# Age in months
'min_recommended': [14, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 9,
9, 8, 8, 7, 7], # Min hours
'max_recommended': [17, 17, 17, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11,
10, 10, 9, 9]
# Max hours
}
# Creating a DataFrame
sleepDf = pd.DataFrame(data)
# Scatter plot for minimum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended')
# Scatter plot for maximum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended')
# Labeling the x-axis (Age in years)
plt.xlabel('Age of person in Years')
# Labeling the y-axis (Total hours of sleep required')
plt.ylabel('Total hours of sleep required')
# Adding a title to the plot
plt.title('Recommended Sleep Hours by Age')
# Adding a legend to distinguish between the points
plt.legend()
# Display the plot
plt.show()
Explanation:
1. Data Creation:
 The data dictionary is the same as before, with age in months and the
corresponding min_recommended and max_recommended sleep hours.
2. Creating the Scatter Plots:
 plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended'):
Creates a scatter plot of minimum recommended sleep hours (dependent
variable) against age in years (independent variable).
The points are colored green.
 plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended'):
 Creates a scatter plot of maximum recommended sleep hours
(dependent variable) against age in years (independent variable).
 The points are colored red.
3. Labeling and Titles:
 The x-axis is labeled "Age of person in Years" and the y-axis is
labeled "Total hours of sleep required."
 The plot is titled "Recommended Sleep Hours by Age."
4. Legend:
 plt.legend() adds a legend to differentiate between the minimum and
maximum recommended sleep hours.
5. Display:
 plt.show() renders the scatter plot.
Result:
This code will generate a scatter plot with two sets of points:
 Green points represent the minimum recommended sleep hours for
each age.
 Red points represent the maximum recommended sleep hours for
each age.
 The x-axis represents the age in years, and the y-axis represents the
total hours of sleep required.
This type of scatter plot is useful for observing how the recommended
sleep hours vary as a person ages, making it easy to compare the minimum
and maximum recommendations.

OUTPUT:
2.Data Transformation Techniques
Data Transformation

Data transformation is a set of techniques used to convert data from


one format or structure to another format or structure. The following are
some examples of transformation activities:

 Data deduplication involves the identification of duplicates and their


removal.
 Key restructuring involves transforming any keys with built-in
meanings
 to the generic keys.
 Data cleansing involves extracting words and deleting out-of-date,
inaccurate, and incomplete information from the source language
without extracting the meaning or information to enhance the
accuracy of the source data.
 Data validation is a process of formulating rules or algorithms that
help in validating different types of data against some known issues.
 Format revisioning involves converting from one format to another.
 Data derivation consists of creating a set of rules to generate
more information from the data source.
 Data aggregation involves searching, extracting,
summarizing, and
 preserving important information in different types of reporting
systems.
 Data integration involves converting different data types and
merging them into a common structure or schema.
 Data filtering involves identifying information relevant to any
particular user.
 Data joining involves establishing a relationship between two or
more tables.

The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data.

In addition to this, interoperability in a system can be achieved by following


a common data structure and format.

1. Data Duplication

Explaination:

Data deduplication involves identifying and removing duplicate entries


in a dataset.

Sample Source Code:

import pandas as pd

# Create a DataFrame with duplicate rows

data = {'ID': [1, 2, 2, 4, 5],


'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 30, 35, 40]}

df = pd.DataFrame(data)

# Drop duplicate rows

df_dedup = df.drop_duplicates()

print("DataFrame after deduplication:")

print(df_dedup)

OUTPUT:

DataFrame after deduplication:

ID Name Age

0 1 Alice 25

1 2 Bob 30

3 4 Charlie 35

4 5 David 40

2. Key Restructuring

Explanation:

Transforming keys with specific meanings to generic keys.

Example Python Program:

import pandas as pd
# Create a DataFrame with meaningful keys

data = {'EmployeeID': [1, 2, 3],

'EmployeeName': ['Alice', 'Bob', 'Charlie'],

'EmployeeAge': [25, 30, 35]}

df = pd.DataFrame(data)

# Rename columns to generic keys

df_restructured = df.rename(columns={'EmployeeID': 'ID',


'EmployeeName': 'Name', 'EmployeeAge': 'Age'})

print("DataFrame after key restructuring:")

print(df_restructured)

OUTPUT:

DataFrame after key restructuring:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie 35

3. Data Cleansing

Explanation: Removing inaccurate or incomplete information.


Example Python Program:

import pandas as pd

# Create a DataFrame with some incorrect data

data = {'Name': ['Alice', 'Bob', 'Charlie', None],

'Age': [25, None, 35, 40]}

df = pd.DataFrame(data)

# Drop rows with missing values

df_cleaned = df.dropna()

print("DataFrame after data cleansing:")

print(df_cleaned)

OUTPUT:

DataFrame after data cleansing:

Name Age

0 Alice 25.0

2 Charlie 35.0

4. Data Validation

Explanation: Validating data against certain rules.

Example Python Program:


import pandas as pd

# Create a DataFrame

data = {'Age': [25, 30, -5, 35]} # -5 is an invalid age

df = pd.DataFrame(data)

# Validate ages (must be non-negative)

df_validated = df[df['Age'] >= 0]

print("DataFrame after data validation:")

print(df_validated)

OUTPUT:

DataFrame after data validation:

Age

0 25

1 30

3 35

5. 5. Format Revisioning

Explanation: Converting data from one format to another.

Example Source Code:

import pandas as pd
# Create a DataFrame

data = {'Date': ['2024-01-01', '2024-02-01']}

df = pd.DataFrame(data)

# Convert Date column from string to datetime

df['Date'] = pd.to_datetime(df['Date'])

print("DataFrame after format revisioning:")

print(df)

OUTPUT:

DataFrame after format revisioning:

Date

0 2024-01-01

1 2024-02-01

6. Data Derivation

Explanation: Creating new information from existing data.

Example Source code:

import pandas as pd

# Create a DataFrame

data = {'Sales': [100, 200, 300]}

df = pd.DataFrame(data)

# Derive a new column with a 10% increase in sales

df['SalesIncrease'] = df['Sales'] * 1.10

print("DataFrame after data derivation:")

print(df)
OUTPUT:

DataFrame after data derivation:

Sales SalesIncrease

0 100 110.0

1 200 220.0

2 300 330.0

7. Data Aggregation

Explanation: Summarizing data, e.g., calculating total or average.

Example Python Program:

import pandas as pd

# Create a DataFrame

data = {'Category': ['A', 'A', 'B', 'B'],

'Sales': [100, 150, 200, 250]}

df = pd.DataFrame(data)

# Aggregate data by category

df_aggregated = df.groupby('Category').sum()

print("DataFrame after data aggregation:")

print(df_aggregated)

Output:

DataFrame after data aggregation:

Sales

Category

A 250
B 450

8. Data Integration

Explanation: Combining different datasets into one.

Example Python Program:

import pandas as pd

# Create two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 3], 'Age': [25, 30, 35]})

# Merge DataFrames on 'ID'

df_integrated = pd.merge(df1, df2, on='ID')

print("DataFrame after data integration:")

print(df_integrated)

Output:

DataFrame after data integration:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie 35

9. Data Filtering

Explanation: Extracting specific information based on conditions.

Example Python Program:

import pandas as pd
# Create a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 35, 40]}

df = pd.DataFrame(data)

# Filter data for people older than 30

df_filtered = df[df['Age'] > 30]

print("DataFrame after data filtering:")

print(df_filtered)

Output:

DataFrame after data filtering:

Name Age

2 Charlie 35

3 David 40

10. Data Joining

Explanation: Establishing relationships between tables based on common


keys.

Example Python Program:

import pandas as pd

# Create two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 45]})

# Perform a join operation (left join)

df_joined = pd.merge(df1, df2, on='ID', how='left')


print("DataFrame after data joining:")

print(df_joined)

Output:

DataFrame after data joining:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie NaN

In this example, the left join includes all records from df1 and matches
records from df2 based on the ID column. Charlie does not have a
corresponding ID in df2, so Age is NaN.

These examples demonstrate various data transformation techniques and


how they can be implemented using Python and pandas.

Benefits of data transformation

• Data transformation promotes interoperability between several


applications. The main reason for creating a similar format and
structure in the dataset is that it becomes compatible with other
systems.
• Comprehensibility for both humans and computers is improved
when using better-organized data compared to messier data.
• Data transformation ensures a higher degree of data quality and
protects applications from several computational challenges
such as null values, unexpected duplicates, and incorrect
indexings, as well as incompatible structures or formats.
• Data transformation ensures higher performance and scalability for
modern

analytical databases and dataframes.


Challenges

The process of data transformation can be challenging for several reasons:


• It requires a qualified team of experts and state-of-the-art
infrastructure. The cost of attaining such experts and
infrastructure can increase the cost of the operation.
• Data transformation requires data cleaning before data
transformation and data migration. This process of cleansing
can be expensively time- consuming.
• Generally, the activities of data transformations involve batch
processing. This means that sometimes, we might have to wait
for a day before the next batch of data is ready for cleansing.
This can be very slow.

Grouping and aggregation


Grouping and aggregation are powerful techniques for summarizing and visualizing data.
These techniques help you analyze data by grouping it into categories and computing
aggregate statistics for each category. In Python, the pandas library is commonly used for
these operations, and matplotlib or seaborn is used for visualization.

Here's a detailed guide on how to perform grouping and aggregation in data visualization
using Python:

Example Dataset

Let's start with a sample dataset to demonstrate these techniques. We'll use a DataFrame
with sales data across different regions and products.

import pandas as pd

# Sample data

data = {

'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],

'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],


'Sales': [200, 150, 300, 250, 180, 210, 320, 270],

'Quantity': [20, 15, 30, 25, 18, 21, 32, 27]

df = pd.DataFrame(data)

1. Grouping and Aggregation

Group by Region and Aggregate Sales

# Group by 'Region' and aggregate sales (sum and average)

grouped_df = df.groupby('Region').agg({

'Sales': ['sum', 'mean'],

'Quantity': ['sum', 'mean']

})

print("Grouped and Aggregated DataFrame:")

print(grouped_df)

Output:

Grouped and Aggregated DataFrame:

Sales Quantity

sum mean sum mean

Region

East 620 310.0 62 31.0

North 380 190.0 38 19.0

South 360 180.0 36 18.0

West 520 260.0 52 26.0


2. Visualization with Matplotlib

Bar Plot for Total Sales by Region

import matplotlib.pyplot as plt

# Group by 'Region' and aggregate sales (sum)

sales_by_region = df.groupby('Region')['Sales'].sum()

# Plot

sales_by_region.plot(kind='bar', color='skyblue')

plt.title('Total Sales by Region')

plt.xlabel('Region')

plt.ylabel('Total Sales')

plt.show()

Explanation: This creates a bar plot showing the total sales for each region.

Bar Plot for Average Sales and Quantity by Product

# Group by 'Product' and aggregate sales and quantity (mean)

agg_by_product = df.groupby('Product').agg({

'Sales': 'mean',

'Quantity': 'mean'

})

# Plot

agg_by_product.plot(kind='bar')

plt.title('Average Sales and Quantity by Product')

plt.xlabel('Product')

plt.ylabel('Average Value')

plt.show()
Explanation: This creates a bar plot for the average sales and quantity of each product.

3. Visualization with Seaborn

Seaborn provides a high-level interface for creating attractive and informative statistical
graphics.

Sample Source code:

#Box Plot for Sales Distribution by Region

import seaborn as sns

# Plot

sns.boxplot(x='Region', y='Sales', data=df)

plt.title('Sales Distribution by Region')

plt.xlabel('Region')

plt.ylabel('Sales')

plt.show()

Explanation: This creates a box plot that shows the distribution of sales across different
regions, highlighting the spread and outliers.

 Grouping and Aggregation: Use pandas to group data and compute aggregate
statistics like sum, mean, etc.

 Visualization with Matplotlib and Seaborn: Use matplotlib and seaborn to create
visualizations that make the aggregated data easier to interpret.

These techniques are essential for understanding patterns and trends in your data,
enabling better decision-making based on summarized information.

You might also like