Notes Unit I
Notes Unit I
ANALYSIS
1. EDA Fundamentals
Data Science
• Data science is at the peak of its hype and the skills for
data scientists are changing. Now, data scientists are
not only required to build a performant model, but it is
essential for them to explain the results obtained and
use the result for business intelligence.
application.
ix. Communication:
❖ Steps in EDA
building datasets.
1. Discrete data
2. Continuous data
• A variable that can have an infinite number of numerical
values within a
3. Categorical data
1. Measurement scales
• Male
• Female
• Other
• Biological species
4.2.1.2 Ordinal
are significant.
1. Python Libraries
2. R Libraries
3. Standalone Software
Orange: A data visualization and analysis tool for both novices and
experts in data science.
Apache Spark: A big data processing framework that can be used for
large-scale EDA.
5. Interactive Notebooks
These tools enable data scientists and analysts to explore data, identify
patterns, and extract meaningful insights, often through visualizations
and statistical summaries.
Advantage :
It helps you clean, organize, and analyze data quickly. You can
filter data, calculate statistics, and handle missing values with just
a few commands.
2. NumPy
Advantage :
3. Matplotlib
Advantage:
Advantage:
5. Tableau
Advantage:
1. Line chart
• We will use the matplotlib library and the stock price data to
plot time series lines. First of all, let's understand the dataset.
import datetime
import random
import pandas as pd
import radar
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
for _ in range(n):
listdata.append([date, price])
# Group the data by date and take the mean of the prices
df = df.groupby(by='Date').mean()
return df
Explanation:
generateData function:
o The data is grouped by date, and the average price for each date is
computed.
Output: The function returns a DataFrame with the mean price for each date.
Line chart:
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y, marker='o')
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
EXPLANATION:
2. Bar charts
• In most cases, bar charts are very convenient when the changes are
large.
import numpy as np
import calendar
import random
height = rectangle.get_height()
plt.show()
EXPLANATION
1. Imports:
o numpy, calendar, and matplotlib.pyplot are imported along with
random.
2. Set up Data:
o months is a list of integers from 1 to 12.
o sold_quantity generates random integers between 100 and 200 for
each month.
3. Figure and Axis Layout:
o figure, axis = plt.subplots() creates a figure and an axis for plotting.
4. Customize X-axis:
o plt.xticks() sets custom tick labels on the x-axis to display month
names.
5. Plot the Graph:
o axis.bar() creates a bar chart using the data.
6. Annotate Bars (Optional):
o A loop iterates through the bars and places the height value on top of
each bar for clarity.
7. Display the Graph:
o plt.show() renders the graph.
Running this code will display a bar chart showing the number of items sold for
each month, with the month names on the x-axis and the sold quantity on the y-
axis. The numbers on top of the bars represent the exact quantities sold.
OUTPUT:
1.3 Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams,
and
scatter diagrams. They use a Cartesian coordinates system to display
values of typically two variables for a set of data.
Example Source :
import matplotlib.pyplot as plt
import pandas as pd
# Example data creation
data = {
'age': [0, 6, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156,
168, 180],
# Age in months
'min_recommended': [14, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 9,
9, 8, 8, 7, 7], # Min hours
'max_recommended': [17, 17, 17, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11,
10, 10, 9, 9]
# Max hours
}
# Creating a DataFrame
sleepDf = pd.DataFrame(data)
# Scatter plot for minimum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended')
# Scatter plot for maximum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended')
# Labeling the x-axis (Age in years)
plt.xlabel('Age of person in Years')
# Labeling the y-axis (Total hours of sleep required')
plt.ylabel('Total hours of sleep required')
# Adding a title to the plot
plt.title('Recommended Sleep Hours by Age')
# Adding a legend to distinguish between the points
plt.legend()
# Display the plot
plt.show()
Explanation:
1. Data Creation:
The data dictionary is the same as before, with age in months and the
corresponding min_recommended and max_recommended sleep hours.
2. Creating the Scatter Plots:
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended'):
Creates a scatter plot of minimum recommended sleep hours (dependent
variable) against age in years (independent variable).
The points are colored green.
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended'):
Creates a scatter plot of maximum recommended sleep hours
(dependent variable) against age in years (independent variable).
The points are colored red.
3. Labeling and Titles:
The x-axis is labeled "Age of person in Years" and the y-axis is
labeled "Total hours of sleep required."
The plot is titled "Recommended Sleep Hours by Age."
4. Legend:
plt.legend() adds a legend to differentiate between the minimum and
maximum recommended sleep hours.
5. Display:
plt.show() renders the scatter plot.
Result:
This code will generate a scatter plot with two sets of points:
Green points represent the minimum recommended sleep hours for
each age.
Red points represent the maximum recommended sleep hours for
each age.
The x-axis represents the age in years, and the y-axis represents the
total hours of sleep required.
This type of scatter plot is useful for observing how the recommended
sleep hours vary as a person ages, making it easy to compare the minimum
and maximum recommendations.
OUTPUT:
2.Data Transformation Techniques
Data Transformation
The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data.
1. Data Duplication
Explaination:
import pandas as pd
df = pd.DataFrame(data)
df_dedup = df.drop_duplicates()
print(df_dedup)
OUTPUT:
ID Name Age
0 1 Alice 25
1 2 Bob 30
3 4 Charlie 35
4 5 David 40
2. Key Restructuring
Explanation:
import pandas as pd
# Create a DataFrame with meaningful keys
df = pd.DataFrame(data)
print(df_restructured)
OUTPUT:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
3. Data Cleansing
import pandas as pd
df = pd.DataFrame(data)
df_cleaned = df.dropna()
print(df_cleaned)
OUTPUT:
Name Age
0 Alice 25.0
2 Charlie 35.0
4. Data Validation
# Create a DataFrame
df = pd.DataFrame(data)
print(df_validated)
OUTPUT:
Age
0 25
1 30
3 35
5. 5. Format Revisioning
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
OUTPUT:
Date
0 2024-01-01
1 2024-02-01
6. Data Derivation
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
print(df)
OUTPUT:
Sales SalesIncrease
0 100 110.0
1 200 220.0
2 300 330.0
7. Data Aggregation
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
df_aggregated = df.groupby('Category').sum()
print(df_aggregated)
Output:
Sales
Category
A 250
B 450
8. Data Integration
import pandas as pd
print(df_integrated)
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
9. Data Filtering
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
print(df_filtered)
Output:
Name Age
2 Charlie 35
3 David 40
import pandas as pd
print(df_joined)
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie NaN
In this example, the left join includes all records from df1 and matches
records from df2 based on the ID column. Charlie does not have a
corresponding ID in df2, so Age is NaN.
Here's a detailed guide on how to perform grouping and aggregation in data visualization
using Python:
Example Dataset
Let's start with a sample dataset to demonstrate these techniques. We'll use a DataFrame
with sales data across different regions and products.
import pandas as pd
# Sample data
data = {
df = pd.DataFrame(data)
grouped_df = df.groupby('Region').agg({
})
print(grouped_df)
Output:
Sales Quantity
Region
sales_by_region = df.groupby('Region')['Sales'].sum()
# Plot
sales_by_region.plot(kind='bar', color='skyblue')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.show()
Explanation: This creates a bar plot showing the total sales for each region.
agg_by_product = df.groupby('Product').agg({
'Sales': 'mean',
'Quantity': 'mean'
})
# Plot
agg_by_product.plot(kind='bar')
plt.xlabel('Product')
plt.ylabel('Average Value')
plt.show()
Explanation: This creates a bar plot for the average sales and quantity of each product.
Seaborn provides a high-level interface for creating attractive and informative statistical
graphics.
# Plot
plt.xlabel('Region')
plt.ylabel('Sales')
plt.show()
Explanation: This creates a box plot that shows the distribution of sales across different
regions, highlighting the spread and outliers.
Grouping and Aggregation: Use pandas to group data and compute aggregate
statistics like sum, mean, etc.
Visualization with Matplotlib and Seaborn: Use matplotlib and seaborn to create
visualizations that make the aggregated data easier to interpret.
These techniques are essential for understanding patterns and trends in your data,
enabling better decision-making based on summarized information.