0% found this document useful (0 votes)

36 views

Python

NumPy is a Python library for numerical operations and linear algebra. It allows efficient manipulation of arrays and matrices. Pandas is a library for data analysis and manipulation of structured data. It allows easy importing and exporting of data from various file formats and provides functions for reshaping data frames.

Uploaded by

YogenDran Suraskumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Python

Uploaded by

YogenDran Suraskumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Numpy:

NumPy is a powerful Python library for numerical operations, particularly for working with
arrays and matrices. It provides a high-performance multidimensional array object and tools for
working with these arrays. Here are some key features of NumPy along with examples:
Creating Arrays: NumPy provides various ways to create arrays.

import numpy as np

# Creating a 1D array
arr_1d = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

Array Operations: NumPy supports element-wise operations and provides functions for
common mathematical operations.

# Element-wise addition
result_addition = arr_1d + 10

# Element-wise multiplication
result_multiply = arr_2d * 2

Array Indexing and Slicing: NumPy allows you to index and slice arrays efficiently.

# Accessing elements
print(arr_1d[2]) # Output: 3

# Slicing
print(arr_2d[:, 1]) # Output: [2, 5, 8]

Broadcasting: NumPy supports broadcasting, allowing operations on arrays of different shapes

and sizes.

# Broadcasting
arr_broadcast = np.array([1, 2, 3]) + 5

Linear Algebra Operations: NumPy provides a set of linear algebra functions for matrix
operations.

# Matrix multiplication
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
result_matmul = np.dot(mat1, mat2)

Statistical Functions: NumPy includes various statistical functions for analyzing data.

# Mean and standard deviation

mean_value = np.mean(arr_1d)
std_deviation = np.std(arr_1d)

Random Module: NumPy provides a random module for generating random numbers and
arrays.

# Random array
random_array = np.random.rand(3, 3)

# Generate 1000 random numbers from a normal distribution with mean 0 and
standard deviation 1
normal_random = np.random.randn(1000)

# Generate 5 random integers between 1 and 10

random_integers = np.random.randint(1, 11, size=5)

# Generating random numbers with given mean and standard deviation

# Set the desired mean and standard deviation

desired_mean = 10
desired_std = 2

# Generate 1000 random numbers with the specified mean and standard deviation
random_numbers = np.random.normal(loc=desired_mean, scale=desired_std,
size=1000)

# Calculate the actual mean and standard deviation of the generated numbers
actual_mean = np.mean(random_numbers)
actual_std = np.std(random_numbers)

# Print the results

print(f"Desired Mean: {desired_mean}, Desired Standard Deviation:
{desired_std}")
print(f"Actual Mean: {actual_mean:.4f}, Actual Standard Deviation:
{actual_std:.4f}")

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate random heights following a normal distribution

mean_height = 160 # mean height in centimeters
std_deviation = 10 # standard deviation in centimeters
num_students = 1000

# Generate random heights

heights = np.random.normal(loc=mean_height, scale=std_deviation,
size=num_students)

# Plot the histogram of heights

plt.hist(heights, bins=30, density=True, alpha=0.7, color='blue', label='Height
Distribution')

# Fit a normal distribution to the data

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean_height, std_deviation)
plt.plot(x, p, 'k', linewidth=2, label='Fitted Normal Distribution')

plt.title('Height Distribution of Students with Fitted Normal Curve')

plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

Reshaping Arrays: NumPy allows you to reshape arrays easily.

import numpy as np

# Create a 1D array with 12 elements

arr_1d = np.arange(12)

# Reshape the 1D array into a 3x4 2D array

arr_2d = arr_1d.reshape(3, 4)

print("Original 1D array:")
print(arr_1d) #[ 0 1 2 3 4 5 6 7 8 9 10 11]
print("\nReshaped 2D array:")
print(arr_2d) #[[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]]

Pandas:
NumPy is a library for numerical operations and array manipulations, pandas is a library
specifically designed for structured and tabular data manipulation and analysis. In pandas, the
concept of reshaping is more associated with operations like pivoting, melting, and stacking,
rather than the direct reshaping of arrays.
Pandas provides a rich set of functions for reshaping and manipulating data, which is especially
useful when dealing with datasets. The examples above highlight some common operations, but
pandas offers many more functionalities for reshaping and transforming data to suit your analysis
needs. Keep in mind that the focus of pandas is on tabular data, and it may not directly align with
the array-centric operations in NumPy.
Let's explore a few operations in pandas that are related to reshaping data:
Creating a DataFrame: In pandas, the primary data structure is the DataFrame, which is a 2-
dimensional labeled data structure.

import pandas as pd

# Creating a DataFrame from a dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 San Francisco
# 2 Charlie 35 Los Angeles

Pivoting: The pivot function in pandas is used to reshape the DataFrame by pivoting based on
the values in a column.

# Pivoting the DataFrame

pivoted_df = df.pivot(index='Name', columns='City', values='Age')
print("\nPivoted DataFrame:")
print(pivoted_df)
# Output:
# City Los Angeles New York San Francisco
# Name
# Alice NaN 25.0 NaN
# Bob NaN NaN 30.0
# Charlie 35.0 NaN NaN
Melting: The melt function is used to unpivot a DataFrame, converting it from wide format to
long format.

# Melting the DataFrame

melted_df = pd.melt(df, id_vars='Name', value_vars=['Age', 'City'])
print("\nMelted DataFrame:")
print(melted_df)
# Output:
# Name variable value
# 0 Alice Age 25
# 1 Bob Age 30
# 2 Charlie Age 35
# 3 Alice City New York
# 4 Bob City San Francisco
# 5 Charlie City Los Angeles

Stacking and Unstacking: The stack and unstack functions are used to reshape a DataFrame by
stacking or unstacking levels.

# Stacking and unstacking

stacked_df = df.set_index(['Name', 'City']).stack()
print("\nStacked DataFrame:")
print(stacked_df)
# Output:
# Name City
# Alice New York 25
# Bob San Francisco 30
# Charlie Los Angeles 35
# dtype: int64

unstacked_df = stacked_df.unstack()
print("\nUnstacked DataFrame:")
print(unstacked_df)
# Output:
# City Los Angeles New York San Francisco
# Name
# Alice NaN 25.0 NaN
# Bob NaN NaN 30.0
# Charlie 35.0 NaN NaN

Importing and Exporting files

In pandas, you can easily import and export data from various formats, including Excel (xls,
xlsx), CSV, JSON, and text files. Here are examples for each:
CSV (Comma-Separated Values):

import pandas as pd

# Import CSV file

csv_data = pd.read_csv('your_file.csv')
print("CSV Data:")
print(csv_data)

Excel

# Import Excel file

excel_data = pd.read_excel('your_file.xlsx', sheet_name='Sheet1')
print("\nExcel Data:")
print(excel_data)

Jason

# Import JSON file

json_data = pd.read_json('your_file.json')
print("\nJSON Data:")
print(json_data)

Text (Delimiter-Separated Values):

# Import text file (tab-separated values)

text_data = pd.read_csv('your_file.txt', delimiter='\t')
print("\nText Data:")
print(text_data)

Exporting Data:
CSV:

# Export DataFrame to CSV

csv_data.to_csv('output_file.csv', index=False)

Excel:

# Export DataFrame to Excel

excel_data.to_excel('output_file.xlsx', index=False, sheet_name='Sheet1')

Jason:
# Export DataFrame to JSON
json_data.to_json('output_file.json', orient='records')

Text:

# Export DataFrame to text file (tab-separated values)

text_data.to_csv('output_file.txt', sep='\t', index=False)

These examples demonstrate basic import and export operations in pandas. Make sure to replace
'your_file.csv', 'your_file.xlsx', 'your_file.json', and 'your_file.txt' with the actual paths or URLs
of your data files.
Keep in mind that these functions have various parameters to handle different configurations of
data, such as specifying delimiters, encoding, header presence, etc. Always refer to the pandas
documentation for more details on these parameters.
Appendix:
A normal distribution, also known as a Gaussian distribution, is a symmetric bell-shaped
probability distribution. In simpler terms, it describes how data is spread around a central value.

Here's a basic explanation with an example:

Characteristics of a Normal Distribution:
Symmetry: The distribution is symmetric, meaning the left and right sides of the central value
are mirror images of each other.
Bell Shape: The distribution forms a bell-shaped curve when plotted.
Central Tendency: The highest point of the curve represents the central or average value of the
data.
Standard Deviation: The spread of the data is determined by the standard deviation. A smaller
standard deviation means the data is more concentrated around the mean.
Example:
Let's consider the heights of a population of students in a school. If the heights follow a normal
distribution:

 The average height would be the central value (mean).

 Most students would have heights close to the average.
 Taller and shorter students would be less common, gradually decreasing as you move
away from the average in either direction.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate random heights following a normal distribution

mean_height = 160 # mean height in centimeters
std_deviation = 10 # standard deviation in centimeters
num_students = 1000

# Generate random heights

heights = np.random.normal(loc=mean_height, scale=std_deviation,
size=num_students)

# Plot the histogram of heights

plt.hist(heights, bins=30, density=True, alpha=0.7, color='blue',
label='Height Distribution')
# Fit a normal distribution to the data
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean_height, std_deviation)
plt.plot(x, p, 'k', linewidth=2, label='Fitted Normal Distribution')

plt.title('Height Distribution of Students with Fitted Normal Curve')

plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

In this example, the numpy.random.normal function is used to generate random heights for 1000
students. The resulting histogram will have a bell-shaped curve, with the majority of heights
centered around the mean value.
Understanding the normal distribution is fundamental in statistics, and it often appears in various
natural phenomena and measurements. The concept is widely used in fields such as physics,
biology, finance, and more.
Top of Form
Descriptive Statistics:
Descriptive statistics are used to summarize and describe the main features of a dataset.

import numpy as np

# Sample data
data = np.array([2, 4, 5, 7, 9, 12, 15, 18, 21])

# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

# Median
median_value = np.median(data)
print(f"Median: {median_value}")

# Standard Deviation
std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

# Variance
variance = np.var(data)
print(f"Variance: {variance}")

Correlation:
Correlation measures the strength and direction of a linear relationship between two variables.

import numpy as np
from scipy.stats import pearsonr

# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 6, 8])

# Pearson correlation coefficient and p-value

corr_coefficient, p_value = pearsonr(x, y)
print(f"Pearson Correlation Coefficient: {corr_coefficient}")
print(f"P-value: {p_value}")

import pandas as pd

# Create a sample DataFrame

data = {'A': [1, 2, 3, 4, 5],
'B': [2, 4, 1, 6, 8],
'C': [5, 3, 8, 1, 2]}

df = pd.DataFrame(data)

# Calculate the correlation matrix

correlation_matrix = df.corr()

# Display the correlation matrix

print("Correlation Matrix:")
print(correlation_matrix)

To see the correlation between 2 variables, we can also use the scatter plot. You can create a
scatter plot for two columns using the matplotlib library. Here's an example using the previously
defined DataFrame (df) and plotting a scatter plot for columns 'A' and 'B':

import matplotlib.pyplot as plt

# Scatter plot for columns 'A' and 'B'

plt.scatter(df['A'], df['B'])
plt.title('Scatter Plot of Column A vs Column B')
plt.xlabel('Column A')
plt.ylabel('Column B')
plt.show()
Filtering data:
Filtering data in pandas can be done using the `loc` or `iloc` accessor along with conditional
statements. Let's look at examples for filtering data based on one column and more than one
column.
Filtering Data Based on One Column:

import pandas as pd

# Sample DataFrame
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Subcategory': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
'Value': [10, 15, 20, 25, 30, 35]
}

df = pd.DataFrame(data)

# Filtering data where 'Category' is 'A'

filtered_data_one_column = df[df['Category'] == 'A']
filtered_data_one_column

Using query method:

# Using query method to filter data

filtered_data_one_column_query = df.query("Category == 'A'")
filtered_data_one_column_query

Data Based on More Than One Column:

Filtering data where 'Category' is 'A' and 'Value' is greater than 15

filtered_data_multiple_columns = df[(df['Category'] == 'A') & (df['Value'] > 15)]
filtered_data_multiple_columns

Using query method:

# Using query method to filter data based on multiple columns

filtered_data_multiple_columns_query = df.query("Category == 'A' and Value > 15")
filtered_data_multiple_columns_query

In these examples, the conditions inside the square brackets ([] or ()) for both methods specify
the filtering criteria. The `&` operator is used for the logical AND operation.

Grouping and the Group Aggregation

In pandas, grouping data and performing aggregations is a common operation when working
with tabular data. The process involves splitting the data into groups based on some criteria and
then applying a function to each group independently.

Grouping Data:

To group data in pandas, you typically use the groupby method, which allows you to split the
DataFrame into groups based on one or more columns.

# Grouping by the 'Category' column

grouped = df.groupby('Category')

Aggregation with Different Formats:

Once you've grouped the data, you can apply various aggregation functions to obtain summary
statistics for each group. Let's explore different formats of aggregation:

1. Aggregating with Built-in Functions:

You can use built-in aggregation functions like sum(), mean(), count(), max(), min(), etc.

# Summing the 'Value' column for each category

result_sum = grouped['Value'].sum()
#Or simply df.groupby('Category') ['Value'].sum()
result_sum

2. Aggregating with Custom Functions:

You can define your own aggregation functions.

# Custom aggregation function to calculate the range

def custom_range(series):
return series.max() - series.min()

# Applying the custom function to each group

result_range = grouped['Value'].agg(custom_range)
result_range

3. Aggregating with Multiple Functions:

You can apply multiple aggregation functions at once using a list.

# Applying multiple aggregation functions

result_multiple = grouped['Value'].agg(['sum', 'mean', 'count'])
print(result_multiple)

4. Grouping Data Based on Multiple Columns:

When you want to group data based on two columns, you can pass a list of column names to the
groupby method.

# Grouping by 'Category' and 'Subcategory'

grouped = df.groupby(['Category', 'Subcategory'])

# Aggregating the sum of 'Value' for each group

result = grouped['Value'].sum()

result
Merging DataFrames:
In pandas, merging refers to the combining of two or more DataFrames based on a common set
of columns or indices. There are several types of merges:

1. Inner Merge:
An inner merge returns only the rows where there is a match in both DataFrames based on the
specified columns.

# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Inner merge on the 'key' column

merged_inner = pd.merge(df1, df2, on='key')
print(merged_inner)

Output:
key value1 value2
0 B 2 4
1 C 3 5

2. Left Merge:
A left merge returns all rows from the left DataFrame and the matched rows from the right
DataFrame. If there is no match, NaN values are filled in for the columns from the right
DataFrame.

# Left merge on the 'key' column

merged_left = pd.merge(df1, df2, on='key', how='left')
print(merged_left)

Output:
key value1 value2

0 A 1 NaN
1 B 2 4.0
2 C 3 5.0

3. Right Merge:
A right merge is similar to a left merge but returns all rows from the right DataFrame and the
matched rows from the left DataFrame.

# Right merge on the 'key' column

merged_right = pd.merge(df1, df2, on='key', how='right')
print(merged_right)

Output:
key value1 value2
0 B 2.0 4
1 C 3.0 5
2 D NaN 6

4. Outer Merge:
An outer merge returns all rows from both DataFrames, filling in NaN for missing values.

# Outer merge on the 'key' column

merged_outer = pd.merge(df1, df2, on='key', how='outer')
print(merged_outer)

Output:

key value1 value2

0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0

5. Concatenation:

Concatenation is a method of combining two DataFrames along a particular axis (either rows or
columns).

# Concatenating along rows (axis=0)

concatenated_rows = pd.concat([df1, df2], ignore_index=True)
print(concatenated_rows)

Output:
key value1 value2
0 A 1 NaN
1 B 2 NaN
2 C 3 NaN
3 B NaN 4
4 C NaN 5
5 D NaN 6

Concatenating along

# Concatenating along rows (axis=0)

concatenated_rows = pd.concat([df1, df2], ignore_index=True)
print(concatenated_rows)

Output:

Output
key value1 value2
0 A 1.0 NaN
1 B 2.0 NaN
2 C 3.0 NaN
3 B NaN 4.0
4 C NaN 5.0
5 D NaN 6.0

Adding Rows to a DataFrame

pd.concat([df1, pd.DataFrame({"key":["E","F"],"value1":
[5,9]})],inore_index=True)

Output
key value1
0 A 1
1 B 2
2 C 3
3 E 5
4 F 9

These examples cover some of the basic merging and concatenation operations in pandas.
Depending on your specific use case, you may choose the appropriate type of merge or
Data Visualization
Data Visualization is the technique of making inferences from the data by plotting charts and
graphs from the data.
Data visualization is used to deduce trends, patterns and correlation from the data that might be
hard to find without visualization.
Data Visualization is also used for presentation of inferences derived from data analytics.

Data Visualization in Python:

Python provides a variety of packages for data visualization. Some of the popular and most used
ones are:
1. Matplotlib
2. Seaborn
3. Plotly (for 3D plotting)
4. Pandas Visualization
5. ggplot
For the course of this module, we will focus on Matplotlib as it forms the basis for most of the
other libraries like seaborn or pandas visualization.

Type of Plots:
Throughout the upcoming slides we will cover the following plots:
1. Line Plots (overtime - trend)
2. Scatter (to see correlation between two columns)
3. Bar charts and Column Charts (frequency of data)
4. Histogram (frequency of data)
5. Pie charts
6. Box plot (statistical value – outliers)
7. Heatmap (comparison)
For the rest of this chapter consider the following Dataset.’
# Example dataset
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Income': [50000, 60000, 75000, 90000, 80000, 95000, 110000, 120000, 105000,
130000],
'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame

df.head()

Line Plot:
In this example, we are going to see the relationship between age and the income visually. To do
this, we use a line plot:

# Create a line plot for Income over Age

plt.plot(data['Age'], data['Income'], marker='o', linestyle='-', color='blue')

# Add labels and title

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Line Plot: Income over Age')

# Display the plot

plt.show()
Now let’s make it fancier by adding a line of average income and a highlight age

# Create a line plot for Income over Age

plt.plot(df['Age'], df['Income'], marker='o', linestyle='-', color='blue',
label='Income Trend')

# Add a horizontal line for average income

average_income = df['Income'].mean()
plt.axhline(y=average_income, color='green', linestyle='--', label='Average
Income')

# Add a vertical line to highlight a specific age

highlight_age = 50
plt.axvline(x=highlight_age, color='red', linestyle='--', label='Highlight Age')

# Add labels and title

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Income over Age')

# Add legend
plt.legend()

# Display the plot

plt.show()

In this code:

 A horizontal line is added using plt.axhline to represent the average income.

 A vertical line is added using plt.axvline to highlight a specific age (set as highlight age).
 Legends are added to distinguish between the income trend, average income, and the
highlighted age.

Scatter Plot:
In the example below, we use scatter plot to visualize the relationship between age and income
for each individual in the dataset. Each point represents a unique combination of age and income,
and the color and marker style can be adjusted based on preferences.

# Create a scatter plot for Income over Age

plt.scatter(data['Age'], data['Income'], color='blue', marker='o')

# Add labels and title

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Scatter Plot: Income vs. Age')

# Display the plot

plt.show()

Now we want to see the scatter plot of Age vs Income but visually separated for each Category.

# Separate data based on categories

category_a = df[df['Category'] == 'A']
category_b = df[df['Category'] == 'B']

# Create a scatter plot for Income over Age with different colors for each
category
plt.scatter(category_a['Age'], category_a['Income'], color='blue', marker='o',
label='Category A')
plt.scatter(category_b['Age'], category_b['Income'], color='orange', marker='s',
label='Category B')

# Add labels, title, and legend

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Income vs. Age by Category')
plt.legend()
# Display the plot
plt.show()

In this code:

 Data is separated into two categories, 'A' and 'B'.

 Scatter plots are created for each category with different colors (blue for 'A', orange for
'B') and different markers (circle for 'A', square for 'B').
 Legends are added to distinguish between the categories.

Bar charts and Column Charts

Bar charts and column charts are essentially the same, but they differ in the orientation of the
bars or columns. Bar charts have horizontal bars, while column charts have vertical columns. In
this example, we'll show you both variations.

In the example below, we want to see the average income in each category A and B visually

# Calculate average income for each category

average_income_by_category = df.groupby('Category')['Income'].mean()

# Create a bar chart (vertical columns) for average income by category

plt.bar(average_income_by_category.index, average_income_by_category,
color=['blue', 'orange'])
plt.xlabel('Category')
plt.ylabel('Average Income')
plt.title('Bar Chart: Average Income by Category')

# Display the plot

plt.show()
# Create a horizontal bar chart for average income by category
plt.barh(average_income_by_category.index, average_income_by_category,
color=['blue', 'orange'])
plt.xlabel('Average Income')
plt.ylabel('Category')
plt.title('Horizontal Bar Chart: Average Income by Category')

# Display the plot

plt.show()

In this code:

 The average income for each category is calculated using groupby and mean.
 The first set of plots is a vertical bar chart (plt.bar) with the bars representing the average
income for each category.
 The second set of plots is a horizontal bar chart (plt.barh) with the bars arranged
horizontally.

Histogram:
For histograms, you typically focus on a single variable to visualize its statistical distribution.

# Create a histogram for the 'Income' variable

plt.hist(df['Income'], bins=10, color='lightcoral', edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Histogram: Distribution of Income')
plt.show()

In this code:

 plt.hist is used to create histograms for both the "Age" and "Income" variables.
 bins parameter controls the number of bins (intervals) in the histogram.
 edgecolor is set to 'black' to add black borders around the bars.

Pie charts:
A pie chart is suitable when you want to represent the proportions of different categories in a
whole. In this example, let's create a pie chart to represent the distribution of the "Category"
variable in the given dataset:

# Calculate the distribution of categories

category_distribution = df['Category'].value_counts()

# Create a pie chart for the distribution of categories

plt.pie(category_distribution, labels=category_distribution.index, autopct='%1.1f
%%', colors=['lightgreen', 'lightcoral'])
plt.title('Pie Chart: Category Distribution')
plt.show()

This pie chart below represents the distribution of age categories in the dataset. We can create
age categories and represent the distribution of these age categories in a pie chart. Here, we'll
define age categories based on ranges (e.g., 20-30, 31-40, etc.):

# Define age categories

age_bins = [20, 30, 40, 50, 60, 70, 80]
age_labels = ['20-30', '31-40', '41-50', '51-60', '61-70', '71+']

# Create a new column 'AgeCategory' in the DataFrame

df['AgeCategory'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels,
right=False)

# Calculate the distribution of age categories

age_category_distribution = df['AgeCategory'].value_counts()

# Sort the age categories

age_category_distribution = age_category_distribution.sort_index()

# Create a pie chart for the distribution of age categories

plt.pie(age_category_distribution, labels=age_category_distribution.index,
autopct='%1.1f%%', colors=['lightblue', 'lightcoral', 'lightgreen',
'lightyellow', 'lightpink', 'lightgrey'])
plt.title('Category Distribution')
plt.show()
In this code:

 Age categories are defined using pd.cut to create a new column 'AgeCategory' in the
DataFrame.
 The age categories are sorted using sort_index() after calculating the distribution.
 plt.pie is used to create a pie chart for the distribution of age categories.

Box plot:
A boxplot, also known as a box-and-whisker plot, is a statistical visualization that provides a
summary of the distribution of a dataset. It is particularly useful for comparing distributions
between different categories or groups. A boxplot consists of several key components:
1. Box:

 The box represents the interquartile range (IQR), which is the range between the first
quartile (Q1) and the third quartile (Q3).

 The height of the box is the IQR, and it contains the middle 50% of the data.

2. Whiskers:

 Whiskers extend from the box to the minimum and maximum values within a certain
range, often determined by a multiplier of the IQR.

 They provide information about the spread of the data beyond the quartiles.

3. Median (Line inside the box):

 The line inside the box represents the median of the dataset. It divides the dataset into
two halves, with 50% of the data points below and 50% above the median.
4. Outliers (Individual Points or Dots):
 Individual points beyond the whiskers are considered outliers and are plotted
individually.

Let's create a boxplot for the "Income" variable in the given dataset:

import seaborn as sns

# Set the style for seaborn

sns.set(style="whitegrid")

# Create a boxplot for the 'Income' variable

plt.figure(figsize=(8, 6))
sns.boxplot(x='Category', y='Income', data=df, palette=['lightblue',
'lightcoral'])
plt.xlabel('Category')
plt.ylabel('Income')
plt.title('Boxplot: Distribution of Income by Category')
plt.show()

In this code:

 Seaborn (sns) is used to create a boxplot. Seaborn is a statistical data visualization library
that works well with pandas DataFrames.
 The x parameter specifies the categorical variable ('Category') on the x-axis, and the y
parameter specifies the numerical variable ('Income') on the y-axis.
 The boxplot visually represents the distribution of income for each category.

Let's interpret a boxplot using the boxplot created for the "Income" variable in the given dataset:
1. Box (Interquartile Range):
 The box represents the middle 50% of the income distribution for each category.

 The bottom and top edges of the box correspond to the first quartile (Q1) and the third
quartile (Q3), respectively.

 The height of the box (Q3 - Q1) provides a measure of the spread of the central portion of
the data.
2. Whiskers:

 The whiskers extend from the box to the minimum and maximum values within a certain
range.

 Outliers beyond the whiskers are considered as individual points.

3. Median (Line inside the box):

 The line inside the box represents the median income for each category.

4. Outliers (Individual Points or Dots):

 Individual points beyond the whiskers are considered outliers. They are plotted
individually to highlight values that are significantly different from the majority of the
data.

Heatmap:
A heatmap is a graphical representation of data where values in a matrix are represented as
colors. It's particularly useful for visualizing the relationships between two categorical variables.
In the given dataset, we can create a heatmap to visualize the relationship between
"AgeCategory" and "Income."

import seaborn as sns

# Create a pivot table to organize the data for the heatmap

heatmap_data = df.pivot_table(values='Income', index='AgeCategory',
columns='Category', aggfunc='mean')

# Set the style for seaborn

sns.set()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".0f",
cbar_kws={'label': 'Mean Income'})
plt.xlabel('Category')
plt.ylabel('Age Category')
plt.title('Heatmap: Relationship between Age Category and Income by Category')
plt.show()

In this code:
df.pivot_table is used to create a pivot table that organizes the data to be suitable for a heatmap.
sns.heatmap is used to create the heatmap with annotations, a specific colormap ('YlGnBu'), and
a color bar indicating the mean income.

Interpreting the heatmap involves looking at the colors in the cells:

 Darker colors generally represent higher values, and lighter colors represent lower values.
 In this heatmap, each cell represents the mean income for a combination of
"AgeCategory" and "Category."
This visualization helps in identifying patterns and trends in the dataset, especially how income
varies across different age categories and categories.
Appendix: Quartiles
Q1 : 25 % 25 percent of the data is less than Q1

Q2 :50 % 50 percent of the data is less than Q2(median)

Q3 :75 % 75 percent of the data is less than Q3

IQR=Q 3−Q1Interquartile.

EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
Pandas,Numpy,Matplotlib
No ratings yet
Pandas,Numpy,Matplotlib
11 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
External
No ratings yet
External
11 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
16 pages
Oxy Metre
No ratings yet
Oxy Metre
17 pages
ML Lab Manual
No ratings yet
ML Lab Manual
28 pages
Data Science 1-5
No ratings yet
Data Science 1-5
15 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
R Practical File
No ratings yet
R Practical File
17 pages
DV Lab2 Updated
No ratings yet
DV Lab2 Updated
12 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
batch2 ds
No ratings yet
batch2 ds
34 pages
jjkjk
No ratings yet
jjkjk
10 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
Pandas Numpy
No ratings yet
Pandas Numpy
4 pages
Ai - Phase 3
No ratings yet
Ai - Phase 3
9 pages
2,3. Introduction Pandas & Matplotlib - Copy
No ratings yet
2,3. Introduction Pandas & Matplotlib - Copy
32 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
AD3411 DATA SCIENCE AND ANALYTICS LAB (2)_removed
No ratings yet
AD3411 DATA SCIENCE AND ANALYTICS LAB (2)_removed
24 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
PythonforScientificComputing AEC QuestionBank
No ratings yet
PythonforScientificComputing AEC QuestionBank
8 pages
DP prog
No ratings yet
DP prog
10 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Pandas 2
No ratings yet
Pandas 2
17 pages
Ds Lab-1
No ratings yet
Ds Lab-1
40 pages
L and T Projects - Colabs
No ratings yet
L and T Projects - Colabs
7 pages
Edx Course Lab Programs
No ratings yet
Edx Course Lab Programs
19 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Pandas
No ratings yet
Pandas
82 pages
3 Mark Python Imp
No ratings yet
3 Mark Python Imp
18 pages
Lab 9
No ratings yet
Lab 9
9 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
IFM GROUP2 CODE
No ratings yet
IFM GROUP2 CODE
7 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
34 pages
Manual
No ratings yet
Manual
52 pages
MCP Lab-2023 ContentForPythonLibrariesTopic
No ratings yet
MCP Lab-2023 ContentForPythonLibrariesTopic
9 pages
mypnotes
No ratings yet
mypnotes
3 pages
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
No ratings yet
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
68 pages
IP Book 12 Question Bank
No ratings yet
IP Book 12 Question Bank
20 pages
Pandas_Dataframe_All_Operations_1735471870
No ratings yet
Pandas_Dataframe_All_Operations_1735471870
4 pages
data science practicals
No ratings yet
data science practicals
47 pages
Data Science Practical
No ratings yet
Data Science Practical
28 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Pandas_Tutorial
No ratings yet
Pandas_Tutorial
9 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Human Resource Management in MSME in Saurashtra Region
No ratings yet
Human Resource Management in MSME in Saurashtra Region
4 pages
Project List Inframax Engineers NEW
No ratings yet
Project List Inframax Engineers NEW
14 pages
Fino 15se 20se 30se
No ratings yet
Fino 15se 20se 30se
29 pages
BOXER KING Setup Operators Manual ENG
No ratings yet
BOXER KING Setup Operators Manual ENG
4 pages
Navigation Flying Technique
100% (1)
Navigation Flying Technique
24 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
15 pages
LT (N) Jeremy Kennedy 09 January 2018: RCSCC Centurion Summer Training Presentation
No ratings yet
LT (N) Jeremy Kennedy 09 January 2018: RCSCC Centurion Summer Training Presentation
30 pages
ĐỀ CƯƠNG ÔN TẬP HK 2 - LỚP 8
No ratings yet
ĐỀ CƯƠNG ÔN TẬP HK 2 - LỚP 8
42 pages
Basic Details: Application Form
No ratings yet
Basic Details: Application Form
2 pages
December 30 MP
No ratings yet
December 30 MP
142 pages
CI-1580A ENG User Manual
67% (3)
CI-1580A ENG User Manual
50 pages
High School Science Teachers Training Program (Bangalore)
No ratings yet
High School Science Teachers Training Program (Bangalore)
4 pages
Miracles of Ahmad Ibn Idris
75% (4)
Miracles of Ahmad Ibn Idris
8 pages
ECMAS Terrace Waterproofing Solutions
No ratings yet
ECMAS Terrace Waterproofing Solutions
8 pages
GATE Computer Science Paper 2017
No ratings yet
GATE Computer Science Paper 2017
29 pages
Essay On Ageing Population in Singapore
No ratings yet
Essay On Ageing Population in Singapore
2 pages
Measure of Central Tendency Lesson Plan
No ratings yet
Measure of Central Tendency Lesson Plan
6 pages
Feedforward Signal Characterizer: AC 104 FT 105 FC 105
100% (1)
Feedforward Signal Characterizer: AC 104 FT 105 FC 105
6 pages
Elegy Moves
No ratings yet
Elegy Moves
14 pages
Title - The Time To Repair The Roof Is When The Sun Is Shining
No ratings yet
Title - The Time To Repair The Roof Is When The Sun Is Shining
4 pages
Boubakri Hanen JSTMM 2024
No ratings yet
Boubakri Hanen JSTMM 2024
6 pages
Political Science Begins and Ends With State Itself
No ratings yet
Political Science Begins and Ends With State Itself
11 pages
Multiplication Word Problem
No ratings yet
Multiplication Word Problem
40 pages
Practice Problems 2
No ratings yet
Practice Problems 2
2 pages
Asef 2022-2023 Guidelines Teachers and Students1
No ratings yet
Asef 2022-2023 Guidelines Teachers and Students1
10 pages
Panzer Aces 27
100% (1)
Panzer Aces 27
64 pages
Sepm MCQ
No ratings yet
Sepm MCQ
11 pages
Sapnote 0000838402
No ratings yet
Sapnote 0000838402
3 pages
EEE 713 - Mid-Term - Q - 2021-22
No ratings yet
EEE 713 - Mid-Term - Q - 2021-22
1 page
Instruction Set Architecture
No ratings yet
Instruction Set Architecture
18 pages