Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Python

NumPy is a Python library for numerical operations and linear algebra. It allows efficient manipulation of arrays and matrices. Pandas is a library for data analysis and manipulation of structured data. It allows easy importing and exporting of data from various file formats and provides functions for reshaping data frames.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Python

NumPy is a Python library for numerical operations and linear algebra. It allows efficient manipulation of arrays and matrices. Pandas is a library for data analysis and manipulation of structured data. It allows easy importing and exporting of data from various file formats and provides functions for reshaping data frames.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Numpy:

NumPy is a powerful Python library for numerical operations, particularly for working with
arrays and matrices. It provides a high-performance multidimensional array object and tools for
working with these arrays. Here are some key features of NumPy along with examples:
Creating Arrays: NumPy provides various ways to create arrays.

import numpy as np

# Creating a 1D array
arr_1d = np.array([1, 2, 3, 4, 5])

# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

Array Operations: NumPy supports element-wise operations and provides functions for
common mathematical operations.

# Element-wise addition
result_addition = arr_1d + 10

# Element-wise multiplication
result_multiply = arr_2d * 2

Array Indexing and Slicing: NumPy allows you to index and slice arrays efficiently.

# Accessing elements
print(arr_1d[2]) # Output: 3

# Slicing
print(arr_2d[:, 1]) # Output: [2, 5, 8]

Broadcasting: NumPy supports broadcasting, allowing operations on arrays of different shapes


and sizes.

# Broadcasting
arr_broadcast = np.array([1, 2, 3]) + 5

Linear Algebra Operations: NumPy provides a set of linear algebra functions for matrix
operations.

# Matrix multiplication
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
result_matmul = np.dot(mat1, mat2)

Statistical Functions: NumPy includes various statistical functions for analyzing data.

# Mean and standard deviation


mean_value = np.mean(arr_1d)
std_deviation = np.std(arr_1d)

Random Module: NumPy provides a random module for generating random numbers and
arrays.

# Random array
random_array = np.random.rand(3, 3)

# Generate 1000 random numbers from a normal distribution with mean 0 and
standard deviation 1
normal_random = np.random.randn(1000)

# Generate 5 random integers between 1 and 10


random_integers = np.random.randint(1, 11, size=5)

# Generating random numbers with given mean and standard deviation

# Set the desired mean and standard deviation


desired_mean = 10
desired_std = 2

# Generate 1000 random numbers with the specified mean and standard deviation
random_numbers = np.random.normal(loc=desired_mean, scale=desired_std,
size=1000)

# Calculate the actual mean and standard deviation of the generated numbers
actual_mean = np.mean(random_numbers)
actual_std = np.std(random_numbers)

# Print the results


print(f"Desired Mean: {desired_mean}, Desired Standard Deviation:
{desired_std}")
print(f"Actual Mean: {actual_mean:.4f}, Actual Standard Deviation:
{actual_std:.4f}")

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate random heights following a normal distribution


mean_height = 160 # mean height in centimeters
std_deviation = 10 # standard deviation in centimeters
num_students = 1000

# Generate random heights


heights = np.random.normal(loc=mean_height, scale=std_deviation,
size=num_students)

# Plot the histogram of heights


plt.hist(heights, bins=30, density=True, alpha=0.7, color='blue', label='Height
Distribution')

# Fit a normal distribution to the data


xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean_height, std_deviation)
plt.plot(x, p, 'k', linewidth=2, label='Fitted Normal Distribution')

plt.title('Height Distribution of Students with Fitted Normal Curve')


plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

Reshaping Arrays: NumPy allows you to reshape arrays easily.

import numpy as np

# Create a 1D array with 12 elements


arr_1d = np.arange(12)

# Reshape the 1D array into a 3x4 2D array


arr_2d = arr_1d.reshape(3, 4)

print("Original 1D array:")
print(arr_1d) #[ 0 1 2 3 4 5 6 7 8 9 10 11]
print("\nReshaped 2D array:")
print(arr_2d) #[[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]]

Pandas:
NumPy is a library for numerical operations and array manipulations, pandas is a library
specifically designed for structured and tabular data manipulation and analysis. In pandas, the
concept of reshaping is more associated with operations like pivoting, melting, and stacking,
rather than the direct reshaping of arrays.
Pandas provides a rich set of functions for reshaping and manipulating data, which is especially
useful when dealing with datasets. The examples above highlight some common operations, but
pandas offers many more functionalities for reshaping and transforming data to suit your analysis
needs. Keep in mind that the focus of pandas is on tabular data, and it may not directly align with
the array-centric operations in NumPy.
Let's explore a few operations in pandas that are related to reshaping data:
Creating a DataFrame: In pandas, the primary data structure is the DataFrame, which is a 2-
dimensional labeled data structure.

import pandas as pd

# Creating a DataFrame from a dictionary


data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 San Francisco
# 2 Charlie 35 Los Angeles

Pivoting: The pivot function in pandas is used to reshape the DataFrame by pivoting based on
the values in a column.

# Pivoting the DataFrame


pivoted_df = df.pivot(index='Name', columns='City', values='Age')
print("\nPivoted DataFrame:")
print(pivoted_df)
# Output:
# City Los Angeles New York San Francisco
# Name
# Alice NaN 25.0 NaN
# Bob NaN NaN 30.0
# Charlie 35.0 NaN NaN
Melting: The melt function is used to unpivot a DataFrame, converting it from wide format to
long format.

# Melting the DataFrame


melted_df = pd.melt(df, id_vars='Name', value_vars=['Age', 'City'])
print("\nMelted DataFrame:")
print(melted_df)
# Output:
# Name variable value
# 0 Alice Age 25
# 1 Bob Age 30
# 2 Charlie Age 35
# 3 Alice City New York
# 4 Bob City San Francisco
# 5 Charlie City Los Angeles

Stacking and Unstacking: The stack and unstack functions are used to reshape a DataFrame by
stacking or unstacking levels.

# Stacking and unstacking


stacked_df = df.set_index(['Name', 'City']).stack()
print("\nStacked DataFrame:")
print(stacked_df)
# Output:
# Name City
# Alice New York 25
# Bob San Francisco 30
# Charlie Los Angeles 35
# dtype: int64

unstacked_df = stacked_df.unstack()
print("\nUnstacked DataFrame:")
print(unstacked_df)
# Output:
# City Los Angeles New York San Francisco
# Name
# Alice NaN 25.0 NaN
# Bob NaN NaN 30.0
# Charlie 35.0 NaN NaN

Importing and Exporting files


In pandas, you can easily import and export data from various formats, including Excel (xls,
xlsx), CSV, JSON, and text files. Here are examples for each:
CSV (Comma-Separated Values):

import pandas as pd

# Import CSV file


csv_data = pd.read_csv('your_file.csv')
print("CSV Data:")
print(csv_data)

Excel

# Import Excel file


excel_data = pd.read_excel('your_file.xlsx', sheet_name='Sheet1')
print("\nExcel Data:")
print(excel_data)

Jason

# Import JSON file


json_data = pd.read_json('your_file.json')
print("\nJSON Data:")
print(json_data)

Text (Delimiter-Separated Values):

# Import text file (tab-separated values)


text_data = pd.read_csv('your_file.txt', delimiter='\t')
print("\nText Data:")
print(text_data)

Exporting Data:
CSV:

# Export DataFrame to CSV


csv_data.to_csv('output_file.csv', index=False)

Excel:

# Export DataFrame to Excel


excel_data.to_excel('output_file.xlsx', index=False, sheet_name='Sheet1')

Jason:
# Export DataFrame to JSON
json_data.to_json('output_file.json', orient='records')

Text:

# Export DataFrame to text file (tab-separated values)


text_data.to_csv('output_file.txt', sep='\t', index=False)

These examples demonstrate basic import and export operations in pandas. Make sure to replace
'your_file.csv', 'your_file.xlsx', 'your_file.json', and 'your_file.txt' with the actual paths or URLs
of your data files.
Keep in mind that these functions have various parameters to handle different configurations of
data, such as specifying delimiters, encoding, header presence, etc. Always refer to the pandas
documentation for more details on these parameters.
Appendix:
A normal distribution, also known as a Gaussian distribution, is a symmetric bell-shaped
probability distribution. In simpler terms, it describes how data is spread around a central value.

Here's a basic explanation with an example:


Characteristics of a Normal Distribution:
Symmetry: The distribution is symmetric, meaning the left and right sides of the central value
are mirror images of each other.
Bell Shape: The distribution forms a bell-shaped curve when plotted.
Central Tendency: The highest point of the curve represents the central or average value of the
data.
Standard Deviation: The spread of the data is determined by the standard deviation. A smaller
standard deviation means the data is more concentrated around the mean.
Example:
Let's consider the heights of a population of students in a school. If the heights follow a normal
distribution:

 The average height would be the central value (mean).


 Most students would have heights close to the average.
 Taller and shorter students would be less common, gradually decreasing as you move
away from the average in either direction.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate random heights following a normal distribution


mean_height = 160 # mean height in centimeters
std_deviation = 10 # standard deviation in centimeters
num_students = 1000

# Generate random heights


heights = np.random.normal(loc=mean_height, scale=std_deviation,
size=num_students)

# Plot the histogram of heights


plt.hist(heights, bins=30, density=True, alpha=0.7, color='blue',
label='Height Distribution')
# Fit a normal distribution to the data
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean_height, std_deviation)
plt.plot(x, p, 'k', linewidth=2, label='Fitted Normal Distribution')

plt.title('Height Distribution of Students with Fitted Normal Curve')


plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

In this example, the numpy.random.normal function is used to generate random heights for 1000
students. The resulting histogram will have a bell-shaped curve, with the majority of heights
centered around the mean value.
Understanding the normal distribution is fundamental in statistics, and it often appears in various
natural phenomena and measurements. The concept is widely used in fields such as physics,
biology, finance, and more.
Top of Form
Descriptive Statistics:
Descriptive statistics are used to summarize and describe the main features of a dataset.

import numpy as np

# Sample data
data = np.array([2, 4, 5, 7, 9, 12, 15, 18, 21])

# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

# Median
median_value = np.median(data)
print(f"Median: {median_value}")

# Standard Deviation
std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

# Variance
variance = np.var(data)
print(f"Variance: {variance}")

Correlation:
Correlation measures the strength and direction of a linear relationship between two variables.

import numpy as np
from scipy.stats import pearsonr

# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 6, 8])

# Pearson correlation coefficient and p-value


corr_coefficient, p_value = pearsonr(x, y)
print(f"Pearson Correlation Coefficient: {corr_coefficient}")
print(f"P-value: {p_value}")

import pandas as pd

# Create a sample DataFrame


data = {'A': [1, 2, 3, 4, 5],
'B': [2, 4, 1, 6, 8],
'C': [5, 3, 8, 1, 2]}

df = pd.DataFrame(data)

# Calculate the correlation matrix


correlation_matrix = df.corr()

# Display the correlation matrix


print("Correlation Matrix:")
print(correlation_matrix)

To see the correlation between 2 variables, we can also use the scatter plot. You can create a
scatter plot for two columns using the matplotlib library. Here's an example using the previously
defined DataFrame (df) and plotting a scatter plot for columns 'A' and 'B':

import matplotlib.pyplot as plt

# Scatter plot for columns 'A' and 'B'


plt.scatter(df['A'], df['B'])
plt.title('Scatter Plot of Column A vs Column B')
plt.xlabel('Column A')
plt.ylabel('Column B')
plt.show()
Filtering data:
Filtering data in pandas can be done using the `loc` or `iloc` accessor along with conditional
statements. Let's look at examples for filtering data based on one column and more than one
column.
Filtering Data Based on One Column:

import pandas as pd

# Sample DataFrame
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Subcategory': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
'Value': [10, 15, 20, 25, 30, 35]
}

df = pd.DataFrame(data)

df

# Filtering data where 'Category' is 'A'


filtered_data_one_column = df[df['Category'] == 'A']
filtered_data_one_column

Using query method:

# Using query method to filter data


filtered_data_one_column_query = df.query("Category == 'A'")
filtered_data_one_column_query

Data Based on More Than One Column:

Filtering data where 'Category' is 'A' and 'Value' is greater than 15


filtered_data_multiple_columns = df[(df['Category'] == 'A') & (df['Value'] > 15)]
filtered_data_multiple_columns

Using query method:

# Using query method to filter data based on multiple columns


filtered_data_multiple_columns_query = df.query("Category == 'A' and Value > 15")
filtered_data_multiple_columns_query

In these examples, the conditions inside the square brackets ([] or ()) for both methods specify
the filtering criteria. The `&` operator is used for the logical AND operation.

Grouping and the Group Aggregation


In pandas, grouping data and performing aggregations is a common operation when working
with tabular data. The process involves splitting the data into groups based on some criteria and
then applying a function to each group independently.

Grouping Data:

To group data in pandas, you typically use the groupby method, which allows you to split the
DataFrame into groups based on one or more columns.

# Grouping by the 'Category' column


grouped = df.groupby('Category')

Aggregation with Different Formats:

Once you've grouped the data, you can apply various aggregation functions to obtain summary
statistics for each group. Let's explore different formats of aggregation:

1. Aggregating with Built-in Functions:

You can use built-in aggregation functions like sum(), mean(), count(), max(), min(), etc.

# Summing the 'Value' column for each category


result_sum = grouped['Value'].sum()
#Or simply df.groupby('Category') ['Value'].sum()
result_sum

2. Aggregating with Custom Functions:

You can define your own aggregation functions.

# Custom aggregation function to calculate the range


def custom_range(series):
return series.max() - series.min()

# Applying the custom function to each group


result_range = grouped['Value'].agg(custom_range)
result_range

3. Aggregating with Multiple Functions:

You can apply multiple aggregation functions at once using a list.

# Applying multiple aggregation functions


result_multiple = grouped['Value'].agg(['sum', 'mean', 'count'])
print(result_multiple)

4. Grouping Data Based on Multiple Columns:


When you want to group data based on two columns, you can pass a list of column names to the
groupby method.

# Grouping by 'Category' and 'Subcategory'


grouped = df.groupby(['Category', 'Subcategory'])

# Aggregating the sum of 'Value' for each group


result = grouped['Value'].sum()

result
Merging DataFrames:
In pandas, merging refers to the combining of two or more DataFrames based on a common set
of columns or indices. There are several types of merges:

1. Inner Merge:
An inner merge returns only the rows where there is a match in both DataFrames based on the
specified columns.

# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Inner merge on the 'key' column


merged_inner = pd.merge(df1, df2, on='key')
print(merged_inner)

Output:
key value1 value2
0 B 2 4
1 C 3 5

2. Left Merge:
A left merge returns all rows from the left DataFrame and the matched rows from the right
DataFrame. If there is no match, NaN values are filled in for the columns from the right
DataFrame.

# Left merge on the 'key' column


merged_left = pd.merge(df1, df2, on='key', how='left')
print(merged_left)

Output:
key value1 value2

0 A 1 NaN
1 B 2 4.0
2 C 3 5.0

3. Right Merge:
A right merge is similar to a left merge but returns all rows from the right DataFrame and the
matched rows from the left DataFrame.

# Right merge on the 'key' column


merged_right = pd.merge(df1, df2, on='key', how='right')
print(merged_right)

Output:
key value1 value2
0 B 2.0 4
1 C 3.0 5
2 D NaN 6

4. Outer Merge:
An outer merge returns all rows from both DataFrames, filling in NaN for missing values.

# Outer merge on the 'key' column


merged_outer = pd.merge(df1, df2, on='key', how='outer')
print(merged_outer)

Output:

key value1 value2


0 A 1.0 NaN
1 B 2.0 4.0
2 C 3.0 5.0
3 D NaN 6.0

5. Concatenation:

Concatenation is a method of combining two DataFrames along a particular axis (either rows or
columns).

# Concatenating along rows (axis=0)


concatenated_rows = pd.concat([df1, df2], ignore_index=True)
print(concatenated_rows)

Output:
key value1 value2
0 A 1 NaN
1 B 2 NaN
2 C 3 NaN
3 B NaN 4
4 C NaN 5
5 D NaN 6

Concatenating along

# Concatenating along rows (axis=0)


concatenated_rows = pd.concat([df1, df2], ignore_index=True)
print(concatenated_rows)

Output:

Output
key value1 value2
0 A 1.0 NaN
1 B 2.0 NaN
2 C 3.0 NaN
3 B NaN 4.0
4 C NaN 5.0
5 D NaN 6.0

Adding Rows to a DataFrame

pd.concat([df1, pd.DataFrame({"key":["E","F"],"value1":
[5,9]})],inore_index=True)

Output
key value1
0 A 1
1 B 2
2 C 3
3 E 5
4 F 9

These examples cover some of the basic merging and concatenation operations in pandas.
Depending on your specific use case, you may choose the appropriate type of merge or
Data Visualization
Data Visualization is the technique of making inferences from the data by plotting charts and
graphs from the data.
Data visualization is used to deduce trends, patterns and correlation from the data that might be
hard to find without visualization.
Data Visualization is also used for presentation of inferences derived from data analytics.

Data Visualization in Python:


Python provides a variety of packages for data visualization. Some of the popular and most used
ones are:
1. Matplotlib
2. Seaborn
3. Plotly (for 3D plotting)
4. Pandas Visualization
5. ggplot
For the course of this module, we will focus on Matplotlib as it forms the basis for most of the
other libraries like seaborn or pandas visualization.

Type of Plots:
Throughout the upcoming slides we will cover the following plots:
1. Line Plots (overtime - trend)
2. Scatter (to see correlation between two columns)
3. Bar charts and Column Charts (frequency of data)
4. Histogram (frequency of data)
5. Pie charts
6. Box plot (statistical value – outliers)
7. Heatmap (comparison)
For the rest of this chapter consider the following Dataset.’
# Example dataset
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Income': [50000, 60000, 75000, 90000, 80000, 95000, 110000, 120000, 105000,
130000],
'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame


df.head()

Line Plot:
In this example, we are going to see the relationship between age and the income visually. To do
this, we use a line plot:

# Create a line plot for Income over Age


plt.plot(data['Age'], data['Income'], marker='o', linestyle='-', color='blue')

# Add labels and title


plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Line Plot: Income over Age')

# Display the plot


plt.show()
Now let’s make it fancier by adding a line of average income and a highlight age

# Create a line plot for Income over Age


plt.plot(df['Age'], df['Income'], marker='o', linestyle='-', color='blue',
label='Income Trend')

# Add a horizontal line for average income


average_income = df['Income'].mean()
plt.axhline(y=average_income, color='green', linestyle='--', label='Average
Income')

# Add a vertical line to highlight a specific age


highlight_age = 50
plt.axvline(x=highlight_age, color='red', linestyle='--', label='Highlight Age')

# Add labels and title


plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Income over Age')

# Add legend
plt.legend()

# Display the plot


plt.show()

In this code:

 A horizontal line is added using plt.axhline to represent the average income.


 A vertical line is added using plt.axvline to highlight a specific age (set as highlight age).
 Legends are added to distinguish between the income trend, average income, and the
highlighted age.

Scatter Plot:
In the example below, we use scatter plot to visualize the relationship between age and income
for each individual in the dataset. Each point represents a unique combination of age and income,
and the color and marker style can be adjusted based on preferences.

# Create a scatter plot for Income over Age


plt.scatter(data['Age'], data['Income'], color='blue', marker='o')

# Add labels and title


plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Scatter Plot: Income vs. Age')

# Display the plot


plt.show()

Now we want to see the scatter plot of Age vs Income but visually separated for each Category.

# Separate data based on categories


category_a = df[df['Category'] == 'A']
category_b = df[df['Category'] == 'B']

# Create a scatter plot for Income over Age with different colors for each
category
plt.scatter(category_a['Age'], category_a['Income'], color='blue', marker='o',
label='Category A')
plt.scatter(category_b['Age'], category_b['Income'], color='orange', marker='s',
label='Category B')

# Add labels, title, and legend


plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Income vs. Age by Category')
plt.legend()
# Display the plot
plt.show()

In this code:

 Data is separated into two categories, 'A' and 'B'.


 Scatter plots are created for each category with different colors (blue for 'A', orange for
'B') and different markers (circle for 'A', square for 'B').
 Legends are added to distinguish between the categories.

Bar charts and Column Charts


Bar charts and column charts are essentially the same, but they differ in the orientation of the
bars or columns. Bar charts have horizontal bars, while column charts have vertical columns. In
this example, we'll show you both variations.

In the example below, we want to see the average income in each category A and B visually

# Calculate average income for each category


average_income_by_category = df.groupby('Category')['Income'].mean()

# Create a bar chart (vertical columns) for average income by category


plt.bar(average_income_by_category.index, average_income_by_category,
color=['blue', 'orange'])
plt.xlabel('Category')
plt.ylabel('Average Income')
plt.title('Bar Chart: Average Income by Category')

# Display the plot


plt.show()
# Create a horizontal bar chart for average income by category
plt.barh(average_income_by_category.index, average_income_by_category,
color=['blue', 'orange'])
plt.xlabel('Average Income')
plt.ylabel('Category')
plt.title('Horizontal Bar Chart: Average Income by Category')

# Display the plot


plt.show()

In this code:

 The average income for each category is calculated using groupby and mean.
 The first set of plots is a vertical bar chart (plt.bar) with the bars representing the average
income for each category.
 The second set of plots is a horizontal bar chart (plt.barh) with the bars arranged
horizontally.

Histogram:
For histograms, you typically focus on a single variable to visualize its statistical distribution.

# Create a histogram for the 'Income' variable


plt.hist(df['Income'], bins=10, color='lightcoral', edgecolor='black')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Histogram: Distribution of Income')
plt.show()

In this code:

 plt.hist is used to create histograms for both the "Age" and "Income" variables.
 bins parameter controls the number of bins (intervals) in the histogram.
 edgecolor is set to 'black' to add black borders around the bars.

Pie charts:
A pie chart is suitable when you want to represent the proportions of different categories in a
whole. In this example, let's create a pie chart to represent the distribution of the "Category"
variable in the given dataset:

# Calculate the distribution of categories


category_distribution = df['Category'].value_counts()

# Create a pie chart for the distribution of categories


plt.pie(category_distribution, labels=category_distribution.index, autopct='%1.1f
%%', colors=['lightgreen', 'lightcoral'])
plt.title('Pie Chart: Category Distribution')
plt.show()

This pie chart below represents the distribution of age categories in the dataset. We can create
age categories and represent the distribution of these age categories in a pie chart. Here, we'll
define age categories based on ranges (e.g., 20-30, 31-40, etc.):

# Define age categories


age_bins = [20, 30, 40, 50, 60, 70, 80]
age_labels = ['20-30', '31-40', '41-50', '51-60', '61-70', '71+']

# Create a new column 'AgeCategory' in the DataFrame


df['AgeCategory'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels,
right=False)

# Calculate the distribution of age categories


age_category_distribution = df['AgeCategory'].value_counts()

# Sort the age categories


age_category_distribution = age_category_distribution.sort_index()

# Create a pie chart for the distribution of age categories


plt.pie(age_category_distribution, labels=age_category_distribution.index,
autopct='%1.1f%%', colors=['lightblue', 'lightcoral', 'lightgreen',
'lightyellow', 'lightpink', 'lightgrey'])
plt.title('Category Distribution')
plt.show()
In this code:

 Age categories are defined using pd.cut to create a new column 'AgeCategory' in the
DataFrame.
 The age categories are sorted using sort_index() after calculating the distribution.
 plt.pie is used to create a pie chart for the distribution of age categories.

Box plot:
A boxplot, also known as a box-and-whisker plot, is a statistical visualization that provides a
summary of the distribution of a dataset. It is particularly useful for comparing distributions
between different categories or groups. A boxplot consists of several key components:
1. Box:

 The box represents the interquartile range (IQR), which is the range between the first
quartile (Q1) and the third quartile (Q3).

 The height of the box is the IQR, and it contains the middle 50% of the data.

2. Whiskers:

 Whiskers extend from the box to the minimum and maximum values within a certain
range, often determined by a multiplier of the IQR.

 They provide information about the spread of the data beyond the quartiles.

3. Median (Line inside the box):

 The line inside the box represents the median of the dataset. It divides the dataset into
two halves, with 50% of the data points below and 50% above the median.
4. Outliers (Individual Points or Dots):
 Individual points beyond the whiskers are considered outliers and are plotted
individually.

Let's create a boxplot for the "Income" variable in the given dataset:

import seaborn as sns

# Set the style for seaborn


sns.set(style="whitegrid")

# Create a boxplot for the 'Income' variable


plt.figure(figsize=(8, 6))
sns.boxplot(x='Category', y='Income', data=df, palette=['lightblue',
'lightcoral'])
plt.xlabel('Category')
plt.ylabel('Income')
plt.title('Boxplot: Distribution of Income by Category')
plt.show()

In this code:

 Seaborn (sns) is used to create a boxplot. Seaborn is a statistical data visualization library
that works well with pandas DataFrames.
 The x parameter specifies the categorical variable ('Category') on the x-axis, and the y
parameter specifies the numerical variable ('Income') on the y-axis.
 The boxplot visually represents the distribution of income for each category.

Let's interpret a boxplot using the boxplot created for the "Income" variable in the given dataset:
1. Box (Interquartile Range):
 The box represents the middle 50% of the income distribution for each category.

 The bottom and top edges of the box correspond to the first quartile (Q1) and the third
quartile (Q3), respectively.

 The height of the box (Q3 - Q1) provides a measure of the spread of the central portion of
the data.
2. Whiskers:

 The whiskers extend from the box to the minimum and maximum values within a certain
range.

 Outliers beyond the whiskers are considered as individual points.

3. Median (Line inside the box):

 The line inside the box represents the median income for each category.

4. Outliers (Individual Points or Dots):

 Individual points beyond the whiskers are considered outliers. They are plotted
individually to highlight values that are significantly different from the majority of the
data.

Heatmap:
A heatmap is a graphical representation of data where values in a matrix are represented as
colors. It's particularly useful for visualizing the relationships between two categorical variables.
In the given dataset, we can create a heatmap to visualize the relationship between
"AgeCategory" and "Income."

import seaborn as sns

# Create a pivot table to organize the data for the heatmap


heatmap_data = df.pivot_table(values='Income', index='AgeCategory',
columns='Category', aggfunc='mean')

# Set the style for seaborn


sns.set()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".0f",
cbar_kws={'label': 'Mean Income'})
plt.xlabel('Category')
plt.ylabel('Age Category')
plt.title('Heatmap: Relationship between Age Category and Income by Category')
plt.show()

In this code:
df.pivot_table is used to create a pivot table that organizes the data to be suitable for a heatmap.
sns.heatmap is used to create the heatmap with annotations, a specific colormap ('YlGnBu'), and
a color bar indicating the mean income.

Interpreting the heatmap involves looking at the colors in the cells:

 Darker colors generally represent higher values, and lighter colors represent lower values.
 In this heatmap, each cell represents the mean income for a combination of
"AgeCategory" and "Category."
This visualization helps in identifying patterns and trends in the dataset, especially how income
varies across different age categories and categories.
Appendix: Quartiles
Q1 : 25 % 25 percent of the data is less than Q1

Q2 :50 % 50 percent of the data is less than Q2(median)

Q3 :75 % 75 percent of the data is less than Q3

IQR=Q 3−Q1Interquartile.

You might also like