Python
Python
NumPy is a powerful Python library for numerical operations, particularly for working with
arrays and matrices. It provides a high-performance multidimensional array object and tools for
working with these arrays. Here are some key features of NumPy along with examples:
Creating Arrays: NumPy provides various ways to create arrays.
import numpy as np
# Creating a 1D array
arr_1d = np.array([1, 2, 3, 4, 5])
# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
Array Operations: NumPy supports element-wise operations and provides functions for
common mathematical operations.
# Element-wise addition
result_addition = arr_1d + 10
# Element-wise multiplication
result_multiply = arr_2d * 2
Array Indexing and Slicing: NumPy allows you to index and slice arrays efficiently.
# Accessing elements
print(arr_1d[2]) # Output: 3
# Slicing
print(arr_2d[:, 1]) # Output: [2, 5, 8]
# Broadcasting
arr_broadcast = np.array([1, 2, 3]) + 5
Linear Algebra Operations: NumPy provides a set of linear algebra functions for matrix
operations.
# Matrix multiplication
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
result_matmul = np.dot(mat1, mat2)
Statistical Functions: NumPy includes various statistical functions for analyzing data.
Random Module: NumPy provides a random module for generating random numbers and
arrays.
# Random array
random_array = np.random.rand(3, 3)
# Generate 1000 random numbers from a normal distribution with mean 0 and
standard deviation 1
normal_random = np.random.randn(1000)
# Generate 1000 random numbers with the specified mean and standard deviation
random_numbers = np.random.normal(loc=desired_mean, scale=desired_std,
size=1000)
# Calculate the actual mean and standard deviation of the generated numbers
actual_mean = np.mean(random_numbers)
actual_std = np.std(random_numbers)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import numpy as np
print("Original 1D array:")
print(arr_1d) #[ 0 1 2 3 4 5 6 7 8 9 10 11]
print("\nReshaped 2D array:")
print(arr_2d) #[[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]]
Pandas:
NumPy is a library for numerical operations and array manipulations, pandas is a library
specifically designed for structured and tabular data manipulation and analysis. In pandas, the
concept of reshaping is more associated with operations like pivoting, melting, and stacking,
rather than the direct reshaping of arrays.
Pandas provides a rich set of functions for reshaping and manipulating data, which is especially
useful when dealing with datasets. The examples above highlight some common operations, but
pandas offers many more functionalities for reshaping and transforming data to suit your analysis
needs. Keep in mind that the focus of pandas is on tabular data, and it may not directly align with
the array-centric operations in NumPy.
Let's explore a few operations in pandas that are related to reshaping data:
Creating a DataFrame: In pandas, the primary data structure is the DataFrame, which is a 2-
dimensional labeled data structure.
import pandas as pd
# Original DataFrame
print("Original DataFrame:")
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 San Francisco
# 2 Charlie 35 Los Angeles
Pivoting: The pivot function in pandas is used to reshape the DataFrame by pivoting based on
the values in a column.
Stacking and Unstacking: The stack and unstack functions are used to reshape a DataFrame by
stacking or unstacking levels.
unstacked_df = stacked_df.unstack()
print("\nUnstacked DataFrame:")
print(unstacked_df)
# Output:
# City Los Angeles New York San Francisco
# Name
# Alice NaN 25.0 NaN
# Bob NaN NaN 30.0
# Charlie 35.0 NaN NaN
import pandas as pd
Excel
Jason
Exporting Data:
CSV:
Excel:
Jason:
# Export DataFrame to JSON
json_data.to_json('output_file.json', orient='records')
Text:
These examples demonstrate basic import and export operations in pandas. Make sure to replace
'your_file.csv', 'your_file.xlsx', 'your_file.json', and 'your_file.txt' with the actual paths or URLs
of your data files.
Keep in mind that these functions have various parameters to handle different configurations of
data, such as specifying delimiters, encoding, header presence, etc. Always refer to the pandas
documentation for more details on these parameters.
Appendix:
A normal distribution, also known as a Gaussian distribution, is a symmetric bell-shaped
probability distribution. In simpler terms, it describes how data is spread around a central value.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
In this example, the numpy.random.normal function is used to generate random heights for 1000
students. The resulting histogram will have a bell-shaped curve, with the majority of heights
centered around the mean value.
Understanding the normal distribution is fundamental in statistics, and it often appears in various
natural phenomena and measurements. The concept is widely used in fields such as physics,
biology, finance, and more.
Top of Form
Descriptive Statistics:
Descriptive statistics are used to summarize and describe the main features of a dataset.
import numpy as np
# Sample data
data = np.array([2, 4, 5, 7, 9, 12, 15, 18, 21])
# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")
# Median
median_value = np.median(data)
print(f"Median: {median_value}")
# Standard Deviation
std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")
# Variance
variance = np.var(data)
print(f"Variance: {variance}")
Correlation:
Correlation measures the strength and direction of a linear relationship between two variables.
import numpy as np
from scipy.stats import pearsonr
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 6, 8])
import pandas as pd
df = pd.DataFrame(data)
To see the correlation between 2 variables, we can also use the scatter plot. You can create a
scatter plot for two columns using the matplotlib library. Here's an example using the previously
defined DataFrame (df) and plotting a scatter plot for columns 'A' and 'B':
import pandas as pd
# Sample DataFrame
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Subcategory': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
'Value': [10, 15, 20, 25, 30, 35]
}
df = pd.DataFrame(data)
df
In these examples, the conditions inside the square brackets ([] or ()) for both methods specify
the filtering criteria. The `&` operator is used for the logical AND operation.
Grouping Data:
To group data in pandas, you typically use the groupby method, which allows you to split the
DataFrame into groups based on one or more columns.
Once you've grouped the data, you can apply various aggregation functions to obtain summary
statistics for each group. Let's explore different formats of aggregation:
You can use built-in aggregation functions like sum(), mean(), count(), max(), min(), etc.
result
Merging DataFrames:
In pandas, merging refers to the combining of two or more DataFrames based on a common set
of columns or indices. There are several types of merges:
1. Inner Merge:
An inner merge returns only the rows where there is a match in both DataFrames based on the
specified columns.
# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
Output:
key value1 value2
0 B 2 4
1 C 3 5
2. Left Merge:
A left merge returns all rows from the left DataFrame and the matched rows from the right
DataFrame. If there is no match, NaN values are filled in for the columns from the right
DataFrame.
Output:
key value1 value2
0 A 1 NaN
1 B 2 4.0
2 C 3 5.0
3. Right Merge:
A right merge is similar to a left merge but returns all rows from the right DataFrame and the
matched rows from the left DataFrame.
Output:
key value1 value2
0 B 2.0 4
1 C 3.0 5
2 D NaN 6
4. Outer Merge:
An outer merge returns all rows from both DataFrames, filling in NaN for missing values.
Output:
5. Concatenation:
Concatenation is a method of combining two DataFrames along a particular axis (either rows or
columns).
Output:
key value1 value2
0 A 1 NaN
1 B 2 NaN
2 C 3 NaN
3 B NaN 4
4 C NaN 5
5 D NaN 6
Concatenating along
Output:
Output
key value1 value2
0 A 1.0 NaN
1 B 2.0 NaN
2 C 3.0 NaN
3 B NaN 4.0
4 C NaN 5.0
5 D NaN 6.0
pd.concat([df1, pd.DataFrame({"key":["E","F"],"value1":
[5,9]})],inore_index=True)
Output
key value1
0 A 1
1 B 2
2 C 3
3 E 5
4 F 9
These examples cover some of the basic merging and concatenation operations in pandas.
Depending on your specific use case, you may choose the appropriate type of merge or
Data Visualization
Data Visualization is the technique of making inferences from the data by plotting charts and
graphs from the data.
Data visualization is used to deduce trends, patterns and correlation from the data that might be
hard to find without visualization.
Data Visualization is also used for presentation of inferences derived from data analytics.
Type of Plots:
Throughout the upcoming slides we will cover the following plots:
1. Line Plots (overtime - trend)
2. Scatter (to see correlation between two columns)
3. Bar charts and Column Charts (frequency of data)
4. Histogram (frequency of data)
5. Pie charts
6. Box plot (statistical value – outliers)
7. Heatmap (comparison)
For the rest of this chapter consider the following Dataset.’
# Example dataset
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Income': [50000, 60000, 75000, 90000, 80000, 95000, 110000, 120000, 105000,
130000],
'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
}
# Create a DataFrame
df = pd.DataFrame(data)
Line Plot:
In this example, we are going to see the relationship between age and the income visually. To do
this, we use a line plot:
# Add legend
plt.legend()
In this code:
Scatter Plot:
In the example below, we use scatter plot to visualize the relationship between age and income
for each individual in the dataset. Each point represents a unique combination of age and income,
and the color and marker style can be adjusted based on preferences.
Now we want to see the scatter plot of Age vs Income but visually separated for each Category.
# Create a scatter plot for Income over Age with different colors for each
category
plt.scatter(category_a['Age'], category_a['Income'], color='blue', marker='o',
label='Category A')
plt.scatter(category_b['Age'], category_b['Income'], color='orange', marker='s',
label='Category B')
In this code:
In the example below, we want to see the average income in each category A and B visually
In this code:
The average income for each category is calculated using groupby and mean.
The first set of plots is a vertical bar chart (plt.bar) with the bars representing the average
income for each category.
The second set of plots is a horizontal bar chart (plt.barh) with the bars arranged
horizontally.
Histogram:
For histograms, you typically focus on a single variable to visualize its statistical distribution.
In this code:
plt.hist is used to create histograms for both the "Age" and "Income" variables.
bins parameter controls the number of bins (intervals) in the histogram.
edgecolor is set to 'black' to add black borders around the bars.
Pie charts:
A pie chart is suitable when you want to represent the proportions of different categories in a
whole. In this example, let's create a pie chart to represent the distribution of the "Category"
variable in the given dataset:
This pie chart below represents the distribution of age categories in the dataset. We can create
age categories and represent the distribution of these age categories in a pie chart. Here, we'll
define age categories based on ranges (e.g., 20-30, 31-40, etc.):
Age categories are defined using pd.cut to create a new column 'AgeCategory' in the
DataFrame.
The age categories are sorted using sort_index() after calculating the distribution.
plt.pie is used to create a pie chart for the distribution of age categories.
Box plot:
A boxplot, also known as a box-and-whisker plot, is a statistical visualization that provides a
summary of the distribution of a dataset. It is particularly useful for comparing distributions
between different categories or groups. A boxplot consists of several key components:
1. Box:
The box represents the interquartile range (IQR), which is the range between the first
quartile (Q1) and the third quartile (Q3).
The height of the box is the IQR, and it contains the middle 50% of the data.
2. Whiskers:
Whiskers extend from the box to the minimum and maximum values within a certain
range, often determined by a multiplier of the IQR.
They provide information about the spread of the data beyond the quartiles.
The line inside the box represents the median of the dataset. It divides the dataset into
two halves, with 50% of the data points below and 50% above the median.
4. Outliers (Individual Points or Dots):
Individual points beyond the whiskers are considered outliers and are plotted
individually.
Let's create a boxplot for the "Income" variable in the given dataset:
In this code:
Seaborn (sns) is used to create a boxplot. Seaborn is a statistical data visualization library
that works well with pandas DataFrames.
The x parameter specifies the categorical variable ('Category') on the x-axis, and the y
parameter specifies the numerical variable ('Income') on the y-axis.
The boxplot visually represents the distribution of income for each category.
Let's interpret a boxplot using the boxplot created for the "Income" variable in the given dataset:
1. Box (Interquartile Range):
The box represents the middle 50% of the income distribution for each category.
The bottom and top edges of the box correspond to the first quartile (Q1) and the third
quartile (Q3), respectively.
The height of the box (Q3 - Q1) provides a measure of the spread of the central portion of
the data.
2. Whiskers:
The whiskers extend from the box to the minimum and maximum values within a certain
range.
The line inside the box represents the median income for each category.
Individual points beyond the whiskers are considered outliers. They are plotted
individually to highlight values that are significantly different from the majority of the
data.
Heatmap:
A heatmap is a graphical representation of data where values in a matrix are represented as
colors. It's particularly useful for visualizing the relationships between two categorical variables.
In the given dataset, we can create a heatmap to visualize the relationship between
"AgeCategory" and "Income."
# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".0f",
cbar_kws={'label': 'Mean Income'})
plt.xlabel('Category')
plt.ylabel('Age Category')
plt.title('Heatmap: Relationship between Age Category and Income by Category')
plt.show()
In this code:
df.pivot_table is used to create a pivot table that organizes the data to be suitable for a heatmap.
sns.heatmap is used to create the heatmap with annotations, a specific colormap ('YlGnBu'), and
a color bar indicating the mean income.
Darker colors generally represent higher values, and lighter colors represent lower values.
In this heatmap, each cell represents the mean income for a combination of
"AgeCategory" and "Category."
This visualization helps in identifying patterns and trends in the dataset, especially how income
varies across different age categories and categories.
Appendix: Quartiles
Q1 : 25 % 25 percent of the data is less than Q1
IQR=Q 3−Q1Interquartile.