BTech 5 CSE Data Analytics Using Python Unit 4 Notes
BTech 5 CSE Data Analytics Using Python Unit 4 Notes
UNIT – 4
THE pandas LIBRARY
Pandas provides several methods for loading different types of data files,
including CSV, Excel, and SQL. Below are the commonly used methods for each:
1. Loading CSV Files
We can load a CSV file using the read_csv() function.
import pandas as pd
# Loading a CSV file
df = pd.read_csv('filename.csv')
We can also specify various parameters such as delimiter, encoding, and handling
missing values:
df = pd.read_csv('filename.csv', delimiter=',', encoding='utf-8',
na_values=['N/A', 'NA'])
In Pandas, when reading files like CSVs, you often use encoding='utf-8' to specify the character encoding of the
file you're reading. UTF-8 is one of the most widely used encodings for representing text in computers and is
compatible with most characters from different languages, symbols, and emojis.
Reasons for Using encoding='utf-8' in Pandas:
1. Handle Non-ASCII Characters:
o Many text files contain characters outside the standard ASCII range (e.g., accented letters,
symbols, non-Latin alphabets). UTF-8 can handle a wide variety of characters from different
languages.
2. Avoid Encoding Errors:
SRGI, BHILAI
o If the file contains non-UTF-8 characters and you don't specify the encoding, you might get
errors like UnicodeDecodeError. Specifying encoding='utf-8' ensures that Pandas knows how
to properly interpret the characters in the file.
3. Standard and Universal:
o UTF-8 is the default encoding for many web pages, applications, and databases, so it's a
common practice to use it when importing files. This makes your code more portable and
compatible across different systems and applications.
4. Prevent Misinterpretation of Data:
o If the wrong encoding is used (e.g., ASCII or ISO-8859-1), some characters may not be
interpreted correctly, leading to data corruption or unexpected symbols. By explicitly using utf-
8, you ensure proper interpretation of characters.
# Output:
# Name Age City
# 0 Ram 25.0 NaN
# 1 Shyam NaN London
# 2 Ajay 30.0 NaN
Benefits:
• Flexibility: Handle multiple formats of missing data in different datasets.
• Data Cleaning: Convert all recognized missing values to a standard format (NaN).
• Ease of Processing: Allows seamless use of Pandas' built-in functionality for dealing with missing data
(like .fillna() or .dropna()).
Thus, using na_values helps ensure that data is correctly interpreted and missing values are properly handled
during import.
CSV (Comma-Separated Values) and Excel files (typically .xls or .xlsx formats) are both popular formats for
storing tabular data, but they have significant differences in structure, functionality, and usability. Here’s a detailed
comparison:
1. File Format
• CSV:
o A plain text file format that uses commas to separate values (though other delimiters like tabs
can also be used).
o Each line represents a row, and each value in the row is separated by a comma.
o File extension: .csv.
• Excel:
o A binary or XML-based file format used by Microsoft Excel to store data.
o Supports multiple sheets within a single file, allowing for more complex data organization.
o File extensions: .xls (older format) and .xlsx (newer format).
2. Data Structure
• CSV:
o Flat structure with a single table of data (no support for multiple sheets).
o Only supports text and numerical data; no built-in support for data types, formulas, or
formatting.
• Excel:
o Can contain multiple worksheets (tabs) within a single file.
o Supports various data types, including text, numbers, dates, and more complex objects.
o Allows for rich formatting (font styles, colors, borders, etc.), charts, graphs, and formulas.
3. Usability
• CSV:
o Simple and lightweight, making it easy to read and edit with any text editor.
o Ideal for simple data storage and exchange between applications.
o Limited in terms of features for data manipulation and visualization.
• Excel:
o User-friendly interface with advanced features for data analysis and visualization.
o Ideal for more complex datasets requiring calculations, charts, and formatting.
o Supports functionalities like pivot tables, data validation, and filtering.
4. Size and Performance
• CSV:
o Generally smaller in size compared to Excel files, as it contains only raw data without additional
formatting.
o Faster to load and process due to its simplicity.
• Excel:
o Typically larger in size because of additional features, formatting, and potential embedded
objects.
o Performance can degrade with very large datasets or complex calculations.
5. Interoperability
• CSV:
o Universally compatible with virtually any data processing software, programming languages,
and databases.
o Ideal for data exchange between different platforms and applications.
• Excel:
o Primarily designed for use with Microsoft Excel but can be opened by other spreadsheet
applications (e.g., Google Sheets, LibreOffice Calc).
SRGI, BHILAI
import pandas as pd
# Output
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64
Custom Indexing in Series
You can assign custom indices to the Series for better readability.
# Output
# a 10
# b 20
# c 30
# dtype: int64
Series from a Dictionary
If you create a Series from a dictionary, the keys become the indices, and the
values become the data.
# Output
# a 100
# b 200
# c 300
# dtype: int64
SRGI, BHILAI
2. Pandas DataFrame
A DataFrame is a two-dimensional labeled data structure that can hold data of
different types (integer, float, string, etc.) in columns. Think of it as a table (like
an Excel sheet or SQL table) where rows and columns are both labeled.
Creating a DataFrame
You can create a DataFrame from various data structures like lists of lists,
dictionaries, or NumPy arrays.
# Output
# Name Age City
# 0 Ram 23 New York
# 1 Shyam 25 Paris
# 2 Ajay 22 London
Custom Index in DataFrame
You can assign custom row and column labels to make the DataFrame more
descriptive.
# Output
# Name Age City
# Row1 Ram 23 New York
# Row2 Shyam 25 Paris
# Row3 Ajay 22 London
SRGI, BHILAI
data = [[23, 'Ram', 'New York'], [25, 'Shyam', 'Paris'], [22, 'Ajay', 'London']]
df = pd.DataFrame(data, columns=['Age', 'Name', 'City'])
print(df)
# Output
# Age Name City
# 0 23 Ram New York
# 1 25 Shyam Paris
# 2 22 Ajay London
DataFrame from a Dictionary of Series
You can also create a DataFrame using a dictionary of Pandas Series.
# Output
# Age City
# Ram 23 New York
# Shyam 25 Paris
# Ajay 22 London
Key Features of DataFrame:
• Heterogeneous data: Different columns can have different data types (int,
float, string, etc.).
• Size-mutable: You can add or remove rows and columns dynamically.
• Supports indexing by rows and columns (using labels or integers).
• Can handle missing data (NaN).
Key Differences between Series and DataFrame:
• Series is one-dimensional, while DataFrame is two-dimensional.
• Series can be thought of as a column, whereas a DataFrame is a collection
of multiple columns (with potentially different data types) organized into
rows and columns.
Example of DataFrame Manipulation:
You can access, filter, and manipulate the data in a DataFrame like so:
SRGI, BHILAI
# Accessing a column
print(df['Age'])
# Accessing rows by index
print(df.loc['Ram'])
import pandas as pd
# Output
# 0 100
# 1 200
# 2 300
# dtype: int64
2. Custom Index
You can define custom index labels when creating a Series or DataFrame.
# Output
# a 100
# b 200
# c 300
# dtype: int64
3. Accessing the Index Object
You can access the index of a Series or DataFrame using the .index attribute.
# Output
# Index(['a', 'b', 'c'], dtype='object')
SRGI, BHILAI
4. Indexing in DataFrame
The index in a DataFrame refers to the row labels, while the column labels are
referred to as columns.
# Output
# Age City
# Ram 23 New York
# Shyam 25 Paris
# Ajay 22 London
You can access both row indices and column names:
# Accessing row index
print(df.index)
# Accessing column names
print(df.columns)
5. MultiIndex (Hierarchical Index)
A MultiIndex allows for multi-level indexing, which is helpful when you have
more complex data.
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
multi_index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))
# Creating a DataFrame with MultiIndex
df = pd.DataFrame({'Value': [10, 20, 30, 40]}, index=multi_index)
print(df)
# Output
# Value
# Upper Lower
#A one 10
# two 20
SRGI, BHILAI
#B one 30
# two 40
# Output
# Age City
# Name
# Shyam 25 Paris
# Ajay 30 London
# Charlie 35 New York
# Output
#c 3
#b 2
#a 1
SRGI, BHILAI
#d 0
# dtype: int64
2. Sorting: You can sort a DataFrame or Series by its index using
sort_index().
df_sorted = df.sort_index()
print(df_sorted)
3. Checking for Uniqueness: You can check whether an index is unique
using .is_unique.
print(df.index.is_unique)
4. Duplicating Index: Duplicates in the index can sometimes occur, and you
can use duplicated() to check for them.
print(df.index.duplicated())
Conclusion
The Index object is fundamental in Pandas for efficient data manipulation,
alignment, and access. It acts as a label or identifier for the data, allowing easy
selection, reindexing, and slicing of data. Whether you're working with simple or
hierarchical data, Pandas' Index provides a flexible and powerful tool to manage
data effectively.
print(a - b) # Output: 7
print(a * b) # Output: 30
print(a / b) # Output: 3.333...
# Creating a DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
print(result)
Output:
A B C
0 0 2 4
1 1 3 5
2 2 4 6
2. Column-wise Operations
SRGI, BHILAI
To perform operations column-wise, the index of the Series must align with the
DataFrame’s index (rows). You can explicitly specify the operation to be
performed along the columns using axis=0.
Example: Adding a Series to each column of a DataFrame
# Series with index matching the DataFrame's index (row-wise alignment)
series_row = pd.Series([1, 2, 3])
print(result)
Output:
A B C
0 2 5 8
1 4 7 10
2 6 9 12
3. Operations with Non-Matching Indices/Columns
If the Series and DataFrame have non-matching indices or columns, pandas will
align based on the labels and fill in NaN where data is missing.
Example: Adding a Series with non-matching columns
# Series with different columns
series_diff = pd.Series([10, 20], index=['A', 'D']) # 'D' is not in DataFrame
print(result)
Output:
A B C D
0 11 NaN NaN NaN
1 12 NaN NaN NaN
2 13 NaN NaN NaN
4. Other Operations (Multiplication, Division, etc.)
You can perform other element-wise operations between a DataFrame and a
Series, such as multiplication, division, modulo, etc.
Example: Multiplying a DataFrame by a Series
# Multiplying a Series to each column of the DataFrame
result = df * series
SRGI, BHILAI
print(result)
Output:
A B C
0 1 8 21
1 2 10 24
2 3 12 27
5. Using Functions like apply
You can also use the apply() function to perform more complex operations by
applying a function to each row or column.
Example: Applying a function row-wise
# Applying a lambda function row-wise
result = df.apply(lambda row: row + series, axis=1)
print(result)
Summary of Key Operations:
• Arithmetic: +, -, *, /, etc. will broadcast and align based on indices or
columns.
• Alignment: Missing labels will result in NaN.
• Control Axis: Use axis=0 for row-wise operations and axis=1 for column-
wise operations.
SRGI, BHILAI
Statistics Functions:
In data analytics, statistical functions are essential to describe, summarize,
and interpret data. They allow analysts to identify patterns, trends, and
relationships in datasets. Here’s a comprehensive list of the most commonly
used statistical functions in Python, along with examples.
1. Measures of Central Tendency
Mean (Average)
The mean is the average of all values in the dataset.
import numpy as np
data = [10, 15, 20, 25, 30]
mean = np.mean(data)
print("Mean:", mean)
Output:
Mean: 20.0
Median
The median is the middle value in an ordered dataset. It’s useful for skewed data,
as it isn’t affected by extreme values.
median = np.median(data)
print("Median:", median)
Output:
Median: 20.0
Mode
The mode is the most frequently occurring value. It’s especially useful for
categorical data.
from scipy import stats
data = [10, 15, 15, 20, 25, 25, 25, 30]
mode = stats.mode(data)
print("Mode:", mode.mode[0])
Output:
Mode: 25
2. Measures of Dispersion
Variance
Variance measures the spread of data points around the mean. A high variance
means the data is more spread out.
variance = np.var(data)
print("Variance:", variance)
Output:
Variance: 36.875
Standard Deviation
SRGI, BHILAI
The standard deviation is the square root of the variance, showing the average
distance from the mean.
std_dev = np.std(data)
print("Standard Deviation:", std_dev)
Output:
Standard Deviation: 6.07
Range
The range is the difference between the maximum and minimum values.
range_value = np.max(data) - np.min(data)
print("Range:", range_value)
Output:
Range: 20
3. Measures of Position
Percentiles
Percentiles divide the data into 100 equal parts. For instance, the 90th percentile
is the value below which 90% of the data lies.
percentile_90 = np.percentile(data, 90)
print("90th Percentile:", percentile_90)
Output:
90th Percentile: 28.5
Quartiles
Quartiles divide the data into four equal parts. The 25th percentile is the first
quartile (Q1), the 50th percentile is the median (Q2), and the 75th percentile is
the third quartile (Q3).
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # This is the median
q3 = np.percentile(data, 75)
print("Q1:", q1, "Q2 (Median):", q2, "Q3:", q3)
Output:
Q1: 15.0 Q2 (Median): 20.0 Q3: 25.0
Interquartile Range (IQR)
The IQR is the range between the first and third quartiles. It helps detect outliers.
python
iqr = q3 - q1
print("Interquartile Range (IQR):", iqr)
Output:
Interquartile Range (IQR): 10.0
4. Measures of Shape
Skewness
SRGI, BHILAI
Ranking
Ranking assigns a rank to each value based on its order. Ties can be handled by
averaging ranks or assigning ranks based on the order they appear.
import pandas as pd
rank_ascending = pd.Series(data).rank() # Ascending rank
rank_descending = pd.Series(data).rank(ascending=False) # Descending rank
print("Ascending Rank:\n", rank_ascending)
print("Descending Rank:\n", rank_descending)
7. Z-Scores
Z-scores indicate how many standard deviations a data point is from the mean.
It’s used to identify outliers.
z_scores = stats.zscore(data)
print("Z-Scores:", z_scores)
8. Probability Distributions
Normal Distribution
The normal distribution is a symmetrical, bell-shaped distribution. You can
generate it in Python with:
import matplotlib.pyplot as plt
data = np.random.normal(0, 1, 1000)
plt.hist(data, bins=30, alpha=0.5)
plt.show()
Binomial Distribution
The binomial distribution represents the number of successes in a fixed number
of trials.
from scipy.stats import binom
n, p = 10, 0.5 # 10 trials, 50% success probability
binom_data = binom.rvs(n, p, size=1000)
plt.hist(binom_data, bins=30, alpha=0.5)
plt.show()
Summary
This comprehensive set of statistical functions allows data analysts to gain
insights, make comparisons, detect outliers, and understand relationships within
data. These functions are essential for exploring data and forming conclusions
based on patterns and trends.