Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

BTech 5 CSE Data Analytics Using Python Unit 4 Notes

The document provides an overview of the pandas library, a powerful tool for data analysis and manipulation in Python, highlighting its key features, data structures (Series and DataFrame), and methods for loading data from various formats like CSV, Excel, and SQL databases. It discusses the importance of encoding, delimiters, and handling missing values when importing data, as well as the differences between CSV and Excel file formats. Additionally, it explains the Index object in pandas, which is essential for data alignment and selection.

Uploaded by

yeeshandas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BTech 5 CSE Data Analytics Using Python Unit 4 Notes

The document provides an overview of the pandas library, a powerful tool for data analysis and manipulation in Python, highlighting its key features, data structures (Series and DataFrame), and methods for loading data from various formats like CSV, Excel, and SQL databases. It discusses the importance of encoding, delimiters, and handling missing values when importing data, as well as the differences between CSV and Excel file formats. Additionally, it explains the Index object in pandas, which is essential for data alignment and selection.

Uploaded by

yeeshandas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SRGI, BHILAI

UNIT – 4
THE pandas LIBRARY

Introduction to pandas Library


Pandas is a powerful and widely-used open-source data analysis and
manipulation library in Python. It provides data structures like Series (1D) and
DataFrame (2D), which are designed to handle structured data intuitively and
efficiently.
Pandas is built on top of NumPy and offers functions for tasks like:
• Data cleaning
• Data wrangling
• Data manipulation
• Handling missing data
• Merging, joining, and grouping datasets
• Time series analysis
Its user-friendly interface allows easy loading, manipulation, and analysis of
datasets from various formats like CSV, Excel, SQL databases, and more. Pandas
simplifies complex tasks in data science and machine learning by providing
flexible tools to manage and transform data. It is essential for data scientists and
analysts for its versatility and ease of use.

Pandas provides several methods for loading different types of data files,
including CSV, Excel, and SQL. Below are the commonly used methods for each:
1. Loading CSV Files
We can load a CSV file using the read_csv() function.
import pandas as pd
# Loading a CSV file
df = pd.read_csv('filename.csv')
We can also specify various parameters such as delimiter, encoding, and handling
missing values:
df = pd.read_csv('filename.csv', delimiter=',', encoding='utf-8',
na_values=['N/A', 'NA'])
In Pandas, when reading files like CSVs, you often use encoding='utf-8' to specify the character encoding of the
file you're reading. UTF-8 is one of the most widely used encodings for representing text in computers and is
compatible with most characters from different languages, symbols, and emojis.
Reasons for Using encoding='utf-8' in Pandas:
1. Handle Non-ASCII Characters:
o Many text files contain characters outside the standard ASCII range (e.g., accented letters,
symbols, non-Latin alphabets). UTF-8 can handle a wide variety of characters from different
languages.
2. Avoid Encoding Errors:
SRGI, BHILAI

o If the file contains non-UTF-8 characters and you don't specify the encoding, you might get
errors like UnicodeDecodeError. Specifying encoding='utf-8' ensures that Pandas knows how
to properly interpret the characters in the file.
3. Standard and Universal:
o UTF-8 is the default encoding for many web pages, applications, and databases, so it's a
common practice to use it when importing files. This makes your code more portable and
compatible across different systems and applications.
4. Prevent Misinterpretation of Data:
o If the wrong encoding is used (e.g., ASCII or ISO-8859-1), some characters may not be
interpreted correctly, leading to data corruption or unexpected symbols. By explicitly using utf-
8, you ensure proper interpretation of characters.

Why Use delimiter=","?


1. Comma-Separated Values: The default format for a CSV file uses commas to separate columns. If you
are working with a standard CSV file, you might not need to explicitly use delimiter=",", as it's the default
separator for pd.read_csv().
2. Handling Other Delimiters: Sometimes, files might not use a comma. In such cases, you would specify
the correct delimiter. For example:
o Tab-delimited files: delimiter="\t"
o Semicolon-separated files: delimiter=";"
3. Correct Parsing: Specifying the delimiter ensures that Pandas correctly parses the columns in the file.
If the file uses a comma as a delimiter, specifying delimiter="," guarantees that Pandas splits the data at
the commas.
Reasons for using na_values =['NA', 'N/A']:
1. Custom Missing Data Indicators: Different datasets use different representations for missing values.
Some use NA, N/A, or other placeholders to indicate missing data. By using the na_values parameter,
you can tell Pandas to interpret these specific strings as NaN.
2. Ensure Consistent Missing Data Handling: Inconsistent missing data formats can lead to incorrect data
analysis. Using na_values ensures that all specified missing values are treated consistently as NaN,
regardless of how they are represented in the original file.
3. Cleaner Data: By identifying specific placeholders like 'NA' or 'N/A' and converting them to NaN, you
make it easier to handle missing values later in your analysis, such as when imputing missing data or
filtering out incomplete rows.
Example:
When you import a CSV file with the na_values parameter, it automatically converts the specified values to NaN.
import pandas as pd
# Example CSV file data:
# Name, Age, City
# Ram, 25, NA
# Shyam, N/A, London
# Ajay, 30, N/A
SRGI, BHILAI

# Loading the CSV with custom NA values


df = pd.read_csv('data.csv', na_values=['NA', 'N/A'])
print(df)

# Output:
# Name Age City
# 0 Ram 25.0 NaN
# 1 Shyam NaN London
# 2 Ajay 30.0 NaN
Benefits:
• Flexibility: Handle multiple formats of missing data in different datasets.
• Data Cleaning: Convert all recognized missing values to a standard format (NaN).
• Ease of Processing: Allows seamless use of Pandas' built-in functionality for dealing with missing data
(like .fillna() or .dropna()).
Thus, using na_values helps ensure that data is correctly interpreted and missing values are properly handled
during import.

2. Loading Excel Files


To read Excel files, Pandas provides the read_excel() function. You can read
individual sheets or multiple sheets from the Excel workbook.
# Loading a single sheet from an Excel file
df = pd.read_excel('filename.xlsx', sheet_name='Sheet1')

# Loading multiple sheets


df_dict = pd.read_excel('filename.xlsx', sheet_name=['Sheet1', 'Sheet2'])
We may need to install openpyxl for handling .xlsx files:
pip install openpyxl
3. Loading Data from a SQL Database
To load data from an SQL database, Pandas uses the read_sql() function, which
requires an active database connection (using libraries like sqlite3, sqlalchemy,
or pyodbc).
First, set up a connection to your SQL database:
import pandas as pd
import sqlite3
# Connecting to a SQLite database
conn = sqlite3.connect('database_name.db')
# Loading data from a SQL table
df = pd.read_sql('SELECT * FROM table_name', conn)
Alternatively, with SQLAlchemy (for more complex databases like MySQL,
PostgreSQL):
from sqlalchemy import create_engine
# Create a connection engine
engine = create_engine('mysql://username:password@host:port/database_name')
SRGI, BHILAI

# Loading data from SQL using SQLAlchemy


df = pd.read_sql('SELECT * FROM table_name', engine)
These methods make it easy to load data from different formats into Pandas
DataFrames for further analysis and manipulation.

CSV (Comma-Separated Values) and Excel files (typically .xls or .xlsx formats) are both popular formats for
storing tabular data, but they have significant differences in structure, functionality, and usability. Here’s a detailed
comparison:
1. File Format
• CSV:
o A plain text file format that uses commas to separate values (though other delimiters like tabs
can also be used).
o Each line represents a row, and each value in the row is separated by a comma.
o File extension: .csv.
• Excel:
o A binary or XML-based file format used by Microsoft Excel to store data.
o Supports multiple sheets within a single file, allowing for more complex data organization.
o File extensions: .xls (older format) and .xlsx (newer format).
2. Data Structure
• CSV:
o Flat structure with a single table of data (no support for multiple sheets).
o Only supports text and numerical data; no built-in support for data types, formulas, or
formatting.
• Excel:
o Can contain multiple worksheets (tabs) within a single file.
o Supports various data types, including text, numbers, dates, and more complex objects.
o Allows for rich formatting (font styles, colors, borders, etc.), charts, graphs, and formulas.
3. Usability
• CSV:
o Simple and lightweight, making it easy to read and edit with any text editor.
o Ideal for simple data storage and exchange between applications.
o Limited in terms of features for data manipulation and visualization.
• Excel:
o User-friendly interface with advanced features for data analysis and visualization.
o Ideal for more complex datasets requiring calculations, charts, and formatting.
o Supports functionalities like pivot tables, data validation, and filtering.
4. Size and Performance
• CSV:
o Generally smaller in size compared to Excel files, as it contains only raw data without additional
formatting.
o Faster to load and process due to its simplicity.
• Excel:
o Typically larger in size because of additional features, formatting, and potential embedded
objects.
o Performance can degrade with very large datasets or complex calculations.
5. Interoperability
• CSV:
o Universally compatible with virtually any data processing software, programming languages,
and databases.
o Ideal for data exchange between different platforms and applications.
• Excel:
o Primarily designed for use with Microsoft Excel but can be opened by other spreadsheet
applications (e.g., Google Sheets, LibreOffice Calc).
SRGI, BHILAI

o May require additional libraries or software to read or manipulate in programming languages.


6. Data Integrity
• CSV:
o Lack of data validation means potential issues with data integrity (e.g., inconsistent formatting).
o Doesn’t store metadata (like column data types or constraints).
• Excel:
o Can include data validation features to restrict the type of data entered in cells.
o Stores metadata about the data, such as formatting and formulas, which can help maintain data
integrity.
7. Editing and Collaboration
• CSV:
o Easier to collaborate on in terms of version control, as it’s just a text file.
o However, editing can be less intuitive for non-technical users.
• Excel:
o Provides built-in collaboration tools (e.g., track changes, comments) for multiple users.
o More suitable for users who require a visual interface for data manipulation.
Summary Table
Feature CSV Excel
File Format Plain text Binary/XML
Data Structure Single table Multiple sheets
Usability Simple Advanced features
Size Generally smaller Generally larger
Interoperability Highly compatible Mostly Excel-based
Data Integrity Basic Enhanced validation
Editing Text editor Spreadsheet software
Conclusion
The choice between CSV and Excel depends on the specific requirements of the task at hand. CSV is best for
simple, flat data storage and transfer, while Excel is ideal for more complex datasets requiring advanced
functionalities and formatting.

Pandas data structures:


Pandas provides two primary data structures: Series and DataFrame. Each
is designed to handle different types of data efficiently and intuitively. Below is
a detailed explanation of both, with examples.
1. Pandas Series
A Series is a one-dimensional labeled array capable of holding any data type,
such as integers, strings, floats, or even Python objects. It is like a column in a
table or a one-dimensional array with labels.
Creating a Series
You can create a Series using various data types like lists, dictionaries, or NumPy
arrays.
SRGI, BHILAI

import pandas as pd

# Creating a Series from a list


s = pd.Series([10, 20, 30, 40, 50])
print(s)

# Output
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64
Custom Indexing in Series
You can assign custom indices to the Series for better readability.

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])


print(s)

# Output
# a 10
# b 20
# c 30
# dtype: int64
Series from a Dictionary
If you create a Series from a dictionary, the keys become the indices, and the
values become the data.

data = {'a': 100, 'b': 200, 'c': 300}


s = pd.Series(data)
print(s)

# Output
# a 100
# b 200
# c 300
# dtype: int64
SRGI, BHILAI

Key Features of Series:


• Homogeneous data: All elements must be of the same type.
• Supports indexing (custom or default numeric indices).
• Can handle missing data (NaN).

2. Pandas DataFrame
A DataFrame is a two-dimensional labeled data structure that can hold data of
different types (integer, float, string, etc.) in columns. Think of it as a table (like
an Excel sheet or SQL table) where rows and columns are both labeled.
Creating a DataFrame
You can create a DataFrame from various data structures like lists of lists,
dictionaries, or NumPy arrays.

# Creating a DataFrame from a dictionary


data = {
'Name': ['Ram', 'Shyam', 'Ajay'],
'Age': [23, 25, 22],
'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
print(df)

# Output
# Name Age City
# 0 Ram 23 New York
# 1 Shyam 25 Paris
# 2 Ajay 22 London
Custom Index in DataFrame
You can assign custom row and column labels to make the DataFrame more
descriptive.

df = pd.DataFrame(data, index=['Row1', 'Row2', 'Row3'])


print(df)

# Output
# Name Age City
# Row1 Ram 23 New York
# Row2 Shyam 25 Paris
# Row3 Ajay 22 London
SRGI, BHILAI

Creating a DataFrame from a List of Lists


You can create a DataFrame from a list of lists (each inner list represents a row).

data = [[23, 'Ram', 'New York'], [25, 'Shyam', 'Paris'], [22, 'Ajay', 'London']]
df = pd.DataFrame(data, columns=['Age', 'Name', 'City'])
print(df)

# Output
# Age Name City
# 0 23 Ram New York
# 1 25 Shyam Paris
# 2 22 Ajay London
DataFrame from a Dictionary of Series
You can also create a DataFrame using a dictionary of Pandas Series.

s1 = pd.Series([23, 25, 22], index=['Ram', 'Shyam', 'Ajay'])


s2 = pd.Series(['New York', 'Paris', 'London'], index=['Ram', 'Shyam', 'Ajay'])

df = pd.DataFrame({'Age': s1, 'City': s2})


print(df)

# Output
# Age City
# Ram 23 New York
# Shyam 25 Paris
# Ajay 22 London
Key Features of DataFrame:
• Heterogeneous data: Different columns can have different data types (int,
float, string, etc.).
• Size-mutable: You can add or remove rows and columns dynamically.
• Supports indexing by rows and columns (using labels or integers).
• Can handle missing data (NaN).
Key Differences between Series and DataFrame:
• Series is one-dimensional, while DataFrame is two-dimensional.
• Series can be thought of as a column, whereas a DataFrame is a collection
of multiple columns (with potentially different data types) organized into
rows and columns.
Example of DataFrame Manipulation:
You can access, filter, and manipulate the data in a DataFrame like so:
SRGI, BHILAI

# Accessing a column
print(df['Age'])
# Accessing rows by index
print(df.loc['Ram'])

# Filtering rows based on a condition


print(df[df['Age'] > 23])
These powerful data structures are the foundation of most data manipulation tasks
in Pandas. They are designed to be fast, flexible, and intuitive for handling
structured data.

The Index Object


The Index object in Pandas is an immutable array that holds the labels for
rows or columns of a Pandas Series or DataFrame. It plays a critical role in data
alignment and selection, allowing for efficient data access and manipulation.
Every Pandas object (Series or DataFrame) has an associated Index, which can
be customized or default (i.e., simple integer values starting from 0).
Key Features of the Index Object:
1. Immutable: The Index object is immutable, meaning its values cannot be
changed once the object is created. This ensures data integrity during
operations.
2. Supports Heterogeneous Data Types: Index can hold various types of
data, including strings, integers, datetime objects, and more.
3. Allows for Fast Lookups: Indexes are optimized for quick access to rows
and columns, enabling fast lookups and alignment in operations like
merging and joining.
4. Labels for Data: Indexes provide labels to Series and DataFrame rows or
columns, making data more accessible and meaningful. You can use these
labels to reference data rather than relying on integer-based positions.
Types of Indexes
• Default Index: When you don't provide an explicit index, Pandas assigns
a default index that consists of integers starting from 0.
• Custom Index: You can assign custom index labels, such as strings, dates,
or a combination of data types.
• MultiIndex (Hierarchical Index): This is an advanced index type that
allows multiple levels of indexing, which is useful for handling multi-
dimensional data.
SRGI, BHILAI

Example of the Index Object in Pandas


1. Default Index
When creating a Pandas Series or DataFrame, if no index is specified, Pandas
assigns a default index starting at 0.

import pandas as pd

# Creating a Series with a default integer index


s = pd.Series([100, 200, 300])
print(s)

# Output
# 0 100
# 1 200
# 2 300
# dtype: int64
2. Custom Index
You can define custom index labels when creating a Series or DataFrame.

# Creating a Series with a custom index


s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s)

# Output
# a 100
# b 200
# c 300
# dtype: int64
3. Accessing the Index Object
You can access the index of a Series or DataFrame using the .index attribute.

# Accessing the index


print(s.index)

# Output
# Index(['a', 'b', 'c'], dtype='object')
SRGI, BHILAI

4. Indexing in DataFrame
The index in a DataFrame refers to the row labels, while the column labels are
referred to as columns.

# Creating a DataFrame with custom row index and column names


data = {
'Age': [23, 25, 22],
'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data, index=['Ram', 'Shyam', 'Ajay'])
print(df)

# Output
# Age City
# Ram 23 New York
# Shyam 25 Paris
# Ajay 22 London
You can access both row indices and column names:
# Accessing row index
print(df.index)
# Accessing column names
print(df.columns)
5. MultiIndex (Hierarchical Index)
A MultiIndex allows for multi-level indexing, which is helpful when you have
more complex data.
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
multi_index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))
# Creating a DataFrame with MultiIndex
df = pd.DataFrame({'Value': [10, 20, 30, 40]}, index=multi_index)
print(df)
# Output
# Value
# Upper Lower
#A one 10
# two 20
SRGI, BHILAI

#B one 30
# two 40

6. Setting and Resetting Index


You can set and reset the index of a DataFrame using set_index() and
reset_index().

# Setting a new index


df = pd.DataFrame({
'Name': ['Shyam', 'Ajay', 'Charlie'],
'Age': [25, 30, 35],
'City': ['Paris', 'London', 'New York']
})
df.set_index('Name', inplace=True)
print(df)

# Output
# Age City
# Name
# Shyam 25 Paris
# Ajay 30 London
# Charlie 35 New York

# Resetting the index back to default


df.reset_index(inplace=True)
print(df)

Key Index Operations:


1. Reindexing: You can change or reassign index labels using the reindex()
method.
# Reindexing a Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s_reindexed = s.reindex(['c', 'b', 'a', 'd'], fill_value=0)
print(s_reindexed)

# Output
#c 3
#b 2
#a 1
SRGI, BHILAI

#d 0
# dtype: int64
2. Sorting: You can sort a DataFrame or Series by its index using
sort_index().

df_sorted = df.sort_index()
print(df_sorted)
3. Checking for Uniqueness: You can check whether an index is unique
using .is_unique.

print(df.index.is_unique)
4. Duplicating Index: Duplicates in the index can sometimes occur, and you
can use duplicated() to check for them.

print(df.index.duplicated())
Conclusion
The Index object is fundamental in Pandas for efficient data manipulation,
alignment, and access. It acts as a label or identifier for the data, allowing easy
selection, reindexing, and slicing of data. Whether you're working with simple or
hierarchical data, Pandas' Index provides a flexible and powerful tool to manage
data effectively.

Arithmetic and Data Alignment


In Python, arithmetic and data alignment often refers to performing
mathematical operations on data structures (such as pandas Series or DataFrame),
where the alignment of data between different elements (like indexes or labels)
plays a crucial role in how the arithmetic operations are performed.
Let’s break down both concepts:
1. Arithmetic in Python
Arithmetic operations involve basic mathematical functions like addition,
subtraction, multiplication, division, etc. These can be performed on individual
variables or on elements of data structures like lists, tuples, arrays, or pandas
objects.
Examples of arithmetic operations in Python:
# Arithmetic with integers and floats
a = 10
b=3
print(a + b) # Output: 13
SRGI, BHILAI

print(a - b) # Output: 7
print(a * b) # Output: 30
print(a / b) # Output: 3.333...

# Arithmetic with lists (element-wise addition using list comprehension)


list1 = [1, 2, 3]
list2 = [4, 5, 6]
result = [x + y for x, y in zip(list1, list2)] # Output: [5, 7, 9]
2. Data Alignment in Python (Pandas)
When working with pandas data structures such as Series and DataFrame,
data alignment is crucial when performing arithmetic operations. Pandas aligns
the data on index or labels before performing element-wise arithmetic. If the
indexes don’t match, it fills the unmatched elements with NaN (Not a Number).
Example 1: Arithmetic with Pandas Series
import pandas as pd

# Two pandas Series with different indexes


s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

# Element-wise addition (automatically aligns by index)


result = s1 + s2
print(result)
# Output:
# a NaN # 'a' is not in s2, so it returns NaN
# b 6.0 # 2 + 4
# c 8.0 # 3 + 5
# d NaN # 'd' is not in s1, so it returns NaN
In this example, pandas aligns the data based on the index labels ('a', 'b', 'c', 'd'),
and fills in NaN for missing values where the indexes don’t match.
Example 2: Arithmetic with Pandas DataFrame
When performing operations between DataFrame objects, pandas aligns
data by both the row index and column labels.
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'A': [7, 8, 9],
SRGI, BHILAI

'C': [10, 11, 12]


}, index=['b', 'c', 'd'])

# Element-wise addition (aligns by row index and column labels)


result = df1 + df2
print(result)
# Output:
# A B C
# a NaN NaN NaN # No matching row index or column label for 'a'
# b 9.0 4.0 NaN # 2 + 7 for 'A', and 'B' and 'C' are missing
# c 11.0 5.0 NaN # 3 + 8 for 'A'
# d NaN NaN 12.0 # No 'A' or 'B' in df1 for 'd', only 'C' exists
In the above example:
• Columns ‘A’ are aligned and arithmetic is performed on the values.
• Since df2 has a column ‘C’ which doesn’t exist in df1, and df1 has a
column ‘B’ which doesn’t exist in df2, those result in NaN.
Handling Missing Values:
If you want to fill the NaN values with a specific value during arithmetic
operations, you can use the fill_value parameter in functions like add(), sub(), etc.
# Adding two DataFrames with fill_value
result = df1.add(df2, fill_value=0)
print(result)
# Output:
# A B C
# a 1.0 4.0 0.0
# b 9.0 4.0 0.0
# c 11.0 5.0 0.0
# d 0.0 0.0 12.0
Key Points:
• Arithmetic in Python involves basic math operations, but when used with
complex data structures (like Series or DataFrames), it operates element-
wise.
• Data alignment in pandas ensures that arithmetic operations are
performed based on matching indexes or labels. If the indexes don’t align,
pandas fills in the gaps with NaN.
This alignment behavior is very useful for working with incomplete or
mismatched datasets, as pandas handles the underlying complexity of ensuring
the data is properly aligned before performing the calculations.
SRGI, BHILAI

Operations between DataFrame and Series


In pandas, operations between a DataFrame and a Series are quite
common. These operations are usually performed row-wise (axis 1) or column-
wise (axis 0), depending on the alignment of indices and column labels.
Key Points for DataFrame and Series Operations:
1. Broadcasting: When you perform operations between a DataFrame and a
Series, pandas will attempt to broadcast the Series to match the
DataFrame's shape, aligning on the index (rows) or columns.
2. Alignment: Operations between a DataFrame and a Series are aligned by
labels. If the labels (index or column names) don't match, pandas will insert
NaN for missing data.
Let's look at different types of operations:
1. Row-wise Operations (Default)
By default, operations between a DataFrame and a Series are performed row-
wise (along each column). This happens when the index of the Series matches the
columns of the DataFrame.
Example: Subtracting a Series from each row of a DataFrame
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})

# Creating a Series to subtract from DataFrame (broadcast row-wise)


series = pd.Series([1, 2, 3], index=['A', 'B', 'C'])

# Subtract the Series from DataFrame (column-wise operation)


result = df - series

print(result)
Output:
A B C
0 0 2 4
1 1 3 5
2 2 4 6
2. Column-wise Operations
SRGI, BHILAI

To perform operations column-wise, the index of the Series must align with the
DataFrame’s index (rows). You can explicitly specify the operation to be
performed along the columns using axis=0.
Example: Adding a Series to each column of a DataFrame
# Series with index matching the DataFrame's index (row-wise alignment)
series_row = pd.Series([1, 2, 3])

# Adding the Series to each column (element-wise)


result = df.add(series_row, axis=0)

print(result)
Output:
A B C
0 2 5 8
1 4 7 10
2 6 9 12
3. Operations with Non-Matching Indices/Columns
If the Series and DataFrame have non-matching indices or columns, pandas will
align based on the labels and fill in NaN where data is missing.
Example: Adding a Series with non-matching columns
# Series with different columns
series_diff = pd.Series([10, 20], index=['A', 'D']) # 'D' is not in DataFrame

# Adding Series to DataFrame (results in NaN for missing 'D')


result = df + series_diff

print(result)
Output:
A B C D
0 11 NaN NaN NaN
1 12 NaN NaN NaN
2 13 NaN NaN NaN
4. Other Operations (Multiplication, Division, etc.)
You can perform other element-wise operations between a DataFrame and a
Series, such as multiplication, division, modulo, etc.
Example: Multiplying a DataFrame by a Series
# Multiplying a Series to each column of the DataFrame
result = df * series
SRGI, BHILAI

print(result)
Output:
A B C
0 1 8 21
1 2 10 24
2 3 12 27
5. Using Functions like apply
You can also use the apply() function to perform more complex operations by
applying a function to each row or column.
Example: Applying a function row-wise
# Applying a lambda function row-wise
result = df.apply(lambda row: row + series, axis=1)

print(result)
Summary of Key Operations:
• Arithmetic: +, -, *, /, etc. will broadcast and align based on indices or
columns.
• Alignment: Missing labels will result in NaN.
• Control Axis: Use axis=0 for row-wise operations and axis=1 for column-
wise operations.
SRGI, BHILAI

Function by Element and Function by Row/Column


element-wise operations using applymap() and row/column-wise
operations using apply().
1. Element-wise Operation: applymap()
Scenario: Suppose we have a DataFrame with numeric values, and we want to
square each element in the DataFrame.
Example:
import pandas as pd
# Creating a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})

# Applying a function to square each element in the DataFrame


df_squared = df.applymap(lambda x: x ** 2)

print("Original DataFrame:\n", df)


print("\nSquared DataFrame (element-wise):\n", df_squared)
Output:
Original DataFrame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9

Squared DataFrame (element-wise):


A B C
0 1 16 49
1 4 25 64
2 9 36 81
Explanation:
• applymap(lambda x: x ** 2) applies the squaring function to each
individual element in the DataFrame, transforming the entire DataFrame
element-wise.
SRGI, BHILAI

2. Column-wise Operation: apply()


Scenario: We want to calculate the sum of values in each column.
Example:
# Applying a sum function to each column
column_sum = df.apply(sum, axis=0)
print("\nSum of each column:\n", column_sum)
Output:
Sum of each column:
A 6
B 15
C 24
dtype: int64
Explanation:
• The apply() function with axis=0 (default) applies the sum function down
each column, calculating the sum of values in each column.

3. Row-wise Operation: apply()


Scenario: We want to calculate the sum of values in each row.
Example:
# Applying a sum function to each row
row_sum = df.apply(sum, axis=1)

print("\nSum of each row:\n", row_sum)


Output:
Sum of each row:
0 12
1 15
2 18
dtype: int64
Explanation:
• The apply() function with axis=1 applies the sum function across each row,
calculating the sum of values in each row.

4. Applying Custom Function by Row


Scenario: Suppose we want to apply a custom function that calculates the range
(difference between max and min) for each row.
Example:
# Applying a custom function to calculate range (max - min) for each row
row_range = df.apply(lambda x: x.max() - x.min(), axis=1)
SRGI, BHILAI

print("\nRange (max - min) for each row:\n", row_range)


Output:
Range (max - min) for each row:
0 6
1 6
2 6
dtype: int64
Explanation:
• The custom lambda function lambda x: x.max() - x.min() calculates the
range of values in each row, with axis=1 ensuring it is applied row-wise.

5. Applying Custom Function by Column


Scenario: Now, let's apply a similar range function, but this time by column.
Example:
# Applying a custom function to calculate range (max - min) for each column
column_range = df.apply(lambda x: x.max() - x.min(), axis=0)

print("\nRange (max - min) for each column:\n", column_range)


Output:
Range (max - min) for each column:
A 2
B 2
C 2
dtype: int64
Explanation:
• The lambda function computes the range for each column, with axis=0
ensuring it is applied column-wise.

Summary of Key Differences:


• applymap(): Applies element-wise transformations. Ideal for modifying
each individual element in the DataFrame.
• apply(): Applies row-wise or column-wise transformations depending on
the axis parameter. Use this when you need to compute or transform across
a full row or column.
This flexibility in Pandas allows you to perform a wide variety of data
manipulations, whether on individual elements or entire rows/columns.
SRGI, BHILAI

Statistics Functions:
In data analytics, statistical functions are essential to describe, summarize,
and interpret data. They allow analysts to identify patterns, trends, and
relationships in datasets. Here’s a comprehensive list of the most commonly
used statistical functions in Python, along with examples.
1. Measures of Central Tendency
Mean (Average)
The mean is the average of all values in the dataset.
import numpy as np
data = [10, 15, 20, 25, 30]
mean = np.mean(data)
print("Mean:", mean)
Output:
Mean: 20.0
Median
The median is the middle value in an ordered dataset. It’s useful for skewed data,
as it isn’t affected by extreme values.
median = np.median(data)
print("Median:", median)
Output:
Median: 20.0
Mode
The mode is the most frequently occurring value. It’s especially useful for
categorical data.
from scipy import stats
data = [10, 15, 15, 20, 25, 25, 25, 30]
mode = stats.mode(data)
print("Mode:", mode.mode[0])
Output:
Mode: 25
2. Measures of Dispersion
Variance
Variance measures the spread of data points around the mean. A high variance
means the data is more spread out.
variance = np.var(data)
print("Variance:", variance)
Output:
Variance: 36.875
Standard Deviation
SRGI, BHILAI

The standard deviation is the square root of the variance, showing the average
distance from the mean.
std_dev = np.std(data)
print("Standard Deviation:", std_dev)
Output:
Standard Deviation: 6.07
Range
The range is the difference between the maximum and minimum values.
range_value = np.max(data) - np.min(data)
print("Range:", range_value)
Output:
Range: 20
3. Measures of Position
Percentiles
Percentiles divide the data into 100 equal parts. For instance, the 90th percentile
is the value below which 90% of the data lies.
percentile_90 = np.percentile(data, 90)
print("90th Percentile:", percentile_90)
Output:
90th Percentile: 28.5
Quartiles
Quartiles divide the data into four equal parts. The 25th percentile is the first
quartile (Q1), the 50th percentile is the median (Q2), and the 75th percentile is
the third quartile (Q3).
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # This is the median
q3 = np.percentile(data, 75)
print("Q1:", q1, "Q2 (Median):", q2, "Q3:", q3)
Output:
Q1: 15.0 Q2 (Median): 20.0 Q3: 25.0
Interquartile Range (IQR)
The IQR is the range between the first and third quartiles. It helps detect outliers.
python
iqr = q3 - q1
print("Interquartile Range (IQR):", iqr)
Output:
Interquartile Range (IQR): 10.0
4. Measures of Shape
Skewness
SRGI, BHILAI

Skewness measures the asymmetry of the distribution. Positive skew indicates a


right-skewed distribution, while negative skew indicates a left-skewed
distribution.
skewness = stats.skew(data)
print("Skewness:", skewness)
Kurtosis
Kurtosis measures the "tailedness" of the distribution. High kurtosis indicates
heavy tails, while low kurtosis indicates light tails.
kurtosis = stats.kurtosis(data)
print("Kurtosis:", kurtosis)
5. Measures of Association
Correlation
Correlation measures the linear relationship between two variables. Values close
to 1 or -1 indicate a strong relationship.
data1 = [10, 20, 30, 40, 50]
data2 = [15, 25, 35, 45, 55]
correlation = np.corrcoef(data1, data2)
print("Correlation Matrix:\n", correlation)
Output:
Correlation Matrix:
[[1. 1.]
[1. 1.]]
Covariance
Covariance measures how two variables vary together. Positive values indicate
that as one variable increases, the other tends to increase as well.
covariance = np.cov(data1, data2)
print("Covariance Matrix:\n", covariance)
Output:
Covariance Matrix:
[[250. 250.]
[250. 250.]]
6. Ranking and Sorting
Sorting
Sorting arranges the data in ascending or descending order, helpful for quickly
finding minimum or maximum values.
sorted_data_asc = np.sort(data)
sorted_data_desc = np.sort(data)[::-1]
print("Ascending Order:", sorted_data_asc)
print("Descending Order:", sorted_data_desc)
SRGI, BHILAI

Ranking
Ranking assigns a rank to each value based on its order. Ties can be handled by
averaging ranks or assigning ranks based on the order they appear.
import pandas as pd
rank_ascending = pd.Series(data).rank() # Ascending rank
rank_descending = pd.Series(data).rank(ascending=False) # Descending rank
print("Ascending Rank:\n", rank_ascending)
print("Descending Rank:\n", rank_descending)
7. Z-Scores
Z-scores indicate how many standard deviations a data point is from the mean.
It’s used to identify outliers.
z_scores = stats.zscore(data)
print("Z-Scores:", z_scores)
8. Probability Distributions
Normal Distribution
The normal distribution is a symmetrical, bell-shaped distribution. You can
generate it in Python with:
import matplotlib.pyplot as plt
data = np.random.normal(0, 1, 1000)
plt.hist(data, bins=30, alpha=0.5)
plt.show()
Binomial Distribution
The binomial distribution represents the number of successes in a fixed number
of trials.
from scipy.stats import binom
n, p = 10, 0.5 # 10 trials, 50% success probability
binom_data = binom.rvs(n, p, size=1000)
plt.hist(binom_data, bins=30, alpha=0.5)
plt.show()
Summary
This comprehensive set of statistical functions allows data analysts to gain
insights, make comparisons, detect outliers, and understand relationships within
data. These functions are essential for exploring data and forming conclusions
based on patterns and trends.

You might also like