Chapter Notes - Data Handling Using Pandas DataFrame
Chapter Notes - Data Handling Using Pandas DataFrame
1. Pandas DataFrame
Definition: A Pandas DataFrame is a two-dimensional, tabular data structure provided by the Pandas library in
Python. It is designed for efficient data manipulation and analysis, resembling a spreadsheet or a database table.
Key Features:
Rows and Columns: Each row represents a record, and each column represents a variable or feature.
Labeled Axes: Rows and columns can have labels (e.g., names or numbers).
Heterogeneous Data: Can store different data types (e.g., integers, floats, strings) in different columns.
Flexible Operations: Supports filtering, grouping, merging, and statistical computations.
Why Use Pandas?: Simplifies complex data tasks like cleaning, transforming, and analyzing large datasets.
Importing Pandas: Always import Pandas in your Python script to use its features.
import pandas as pd
Explanation:
import pandas: Imports the Pandas library.
as pd: Assigns the alias pd for convenience, so we write pd instead of pandas in our code.
1. From a Dictionary:
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Ravi'],
'Age': [20, 19, 21],
'Marks': [85, 90, 88]
}
df = pd.DataFrame(data)
print(df)
Line-by-Line Explanation:
Line 1: import pandas as pd – Imports the Pandas library with alias pd.
Line 2-5: data = {...} – Creates a dictionary where keys are column names (Name, Age,
Marks) and values are lists containing the data for each column.
Line 6: df = pd.DataFrame(data) – Converts the dictionary into a DataFrame, where
each key becomes a column and each list becomes the column’s data.
Line 7: print(df) – Displays the DataFrame.
Output:
The output is a table with 3 rows and 3 columns, with default row indices (0, 1, 2).
data = [['Amit', 20, 85], ['Neha', 19, 90], ['Ravi', 21, 88]]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Marks'])
print(df)
Line-by-Line Explanation:
Line 1: data = [[...]] – Creates a list of lists, where each inner list represents a row of
data.
Line 2: df = pd.DataFrame(data, columns=['Name', 'Age', 'Marks']) –
Converts the list into a DataFrame and assigns column names using the columns
parameter.
Line 3: print(df) – Displays the DataFrame.
Output:
Name Age Marks
0 Amit 20 85
1 Neha 19 90
2 Ravi 21 88
data = [
{'Name': 'Amit', 'Age': 20, 'Marks': 85},
{'Name': 'Neha', 'Age': 19, 'Marks': 90},
{'Name': 'Ravi', 'Age': 21, 'Marks': 88}
]
df = pd.DataFrame(data)
print(df)
Line-by-Line Explanation:
Line 1-4: data = [...] – Creates a list where each element is a dictionary representing a
row, with keys as column names and values as data.
Line 5: df = pd.DataFrame(data) – Converts the list of dictionaries into a DataFrame.
Line 6: print(df) – Displays the DataFrame.
Output:
Key Points:
The columns parameter is optional when column names are already defined (e.g., in dictionaries).
The index parameter can be used to set custom row labels (e.g., index=['a', 'b', 'c']).
Always ensure the data structure is consistent (e.g., same number of elements in each list).
Reading a CSV File: Use the pd.read_csv() function to load a CSV file into a DataFrame.
import pandas as pd
df = pd.read_csv('students.csv')
print(df)
Line-by-Line Explanation:
Output:
Important Parameters:
Explanation:
Output:
Age Marks
Name
Amit 20 85
Neha 19 90
Ravi 21 88
Note: Ensure the CSV file is in the same directory as your Python script, or provide the full file path (e.g.,
C:/data/students.csv).
4. Dimensions of a DataFrame
Definition: Dimensions refer to the number of rows and columns in a DataFrame, which helps understand its size
and structure.
Attributes:
print(df.shape)
(3, 3)
print(df.size)
Output:
3. Number of Dimensions: df.ndim returns the number of dimensions (always 2 for a DataFrame).
print(df.ndim)
Output:
Use Case: Use these attributes to verify the DataFrame’s structure before performing operations like looping or
filtering.
Methods:
df.info()
Explanation: Provides a concise summary, useful for checking missing values and data types.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 Marks 3 non-null int64
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes
Shows 3 columns, their names, non-null counts, and data types (object for strings, int64
for integers).
2. df.describe(): Generates descriptive statistics for numeric columns (count, mean, std, min, max,
quartiles).
df.describe()
Explanation: Summarizes numeric columns only, ignoring non-numeric ones like 'Name'.
Output:
Age Marks
count 3.0 3.000000
mean 20.0 87.666667
std 1.0 2.516611
min 19.0 85.000000
25% 19.5 86.500000
50% 20.0 88.000000
75% 20.5 89.000000
max 21.0 90.000000
Shows statistics like count, mean, standard deviation, and percentiles for 'Age' and 'Marks'.
df.head(2)
Output:
df.tail(2)
Output:
Name Age Marks
1 Neha 19 90
2 Ravi 21 88
Use Case: Use these methods to inspect the DataFrame before performing complex operations.
Methods:
print(df['Name'])
Output:
0 Amit
1 Neha
2 Ravi
Name: Name, dtype: object
print(df[['Name', 'Marks']])
Output:
Name Marks
0 Amit 85
1 Neha 90
2 Ravi 88
2. Using loc (Label-based):
Example:
print(df.loc[0:1, 'Name':'Age'])
Explanation:
Output:
Name Age
0 Amit 20
1 Neha 19
Example:
print(df.iloc[0:2, 0:2])
Explanation:
Output:
Name Age
0 Amit 20
1 Neha 19
Explanation: Selects rows where the 'Marks' column value is greater than 85.
Output:
Name Age Marks
1 Neha 19 90
2 Ravi 21 88
Key Points:
loc uses labels (names or indices), while iloc uses integer positions.
Conditional slicing is powerful for filtering data based on specific criteria.
7. Descriptive Statistics
Definition: Descriptive statistics summarize the central tendency, dispersion, and distribution of numeric data in a
DataFrame.
Methods:
print(df['Marks'].mean())
Output:
87.66666666666667
print(df['Marks'].median())
Explanation: Finds the middle value of the 'Marks' column when sorted.
Output:
88.0
print(df['Marks'].std())
2.516611478423583
print(df.describe())
Explanation: Summarizes count, mean, std, min, max, and quartiles for 'Age' and 'Marks'.
Output:
Age Marks
count 3.0 3.000000
mean 20.0 87.666667
std 1.0 2.516611
min 19.0 85.000000
25% 19.5 86.500000
50% 20.0 88.000000
75% 20.5 89.000000
max 21.0 90.000000
Use Case: Helps identify trends, outliers, and the overall distribution of data.
8. Data Manipulation
Definition: Data manipulation involves modifying, adding, or removing data in a DataFrame to prepare it for
analysis.
Operations:
1. Adding a Column:
Explanation:
Output:
Name Age Marks Grade
0 Amit 20 85 A
1 Neha 19 90 A+
2 Ravi 21 88 A
2. Modifying a Column:
df['Marks'] = df['Marks'] + 5
print(df)
Output:
3. Deleting a Column:
Explanation:
Output:
4. Renaming Columns:
Explanation:
rename(columns={'Name': 'Student_Name'}): Changes the 'Name' column to
'Student_Name'.
inplace=True: Applies the change to the original DataFrame.
Output:
5. Sorting:
print(df.sort_values('Marks', ascending=False))
Explanation:
Output:
6. Filtering:
Output:
Note: Use inplace=True to modify the original DataFrame; otherwise, operations return a new DataFrame.
9. Writing to a CSV File
Definition: Exporting a DataFrame to a CSV file allows you to save your data for future use or sharing.
df.to_csv('students_updated.csv', index=False)
Line-by-Line Explanation:
Student_Name,Age,Marks
Amit,20,90
Neha,19,95
Ravi,21,93
Parameters:
Use Case: Save processed data for use in other applications like Excel or databases.
result = df.groupby('Age')['Marks'].mean()
print(result)
Line-by-Line Explanation:
df.groupby('Age'): Groups the DataFrame by unique values in the 'Age' column.
['Marks'].mean(): Calculates the average of the 'Marks' column for each group.
print(result): Displays the result as a Series.
Output:
Age
19 95.0
20 90.0
21 93.0
Name: Marks, dtype: float64
Line-by-Line Explanation:
Output:
Marks Student_Name
mean max count
Age
19 95.0 95 1
20 90.0 90 1
21 93.0 93 1
Use Case: Useful for summarizing data, e.g., finding average marks per age group or total sales per region.
Practice Questions
1. Create a DataFrame from a dictionary with student names, ages, and marks, and display it.
2. Read a CSV file and show its first 3 rows and last 3 rows.
3. Find the shape, size, and summary statistics of a DataFrame.
4. Select rows where marks are above 90 and sort them by marks in descending order.
5. Group a DataFrame by age and compute the mean and maximum marks for each group.