Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Chapter Notes - Data Handling Using Pandas DataFrame

The document provides a comprehensive overview of using Pandas DataFrame for data handling in Python, including its definition, key features, and installation instructions. It covers creating DataFrames from various data sources, reading CSV files, and retrieving subsets of data through indexing and slicing. Additionally, it discusses descriptive statistics and data manipulation techniques such as adding, modifying, and deleting columns.

Uploaded by

ANE: Thundres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter Notes - Data Handling Using Pandas DataFrame

The document provides a comprehensive overview of using Pandas DataFrame for data handling in Python, including its definition, key features, and installation instructions. It covers creating DataFrames from various data sources, reading CSV files, and retrieving subsets of data through indexing and slicing. Additionally, it discusses descriptive statistics and data manipulation techniques such as adding, modifying, and deleting columns.

Uploaded by

ANE: Thundres
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chapter Notes: Data Handling

Using Pandas DataFrame


Book: Informatics Practices (Code: 065)
Chapter: Data Handling Using Pandas DataFrame

1. Pandas DataFrame
Definition: A Pandas DataFrame is a two-dimensional, tabular data structure provided by the Pandas library in
Python. It is designed for efficient data manipulation and analysis, resembling a spreadsheet or a database table.

Key Features:

Rows and Columns: Each row represents a record, and each column represents a variable or feature.
Labeled Axes: Rows and columns can have labels (e.g., names or numbers).
Heterogeneous Data: Can store different data types (e.g., integers, floats, strings) in different columns.
Flexible Operations: Supports filtering, grouping, merging, and statistical computations.

Why Use Pandas?: Simplifies complex data tasks like cleaning, transforming, and analyzing large datasets.

Installation: Install Pandas using the command:

pip install pandas

Importing Pandas: Always import Pandas in your Python script to use its features.

import pandas as pd

Explanation:
import pandas: Imports the Pandas library.
as pd: Assigns the alias pd for convenience, so we write pd instead of pandas in our code.

2. Creating a Pandas DataFrame


Definition: A DataFrame can be created from various data sources like dictionaries, lists, or external files.

Methods to Create a DataFrame:

1. From a Dictionary:
import pandas as pd
data = {
'Name': ['Amit', 'Neha', 'Ravi'],
'Age': [20, 19, 21],
'Marks': [85, 90, 88]
}
df = pd.DataFrame(data)
print(df)

Line-by-Line Explanation:

Line 1: import pandas as pd – Imports the Pandas library with alias pd.
Line 2-5: data = {...} – Creates a dictionary where keys are column names (Name, Age,
Marks) and values are lists containing the data for each column.
Line 6: df = pd.DataFrame(data) – Converts the dictionary into a DataFrame, where
each key becomes a column and each list becomes the column’s data.
Line 7: print(df) – Displays the DataFrame.

Output:

Name Age Marks


0 Amit 20 85
1 Neha 19 90
2 Ravi 21 88

The output is a table with 3 rows and 3 columns, with default row indices (0, 1, 2).

2. From a List of Lists:

data = [['Amit', 20, 85], ['Neha', 19, 90], ['Ravi', 21, 88]]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Marks'])
print(df)

Line-by-Line Explanation:

Line 1: data = [[...]] – Creates a list of lists, where each inner list represents a row of
data.
Line 2: df = pd.DataFrame(data, columns=['Name', 'Age', 'Marks']) –
Converts the list into a DataFrame and assigns column names using the columns
parameter.
Line 3: print(df) – Displays the DataFrame.

Output:
Name Age Marks
0 Amit 20 85
1 Neha 19 90
2 Ravi 21 88

3. From a List of Dictionaries:

data = [
{'Name': 'Amit', 'Age': 20, 'Marks': 85},
{'Name': 'Neha', 'Age': 19, 'Marks': 90},
{'Name': 'Ravi', 'Age': 21, 'Marks': 88}
]
df = pd.DataFrame(data)
print(df)

Line-by-Line Explanation:

Line 1-4: data = [...] – Creates a list where each element is a dictionary representing a
row, with keys as column names and values as data.
Line 5: df = pd.DataFrame(data) – Converts the list of dictionaries into a DataFrame.
Line 6: print(df) – Displays the DataFrame.

Output:

Name Age Marks


0 Amit 20 85
1 Neha 19 90
2 Ravi 21 88

Key Points:

The columns parameter is optional when column names are already defined (e.g., in dictionaries).
The index parameter can be used to set custom row labels (e.g., index=['a', 'b', 'c']).
Always ensure the data structure is consistent (e.g., same number of elements in each list).

3. Reading from a CSV File


Definition: A CSV (Comma-Separated Values) file is a plain text file that stores tabular data, with rows separated
by newlines and columns separated by commas (or other delimiters).

Reading a CSV File: Use the pd.read_csv() function to load a CSV file into a DataFrame.

Example: Assume a file students.csv with the following content:


Name,Age,Marks
Amit,20,85
Neha,19,90
Ravi,21,88

import pandas as pd
df = pd.read_csv('students.csv')
print(df)

Line-by-Line Explanation:

Line 1: import pandas as pd – Imports the Pandas library.


Line 2: df = pd.read_csv('students.csv') – Reads the CSV file into a DataFrame, using
the first row as column names by default.
Line 3: print(df) – Displays the DataFrame.

Output:

Name Age Marks


0 Amit 20 85
1 Neha 19 90
2 Ravi 21 88

Important Parameters:

sep: Specifies the delimiter (e.g., sep=';' for semicolon-separated files).


header: Specifies which row to use as column names (e.g., header=None if no header).
index_col: Sets a column as the row index (e.g., index_col='Name').

Example with Parameters:

df = pd.read_csv('students.csv', sep=',', index_col='Name')


print(df)

Explanation:

sep=',': Specifies comma as the delimiter (default).


index_col='Name': Uses the 'Name' column as the row index.

Output:
Age Marks
Name
Amit 20 85
Neha 19 90
Ravi 21 88

Note: Ensure the CSV file is in the same directory as your Python script, or provide the full file path (e.g.,
C:/data/students.csv).

4. Dimensions of a DataFrame
Definition: Dimensions refer to the number of rows and columns in a DataFrame, which helps understand its size
and structure.

Attributes:

1. Shape: df.shape returns a tuple (rows, columns).

print(df.shape)

Explanation: Returns the number of rows and columns as a tuple.

Output (for the above DataFrame):

(3, 3)

3 rows and 3 columns.

2. Size: df.size returns the total number of elements (rows × columns).

print(df.size)

Explanation: Multiplies the number of rows by the number of columns.

Output:

3 rows × 3 columns = 9 elements.

3. Number of Dimensions: df.ndim returns the number of dimensions (always 2 for a DataFrame).
print(df.ndim)

Explanation: Confirms that a DataFrame is a 2D structure.

Output:

Use Case: Use these attributes to verify the DataFrame’s structure before performing operations like looping or
filtering.

5. Summary Information About a


DataFrame
Definition: Summary information provides an overview of the DataFrame’s structure, data types, and content.

Methods:

1. df.info(): Displays column names, data types, and non-null counts.

df.info()

Explanation: Provides a concise summary, useful for checking missing values and data types.

Output (for the above DataFrame):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 Marks 3 non-null int64
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes

Shows 3 columns, their names, non-null counts, and data types (object for strings, int64
for integers).
2. df.describe(): Generates descriptive statistics for numeric columns (count, mean, std, min, max,
quartiles).

df.describe()

Explanation: Summarizes numeric columns only, ignoring non-numeric ones like 'Name'.

Output:

Age Marks
count 3.0 3.000000
mean 20.0 87.666667
std 1.0 2.516611
min 19.0 85.000000
25% 19.5 86.500000
50% 20.0 88.000000
75% 20.5 89.000000
max 21.0 90.000000

Shows statistics like count, mean, standard deviation, and percentiles for 'Age' and 'Marks'.

3. df.head(n): Displays the first n rows (default n=5).

df.head(2)

Explanation: Shows the first 2 rows to preview the data.

Output:

Name Age Marks


0 Amit 20 85
1 Neha 19 90

4. df.tail(n): Displays the last n rows (default n=5).

df.tail(2)

Explanation: Shows the last 2 rows.

Output:
Name Age Marks
1 Neha 19 90
2 Ravi 21 88

Use Case: Use these methods to inspect the DataFrame before performing complex operations.

6. Retrieving Subset of Data - Indexing and


Slicing
Definition: Indexing and slicing allow you to select specific rows, columns, or subsets of a DataFrame.

Methods:

1. Using Column Names:

Single Column: df['column_name'] returns a Series.

print(df['Name'])

Explanation: Selects the 'Name' column as a Series (a single-column structure).

Output:

0 Amit
1 Neha
2 Ravi
Name: Name, dtype: object

Multiple Columns: df[['col1', 'col2']] returns a DataFrame.

print(df[['Name', 'Marks']])

Explanation: Selects the 'Name' and 'Marks' columns as a new DataFrame.

Output:

Name Marks
0 Amit 85
1 Neha 90
2 Ravi 88
2. Using loc (Label-based):

Syntax: df.loc[row_labels, column_labels].

Example:

print(df.loc[0:1, 'Name':'Age'])

Explanation:

0:1: Selects rows with indices 0 and 1.


'Name':'Age': Selects columns from 'Name' to 'Age' (inclusive).

Output:

Name Age
0 Amit 20
1 Neha 19

3. Using iloc (Integer-based):

Syntax: df.iloc[row_indices, column_indices].

Example:

print(df.iloc[0:2, 0:2])

Explanation:

0:2: Selects rows 0 and 1.


0:2: Selects columns 0 and 1 ('Name' and 'Age').

Output:

Name Age
0 Amit 20
1 Neha 19

4. Conditional Slicing: Select rows based on conditions.

print(df[df['Marks'] > 85])

Explanation: Selects rows where the 'Marks' column value is greater than 85.

Output:
Name Age Marks
1 Neha 19 90
2 Ravi 21 88

Key Points:

loc uses labels (names or indices), while iloc uses integer positions.
Conditional slicing is powerful for filtering data based on specific criteria.

7. Descriptive Statistics
Definition: Descriptive statistics summarize the central tendency, dispersion, and distribution of numeric data in a
DataFrame.

Methods:

1. df.mean(): Computes the mean of numeric columns.

print(df['Marks'].mean())

Explanation: Calculates the average of the 'Marks' column.

Output:

87.66666666666667

2. df.median(): Computes the median of numeric columns.

print(df['Marks'].median())

Explanation: Finds the middle value of the 'Marks' column when sorted.

Output:

88.0

3. df.std(): Computes the standard deviation of numeric columns.

print(df['Marks'].std())

Explanation: Measures the spread of the 'Marks' column values.


Output:

2.516611478423583

4. df.describe(): Provides a summary of statistics for all numeric columns.

print(df.describe())

Explanation: Summarizes count, mean, std, min, max, and quartiles for 'Age' and 'Marks'.

Output:

Age Marks
count 3.0 3.000000
mean 20.0 87.666667
std 1.0 2.516611
min 19.0 85.000000
25% 19.5 86.500000
50% 20.0 88.000000
75% 20.5 89.000000
max 21.0 90.000000

Use Case: Helps identify trends, outliers, and the overall distribution of data.

8. Data Manipulation
Definition: Data manipulation involves modifying, adding, or removing data in a DataFrame to prepare it for
analysis.

Operations:

1. Adding a Column:

df['Grade'] = ['A', 'A+', 'A']


print(df)

Explanation:

Adds a new column 'Grade' with the specified values.


The list must have the same length as the number of rows.

Output:
Name Age Marks Grade
0 Amit 20 85 A
1 Neha 19 90 A+
2 Ravi 21 88 A

2. Modifying a Column:

df['Marks'] = df['Marks'] + 5
print(df)

Explanation: Adds 5 to every value in the 'Marks' column.

Output:

Name Age Marks Grade


0 Amit 20 90 A
1 Neha 19 95 A+
2 Ravi 21 93 A

3. Deleting a Column:

df.drop('Grade', axis=1, inplace=True)


print(df)

Explanation:

drop('Grade', axis=1): Removes the 'Grade' column (axis=1 indicates columns).


inplace=True: Modifies the original DataFrame instead of creating a new one.

Output:

Name Age Marks


0 Amit 20 90
1 Neha 19 95
2 Ravi 21 93

4. Renaming Columns:

df.rename(columns={'Name': 'Student_Name'}, inplace=True)


print(df)

Explanation:
rename(columns={'Name': 'Student_Name'}): Changes the 'Name' column to
'Student_Name'.
inplace=True: Applies the change to the original DataFrame.

Output:

Student_Name Age Marks


0 Amit 20 90
1 Neha 19 95
2 Ravi 21 93

5. Sorting:

print(df.sort_values('Marks', ascending=False))

Explanation:

sort_values('Marks'): Sorts the DataFrame by the 'Marks' column.


ascending=False: Sorts in descending order (highest to lowest).

Output:

Student_Name Age Marks


1 Neha 19 95
2 Ravi 21 93
0 Amit 20 90

6. Filtering:

high_scorers = df[df['Marks'] > 90]


print(high_scorers)

Explanation: Selects rows where 'Marks' is greater than 90.

Output:

Student_Name Age Marks


1 Neha 19 95
2 Ravi 21 93

Note: Use inplace=True to modify the original DataFrame; otherwise, operations return a new DataFrame.
9. Writing to a CSV File
Definition: Exporting a DataFrame to a CSV file allows you to save your data for future use or sharing.

Method: Use df.to_csv().

df.to_csv('students_updated.csv', index=False)

Line-by-Line Explanation:

df.to_csv('students_updated.csv'): Saves the DataFrame to a file named


students_updated.csv.
index=False: Excludes the row indices from the output file.

Output File Content (students_updated.csv):

Student_Name,Age,Marks
Amit,20,90
Neha,19,95
Ravi,21,93

Parameters:

index: Set to False to exclude the index column.


sep: Specify a delimiter (e.g., sep=';' for semicolon-separated files).

Use Case: Save processed data for use in other applications like Excel or databases.

10. Grouping and Aggregation


Definition:

Grouping: Divides data into groups based on one or more columns.


Aggregation: Applies functions (e.g., mean, sum, count) to each group to summarize the data.

Method: Use df.groupby() followed by an aggregation function.

Example 1: Group by 'Age' and calculate the average marks.

result = df.groupby('Age')['Marks'].mean()
print(result)

Line-by-Line Explanation:
df.groupby('Age'): Groups the DataFrame by unique values in the 'Age' column.
['Marks'].mean(): Calculates the average of the 'Marks' column for each group.
print(result): Displays the result as a Series.

Output:

Age
19 95.0
20 90.0
21 93.0
Name: Marks, dtype: float64

Example 2: Multiple aggregations.

result = df.groupby('Age').agg({'Marks': ['mean', 'max'], 'Student_Name': 'count'})


print(result)

Line-by-Line Explanation:

df.groupby('Age'): Groups the DataFrame by 'Age'.


.agg({...}): Applies multiple aggregation functions:
'Marks': ['mean', 'max']: Computes the mean and maximum of 'Marks' for each
group.
'Student_Name': 'count': Counts the number of students in each group.
print(result): Displays the result as a DataFrame.

Output:

Marks Student_Name
mean max count
Age
19 95.0 95 1
20 90.0 90 1
21 93.0 93 1

Use Case: Useful for summarizing data, e.g., finding average marks per age group or total sales per region.

Key Points to Remember


Pandas DataFrames are ideal for handling structured data like tables.
Always use df.info() to check data types and missing values before processing.
Be cautious with inplace=True, as it permanently modifies the DataFrame.
Practice loc and iloc to master indexing and slicing.
Grouping and aggregation are powerful for summarizing large datasets.
Ensure CSV files are accessible (correct path) when reading or writing.

Practice Questions
1. Create a DataFrame from a dictionary with student names, ages, and marks, and display it.
2. Read a CSV file and show its first 3 rows and last 3 rows.
3. Find the shape, size, and summary statistics of a DataFrame.
4. Select rows where marks are above 90 and sort them by marks in descending order.
5. Group a DataFrame by age and compute the mean and maximum marks for each group.

You might also like