Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pandas

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Pandas

Pandas is a Python library that makes it really easy to work with data. Think of it as
a tool that helps you organize, clean, and analyze data, especially when that data is
in the form of a table, like what you'd see in Excel.
Imagine you have a table of data, like a list of people's names, ages, and where
they live. Pandas helps you work with that data efficiently. It’s very useful when you
have lots of data that you want to sort, filter, or do calculations with. You can also
use Pandas to read data from files like CSV (Comma-Separated Values), Excel
spreadsheets, and even databases.

Pandas Data Structure


 Series
A Series is like a single column of data, or you can think of it as a list with
labels. Each item in a Series has an index (like a label), which helps you
access the data easily. The data in a Series can be numbers, text, or any
other data type, just like a list in Python.
Key Points About Series:
 It is one-dimensional (like a single list or a column in a table).
 Each value has an index (which acts like a label).
 You can access data using the index.
A data structure in Pandas that has only one column of data. It is one-
dimensional, like a list or a single column in a table.
How to use Series in Pandas
import pandas as pd
ages = [20,52,30,10,25,22,23]
age_series = pd.Series(ages)
age_series

 Data Frame
A DataFrame is like a full table of data with rows and columns. If a Series is
like a single column, then a DataFrame is like several columns (each column
is a Series). It’s essentially a collection of Series that are aligned by their
index.
Key Points About DataFrame:
 It is two-dimensional (like a table).
 Each column is a Series, and the columns are aligned based on their
index (row labels).
 You can think of it as an Excel sheet where each cell belongs to a row
and a column.
A data structure in Pandas that has more than one column (or even just
one column, but it's still considered a DataFrame because it's designed to
handle multiple columns). It is two-dimensional, like a table with rows and
columns.
How to use Data Frame in Python pandas
import pandas as pd
data = {
"Name":["Ritik Kumar","Roshan Kumar","Jitendra Kumar","Asha
Devi","Krishna Kumar","Kajal Kumari"],
"Age":[21,19,47,38,28,19]
}
df = pd.DataFrame(data)
df

Installation and Setup


To use Pandas, you first need to install it. You can do this with a tool called pip,
which is the package manager for Python. Simply open your command line or
terminal and type the following:
pip install pandas
This will download and install Pandas on your system. Once it’s installed, you
can start using it in your Python scripts by importing it:
import pandas as pd
Here, pd is just a shorthand name that people often use for Pandas to make
the code easier to write.
If you are using Anaconda, you don't need to install common libraries like Pandas,
NumPy, and many others separately. Anaconda comes with these popular data
science libraries pre-installed.
What is Anaconda?
Anaconda is a distribution of Python and R that is specifically designed for data
science, machine learning, and scientific computing. When you install Anaconda, it
includes:
 Python (the programming language).
 Conda (a package manager for managing environments and packages).
 Many popular libraries such as Pandas, NumPy, Matplotlib, SciPy, Jupyter
Notebook, and many more.
Loading Data into Pandas
One of the most powerful features of Pandas is that it allows you to easily load data
from different sources like CSV files, Excel spreadsheets, SQL databases, and more.
This is useful because most of the time, the data you want to analyze isn’t typed
directly into your code but rather stored in files or databases.
1. Loading Data from a CSV File
A CSV (Comma-Separated Values) file is a plain text file where each line of data is a
row, and the values are separated by commas.
Example
Let's say you have a CSV file called data.csv
You can load this data into a Pandas DataFrame using the read_csv function:
import pandas as pd
# Load data from a CSV file
df = pd.read_csv('data.csv')
print(df)
2. Loading Data from an Excel File
An Excel file is another common way data is stored. Excel files can have multiple
sheets, and Pandas makes it easy to load them.
Example
Let's say you have an Excel file called data.xlsx with the same data as above. You
can load this into Pandas using the read_excel function:
import pandas as pd
# Load data from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df)
3. Loading Data from Other Sources
Pandas can also load data from many other sources, such as:
 JSON: Use pd.read_json('file.json') to load JSON data.
 HTML: Use pd.read_html('file.html') to load data from an HTML table.
 Parquet: Use pd.read_parquet('file.parquet') to load Parquet files (often used
in big data).
 SQLAlchemy: For more complex database connections, you can use
SQLAlchemy to connect to SQL databases.
Writing Data to Files
After loading and processing your data, you may want to save it back to a file.
Pandas makes this easy too:
 To CSV: df.to_csv('output.csv', index=False)
 To Excel: df.to_excel('output.xlsx', index=False)
 To SQL: df.to_sql('people', conn, if_exists='replace')
Data Frame Operations
1. Creating a DataFrame
You can create a DataFrame from various data sources, such as dictionaries, lists, or
other DataFrames.
Example
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
"Name": ["Ritik", "Roshan", "Anshu", "Vishal"],
"Age": [25, 30, 22, 35],
"City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
print(df)
2. Accessing Data
You can access specific rows and columns in a DataFrame using various methods.
Access Columns
# Access a single column
print(df["Name"])
# Access multiple columns
print(df[["Name", "City"]])
Access Rows
# Access a single row by index
print(df.iloc[1]) # Access the second row
# Access multiple rows by index range
print(df.iloc[1:3]) # Access rows from index 1 to 2
3. Filtering Data
You can filter rows based on conditions.
Example
# Filter rows where Age is greater than 25
filtered_df = df[df["Age"] > 25]
print(filtered_df)
4. Adding and Removing Columns
You can add new columns or remove existing ones.
Add a Column
# Add a new column with default values
df["Country"] = "USA"
print(df)
Remove a Column
# Remove a column
df = df.drop(columns=["Country"])
print(df)
5. Adding and Removing Rows
You can also add new rows or remove existing ones.
Add a Row
# Add a new row
new_row = pd.DataFrame({"Name": ["Mike"], "Age": [28], "City": ["Seattle"]})
df = pd.concat([df, new_row], ignore_index=True)
print(df)
Remove a Row
# Remove a row by index
df = df.drop(index=4)
print(df)
6. Sorting Data
You can sort data based on values in columns.
Example
# Sort by Age in ascending order
sorted_df = df.sort_values(by="Age")
print(sorted_df)

# Sort by Age in descending order


sorted_df_desc = df.sort_values(by="Age", ascending=False)
print(sorted_df_desc)
7. Aggregating Data
You can perform calculations to summarize data.
1. Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset.
1.1 describe()
 What it does: Provides a summary of descriptive statistics for numerical
columns.
 Includes: Count, mean, standard deviation, minimum, quartiles, and
maximum values
import pandas as pd
# Sample DataFrame
data = {'Math': [85, 78, 91, 95, 88],
'English': [92, 88, 85, 90, 87]}
df = pd.DataFrame(data)
# Descriptive statistics summary
print(df.describe())
1.2 mean()
 What it does: Calculates the average (mean) of a column.
print(df['Math'].mean()) # Average of Math scores
1.3 median()
 What it does: Finds the middle value (median) when the data is sorted.
Example:
print(df['Math'].median()) # Median of Math scores
1.4 mode()
 What it does: Finds the most frequent value(s) in a column.
Example:
print(df['Math'].mode()) # Mode of Math scores
1.5 sum()
 What it does: Adds up all the values in a column.
Example:
print(df['Math'].sum()) # Sum of all Math scores
1.6 count()
 What it does: Counts the number of non-missing (non-NA) values in a
column.
Example:
print(df['Math'].count()) # Number of non-missing values in Math
1.7 min() and max()
 What they do: Find the smallest (min()) and largest (max()) values in a
column.
Example:
print(df['Math'].min()) # Minimum Math score
print(df['Math'].max()) # Maximum Math score
1.8 std() and var()
 What they do:
o std() calculates the standard deviation (a measure of how spread out
the values are).
o var() calculates the variance (the square of the standard deviation).
print(df['Math'].std()) # Standard deviation of Math scores
print(df['Math'].var()) # Variance of Math scores
1.9 quantile()
 What it does: Finds the value at a given percentile (e.g., 0.25 for the 25th
percentile, or 1st quartile).
Example:
print(df['Math'].quantile(0.25)) # 25th percentile of Math scores
2. Correlation and Covariance
These methods help understand relationships between different variables.
2.1 corr()
 What it does: Computes the correlation between columns, showing how
strongly they are related.
 Correlation ranges from -1 to 1:
o -1 indicates a strong negative relationship.
o 1 indicates a strong positive relationship.
o 0 indicates no relationship.
Example:
print(df.corr()) # Correlation between Math and English scores
2.2 cov()
 What it does: Computes the covariance, which measures the directional
relationship between two columns. Unlike correlation, covariance is not
standardized.
Example:
print(df.cov()) # Covariance between Math and English scores
3. Ranking and Sorting
These methods help you arrange data in a certain order.
3.1 rank()
 What it does: Assigns ranks to values, where 1 is the smallest value.
Example:
print(df['Math'].rank()) # Ranks of Math scores
3.2 sort_values()
 What it does: Sorts the values in a column in ascending or descending
order.
Example
print(df.sort_values(by='Math')) # Sorts data by Math scores in ascending
order
4. Cumulative Methods
These methods calculate cumulative values across a column.
4.1 cumsum()
 What it does: Calculates the cumulative sum of a column.
Example:
print(df['Math'].cumsum()) # Cumulative sum of Math scores
4.2 cumprod()
 What it does: Calculates the cumulative product of a column.
Example:
print(df['Math'].cumprod()) # Cumulative product of Math scores
4.3 cummin() and cummax()
 What they do: Calculate the cumulative minimum (cummin()) and
cumulative maximum (cummax()).
Example:
print(df['Math'].cummin()) # Cumulative minimum of Math scores
print(df['Math'].cummax()) # Cumulative maximum of Math scores

8. Handling Missing Data


You can handle missing data (NaN) in various ways.
Example
# Create a DataFrame with missing values
df_with_nan = pd.DataFrame({
"Name": ["Ritik", "Roshan", None],
"Age": [25, None, 22],
"City": ["New York", "Los Angeles", "Chicago"]
})

# Fill missing values with a default value


df_filled = df_with_nan.fillna({"Name": "Unknown", "Age":
df_with_nan["Age"].mean()})
print(df_filled)

# Drop rows with missing values


df_dropped = df_with_nan.dropna()
print(df_dropped)
Data Cleaning with Pandas
1. Handling Missing Values
Missing values are common in datasets and can occur for many reasons, such as
incomplete data collection or errors during data entry.
Finding Missing Values
 isnull() and notnull():
o isnull() checks if a value is missing (NaN), and returns True for missing
values and False for non-missing values.
o notnull() is the opposite of isnull() and returns True for non-missing
values and False for missing values.
Dropping Missing Values
 dropna():
o dropna() removes rows or columns that contain missing values. By
default, it removes rows with any missing values.
Filling Missing Values
 fillna():
o fillna() replaces missing values with a specified value or method, such
as a constant, the mean, median, or a forward fill.

 You can also fill missing values with the mean or median of a column:

# Fill missing Age with the mean of the Age column


df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
2. Handling Duplicates
Duplicate data occurs when the same data point appears more than once. It can
lead to skewed analysis.
Finding Duplicates
 duplicated():
o duplicated() returns True for duplicated rows and False for unique rows.
By default, it checks for duplicate rows based on all columns.
Removing Duplicates
 drop_duplicates():
o drop_duplicates() removes duplicate rows. You can choose to remove
duplicates based on specific columns by using the subset parameter.
3. Handling Incorrect Data Types
Sometimes data is in the wrong format. For example, a column meant to contain
numbers may have been stored as text, or dates may be incorrectly formatted.
Converting Data Types
 astype():
o astype() converts a column to a different data type, such as from
object (text) to int or float.
Handling Dates
 to_datetime():
o to_datetime() converts text or other formats into proper date objects.
data = {'Date': ['2023-01-01', '2023/02/01', '01-03-2023']}
df = pd.DataFrame(data)

# Convert Date column to datetime format


df['Date'] = pd.to_datetime(df['Date'])
print(df)

4. Handling Outliers
Outliers are data points that differ significantly from other observations in the
dataset. They can distort your analysis.
Identifying Outliers
 Using Statistics:
o Outliers can be identified using summary statistics like the mean and
standard deviation or using visualizations such as box plots.
# Identify outliers using a simple method (values greater than mean +
3*std)
threshold = df['Age'].mean() + 3 * df['Age'].std()
outliers = df[df['Age'] > threshold]
print(outliers)
Removing or Transforming Outliers
 Transforming or Capping Outliers:
o You can choose to transform outliers (e.g., using log transformation) or
cap them at a certain threshold.
# Cap Age values at a maximum of 40
df['Age'] = df['Age'].apply(lambda x: min(x, 40))
print(df)
5. String Manipulation
Text data often needs cleaning to remove unwanted characters or correct
formatting.
Removing Unwanted Characters
 str.replace():
o str.replace() replaces a specific pattern or character in strings.
6. Renaming Columns
Sometimes, columns may have unclear or inconsistent names that need to be
corrected.
Renaming Columns
 rename():
o rename() changes the names of columns.
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

# Rename columns
df = df.rename(columns={'col1': 'First Column', 'col2': 'Second Column'})
print(df)
Merging, Joining, and Concatenating DataFrames
1. Concatenating DataFrames
Concatenation means combining two or more DataFrames along a particular axis
(either rows or columns). You can think of it like stacking data on top of each other
or side by side.
 pd.concat(): This function concatenates DataFrames either vertically (by
default) or horizontally. You can concatenate along rows (axis=0) or columns
(axis=1).
Example 1: Concatenating Along Rows (Vertical Concatenation)
import pandas as pd
df1 = pd.DataFrame({'Name': ['John', 'Alice'],
'Age': [28, 30]})
df2 = pd.DataFrame({'Name': ['Bob', 'Eve'],
'Age': [35, 22]})
# Concatenating DataFrames vertically
df_vertical_concat = pd.concat([df1, df2], axis=0)
print(df_vertical_concat)
2. Merging DataFrames
Merging is similar to SQL joins. You merge DataFrames based on one or more
common columns (known as "keys"). This operation allows you to combine rows
from two DataFrames where there is a match on the key(s).
 pd.merge(): This function merges DataFrames using a common column or
index. The type of merge can be controlled using the how parameter.
The main types of merges are:
 Inner Join (default): Returns rows that have matching values in both
DataFrames.
 Left Join: Returns all rows from the left DataFrame, and matching rows from
the right DataFrame.
 Right Join: Returns all rows from the right DataFrame, and matching rows
from the left DataFrame.
 Outer Join: Returns all rows from both DataFrames, filling missing values
with NaN where there is no match.
Example 1: Inner Join
df1 = pd.DataFrame({'Name': ['John', 'Alice', 'Bob'],
'Age': [28, 30, 35]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Eve'],
'Score': [88, 90, 92]})
# Merging DataFrames using an inner join (default)
df_inner_merge = pd.merge(df1, df2, on='Name')
print(df_inner_merge)
Example 2: Left Join
# Merging DataFrames using a left join
df_left_merge = pd.merge(df1, df2, on='Name', how='left')
print(df_left_merge)
 Concatenation (pd.concat()): Stacking DataFrames either vertically or
horizontally.
 Merging (pd.merge()): Combining DataFrames based on one or more
common columns (like SQL joins).
 Joining (df.join()): Combining DataFrames based on their index.

You might also like