Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Python Pandas

Uploaded by

Vineet Pal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Python Pandas

Uploaded by

Vineet Pal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

The Unique Computers

Neelmatha Lucknow
Python Pandas
What is Pandas?
Pandas is a powerful Python library that is specifically designed to work on
data frames that have "relational" or "labeled" data. Its aim aligns with doing
real-world data analysis using Python. Its flexibility and functionality make it
indispensable for various data-related tasks. Hence, this Python package
works well for data manipulation, operating a dataset, exploring a data
frame, data analysis, and machine learning-related tasks.

Generally, Pandas operates a data frame using Series and DataFrame; where
Series works on a one-dimensional labeled array holding data of any type
like integers, strings, and objects, while a DataFrame is a two-dimensional
data structure that manages and operates data in tabular form (using rows
and columns).

Why Pandas?
The beauty of Pandas is that it simplifies the task related to data frames and
makes it simple to do many of the time-consuming, repetitive tasks involved
in working with data frames, such as:

• Import datasets - available in the form of spreadsheets, comma-


separated values (CSV) files, and more.

• Data cleansing - dealing with missing values and representing them


as NaN, NA, or NaT.

• Size mutability - columns can be added and removed from


DataFrame and higher-dimensional objects.

• Data normalization – normalize the data into a suitable format for


analysis.

• Data alignment - objects can be explicitly aligned to a set of labels.


Intuitive merging and joining data sets – we can merge and join
datasets.
• Reshaping and pivoting of datasets – datasets can be reshaped
and pivoted as per the need.

• Efficient manipulation and extraction - manipulation and


extraction of specific parts of extensive datasets using intelligent label-
based slicing, indexing, and subsetting techniques.

• Statistical analysis - to perform statistical operations on datasets.

• Data visualization - Visualize datasets and uncover insights.

Applications of Pandas
The most common applications of Pandas are as follows:

• Data Cleaning: Pandas provides functionalities to clean messy data,


deal with incomplete or inconsistent data, handle missing values,
remove duplicates, and standardize formats to do effective data
analysis.

• Data Exploration: Pandas easily summarize statistics, find trends,


and visualize data using built-in plotting functions, Matplotlib, or
Seaborn integration.

• Data Preparation: Pandas may pivot, melt, convert variables, and


merge datasets based on common columns to prepare data for
analysis.

• Data Analysis: Pandas supports descriptive statistics, time series


analysis, group-by operations, and custom functions.

• Data Visualisation: Pandas itself has basic plotting capabilities; it


integrates and supports data visualization libraries like Matplotlib,
Seaborn, and Plotly to create innovative visualizations.

• Time Series Analysis: Pandas supports date/time indexing,


resampling, frequency conversion, and rolling statistics for time series
data.

• Data Aggregation and Grouping: Pandas HYPERLINK


"https://www.tutorialspoint.com/python_pandas/python_pandas_groupb
y.htm"groupby HYPERLINK
"https://www.tutorialspoint.com/python_pandas/python_pandas_groupb
y.htm"() function lets you aggregate data and compute group-wise
summary statistics or apply functions to groups.
• Data Input/Output: Pandas makes data input and export easy by
reading and writing CSV, Excel, JSON, SQL databases, and more.

• Machine Learning: Pandas works well with Scikit-learn for data


preparation, feature engineering, and model input data.

• Financial Analysis: Pandas is commonly used in finance for stock


market data analysis, financial indicator calculation, and portfolio
optimization.

• Text Data Analysis: Pandas' string manipulation, regular expressions,


and text mining functions help analyse textual data.

• Experimental Data Analysis: Pandas makes manipulating and


analysing large datasets, performing statistical tests, and visualizing
results easy.

Python Pandas Data Structures


Data structures in Pandas are designed to handle data efficiently. They allow
for the organization, storage, and modification of data in a way that
optimizes memory usage and computational performance. Python Pandas


library provides two primary data structures for handling and analyzing data

• Series

• DataFrame

Dimension and Description of Pandas Data Structures

Data Dimensio Description


Structure ns
Series 1 A one-dimensional labeled homogeneous array, sizeimmutable.
Data Frames 2 A two-dimensional labeled, size-mutable tabular structure with
potentially heterogeneously typed columns.

Series

A Series is a one-dimensional labeled array that can hold any data type. It
can store integers, strings, floating-point numbers, etc. Each value in a
Series is associated with a label (index), which can be an integer or a string.

Name Steve
Age 35

Gender Male

Rating 3.5

Example

Consider the following Series which is a collection of different data types

import pandas as pd
data = ['Steve', '35', 'Male', '3.5']
series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating'])
print(series)

On executing the above program, you will get the following output −

Name Steve

Age 35

Gender Male

Rating 3.5

dtype: object

Key Points

Following are the key points related to the Pandas Series.

• Homogeneous data

• Size Immutable

• Values of Data Mutable

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns that


can hold different data types. It is similar to a table in a database or a

rating of a sales team −


spreadsheet. Consider the following data representing the performance

Name Age Gender Rating


Steve 32 Male 3.45

Lia 28 Female 4.6

Vin 45 Male 3.9

Katie 38 Female 2.78

Example

The above tabular data can be represented in a DataFrame as follows −

Open Compiler

import pandas as pd

# Data represented as a dictionary

data = {

'Name': ['Steve', 'Lia', 'Vin', 'Katie'],

'Age': [32, 28, 45, 38],

'Gender': ['Male', 'Female', 'Male', 'Female'],

'Rating': [3.45, 4.6, 3.9, 2.78]

# Creating the DataFrame

df = pd.DataFrame(data)
print(df)

Output

On executing the above code you will get the following output −

Name Age Gender Rating

0 Steve 32 Male 3.45

1 Lia 28 Female 4.60

2 Vin 45 Male 3.90

3 Katie 38 Female 2.78


Key Points

Following are the key points related the Pandas DataFrame −

• Heterogeneous data

• Size Mutable

• Data Mutable

Creation of Data Frames

Creation New dataFrames


import pandas as pd
data={"name":["rahul","neha","amit"],
"age":[12,15,27],
"Salary":[1200,1500,1200]
}
df=pd.DataFrame(data)
print(df)

Reading CSV File


import pandas as pd
data=pd.read_csv("book.csv")
print(data)

reading Excel File


import pandas as pd
data=pd.read_excel("book1.xlsx")
print(data)

Exploring Data in Pandas

There are some Function in Pandas

Head()

Tail()

Info()

Describe()

Isnull()
Isnull().sum()

Dealing With Duplicate Values

Data.duplicated()

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data.duplicated())

Data[“emp_id”].duplicated()

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data["Emp_ID"].duplicated())

Data[“emp_id”].duplicated().sum()

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data["Emp_ID"].duplicated().sum())

Data.drop_duplicates(“emp_id”)

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data.drop_duplicates("Emp_ID"))

Working with missing values

Data.isnull
To print null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.isnull())
data.isnull().sum())
to count null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.isnull().sum())
data.dropna()
To delete null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.dropna())
data.replace(np.nan,"hii")
to replace nan
import numpy as np
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data)
data.replace(np.nan,"hii")
data["Salary"]=data["Salary"].replace(np.nan,30000)
to replace any special char
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
data["Salary"]=data["Salary"].replace(np.nan,30000)
print(data)
data["Salary"].mean()
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data["Salary"].mean())
data.fillna(method="bfill")
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.fillna(method="bfill"))
data.fillna(method="ffill")
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.fillna(method="ffill"))
Column transformation in Pandas

To create new column

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data,"\n\n")

data.loc[(data["Bonus"] == 0),"GetBonus"]="No Bonus"

data.loc[(data["Bonus"] > 0,"GetBonus")]="Bonus"

print(data)

To marge two column

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data,"\n\n")

data["Full name"]=data["Name"]+" "+data["Last Name"]

print(data)

To Add Calculation in column

import pandas as pd

data=pd.read_excel("salary.xlsx")

print(data,"\n\n")

data["Bonus"]=(data["Salary"]/100)*20

print(data)

To extract some latter from dataFrame

import pandas as pd

data={"Month":["January","Fabruary","March","April"]}

a=pd.DataFrame(data)
print(a)

def extract(value):

return value[0:3]

a["Short_Months"]=a["Month"].map(extract)

print(a)

GroupBy In Pandas
Count gender by Deparment

import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby("Department").agg({"Gender":"count"})
print(gp)
By Job Title count Emp_id

import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby("Job Title").agg({"Emp_ID":"count"})
print(gp)

By Gender

import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby(["Department","Gender"]).agg({"Emp_ID":"count"})
print(gp)
By Age
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
print("\n\n\n")
a=data.groupby("Countries").agg({"Age":"max"})
print(a)
By Age and Gender
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
print("\n\n\n")
a=data.groupby(["Countries","Gender"]).agg({"Age":"max"})
print(a)

Merge Join and Concatenate in Pandas


Merge
On the basis of EEID
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"],
"Age":[34,56,24,27,28,26]}
print(data1)
data2={"EEID":["A01","A02","A03","A04","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(pd.merge(df1,df2,on="EEID"))
Use of how
P1
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"],
"Age":[34,56,24,27,28,26]}
print(data1)
data2={"EEID":["A01","A02","A03","A04","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(pd.merge(df1,df2,on="EEID", how="inner"))
P2
print(pd.merge(df1,df2,on="EEID", how="left"))
P3
print(pd.merge(df1,df2,on="EEID", how="right"))

Concatenate
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karan","Mohit"]}
data2={"EEID":["A07","A08","A09","A010","A11","A12"],
"Name":["Atin","Pankaj","Alia","Suman","Sanjay","Karan"]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print(df2)
print()
ndf=pd.concat([df1,df2])
print(ndf)

Join
import pandas as pd
data1={"EEI":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"]}
print(data1)
data2={"EEID":["A09","A02","A03","A010","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(df1.join(df2))

You might also like