Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pandas 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Pandas-Data Analysis and

Visualisation with Python


Basic IO operations in Python

syntax
fileobject = open(filename [, accessmode][, buffering])
○ filename is a name of a file we want to open.
○ accessmode is determines the mode in which file has to be opened (list of possible values given
below)
○ If buffering is set to 0, no buffering will happen, if set to 1 line buffering will happen,if negative is given
it will follow system default buffering behaviour.

M Description
Description (create file if not Opens file to append, if file not
M Description M a
exist) exist will create it for write
r Read only (default) Append in binary format, if file
w Write only ab
Read only in binary not exist will create it for write
rb wb Write only in binary format
format Append, if file not exist it will
w+ Read and Write both a+
r+ Read and Write both create for read & write both
Read and Write both in Read and Write both in Read and Write both in binary
rb+ wb+ ab+
binary format binary format format
How to write path?
We can specify relative path in argument to open method, alternatively we can also
specify absolute path.
To specify absolute path,
○ In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
○ In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)

We suppose to close the file once we are done using the file in the Python using
close() method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()
Handling errors using “with” keyword
It is possible that we may have typo in the filename or file we specified is
moved/deleted, in such cases there will be an error while running the file.
To handle such situations we can use new syntax of opening the file using with
keyword.
fileusingwith.py
1 with open('college.txt') as f :
2 data = f.read()
3 print(data)

When we open file using with we need not to close the file.
Example : Write file in Python
write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f :
2 f.write('Hello world')

If we open file with ‘w’ mode it will overwrite the data to the existing file or will create
new file if file does not exists.
If we open file with ‘a’ mode it will append the data at the end of the existing file or
will create new file if file does not exists.
Reading CSV files without any library functions
A comma-separated values file is a delimited text file that uses a comma to separate values.
Each line of is a data record, Each record consists of many fields, separated by commas.
Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
bcde,456789,2.5 3 isFirstLine
for r in rows= :
True
cdef,321654,7.6 4 for r in
cols rows :
= r.split(',')
5 if isFirstLine Name
print('Student : = ', cols[0], end=" ")
6 isFirstLine
print('\tEn. No.==False
', cols[1], end=" ")
7 continue = \t', cols[2])
print('\tCPI
8 cols = r.split(',')
9 print('Student Name = ', cols[0], end=" ")
10 print('\tEn. No. = ', cols[1], end=" ")
11 print('\tCPI = \t', cols[2])

We can use Microsoft Excel to access


CSV files.
What is a Package?
A package contains all the files you need for a module.

Modules are Python code libraries you can include in your project.

Download a Package
Downloading a package is very easy.

Open the command line interface and tell PIP to download the package you want.

Navigate your command line to the location of Python's script directory, and type the following:

Example

Download a package named "camelcase":

C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install camelcase


Using a Package
Once the package is installed, it is ready to use.

Import the "camelcase" package into your project.

Example
Import and use "camelcase":
import camelcase

c = camelcase.CamelCase()

txt = "hello world"

print(c.hump(txt))
Output: Hello World

Find more packages at https://pypi.org/.


How to Create Package in Python?

● Create a Directory: Start by creating a directory (folder) for your package. This directory will serve as the root
of your package structure.
● Add Modules: Within the package directory, you can add Python files (modules) containing your code. Each
module should represent a distinct functionality or component of your package.
● Init File: Include an __init__.py file in the package directory. This file can be empty or can contain an
initialization code for your package. It signals to Python that the directory should be treated as a package.
● Subpackages: You can create sub-packages within your package by adding additional directories containing
modules, along with their own __init__.py files.
● Importing: To use modules from your package, import them into your Python scripts using dot notation. For
example, if you have a module named module1.py inside a package named mypackage, you would import its
function like this: from mypackage.module1 import greet.
● Distribution: If you want to distribute your package for others to use, you can create a setup.py file using
Python’s setuptools library. This file defines metadata about your package and specifies how it should be
installed.
Code Example

Here’s a basic code sample demonstrating how to create a simple Python package:
1. Create a directory named mypackage.
2. Inside mypackage, create two Python files: module1.py and module2.py.
3. Create an __init__.py file inside mypackage (it can be empty).
4. Add some code to the modules.
5. Finally, demonstrate how to import and use the modules from the package.

mypackage/

├── __init__.py
├── module1.py
└── module2.py
Example: Now, let’s create a Python script outside the mypackage directory to import and use these modules:
# module1.py
def greet(name):
print(f"Hello, {name}!")
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Pandas : series,dataframe,panel
• Created by Wes McKinney in 2008, now maintained by many others.
• Author of one of the textbooks: Python for Data Analysis

• Powerful and productive Python data analysis and Management Library


• Panel Data System
• The name is derived from the term "panel data", an econometrics term
for data sets that include both time-series and cross-sectional data
• Its an open source product.
Power of pandas
Merge,join and concat in pandas
Pandas real world example
Read CSV in Pandas
read_csv() is used to read Comma Separated Values (CSV) file into a pandas
DataFrame.
some of important Parameters :
• filePath : str, path object, or file-like object
• sep : separator (Default is comma)
• header: Row number(s) to use as the column names. Output
• index_col : index column(s) of the data frame. PDS Algo SE INS
readCSV.py 101 50 55 60 55.0
102 70 80 61 66.0
1 df = pd.read_csv('Marks.csv',index_col=0,header=0)
103 55 89 70 77.0
2 print(df)
104 58 96 85 88.0
201 77 96 63 66.0
Read Excel in Pandas
Read an Excel file into a pandas DataFrame.
Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local
filesystem or URL. Supports an option to read a single sheet or a list of sheets.
some of important Parameters :
• excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
• sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
• index_col : index column of the data frame.
import pandas as pd

df = pd.read_excel('Example.xlsx')
print(df
Read from MySQL Database
We need two libraries for that,
• conda install sqlalchemy
• conda install pymysql

After installing both the libraries, import create_engine from sqlalchemy


and import pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql

createEngine.py
Then, create a database connection string and create engine using it.
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)
Read from MySQL Database (Cont.)
After getting the engine, we can fire any sql query using pd.read_sql
method.
read_sql is a generic method which can be used to read from any sql
(MySQL,MSSQL,
readSQLDemo.py Oracle etc…)
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)

Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT
Series
Series is an one-dimensional* array with axis labels.
It supports both integer and label-based index but index must be of hashable
type.
If wesyntax
do not specify index it will assign integer zero-based
Parameters index.
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype) index = array like index
dtype = data-type

pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
Series (Cont.)
We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

We can specify the data type of Series using dtype parameter


pdSeriesdtype.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str') Sum = 13
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)
Series (Cont.)
We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name kunal
2 import pandas as pd address rj
3 i = ['name','address','phone','email','website'] phone 123
4 d = ['kunal','rj',123','k@d.com','kunal.ac.in'] email k@d.com
5 s = pd.Series(data=d,index=i) website kunal.ac.in
6 print(s) dtype: object
Series
• One dimensional array-like object
• It contains array of data with associated indexes.
(Indexes can be strings or integers or other data types.)
• By default , the series will get indexing from 0 to N where N = size -1

from pandas import Series, DataFrame #Output


import pandas as pd 0 4
1 7
obj = Series([4, 7, -5, 3]) 2 -5
print(obj) 3 3
print(obj.index) dtype: int64
print(obj.values)
RangeIndex(start=0, stop=4, step=1)
[ 4 7 -5 3]
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series
associates a label with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex
ranging from 0 to N-1.
• Each series object also has a data type.

In: Out:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.

In: Out:

• You can also provide an index manually.


In:

Out:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.

Out:
In:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.

In: Out:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.

Case ID Variable one Variable two Variable 3


1 123 ABC 10
2 456 DEF 20
3 789 XYZ 30
Creating a Pandas data frame
• Pandas data frames can be constructed using Python dictionaries.
In:

Out:
• You can also create a data frame from a list.

In: Out:
• You can ascertain the type of a column with the type() function.

In:

Out:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex from 0 to
N-1.
In:

Out:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:

In: Out:

• or do it during runtime.
• Here, I also named the index ‘country code’.
Out:
In:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.
In: Out:

• Second, you could use .iloc() and provide an index number

In: Out:
• A selection of particular rows and columns can be selected this way.
In: Out:

• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:

In: Out:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In: Out:

In: Out:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the
most.
• You can read in the data from a CSV file using the read_csv() function.

• Similarly, you can write a data frame to a csv file with the to_csv()
function.
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using
position-based indexes
S = Series(range(4), index=['zero', 'one', 'two', 'three'])
print(S['two'])
2
print(S[['zero', 'two']])
zero 0
two 2
dtype: int64
print(S[2])
2
print(S[[0,2]])
zero 0
two 2
dtype: int64
list operator for items >1
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using
position-based indexes
S = Series(range(4), index=['zero', 'one', 'two', 'three'])
print(S[:2])
zero 0
one 1
dtype: int32 print(S[S > 1])
two 2
print(S['zero': 'two']) three 3
zero 0 dtype: int32
one 1
two 2 Inclusive print(S[-2:])
dtype: int32 two 2
three 3
dtype: int32
DataFrame
• A DataFrame is a tabular data structure comprised of rows and columns,
akin to a spreadsheet or database table.
• It can be treated as an ordered collection of columns
• Each column can be a different data type
• Have both row and column indices
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
#output
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])

Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN Same order
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN

Initialized with NaN


DataFrame – from nested dict of dicts
• Outer dict keys as columns and inner dict keys as row indices
pop = {'Nevada': {2001: 2.9, 2002: 2.9}, 'Ohio': {2002: 3.6, 2001: 1.7, 2000: 1.5}}
frame3 = DataFrame(pop)
print(frame3)
#output
Nevada Ohio
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6
Transpose
print(frame3.T)
Union of inner keys (in sorted order) 2000 2001 2002
Nevada NaN 2.9 2.9
Ohio 1.5 1.7 3.6
DataFrame – index, columns, values
frame3.index.name = 'year'
frame3.columns.name='state‘
print(frame3)
state Nevada Ohio
year
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6

print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')

print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')

print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]
DataFrame – retrieving a column
• A column in a DataFrame can be retrieved as a Series by dict-like
notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame['state']) print(frame.state)
0 Ohio 0 Ohio
1 Ohio 1 Ohio
2 Ohio 2 Ohio
3 Nevada 3 Nevada
4 Nevada 4 Nevada
Name: state, dtype: object Name: state, dtype: object
DataFrame – getting rows
• loc for using indexes and iloc for using positions
• loc gets rows (or columns) with particular labels from the index.
• iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
print(frame2)
year state pop debt print(frame2.loc[['A', 'B']]) print(frame2.iloc[1:3])
A 2000 Ohio 1.5 NaN year state pop debt year state pop debt
B 2001 Ohio 1.7 NaN A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN print(frame2.loc['A':'E',['state','pop']]) print(frame2.iloc[:,1:3])
state pop state pop
print(frame2.loc['A']) A Ohio 1.5 A Ohio 1.5
year 2000 B Ohio 1.7 B Ohio 1.7
state Ohio C Ohio 3.6 C Ohio 3.6
pop 1.5 D Nevada 2.4 D Nevada 2.4
debt NaN E Nevada 2.9 E Nevada 2.9
Name: A, dtype: object
DataFrame – modifying columns
frame2['debt'] = 0 val = Series([10, 10, 10], index = ['A', 'C', 'D'])
print(frame2) frame2['debt'] = val
year state pop debt print(frame2)
A 2000 Ohio 1.5 0 year state pop debt
B 2001 Ohio 1.7 0 A 2000 Ohio 1.5 10.0
C 2002 Ohio 3.6 0 B 2001 Ohio 1.7 NaN
D 2001 Nevada 2.4 0 C 2002 Ohio 3.6 10.0
E 2002 Nevada 2.9 0 D 2001 Nevada 2.4 10.0
E 2002 Nevada 2.9 NaN
frame2['debt'] = range(5)
print(frame2)
year state pop debt Rows or individual elements can be modified similarly.
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
Using loc or iloc.
C 2002 Ohio 3.6 2
D 2001 Nevada 2.4 3
E 2002 Nevada 2.9 4
DataFrame – removing columns
del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9
More on DataFrame indexing
import numpy as np
data = np.arange(9).reshape(3,3)
print(data)
[[0 1 2]
[3 4 5]
[6 7 8]]

frame = DataFrame(data, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3'])


print(frame) print(frame['c1']) print(frame.loc['r1'])
c1 c2 c3 r1 0 c1 0
r1 0 1 2 r2 3 c2 1
r2 3 4 5 r3 6 c3 2
r3 6 7 8 Name: c1, dtype: int32 Name: r1, dtype: int32
print(frame['c1']['r1']) print(frame.loc[['r1','r3']]) print(frame.iloc[:2]) print(frame[:2])
0 c1 c2 c3 c1 c2 c3 c1 c2 c3
print(frame[['c1', 'c3']]) r1 0 1 2 r1 0 1 2 r1 0 1 2
c1 c3 r3 6 7 8 r2 3 4 5 r2 3 4 5
r1 0 2
r2 3 5
r3 6 8
Row slices
More on DataFrame indexing - 2
print(frame.loc[['r1', 'r2'], ['c1', 'c2']]) print(frame.loc['r1':'r3', 'c1':'c3'])
c1 c2 c1 c2 c3
r1 0 1 r1 0 1 2
r2 3 4 r2 3 4 5
r3 6 7 8
print(frame.iloc[:2,:2])
c1 c2
r1 0 1
r2 3 4

v = DataFrame(np.arange(9).reshape(3,3), index=['a', 'a', 'b'], columns=['c1','c2','c3'])


print(v)
c1 c2 c3
a 0 1 2
a 3 4 5 Duplicated keys
b 6 7 8

print(v.loc['a'])
c1 c2 c3
a 0 1 2
a 3 4 5
More on DataFrame indexing - 3
print(frame) print(frame[frame['c1']>0])
c1 c2 c3 c1 c2 c3
r1 0 1 2 r2 3 4 5
r2 3 4 5 r3 6 7 8
r3 6 7 8
print(frame['c1']>0)
print(frame <3) r1 False
c1 c2 c3 r2 True
r1 True True True r3 True
r2 False False False Name: c1, dtype: bool
r3 False False False

frame[frame<3] = 3
print(frame)
c1 c2 c3
r1 3 3 3
r2 3 4 5
r3 6 7 8
Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8

print(frame.drop(['r1'])) This returns a new object


c1 c2 c3
r2 3 4 5
r3 6 7 8

print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5

print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
Reindexing
• Alter the order of rows/columns of a DataFrame or order of a series
according to new index
frame2 = frame.reindex(columns=['c2', 'c3', 'c1'])

print(frame2) This returns a new object


c2 c3 c1
r1 1 2 0
r2 4 5 3
r3 7 8 6

frame2 = frame.reindex(['r1', 'r3', 'r2', 'r4'])


c1 c2 c3
r1 0.0 1.0 2.0
r3 6.0 7.0 8.0
r2 3.0 4.0 5.0
r4 NaN NaN NaN
Function application and mapping
• DataFrame.applymap(f) applies f to every entry
• DataFrame.apply(f) applies f to every column (default) or row
print(frame) def max_minus_min(x):
c1 c2 c3 return max(x)-min(x)
r1 0 1 2 print(frame.apply(max_minus_min))
r2 3 4 5 c1 6
r3 6 7 8 c2 6
c3 6
def square(x): dtype: int64
return x**2
print(frame.applymap(square)) print(frame.apply(max_minus_min,axis=1))
c1 c2 c3 r1 2
r1 0 1 4 r2 2
r2 9 16 25 r3 2
r3 36 49 64 dtype: int64
Function application and mapping - 2
def max_min(x):
return Series([max(x), min(x)], index=['max', 'min'])

print(frame.apply(max_min))

c1 c2 c3
max 6 7 8
min 0 1 2
Other DataFrame functions
• sort_index() • sort_values()
frame.index=['A', 'C', 'B']; frame = DataFrame(np.random.randint(0, 10, 9).reshape(3,-1), index=['r1', 'r2',
frame.columns=['b','a','c']; 'r3'], columns=['c1', 'c2', 'c3'])
print(frame) print(frame)
b a c c1 c2 c3
A 0 1 2 r1 6 9 0
C 3 4 5 r2 8 2 9
B 6 7 8 r3 8 0 6

print(frame.sort_index()) print(frame.sort_values(by='c1'))
b a c c1 c2 c3
A 0 1 2 r1 6 9 0
B 6 7 8 r2 8 2 9
C 3 4 5 r3 8 0 6

print(frame.sort_index(axis=1)) print(frame.sort_values(axis=1,by=['r3','r1']))
a b c c2 c3 c1
A 1 0 2 r1 9 0 6
C 4 3 5 r2 2 9 8
B 7 6 8 r3 0 6 8
Other DataFrame functions
• mean()
• Mean(axis=0, skipna=True)

• sum()
• cumsum()
• describe(): return summary statistics of each column
• for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
• For non-numeric data: count, uniq, most-frequent item, etc.

• corr(): correlation between two Series, or between columns of a DataFrame


• corr_with(): correlation between columns of DataFram and a series or between
the columns of another DataFrame

You might also like