Pandas 1
Pandas 1
Pandas 1
syntax
fileobject = open(filename [, accessmode][, buffering])
○ filename is a name of a file we want to open.
○ accessmode is determines the mode in which file has to be opened (list of possible values given
below)
○ If buffering is set to 0, no buffering will happen, if set to 1 line buffering will happen,if negative is given
it will follow system default buffering behaviour.
M Description
Description (create file if not Opens file to append, if file not
M Description M a
exist) exist will create it for write
r Read only (default) Append in binary format, if file
w Write only ab
Read only in binary not exist will create it for write
rb wb Write only in binary format
format Append, if file not exist it will
w+ Read and Write both a+
r+ Read and Write both create for read & write both
Read and Write both in Read and Write both in Read and Write both in binary
rb+ wb+ ab+
binary format binary format format
How to write path?
We can specify relative path in argument to open method, alternatively we can also
specify absolute path.
To specify absolute path,
○ In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
○ In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)
We suppose to close the file once we are done using the file in the Python using
close() method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()
Handling errors using “with” keyword
It is possible that we may have typo in the filename or file we specified is
moved/deleted, in such cases there will be an error while running the file.
To handle such situations we can use new syntax of opening the file using with
keyword.
fileusingwith.py
1 with open('college.txt') as f :
2 data = f.read()
3 print(data)
When we open file using with we need not to close the file.
Example : Write file in Python
write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f :
2 f.write('Hello world')
If we open file with ‘w’ mode it will overwrite the data to the existing file or will create
new file if file does not exists.
If we open file with ‘a’ mode it will append the data at the end of the existing file or
will create new file if file does not exists.
Reading CSV files without any library functions
A comma-separated values file is a delimited text file that uses a comma to separate values.
Each line of is a data record, Each record consists of many fields, separated by commas.
Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
bcde,456789,2.5 3 isFirstLine
for r in rows= :
True
cdef,321654,7.6 4 for r in
cols rows :
= r.split(',')
5 if isFirstLine Name
print('Student : = ', cols[0], end=" ")
6 isFirstLine
print('\tEn. No.==False
', cols[1], end=" ")
7 continue = \t', cols[2])
print('\tCPI
8 cols = r.split(',')
9 print('Student Name = ', cols[0], end=" ")
10 print('\tEn. No. = ', cols[1], end=" ")
11 print('\tCPI = \t', cols[2])
Modules are Python code libraries you can include in your project.
Download a Package
Downloading a package is very easy.
Open the command line interface and tell PIP to download the package you want.
Navigate your command line to the location of Python's script directory, and type the following:
Example
Example
Import and use "camelcase":
import camelcase
c = camelcase.CamelCase()
print(c.hump(txt))
Output: Hello World
● Create a Directory: Start by creating a directory (folder) for your package. This directory will serve as the root
of your package structure.
● Add Modules: Within the package directory, you can add Python files (modules) containing your code. Each
module should represent a distinct functionality or component of your package.
● Init File: Include an __init__.py file in the package directory. This file can be empty or can contain an
initialization code for your package. It signals to Python that the directory should be treated as a package.
● Subpackages: You can create sub-packages within your package by adding additional directories containing
modules, along with their own __init__.py files.
● Importing: To use modules from your package, import them into your Python scripts using dot notation. For
example, if you have a module named module1.py inside a package named mypackage, you would import its
function like this: from mypackage.module1 import greet.
● Distribution: If you want to distribute your package for others to use, you can create a setup.py file using
Python’s setuptools library. This file defines metadata about your package and specifies how it should be
installed.
Code Example
Here’s a basic code sample demonstrating how to create a simple Python package:
1. Create a directory named mypackage.
2. Inside mypackage, create two Python files: module1.py and module2.py.
3. Create an __init__.py file inside mypackage (it can be empty).
4. Add some code to the modules.
5. Finally, demonstrate how to import and use the modules from the package.
mypackage/
│
├── __init__.py
├── module1.py
└── module2.py
Example: Now, let’s create a Python script outside the mypackage directory to import and use these modules:
# module1.py
def greet(name):
print(f"Hello, {name}!")
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Pandas : series,dataframe,panel
• Created by Wes McKinney in 2008, now maintained by many others.
• Author of one of the textbooks: Python for Data Analysis
df = pd.read_excel('Example.xlsx')
print(df
Read from MySQL Database
We need two libraries for that,
• conda install sqlalchemy
• conda install pymysql
createEngine.py
Then, create a database connection string and create engine using it.
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)
Read from MySQL Database (Cont.)
After getting the engine, we can fire any sql query using pd.read_sql
method.
read_sql is a generic method which can be used to read from any sql
(MySQL,MSSQL,
readSQLDemo.py Oracle etc…)
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT
Series
Series is an one-dimensional* array with axis labels.
It supports both integer and label-based index but index must be of hashable
type.
If wesyntax
do not specify index it will assign integer zero-based
Parameters index.
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype) index = array like index
dtype = data-type
pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
Series (Cont.)
We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)
In: Out:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.
In: Out:
Out:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.
Out:
In:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.
In: Out:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.
Out:
• You can also create a data frame from a list.
In: Out:
• You can ascertain the type of a column with the type() function.
In:
Out:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex from 0 to
N-1.
In:
Out:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:
In: Out:
• or do it during runtime.
• Here, I also named the index ‘country code’.
Out:
In:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.
In: Out:
In: Out:
• A selection of particular rows and columns can be selected this way.
In: Out:
• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:
In: Out:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In: Out:
In: Out:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the
most.
• You can read in the data from a CSV file using the read_csv() function.
• Similarly, you can write a data frame to a csv file with the to_csv()
function.
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using
position-based indexes
S = Series(range(4), index=['zero', 'one', 'two', 'three'])
print(S['two'])
2
print(S[['zero', 'two']])
zero 0
two 2
dtype: int64
print(S[2])
2
print(S[[0,2]])
zero 0
two 2
dtype: int64
list operator for items >1
Indexing, selection and filtering
• Series can be sliced/accessed with label-based indexes, or using
position-based indexes
S = Series(range(4), index=['zero', 'one', 'two', 'three'])
print(S[:2])
zero 0
one 1
dtype: int32 print(S[S > 1])
two 2
print(S['zero': 'two']) three 3
zero 0 dtype: int32
one 1
two 2 Inclusive print(S[-2:])
dtype: int32 two 2
three 3
dtype: int32
DataFrame
• A DataFrame is a tabular data structure comprised of rows and columns,
akin to a spreadsheet or database table.
• It can be treated as an ordered collection of columns
• Each column can be a different data type
• Have both row and column indices
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
#output
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN Same order
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')
print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')
print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]
DataFrame – retrieving a column
• A column in a DataFrame can be retrieved as a Series by dict-like
notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame['state']) print(frame.state)
0 Ohio 0 Ohio
1 Ohio 1 Ohio
2 Ohio 2 Ohio
3 Nevada 3 Nevada
4 Nevada 4 Nevada
Name: state, dtype: object Name: state, dtype: object
DataFrame – getting rows
• loc for using indexes and iloc for using positions
• loc gets rows (or columns) with particular labels from the index.
• iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
print(frame2)
year state pop debt print(frame2.loc[['A', 'B']]) print(frame2.iloc[1:3])
A 2000 Ohio 1.5 NaN year state pop debt year state pop debt
B 2001 Ohio 1.7 NaN A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN print(frame2.loc['A':'E',['state','pop']]) print(frame2.iloc[:,1:3])
state pop state pop
print(frame2.loc['A']) A Ohio 1.5 A Ohio 1.5
year 2000 B Ohio 1.7 B Ohio 1.7
state Ohio C Ohio 3.6 C Ohio 3.6
pop 1.5 D Nevada 2.4 D Nevada 2.4
debt NaN E Nevada 2.9 E Nevada 2.9
Name: A, dtype: object
DataFrame – modifying columns
frame2['debt'] = 0 val = Series([10, 10, 10], index = ['A', 'C', 'D'])
print(frame2) frame2['debt'] = val
year state pop debt print(frame2)
A 2000 Ohio 1.5 0 year state pop debt
B 2001 Ohio 1.7 0 A 2000 Ohio 1.5 10.0
C 2002 Ohio 3.6 0 B 2001 Ohio 1.7 NaN
D 2001 Nevada 2.4 0 C 2002 Ohio 3.6 10.0
E 2002 Nevada 2.9 0 D 2001 Nevada 2.4 10.0
E 2002 Nevada 2.9 NaN
frame2['debt'] = range(5)
print(frame2)
year state pop debt Rows or individual elements can be modified similarly.
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
Using loc or iloc.
C 2002 Ohio 3.6 2
D 2001 Nevada 2.4 3
E 2002 Nevada 2.9 4
DataFrame – removing columns
del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9
More on DataFrame indexing
import numpy as np
data = np.arange(9).reshape(3,3)
print(data)
[[0 1 2]
[3 4 5]
[6 7 8]]
print(v.loc['a'])
c1 c2 c3
a 0 1 2
a 3 4 5
More on DataFrame indexing - 3
print(frame) print(frame[frame['c1']>0])
c1 c2 c3 c1 c2 c3
r1 0 1 2 r2 3 4 5
r2 3 4 5 r3 6 7 8
r3 6 7 8
print(frame['c1']>0)
print(frame <3) r1 False
c1 c2 c3 r2 True
r1 True True True r3 True
r2 False False False Name: c1, dtype: bool
r3 False False False
frame[frame<3] = 3
print(frame)
c1 c2 c3
r1 3 3 3
r2 3 4 5
r3 6 7 8
Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5
print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
Reindexing
• Alter the order of rows/columns of a DataFrame or order of a series
according to new index
frame2 = frame.reindex(columns=['c2', 'c3', 'c1'])
print(frame.apply(max_min))
c1 c2 c3
max 6 7 8
min 0 1 2
Other DataFrame functions
• sort_index() • sort_values()
frame.index=['A', 'C', 'B']; frame = DataFrame(np.random.randint(0, 10, 9).reshape(3,-1), index=['r1', 'r2',
frame.columns=['b','a','c']; 'r3'], columns=['c1', 'c2', 'c3'])
print(frame) print(frame)
b a c c1 c2 c3
A 0 1 2 r1 6 9 0
C 3 4 5 r2 8 2 9
B 6 7 8 r3 8 0 6
print(frame.sort_index()) print(frame.sort_values(by='c1'))
b a c c1 c2 c3
A 0 1 2 r1 6 9 0
B 6 7 8 r2 8 2 9
C 3 4 5 r3 8 0 6
print(frame.sort_index(axis=1)) print(frame.sort_values(axis=1,by=['r3','r1']))
a b c c2 c3 c1
A 1 0 2 r1 9 0 6
C 4 3 5 r2 2 9 8
B 7 6 8 r3 0 6 8
Other DataFrame functions
• mean()
• Mean(axis=0, skipna=True)
• sum()
• cumsum()
• describe(): return summary statistics of each column
• for numeric data: mean, std, max, min, 25%, 50%, 75%, etc.
• For non-numeric data: count, uniq, most-frequent item, etc.