Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

06 - The Basics of Python in DS

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

06 - The Basics of Python in DS

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Introduction to data science

Python for data science


Introduction to Python
Basic types and control flow
Intro to Python: part 1

 Intro to IDLE, Python


 Keyword: print
 Types: str, int, float
 Variables
 User input
 Saving your work
 Comments
 Conditionals
 Type: bool
 Keywords: and, or, not, if, elif, else
IDLE: Interactive DeveLopment Environment

 Shell
 Evaluates what you enter and displays output
 Interactive
 Type at “>>>” prompt

 Editor
 Create and save .py files
 Can run files and display output in
shell
A simple example: Hello world!

Syntax highlighting:
 IDLE colors code differently depending on

functionality: Keywords, string, results

Some common keywords:


print if else and or class while for
break elif in def not from import
Fundamental concepts in Python

 Variable: a memory space that is named.


 Names of variables include letters, digits or underscore (_);
names of variables can not begin by a digit and:
 Can not be a keyword
 Can use the digits in Unicode (from Python 3. version)
 All variables in Python are the objects. They have a type
and a location in memory (id)

6
Fundamental concepts in Python

 Variables in Python:
 Case sensitive
 No need to declare in program
 No need to specify the data type
 Can be changed to another data type
 The values of variables should be assigned as soon as they appear
 Basic data types:
 String: str
 Number: integer, float, fraction and complex
 list, tuple, dictionary
7
Fundamental concepts in Python

 Variables
>>> someVar = 2
 To re-use a value in multiple >>> print someVar # it’s an
int
computations, store it in a 2
>>> someVar = “Why hello there”
variable. >>> print someVar # now str
Why hello there
 Python is “dynamically-
typed”, so you can change
the type of value stored.
(differs from Java, C#,
C++, …)
Fundamental concepts in Python

 Input data for variables


 <variable’s name>= input(“string”)
 The users can change the data type when input data
< variable’s name >= <data type>(input (‘title’))
A=int(input(‘a=‘))
 Display data
 print (Value1, Value2,….)
 print(“string”, value)
 Display strings in one line: adding end=‘’ to print statement.
 Examples:

9
Data types in Python

 String
 We’ve already seen one
type in Python, used for
words and phrases.
 In general, this type is
called “string”.
 In Python, it’s referred to as
str.
Data types in Python

Python also has types >>> print 4


4
# int

for numbers.
>>> print 6. # float
6.0

int – integers
>>> print 2.3914 # float
2.3914

float – floating
point (decimal)
numbers
Data types in Python

 int
 In Python 3.X, int has unlimited range.
 Can be used for the computations on very large numbers.
 Common operators

Operations Examples
The division with x+y 20 + 3 = 23
the rounded
x–y 20 – 3 = 17
result
x*y 20 * 3 = 60
x/y 20 / 3 = 6.666
x // y 20 // 3 = 6
x%y 20 % 3 = 2

Exponent x ** y 20**3 = 8000


Unary operators

12
Data types in Python

 int
 Examples

13
Data types in Python

 Float
 Values: distinguished from integers by decimals. The integer part and
the real part are separated by ‘.’
 Operators: +, –, *, /, ** and unary operators
 Use decimal for higher precision: from decimal import *

14
Data types in Python - bool

Boolean values are true or >>> a


>>> b
= 2
= 5
false. >>> a
False
> b

>>> a <= b
True
Python has the values True >>> a
False
== b # does a equal b?

and False (note the capital >>> a != b # does a not-equal b?


True
letters!).

You can compare values


with ==, !=, <, <=, >, >=,
and the result of these
expressions is a bool.
Data types in Python - bool

When combining >>> a = 2


>>> b = 5
Boolean expressions, >>> False == (a > b)
True
parentheses are your
friends.
Keywords: and, or, not

and is True if both >>> a


>>> b
= 2
= 5
parts evaluate to >>> a
False
< b and False

True, otherwise >>> a < b or a == b


True
False >>> a < b and a == b
False
>>> True and False

or is True if at least False


>>> True and True

one part evaluates to True


>>> True or False
True , otherwise True

False
Keywords: and, or, not

and is True if both


parts evaluate to True, >>> not True
otherwise False False
>>> not False
True

or is True if at least >>> True and (False or not True)


one part evaluates to False
>>> True and (False or not False)
True , otherwise False True

not is the opposite of


its argument
Mathematics functions in Python

The list of numeracy and mathematics modules


 math: Mathematical functions (sin() etc.).
 cmath: Mathematical functions for complex numbers.
 decimal: Declaration of general decimal arithmetic forms
 random: Generates "pseudo" random numbers with
normal probability distribution functions.
 itertools: Functions that generate “iterators” used in
efficient loops
 functools“: Functions and operations have higher priority
on callable objects
 operator: All standard Python operators are built in
Mathematics functions in Python

 math
exp(x)
log(x[, base])
log10(x)
pow(x, y)
sqrt(x)
acos(x)
asin(x)
atan(x) atan2(y, x)
cos(x) hypot(x, y)
sin(x) tan(x)
degrees(x) radians(x)
cosh(x) sinh(x) tanh(x)
Constant number: pi, e
Built-in functions in Python
Conditionals: if, elif, else
Conditionals: if, elif, else

 The keywords if and else provide a way to


control the flow of your program.
 Python checks each condition in order, and
executes the block (whatever’s indented) of the
first one to be True
Conditionals: if, elif, else

The keywords if, elif, and else provide a way to control


the flow of your program.
Python checks each condition in order, and executes the block
(whatever’s indented) of the first one to be True.
Conditionals: if, elif, else

Indentation is important in
Python!

Make sure each if, elif,


and else has a colon after it,
and its block is indented one
tab (4 spaces by default).
Conditionals: if, elif, else

Make sure you’re careful what you compare to the result of


raw_input. It is a string, not a number.

# The right way: str to str or int to int


>>> gradYear = raw_input(“When do you plan to graduate? ”)
When do you plan to graduate? 2019
>>> gradYear == 2019 # gradYear is not an integer
False
>>> gradYear == “2019”
True # gradYear is a string :(
>>> int(gradYear) == 2019 # cast gradYear to an int :)
True
Conditionals: if, elif, else

Make sure you’re careful how to compare the result of raw_input.


It is a string, not a number.
Doing it wrong leads to a ValueError:

>>> gradYear = raw_input(“When do you plan to graduate? ”)


When do you plan to graduate? Sometime
>>> int(gradYear) == 2019

Traceback (most recent call last):


File “<pyshell#4>”, line 1, in <module>
int(gradYear) == 2019
ValueError: invalid literal for int() with base 10: ‘sometime’
Nested if

 Syntax :  Example:
var = 100
if condition1: if var < 200:
tasks_1 print (“The value of variable is less than 200")
if condition2: if var == 150:
tasks_2 print (“The value is 150")
elif condition3: elif var == 100:
tasks_3 print (" The value is 100")
else elif var == 50:
tasks print (" The value is 50")
elif condition4: elif var < 50:
tasks_4 print (" The value of variable is less than 50")
else: else:
tasks_5 print (“There is no true condition")
Nested if

 Example:
var = int(input('Enter a value: '))
if var < 200:
print (“The value of variable is less than 200")
if var == 150:
print (“The value is 150")
elif var == 100:
print (" The value is 100")
elif var == 50:
print (" The value is 50")
elif var < 50:
print (" The value of variable is less than 50")
else:
print (“There is no true condition")
Exersise

 Enter the coordinates of 3 points A, B and C on the 2-


dimensional plane. Let's check if triangle ABC is an
equilateral triangle
Exersise

 Enter the coordinates of 3 points A, B and C on the 2-


dimensional plane. Let's check if triangle ABC is an
equilateral triangle
WHILE, FOR loops - Syntax
WHILE loop - Examples

 Example 1: while without else


count = 0
while (count < 5):
print (‘Your sequence number is :', count)
count = count + 1
WHILE loop - Examples

• Example 2: while with else


count = 0
while count < 5:
print (count, " is less than 5")
count = count + 1
else:
print (count, " is not less than 5")
FOR loop - Examples

 Example 1: FOR without else


for i in range (0,10):
print ('The sequence number is:',i)
FOR loop - Examples

• Example 2: FOR with else


for i in range(0,10):
print (‘The sequence number is:',i)
else:
print (‘The last number!')
Nested loops - Exersise

 Example: Find all prime numbers that are less than 100
Nested loops - Exersise

 Example: Find all prime numbers that are less than 100
i=2
while(i < 100):
j=2
while(j <= (i/j)):
if not(i%j): break
j=j+1
if (j > i/j) : print (i, " is a prime number!")
i=i+1
Lists

 A sequence of items
 Has the ability to grow (unlike array)
 Use indexes to access elements (array notation)
 examples
aList = []
another = [1,2,3]
 You can print an entire list or an element
print another
print another[0]
 index -1 accesses the end of a list
List operation

 append method to add elements (don't have to be the same


type)
aList.append(42)
aList.append(another)
 del removes elements
del aList[0] # removes 42
 Concatenate lists with +
 Add multiple elements with *
zerolist = [0] * 5
 Multiple assignments
point = [1,2]
x , y = point
 More operations can be found at
http://docs.python.org/lib/types-set.html
Exercises

1. Enter a string, check if it is a valid email address or not?


(a valid email can be considered as containing the @
letter)

2. Given a random sequence A including 100 integer


elements (values in range of 1 and 300), separate all the
odd elements into another sequence (B)
Exercises

2. Given a random sequence A including 100 integer


elements (values in range of 1 and 300), separate all the
odd elements into another sequence (B)
Exercise 2

 Given a random sequence A including 100 integer


elements (values in range of 1 and 300), separate all the
odd elements into another array (B).
Interacting with user

 Obtaining data from a user


 Use function raw_input for strings or input for
numbers
 Example
name = raw_input("What's your name?")
 Command Line arguments
 Example:
import sys
for item in sys.argv:
print item
 Remember sys.argv[0] is the program name
Libraries in Python
Libraries in Python

 Data processing in Python

 Numpy

 Matplotlib

 Pandas

 Scikit-learn
Data processing in Python

 Basic data types: string, number, boolean


 Other data types: set, dictionary, tuple, list, file

 The errors in Python


 Syntax error: errors in syntax, programs can
not be compiled.
 Exception: abnormalities occur that are not as
designed
Data processing in Python

Deal with the exceptions: using up to 4 blocks


 “try” block: code that is likely to cause an error. When an error
occurs, this block will stop at the line that caused the error
 “except” block: error handling code, only executed if an error
occurs, otherwise it will be ignored
 “else” block: can appear right after the last except block, the
code will be executed if no except is performed (the try block
has no errors)
 “finally” block: also known as clean-up block, always executed
whether an error occurs or not
Data processing in Python

Deal with the exceptions: using up to 4 blocks


Data processing in Python
Numpy

 The main object of numpy is


homogeneous multidimensional arrays:
 The data types of elements in the array must
be the same
 Data can be one-dimensional or multi-
dimensional arrays
 The dimensions (axis) are numbered from 0
onwards
 The number of dimensions is called rank.
 There are up to 24 different number types
 The ndarray type is the main class that
handles multidimensional array data
 Lots of functions and methods for handling
matrices
Numpy

 Syntax: import numpy [as <new name>]


 Create array:
<variable name>=<library name>.array(<value>)
 Access: <variable name>[<index>]
 Examples:
import numpy as np
x = np.arange(3.0)
a = np.zeros((2, 2))
b = np.ones((1, 2))
c = np.full((3, 2, 2), 9)
d = np.eye(2)
e = np.random.random([3, 2])
Numpy

 Examples:
Numpy
 Access by index (slicing)
import numpy as np
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
row_r1 = a[1, :] #1-dimensional array of length 4
row_r2 = a[1:2, :]
# 2-dimensional array 2x4
print(row_r1, row_r1.shape)
# Display "[5 6 7 8] (4,)"
print(row_r2, row_r2.shape)
# Display "[[5 6 7 8]] (1, 4)"

col_r1 = a[:, 1] # 1-dimensional array of length 3


col_r2 = a[:, 1:2] print(col_r1,
# 2-dimensional array 3x2
col_r1.shape) print(col_r2,
col_r2.shape) #Display "[ 2 6 10] (3,)“

# Display "[[ 2]
[ 6]
Numpy

import numpy as np
x = np.array([[1, 2 ] , [ 3 , 4 ] ] , dtype=np.float64)
y = np.array([[5, 6 ] , [ 7 , 8 ] ] , dtype=np.float64)

print(x + y) # print(np.add(x, y)),


print(x - y) # print(np.subtract(x, y))
print(x * y) # print(np.multiply(x, y))
print(x / y) # print(np.divide(x, y))
print(np.sqrt(x)) # applied for all elements of x
print(2**x) # applied for all elements of x
mathplotlib

 “matplotlib” is a library specializing in plotting,


extended from numpy
 “matplotlib” has the goal of maximally simplifying
charting work to "just a few lines of code“
 “matplotlib” supports a wide variety of chart types,
especially those used in research or economics
such as line charts, lines, histograms, spectra,
correlations, errorcharts, scatterplots, etc.
 The structure of matplotlib consists of many parts,
serving different purposes
mathplotlib

 Necessary condition: available data


 There can be 4 basic steps:
Step 1: Choose the right chart type
 Depends a lot on the type of data
 Depends on the user's intended use
Step 2: Set parameters for the chart
 Parameters of axes, meaning, division ratio,...
 Highlights on the map
 Perspective, fill pattern, color and other details
 Additional information
Step 3: Draw a chart
Step 4: Save to file
mathplotlib

 Some charts drawn by using matplotlib


mathplotlib

 Some charts drawn by using matplotlib


mathplotlib

 The graph shows the correlation between X and Y


 Syntax:
plot([x], y, [fmt], data=None, **kwargs)
plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
 “fmt” is the line drawing specification
 “data” is the label of the data
 **kwargs: line drawing parameter
 Plot multiple times on one chart
 The returned result is a list of Line2D objects
mathplotlib

 fmt = '[color][marker][line]‘
 [colors] :
 ‘b’ – blue
 ‘g’ – green
 ‘r’ –red
 ‘c’ – cyan
 ‘m’ – magenta
 ‘y’ –yellow
 ‘b’ – black
 ‘w’ –white
 #rrggbb – chỉ ra mã màu theo hệRGB
mathplotlib

Line plot
 [marker] – the notation for data:
 ‘o’ – circle
 ‘v’ – (‘^’, ‘<‘,‘>’)
 ‘*’ – star
 ‘.’ – dot
 ‘p’ – pentagon
 …
 [line] – line type:
 ‘-’ solid line
 ‘--‘ dash
 ‘-.’ dotted line
 ‘:’
mathplotlib

Example – Line plot


import numpy as np
import matplotlib.pyplot as plt
# divide the interval 0-5 with the step of 0.2
t = np.arange(0., 5., 0.2)
# Draw three lines:
# - red dash line: y = x
# - blue, square marker : y = x^2
# - green, triangle marker: y = x^3
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()
mathplotlib

Example – Line plot


mathplotlib

Example – Bar plot


import matplotlib.pyplot as plt
D = { ‘MIS': 60,
‘AC': 310,
‘AAI': 360,
‘BDA': 580,
‘FDB': 340, ‘MKT': 290 }
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.title(‘The majors in IS')
plt.show()
mathplotlib

Example – Bar plot


mathplotlib

Example – Pie plot


mathplotlib

Example – subplot
import numpy as np
import matplotlib.pyplot as p l t
x1 = np.linspace(0.0, 5.0)
x2 = np.linspace(0.0, 2.0)
y1 = np.cos(2 * np.pi * x1) * np.exp(-x1)
y2 = np.cos(2 * np.pi * x2)
plt.subplot(2, 1, 1)
plt.plot(x1, y1, 'o-')
plt.subplot(2, 1, 2)
plt.plot(x2, y2, '.-')
plt.show()
Pandas

 “pandas” is an extension library from numpy,


specializing in processing tabular data
 The name “pandas” is the plural form of “panel
data”
Pandas

 Read data from multiple formats


 Data binding and missing data processing
integration
 Rotate and convert data dimensions easily
 Split, index, and split large data sets based on
labels
 Data can be grouped for consolidation and
transformation purposes
 Filter data and perform queries on the data
 Time series data processing and sampling
Pandas

 Pandas data has 3 main structures:


 Series: 1-dimensional structure, uniform data array
 Dataframe (frame): 2-dimensional structure, data
on columns is identical (somewhat like table in
SQL, but with named rows)
 Panel: 3-dimensional structure, can be viewed as
a set of dataframes with additional information
 Series data is similar to the array type in
numpy, but there are two important differences:
 Accept missing data (NaN – unknown)
 Rich indexing system (like a dictionary?)
Pandas

 General syntax:
pd.DataFrame(data, index, columns, dtype, copy)
 In there:
 ‘data’ will receive values from many different types such
as list, dictionary, ndarray, series,... and even other
DataFrames
 ‘index’ is the column index label of the dataframe
 ‘columns’ is the row index label of the dataframe
 ‘dtype’ is the data type for each column
 ‘copy’ takes the value True/False to indicate whether
data is copied to a new memory area, default is False
Pandas

 Syntax:
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
 In there:
 ‘data’ can accept the following data types: ndarray,
series, map, lists, dict, constants and other dataframes
 ‘items’ is axis = 0
 ‘major_axis’ is axis = 1
 ‘minor_axis’ is axis = 2
 ‘dtype’ is the data type of each column
 ‘copy’ takes the value True/False to determine whether
the data shares memory or not
Pandas – Series

import pandas as pd
import numpy as np

chi_so = ["KT", "KT", "CNTT", "Co khi"] #duplicated


gia_tri = [310, 360, 580, 340]
S = pd.Series(gia_tri, index=chi_so)
KT 310
print(S) KT 360
print(S.index) CNTT 580
Cokhi 340
print(S.values)
dtype: int64
Index(['KT', 'KT', 'CNTT', 'Co k h i ' ] , dtype='object')
[310 360 580 340]
Pandas – Series

Functions on Series
 S.axes: returns a list of indexes of S
 S.dtype: returns the data type of S's elements
 S.empty: returns True if S is empty
 S.ndim: returns the dimension of S (1)
 S.size: returns the number of elements of S
 S.values: returns a list of elements of S
 S.head(n): returns the first n elements of S
 S.tail(n): returns the last n elements of S
Pandas – Series

Operations on Series
import pandas as pd import numpy as np

chi_so = ["Ke toan", "KT", "CNTT", "Co khi"]


gia_tri = [310, 360, 580, 340]
# If the index is the same, combine it, otherwise NaN

CNTT 680.0
S = pd.Series(gia_tri, index=chi_so) Co khi NaN
P= pd.Series([100, 100], ['CNTT', 'PM']) KT NaN
Y= S +P Ke NaN
print(Y) toan NaN
dtype:
PM float64
Pandas - Frame

 Create dataframe from list

names_rank = [['MIT',1],["Stanford",2],["DHTL",200]] df
= pd.DataFrame(names_rank)
0 1
print(df) 0 MIT 1
1 Stanford 2
2 DHTL 200
Pandas - Frame

 Create dataframe from list

names_rank = [['MIT',1],["Stanford",2],["DHTL",200]] df
= pd.DataFrame(names_rank)
0 1
print(df) 0 MIT 1
1 Stanford 2
2 DHTL 200
Pandas - Panel

• Panels are widely used in


econometrics
 The data has 3 axes:
 Items (axis 0): each item is an
internal dataframe
 Major axis (axis 1 – main axis):
lines
 Minor axis (axis 2 - minor axis):
columns
• No further development (replaced
by MultiIndex)
Pandas - Panel

Syntax:
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
In there:
 ‘data’ can accept the following data types: ndarray,
series, map, lists, dict, constants and other dataframes
 ‘items’ is axis = 0
 ‘major_axis’ is axis = 1
 ‘minor_axis’ is axis = 2
 ‘dtype’ is the data type of each column
 ‘copy’ takes the value True/False to determine whether
the data shares memory or not
scikit- learn (sklearn)
Basic machine learning problem classes
scikit- learn (sklearn)

 Linear regression
 Data clustering
 Data layering
Linear regression

import matplotlib.pyplot as plt


import pandas as pd
import numpy as np
from sklearn import linear_model, metrics

# reading data from file csv


df = pd.read_csv("nguoi.csv", index_col = 0)
print(df)

#Draw the figure


plt.plot(df.Cao, df.Nang, 'ro')
plt.xlabel(‘Height (cm)')
plt.ylabel(‘Weight (kg)')
plt.show()
Linear regression

 Using old data, adding


gender column
(Nam/Nu)
 Using the old method, to
see how gender affects
weight
Linear regression

import matplotlib.pyplot as plt


import pandas as pd
import numpy as np
from sklearn import linear_model, metrics

df = pd.read_csv("nguoi2.csv", index_col = 0)
print(df)

df['GT'] = df.Gioitinh.apply(lambda x: 1 if x=='Nam' else 0)


print(df)
Linear regression

#Training model
X = df.loc[:, ['Cao‘, 'GT']].values
y = df.Nang.values
model = linear_model.LinearRegression()

model.fit(X, y)

# Show the information of model


mse = metrics.mean_squared_error(model.predict(X), y)
print(“Mean squared error: ", mse)
print(“Regression coefficient : ", model.coef_)
print(“Intercept: ", model.intercept_)
print(f"[weight] = {model.coef_} x [height, sex] +{model.intercept_}")
Linear regression

#Applying model into some cases


while True:
x = float(input(“Enter the height (0 for stop): "))
if x <= 0: break
print(“Male with the height ", x, “ cm, will have the weight ", model.predict([[x,1]]))
print(" Female with the height ", x, "cm, will have the weight ", model.predict([[x,0]]))
scikit- learn (sklearn)

 Data clustering
from sklearn.cluster import Kmeans
 Data layering
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
Classification
Classification
Clustering
Clustering
Exercises

Choose three of the following models:


- Linear regression
- Classification based on Naïve Bayes
- Classification based on SVM
- K-means clustering
- FCM clustering
Apply the selected models on 2 datasets taken
from the standard data set of the computer

You might also like