PythonForMachineLearning
PythonForMachineLearning
LEARNING
Why Python?
A large community
tons of machine learning specific libraries
easy to learn
TensorFlow make Python the leading language in the data science community.
About Python
It is case sensitive
Text indentation is important in logic flow!
Use # symbol to add a code comment
Use ‘’’ ‘’’ to comment a block of code
Prepare your machine
Install Python 2.7.xx
Pip install pandas
Pip install matplotlib Sklearn Statistics scipy seaborn Ipython
pip install --no-cache-dir pystan #to force redownload exist package in case of error
Get Data set for training through
http://archive.ics.uci.edu/ml/datasets.html
How to Run Python ML application
Open “IDLE (Python GUI)” File New File
Write your code on the new window then Run Run Module
Common Operations
1) Set dataset columns names data.columns = [‘blue’,’green’,’red’]
2) Get value of any cell use cell index and column index/name like this : data.ix[2,’red’]
3) replace NaN with zero using : data = data.replace(np.nan,0)
4) Select a subset of columns in new dataFrame : NewDF= data [ [ ‘blue’ , ’red’ ] ]
5) Count of rows that has data on a column: data[‘red'].value_counts().sum()
The missing values are dennoted by NaN __
6) Select count of distinct value on a column: data[‘blue’].nunique()
7) Drop duplicates rows: data = data.drop_duplicates()
Basic commands
Reading the Data from disk into Memory
data = pd.read_csv('examples/trip.csv’)
data.describe() #get data summary count, mean, the min and max values
Original DataFrame Keep raw if ANY contains a value Keep raw if ALL contains values
a b a b a b
1 0 0 2 1 0 4 1 1
2 1 0 3 0 1
3 0 1 4 1 1 df[(df.T != 0).all()]
4 1 1
df[(df.T != 0).any()]
Drop the columns where Any/All elements are NaN
data.dropna(how='any’)
Consider duplicate if Column1, Column3 are the same, drop duplicate except last row, return result in the same dataframe
df.drop_duplicates(subset=[‘Column1', ‘Column3'], keep=‘last’)
plt.show()
Bar plot for Data array
import numpy as np
import matplotlib.pyplot as plt
plt.scatter( Data[:, Column1] , Data[:, Column2] , marker = "x“ , s=150 , linewidths = 5 , zorder = 1)
plt.show()
import pandas
df = pandas.DataFrame(Data)
plt.show()
Basic commands
Sort, Filter, group, and Plotting the Data
data = data.sort_values(by='birthyear’)
data = data[(data['birthyear'] >= 1931) & (data['birthyear']<=1999)]
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years’, figsize = (15,4))
plt.show()
Stacked column chart
data1 = data.groupby(['birthyear', 'gender']).size().unstack('gender').fillna(0)
data1.plot.bar(title ='Distribution of birth years by Gender', stacked=True, figsize = (15,4))
Convert string to DateTime
OR
data[‘year’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[0]))
data[‘month’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[1]))
data[‘day’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[2]))
Notes
‘abcde’[0:2] ‘ab’
‘abcde’[:2] ‘ab’
‘abcde’[:-1] ‘abcd’
Concepts
Mean : Average
Median: When the frequency of observations in the data is odd, the middle
data point is returned as the median.
Mode: returns the observation in the dataset with the highest frequency.
Variance: represents variability of data points about the mean.
Standard deviation: just like variance, also captures the spread of data along
the mean. The only difference is that it is a square root of the variance.
Normal distribution (Gaussian distribution): the mean lies at the center of this
distribution with a spread (i.e., standard deviation) around it. Some 68% of the
observations lie within 1 standard deviation from the mean; 95% of the
observations lie within 2 standard deviations from the mean, whereas 99.7% of
the observations lie within 3 standard deviations from the mean.
Outliers: refer to the values distinct from majority of the observations. These
occur either naturally, due to equipment failure, or because of entry mistakes.
Plot mean
Draw mean of ‘TripDuration’ group by ‘StartTime_date’
data.groupby('starttime_date')['tripduration'].mean()
.plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))
data = pd.read_csv('examples/concrete_data.csv’)
print len(data)
data.head()
plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(data.columns)[:-1]:
plt.subplot(3,3,plot_count)
plt.scatter(data[feature], data['concrete_strength'])
plt.xlabel(feature.replace('_',' ').title())
plt.ylabel('Concrete strength')
plot_count+=1
plt.show()
#//Get the correlation between data
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
Dimensionality Reduction
PCA Dimensionality Reduction
When we have a big dataset and we need to reduce the number of columns for fast
training (remove redundancy features).
Example : Dataset (data) contains 20 features and we need to reduce them to 4 features
3) Lift: This says how likely item Y is purchased when item X is purchased
Association Rules (Apriori Algorithm)
from apyori import apriori
transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
]
results = list(apriori(transactions))
Association Rules (Example)
pip install apyori
Algorithms
1) Autoregressive Forecast Model: uses observations at previous time steps to predict observations at future time step.
2) ARIMA Forecast Model: linear model, predict after remove trends and seasonality.
3) Prophet: Facebook library, it is quick and gives very good results, forecasting time series data.
Related Packages: pip install fbprophet
The input to Prophet is always a dataframe with two columns: ds and y.
The ds (datestamp) column must contain a date or datetime (either is fine).
The y column must be numeric, and represents the measurement we wish to forecast,
It is good to use log(y) and not the actual y to remove trends and noise data
FB Prophet
COMPUTER VISION
Open CV for computer vision
To install OpenCV pip install opencv-python
Normal images are 4 dimensional array for each pixel (Blue ,Green, Red, and Alpha)
It is normal to process images in Gray scale for fast performance
Video is just a loop of images. So, any code processing images can process video too
adaptiveMethod
ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of neighborhood area.
ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.
blockSize : A variable of the integer type representing size of the pixelneighborhood used to calculate the threshold value.
C : A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean).
OpenCV Video Stream Example
#VideoStream1 = cv2.VideoCapture('c:/boxed-correct.avi’) read stream from avi video file
VideoStream1 = cv2.VideoCapture(0) read stream from the first connected video cam
while True:
IsReturnFrame1,ImgFrame1=VideoStream1.read()
if IsReturnFrame1 == 0: break
GrayImgFrame1=cv2.cvtColor(ImgFrame1,cv2.COLOR_BGR2GRAY)
cv2.imshow('ImageTitle1',ImgFrame1)
cv2.imshow('ImageTitle2',GrayImgFrame1)
if cv2.waitKey(0): break
VideoStream1.release()
cv2.destroyAllWindows()
Image processing using OpenCV
Original
Threshold_Color
template
Result
NLP
Named-entity recognition: means extract names of persons, organizations, locations, time, quantities, percentages, etc.
example: Jim bought 300 shares of Acme Corp. in 2006. [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
Part-of-speech tagging: used to identification of words as nouns, verbs, adjectives, adverbs, etc.
Lexicon
Get words and their actual means in the context
Corpora
Text classification. Ex: medical journals, presidential speech
Lemmatizing (stemming)
return the word to its root, ie (gone, went, going) go
WordNet
List of different words that have the same meaning for the given word
NLTK Simple example
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
WordsArray = sent_tokenize(EXAMPLE_TEXT)
print(WordsArray)
ListOfAvailableStopWords= set(stopwords.words("english"))
print(ListOfAvailableStopWords)
textblob
pip install textblob
python -m textblob.download_corpora
Sentimental Analysis Example
We have a sample of 3000 random users comments on imdb.com, amazon.com, and
yelp.com website
http://archive.ics.uci.edu/ml/machine-learning-databases/00331/
Each comment has a score, Score is either 1 (for positive) or 0 (for negative)
Python Deep Learning
Common Deep learning fields
Main Fields Visual examples:
Computer vision Colorization of Black and White Images.
https://www.youtube.com/watch?v=_MJU8VK2PI4
Speech recognition
Adding Sounds To Silent Movies.
Natural language processing https://www.youtube.com/watch?v=0FW99AQmMc8
Keras
TensorFlow CNTK Theano
Deep learning Computer vision (image processing)
3 common ways to detect objects
median based features
edge based features
threshold based features
How to use Neural Network Models in Keras
Five steps
1. Define Network.
2. Compile Network.
3. Fit Network.
4. Evaluate Network.
5. Make Predictions.
Step 1. Define Network
Neural networks are defined in Keras as a sequence of layers.
The first layer in the network must define the number of inputs to expect. for a Multilayer
Perceptron model this is specified by the input_dim attribute.
Example of small Multilayer Perceptron model (2 inputs, 5 hidden layers, 1 output)
model = Sequential()
model.add(Dense(5, input_dim=2))
model.add(Dense(1))
Re-write after add activation function
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
Available Activation Functions
Optional, it is like a filter, used to solve some common predictive modeling problem, to get significant boost in performance.
Sigmoid: used for Binary Classification (2 class) one neuron the output layer. What ever the input it will map to zero or one.
Softmax: used for Multiclass Classification (>2 class), one output neuron per class value.
Linear: used for Regression, the number of neurons matching the number of outputs.
Tanh: what ever the input it will convert to number between -1 and 1
Relu: either 0 for a<0 or a for a>0. so, it just remove the negative values and pass the positive as it is.
LeakyReLU: minimize the value of negative values and pass as it is if positive
elu
selu
softplus
softsign
hard_sigmoid
PReLU
ELU
ThresholdedReLU
Step 2. Compile Network
Specifically the optimization algorithm to use to train the network and the loss function used to evaluate the network that is
minimized by the optimization algorithm.
model.compile(optimizer='sgd', loss='mse’)
Optimizers tool to minimize loss between prediction and real value. Commonly used optimization algorithms:
‘sgd‘ (Stochastic Gradient Descent) requires the tuning of a learning rate and momentum.
ADAM requires the tuning of learning rate.
RMSprop requires the tuning of learning rate.
Loss functions:
Regression: Mean Squared Error or ‘mse‘.
Binary Classification (2 class): ‘binary_crossentropy‘.
Multiclass Classification (>2 class): ‘categorical_crossentropy‘.
Finally, you can also specify metrics to collect while fitting the model in addition to the loss function. Generally, the most useful
additional metric to collect is accuracy.
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2
iterations to complete 1 epoch.
Step 4. Evaluate Network
The model evaluates the loss across all of the test patterns, as well as any other metrics
specified when the model was compiled.
For example, for a model compiled with the accuracy metric, we could evaluate it on
a new dataset as follows:
loss, accuracy = model.evaluate(X, Y)
print("Loss: %.2f, Accuracy: %.2f% %" % (loss, accuracy*100))
Step 5. Make Predictions
probabilities = model.predict(X)
predictions = [float(round(x)) for x in probabilities]
accuracy = numpy.mean(predictions == Y) #count the number of True and divide by the total size
print("Prediction Accuracy: %.2f% %" % (accuracy*100))
Binary classification using Neural Network in Keras
Diabetes Data Set
Detect Diabetes Disease based on analysis
Dataset Attributes:
1. Number of times pregnant
2. Plasma
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Save prediction model
After train our model, ie, model.fit(X_train, Y_train), we can save this traning to use later.
This task can done by Pickle package(Python Object Serialization Library), using dump
and load methods. Pickle can save any object not just the prediction model.
import pickle
…………….
model.fit(X_train, Y_train)
# save the model to disk
pickle.dump(model, open("c:/data.dump", 'wb’)) #wb= write bytes