Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Python 06 MachineLearning

The document provides an introduction to machine learning algorithms categorized by their output types, including classification, regression, clustering, and more. It discusses the importance of data features and dimensions, feature extraction, and various machine learning models such as decision trees, neural networks, and ensemble methods. Additionally, it highlights the Scikit-Learn library, its features, and the API for building and evaluating machine learning models.

Uploaded by

ashanisharma9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Python 06 MachineLearning

The document provides an introduction to machine learning algorithms categorized by their output types, including classification, regression, clustering, and more. It discusses the importance of data features and dimensions, feature extraction, and various machine learning models such as decision trees, neural networks, and ensemble methods. Additionally, it highlights the Scikit-Learn library, its features, and the API for building and evaluating machine learning models.

Uploaded by

ashanisharma9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to Machine

Learning with Scikit-Learn


Types of Algorithms by Output
Input training data to fit a model which is then
used to predict incoming inputs into ...

Type of Output Algorithm Category

Output is one or more discrete classes Classification (supervised)

Output is continuous Regression (supervised)

Output is membership in a similar group Clustering (unsupervised)

Output is the distribution of inputs Density Estimation

Output is simplified from higher dimensions Dimensionality Reduction


Classification

Given labeled input data (with two or more labels), fit a


function that can determine for any input, what the label is.
Regression

Given continuous input data fit a function that is able to


predict the continuous value of input given other data.
Clustering

Given data, determine a pattern of associated data points


or clusters via their similarity or distance from one another.
Hadley Wickham (2015)
“Model” is an overloaded term.
•Model family describes, at the broadest possible level, the
connection between the variables of interest.
•Model form specifies exactly how the variables of interest
are connected within the framework of the model family.
•A fitted model is a concrete instance of the
model form where all parameters have been
estimated from data, and the model can be
used to generate predictions.

http://had.co.nz/stat645/model-vis.pdf
Dimensions and Features
In order to do machine learning you need a data set containing
instances (examples) that are composed of features from which
you compose dimensions.

Instance: a single data point or example composed of fields


Feature: a quantity describing an instance
Dimension: one or more attributes that describe a property

from sklearn.datasets import l o a d _ d i g i t s


digit s = load_digits()

X= d i g i t s . d a t a # X.shape == (n_samples, n_features)


y = digits.target # y.shape == (n_samples,)
Feature Space
Feature space refers to the n-dimensions where your variables live (not
including a target variable or class). The term is used often in ML literature
because in ML all variables are features (usually) and feature extraction is the
art of creating a space with decision boundaries.

Target
1. Y ≡ Thickness of car tires after some testing period

Variables
1. X1 ≡ distance travelled in test
2. X2 ≡ time duration of test
3. X3 ≡ amount of chemical C in tires

The feature space is R3, or more accurately, the positive quadrant in R3 as all
the X variables can only be positive quantities.

http://stats.stackexchange.com/questions/46425/what-is-feature-space
Mappings
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4 (this is the
feature extraction part):

X4 = X1*X2 ≡ the speed of the vehicle during testing.

This extends our old feature space into a new one, the positive part of R4.

A mapping is a function, ϕ, from R3 to R4:

ϕ(x1,x2,x3) = (x1,x2,x3,x1x2)

http://stats.stackexchange.com/questions/46425/what-is-feature-space
Your Task
Given a data set of instances of size N, create
a model that is fit from the data (built) by
extracting features and dimensions. Then use
that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction
3. Model Selection/Building
4. Model Evaluation
5. Operationalize Model
A Tour of Machine Learning
Algorithms
Models: Instance Methods
Compare instances in data set with a similarity
measure to find best matches.
- Suffers from curse of dimensionality.
- Focus on feature representation and
similarity metrics between instances

● k-Nearest Neighbors (kNN)


● Self-Organizing Maps (SOM)
● Learning Vector Quantization (LVQ)
Self-Organizing Maps
Models: Regression
Model relationship of independent variables, X
to dependent variable Y by iteratively
optimizing error made in predictions.

● Ordinary Least Squares


● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)
Logistic Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)
• Combine multiple regression models
in a k-nearest-neighbor-based meta-
model
• Fits a low-degree polynomial to a
subset of the data close to the current
point
• Requires fairly large, densely sampled
data sets in order to produce good
models.
Models: Regularization Methods
Extend another method (usually regression),
penalizing complexity (minimize overfit)
- simple, popular, powerful
- better at generalization

● Ridge Regression
● LASSO (Least Absolute Shrinkage & Selection Operator)
● Elastic Net
Models: Regularization Methods
LASSO
• Limits total weight of parameters
• Can be interpreted as a prior distribution on parameters

• Ridge regression: quadratic penalty


• Elastic Net combines both Laplace prior distributions
Models: Decision Trees
Model of decisions based on data attributes.
Predictions are made by following forks in a
tree structure until a decision is made. Used for
classification & regression.

● Classification and Regression Tree (CART)


● Decision Stump
● Random Forest
● Multivariate Adaptive Regression Splines (MARS)
● Gradient Boosting Machines (GBM)
Models: Decision Trees

http://www.saedsayad.com/decision_tree.htm
Models: Bayesian
Explicitly apply Bayes’ Theorem for
classification and regression tasks. Usually by
fitting a probability function constructed via the
chain rule and a naive simplification of Bayes.

● Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
Naive Bayes

- Text retrieval 1960s


- Independence of feature values (given class)
- Bayesian theorem

Probability distribution:

- Voting for max. posterior probability (MAP)


Models: Kernel Methods
Map input data into higher dimensional vector
space where the problem is easier to model.
Named after the “kernel trick” which computes
the inner product of images of pairs of data.

● Support Vector Machines (SVM)


● Radial Basis Function (RBF)
● Linear Discriminant Analysis (LDA)
SVM
SVM
Models: Clustering Methods
Organize data into into groups whose members
share maximum similarity (defined usually by a
distance metric). Two main approaches:
centroids and hierarchical clustering.

● k-Means
● Affinity Propegation
● OPTICS (Ordering Points to Identify Cluster Structure)
● Agglomerative Clustering
K-means clustering
Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are
nonlinear function approximators that estimate
functions with a large number of inputs.
- System of interconnected neurons that activate
- Deep learning extends simple networks recursively

● Restricted Boltzmann Machine (RBM)


● Convolutional Neural Networks (CNN)
● Recurrent Neural Networks (RNN)
● Word2Vec models
Models: Artificial Neural Networks
Models: Ensembles
Models composed of multiple weak models that
are trained independently and whose outputs
are combined to make an overall prediction.

● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Stacked Generalization (blending)
● Gradient Boosting Machines (GBM)
● Random Forest
AdaBoost
AdaBoost
Models: Other
The list before was not comprehensive, other
algorithm and model classes include:
● Conditional Random Fields (CRF)
● Markovian Models (HMMs)
● Dimensionality Reduction (PCA, PLS)
● Rule Learning (Apriori, Brill)
● More ...
What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are
called SciKits. SciKit-Learn provides machine
learning algorithms.
● Algorithms for supervised & unsupervised learning
● Built on SciPy and Numpy
● Standard Python API interface
● Sits on top of c libraries, LAPACK, LibSVM, and Cython
● Open Source: BSD License (part of Linux)

Probably the best general ML framework out there.


Primary Features
- Generalized Linear Models
- SVMs, kNN, Bayes, Decision Trees, Ensembles
- Clustering and Density algorithms
- Cross Validation
- Grid Search
- Pipelining
- Model Evaluations
- Dataset Transformations
- Dataset Loading
A Guide to Scikit-Learn
Scikit-Learn API
Object-oriented interface centered around the
concept of an Estimator:
“An estimator is any object that learns from data; it may
be a classification, regression or clustering algorithm or
a transformer that extracts/filters useful features from
raw data.”

- Scikit-Learn Tutorial
class Estimator(object):

def f i t ( s e l f , X, y=None):
" " " F i t s estimator to d a t a . " " "
# set state o f ` ` s e l f ` `
returns e l f

def p r e d i c t ( s e l f , X ) :
" " " P r e d i c t response o f` ` X ` ` . " " "
# compute p r e d i c t i o n s ` ` p r e d ` `
return pred

The Scikit-Learn Estimator API


Estimators
- f i t ( X , y ) sets the state of the estimator.
- X is usually a 2D numpy array of shape
(num_samples, num_features).
- y is a 1D array with shape (n_samples,)
- p r e d i c t ( X ) returns the class or value
- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
from sklearn import svm

estimato r = svm. SVC(gamma=0.001)


e s t i m a t o r. f i t ( X , y)
e s t i m a t o r. p r e d i c t ( x )

Basic methodology
Wrapping fit and predict
We’ve already discussed a broad workflow, the
following is a development workflow:

Feature Feature
Raw Data
Extraction Evaluation

Load &
Build Model Evaluate Model
Transform Data
Task 6
- Select dataset (wines / student performance)
- Apply various learning algorithms on the
problem
- Provide best possible prediction w.r.t. RMSE

Best bet to start with:


- Student performance: Random forrest /
boosting trees
- Wines: MARS / LASSO / Bagging linear
model
Semestrální projekt – varianta 2
• Drug-target interaction prediction
– Predikce interakcí mezi léčivými látkami a proteiny
– DTInet dataset
• https://github.com/luoyunan/DTINet
– Známé interakce (Binary)
– Strukturální podobnost drugs i targets [0,1]
– Namapování na nemoci a side effects
– Evaluace:
• Cílem je ranking prediction (seřazení objektů od nejlepšího po nejhorší)
• 10-fold cross validace (náhodně se skryje 10% interakcí, vaším úkolem je co možná
nejlépe je seřadit, zdroják bude součástí zadání)
• Area Under ROC Curve, Area Under Precision-Recall Curve
– Odevzdané řešení musí být lepší než základní baseline: BLM
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735674/ (zdrojáky budou k
dispozici)
– Typické algoritmy:
• faktorizace matic e interakcí (případně obohacená o externí data)
• Nearest neighbors a grafové algoritmy
• Lokální modely predikující hrany na základě vlastností interagující drug a target

You might also like