Machine Learning
Machine Learning
Rahul Bhimakari
Machine Learning
Machine learning has its roots in the field of artificial intelligence (AI), and its development was
inspired by the human brain's ability to learn from experience. Researchers aimed to develop
computational models that could imitate this learning process.
Machine learning is a type of artificial intelligence (AI) that enables computers to learn and
make decisions without being explicitly programmed. In basic terms, it's like teaching a
computer to recognize patterns and make predictions based on data.
Machine Learning
Machine learning has its roots in the field of artificial intelligence (AI), and its development was
inspired by the human brain's ability to learn from experience. Researchers aimed to develop
computational models that could imitate this learning process.
Machine learning is a type of artificial intelligence (AI) that enables computers to learn and
make decisions without being explicitly programmed. In basic terms, it's like teaching a
computer to recognize patterns and make predictions based on data.
Relationship with AI
Dependence on data
The fundamental idea behind machine learning being data-driven is that the learning process
involves the analysis of data to identify patterns, make predictions, or perform specific tasks
without being explicitly programmed
•Training on Data
•Learning Patterns
•Generalization to New Data
•Iterative Improvement
Data-Driven Decision-Making
Data analysis is often a precursor to machine learning, helping to understand the structure and
characteristics of the data. Machine learning, in turn, uses these insights to build models that
can make predictions, classify data, or automate decision-making processes. The relationship
between the two emphasizes the importance of a holistic approach when working with data to
extract valuable information and derive actionable insights.
How is machine learning different from traditional
programming ?
Machine learning and traditional programming represent two different approaches to solving
problems with computers
Problem Complexity
Programming Paradigm
Types of algorithms
Supervised Learning Algorithms
Description: These algorithms learn from labeled training data, where the input features
are paired with corresponding target labels. The goal is to learn a mapping from inputs to
outputs
.
Unsupervised Learning Algorithms
Description: Unsupervised learning algorithms operate on unlabeled data to find hidden
patterns or structures. They are used for tasks such as clustering and dimensionality
reduction.
Reinforcement Based Learning Algorithms
Description: Reinforcement learning involves an agent learning to make decisions by interacting
with an environment. The agent receives feedback in the form of rewards or penalties, guiding
its learning process.
Semi-Supervised Learning Algorithms
Description: These algorithms leverage a combination of labeled and unlabeled data for
training. They are particularly useful when acquiring labeled data is expensive or time-
consuming.
Regression
Linear Regression is a statistical supervised learning technique to predict
the quantitative variable by forming a linear relationship with one or more
independent features.
dependent variable.
Assumptions
*The Independent variables should be linearly related to the dependent variables.
*Variance of the residual throughout the data should be same. This is known as
homoscedasticity.
Multiple linear Regression is the most common form of linear regression analysis. As a
predictive analysis, the multiple linear regression is used to explain the relationship between
one continuous dependent variable and two or more independent variables.
The independent variables can be continuous or categorical (dummy coded as appropriate).
We often use Multiple Linear Regression to do any kind of predictive analysis as the data we
get has more than 1 independent features to it.
Formula can be represented as Y=mX1+mX2+mX3…+b
The main aim of gradient descent is to find the best parameters of a model which gives the
highest accuracy on training as well as testing datasets. In gradient descent, The gradient is a
vector that points in the direction of the steepest increase of the function at a specific point.
Moving in the opposite direction of the gradient allows the algorithm to gradually descend
towards lower values of the function, and eventually reaching to the minimum of the function.
At the optimum, the gradient should be close to zero. Therefore, if the gradient is very small, it
may indicate that the algorithm is close to the optimal solution.
Adaptive learning rate schedules can be employed to dynamically adjust the learning rate
during training. Some techniques, like learning rate annealing or learning rate decay, can help
fine-tune the learning process and enhance convergence.
Classification
In machine learning, classification refers to a predictive modeling problem where a
class label is predicted for a given example of input data. Examples of classification
problems include, classify if email is spam or not. Given a handwritten character,
classify it as one of the known characters.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an
input is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0. It’s
referred to as regression because it is the extension of linear regression but is mainly used for
classification problems. The difference between linear regression and logistic regression is that
linear regression output is the continuous value that can be anything while logistic regression
predicts the probability that an instance belongs to a given class or not.
Sigmoid Function
Logistic Regression relies on the logistic function to convert the output into a probability score.
This score represents the probability that an observation belongs to a particular class. The S-
shaped curve assists in thresholding and categorising data into binary outcomes.
Precision
Ability to avoid false positives
Precision=True Positives / False Positives+True Positives
Recall
Ability to capture all positive instances
Recall=True Positives / False Negatives +True Positives
F1-Score
F1 Score=2*Precision*Recall/Precision + Recall
Natural Language Processing
Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with the
interaction between computers and humans in natural language. It involves the use of
computational techniques to process and analyze natural language data, such as text and
speech, with the goal of understanding the meaning behind the language.
NLP techniques are widely used in a variety of applications such as search engines,
machine translation, sentiment analysis, text summarization, question answering,
and many more.
The goal of NLP is to develop algorithms and models that enable computers to understand,
interpret, generate, and manipulate human languages.
Fundamental Concepts in NLP
Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. Tokens are
the basic building blocks, which can be words, phrases, or even characters, depending on the
level of granularity required.
Tokenization is a crucial initial step in NLP, providing a structured way to analyze and
understand textual data. It helps convert unstructured text into a format that can be easily
processed by algorithms.
For the sentence "The quick brown fox jumps over the lazy dog," tokenization would result in
individual tokens such as "The," "quick," "brown," "fox," "jumps," "over," "the," "lazy," "dog."
Stemming
Stemming is a process in which words are reduced to their root or base form by removing
suffixes. The goal is to bring related words to a common base form, simplifying analysis and
improving information retrieval.
Stemming helps in reducing the dimensionality of the feature space and treating variations of
words as a single entity. This can be beneficial in tasks like document clustering and search
engines.Also in the tasks like classification, tenses of words are rendered irrelevant once
stemming is applied.
Lemmatization is a more advanced form of reducing words to their base form, known
as a lemma. Unlike stemming, lemmatization considers the meaning of words and
aims to transform them into their dictionary or canonical form.
The lemma for words like "running," "runner," and "ran" is "run." However, lemmatization would
also consider the lemma of "better" as "good."
Padding
For tasks like text classification or sentiment analysis, it's common to represent
sentences as sequences of word embeddings. However, these sentences may have
different lengths.
Padding is applied by adding zeros or a special token to the end of shorter sequences, making
all sequences equal in length. This ensures that batches of input data fed into a neural network
have consistent dimensions.
Example: If you have sentences "I love NLP" and "It's fascinating," and you're representing
them as sequences of word embeddings, you might pad the first sentence with zeros to match
the length of the second sentence.
Pruning
In applications like text classification or language modeling, the learned embeddings or weights
may contain information that is not crucial for the model's performance.
Pruning involves identifying and removing less important connections, neurons, or even entire
layers from the neural network, resulting in a more compact and efficient model.
In a sentiment analysis model, during training, certain words may end up having very low
weights, indicating they contribute less to the overall sentiment prediction. Pruning involves
removing or reducing the influence of these less important words, streamlining the model.
Stopwords
Stopwords are common words that are often removed from text during the pre-processing
phase in natural language processing (NLP). These words are generally the most frequently
occurring words in a language and are often considered to be of little value in terms of
conveying specific meaning. Including them in certain NLP tasks may introduce noise and
adversely affect the performance of models.
In a practical NLP pipeline, you might first tokenize a document, then apply
stemming or lemmatization to normalize the words, creating a more consistent and
manageable dataset for analysis or model training.
Word Embeddings
Machine learning models take vectors (arrays of numbers) as input. When working with text, the
first thing you must do is come up with a strategy to convert strings to numbers (or to
"vectorize" the text) before feeding it to the model. In this section, you will look at three
strategies for doing so.
TF-IDF
Word2Vec
One-Hot Encoding
Each word is represented as a sparse vector with a 1 in the position corresponding
to the word's index in the vocabulary and 0s elsewhere.
This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices
are zero). Imagine you have 10,000 words in the vocabulary. To one-hot encode each
word, you would create a vector where 99.99% of the elements are zero.
TF-IDF
TF-IDF vectorization involves calculating the TF-IDF score for every word in your
corpus relative to that document and then putting that information into a vector
(see image below using example documents “A” and “B”). Thus each document in
your corpus would have its own vector, and the vector would have a TF-IDF score
for every single word in the entire collection of documents. Once you have these
vectors you can apply them to various use cases such as seeing if two documents
are similar by comparing their TF-IDF vector using cosine similarity.
In TF-IDF, the term frequency (TF) component measures how often a term occurs in a
document, while the inverse document frequency (IDF) component penalizes terms that are
common across many documents. The resulting TF-IDF score reflects the importance of a term
in a specific document within a larger corpus.
Word2Vec
Word2Vec is a popular technique in
natural language processing (NLP)
that is used to represent words as
continuous vector spaces. Developed
by researchers at Google, Word2Vec
captures semantic relationships
between words by representing them
as dense vectors in a high-dimensional
space. The key idea behind Word2Vec
is that words with similar meanings or
contexts are mapped to similar vectors
in this space.
Architectures:
Continuous Bag of Words (CBOW):
CBOW predicts the target word (center word) based on its surrounding context (context
words). The model is trained to predict the target word given a window of context words.
The architecture uses a neural network with a hidden layer to learn the word
embeddings.
Skip Grams:
Skip-gram, on the other hand, predicts the context words based on a given target word.
The model is trained to predict the context words given a target word.
Like CBOW, it uses a neural network with a hidden layer.
In both architectures, the training objective is to adjust the model's parameters (word vectors) so that the
predicted words match the actual context words as closely as possible. The word vectors obtained after
training are dense and capture semantic relationships, making them suitable for various NLP tasks.
Context-Target Pairs:
Create context-target pairs, where the context consists of nearby words and the target is
the word to be predicted.
Neural Network Training:
Train a neural network (CBOW or Skip-gram) on the context-target pairs to learn word
embeddings.
Word Embeddings:
Extract the learned word embeddings from the neural network.