Module 4 - v1
Module 4 - v1
CSE 3005:
Applied Artificial Intelligence
Contents
• Uncertainty in AI
2
Need for Reasoning with Uncertainty
• World is full of uncertainty
• Partial information
• Can’t encode rules for every condition
• Computers need to be able to handle uncertainty
• Probability – New foundation for AI
• Massive amounts of data are available today
• Statistics helps us learn lots of stuff from data
3
Probability Basics
• Begin with a set S, also called the sample space
• Eg. For a dice, what is the sample space?
• x ε S is a sample point / atomic event
• A probability space / probability model is a sample space with an assignment P(x) for all
values of x such that:
• 0 ≤𝑃 𝑥 ≤1
• σ𝑥 𝑃(𝑥) = 1
• An event A is a subset of S.
• Eg. A = “dice roll >= 4”
• A random variable is a function from sample points to some range.
4
Types of Probability Spaces
• Propositional / Boolean
• Eg. Cavity (Do I have a cavity?)
• Discrete random variables (finite or infinite)
• Eg. Weather = {Sunny, Rainy, Cloudy, Snowy, …}
• Values must be exhaustive and mutually exclusive.
• Continuous random variables (bounded or unbounded)
• Temperature = 21.6
• Temperature is between 20 and 35.
• Combination of different propositions
5
Axioms of Probability Theory
• All probabilities are in the range [0,1]
• P(AvB) = P(A) + P(B) – P(A^B)
6
Prior Probability
• Prior probability or unconditional probability correspond to belief
prior to arrival of new evidence.
• Probability distribution gives values of all possible assignments:
• P(Weather) = [0.72 0.1 0.08 0.001 …] (sum of elements total 1]
• Joint probability distribution for a set of random variables gives the
probability of every atomic event on those random variables
7
Conditional Probability
• Conditional Probability or Posterior Probability links 2 or more events that
depend on each other.
• Eg. P(cavity | toothache) = 0.8
• If we know more, the values will change.
• Calculate the values of the following:
• P(cavity | toothache, cavity) = ?
• P(cavity | toothache, sunny) = ?
• Answer:
•1
• 0.8
8
Conditional Probability
• P(A|B) = P(A^B)/P(B)
• Chain Rule: P(X1, X2, … XN) = P(Xn|X1 … Xn-1)*P(Xn-1|X1…Xn-
2)…P(X1)
9
• P(X1,X2,X3)=P(X3∣X1,X2)×P(X2∣X1)×P(X1)
10
Independent Events
• Two events are Independent if they do not depend on each other to occur.
• Eg. A = “A heads on the first toss” and B = “A tails on the second toss”
• For independent events,
𝑃(𝐴∩𝐵)
• 𝑃 𝐴𝐵 = =𝑃 𝐴
𝑃(𝐵)
• 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 ∗ 𝑃(𝐵)
• A and B are independent iff:
• P(A|B) = P(A) OR
• P(B|A) = P(B) OR
• P(A,B) = P(A)*P(B)
11
1.Independence of Events:
1. Two events, A and B, are considered independent if the occurrence of one event
doesn't affect the occurrence of the other. In simpler words, they are unrelated.
2.Probability of A given B:
1. The probability of A given B (P(A|B)) is a measure of how likely A is to happen,
considering that B has already occurred.
2. For independent events, P(A∣B)=P(A). This means knowing that B happened doesn't
change the probability of A.
3.Multiplication Rule for Independent Events:
1. The probability of both A and B happening (denoted as P(A∩B)) for independent
events is the product of their individual probabilities: P(A∩B)=P(A)×P(B).
2. This is because, if A and B are independent, the occurrence of one doesn't influence
the occurrence of the other.
12
1.Equivalent Conditions for Independence:
1. Events A and B are independent if any of the following is true:
1. P(A∣B)=P(A): The probability of A given B is the same as the probability of A. Knowing B doesn't
change the likelihood of A.
2. P(B∣A)=P(B): The probability of B given A is the same as the probability of B. Knowing A doesn't
change the likelihood of B.
3. P(A∩B)=P(A)×P(B): The probability of both A and B happening is the product of their individual
probabilities.
13
Bayes Rule
• 𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑦 ∗ 𝑃 𝑦 = 𝑃 𝑦 𝑥 ∗ 𝑃(𝑥)
𝑃 𝑦 𝑥 ∗𝑃(𝑥)
𝑃 𝑥𝑦 =
𝑃(𝑦)
𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟
• 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒
𝑃 𝐸𝑓𝑓𝑒𝑐𝑡 𝐶𝑎𝑢𝑠𝑒 ∗𝑃(𝐶𝑎𝑢𝑠𝑒)
𝑃 𝐶𝑎𝑢𝑠𝑒 𝐸𝑓𝑓𝑒𝑐𝑡 =
𝑃(𝐸𝑓𝑓𝑒𝑐𝑡)
14
Sample Problem
• Any patient with a stiff neck (S) may or may not have meningitis (M).
The probability of a patient having a stiff neck, if they have meningitis,
is 0.8. However, meningitis is not widespread in the population –
having a probability of only 0.0001. On the other hand, stiff necks are
more common, with a probability of 0.1.
• . What is the probability of getting meningitis, if the person have stiff
neck?
15
Solution
• To be calculated in class!
16
Some Sample Practice Questions
• 1. What is the probability that you select the Queen of Hearts from a
deck of cards?
• 2. What is the probability that you select either a queen OR a heart
from a deck of cards?
• 3. What is the probability that you select first a queen AND THEN a
heart from a deck of cards?
17
1.Probability of selecting the Queen of Hearts:
•In a standard deck of 52 cards, there is only one Queen of Hearts.
•Probability = Number of favorable outcomes / Total number of possible outcomes
•Probability = 1 (Queen of Hearts) / 52 (total cards) = 1/52
2.Probability of selecting either a Queen OR a Heart:
•There are four Queens in a deck (Hearts, Diamonds, Clubs, Spades) and 13 Hearts in total (including the Queen of Hearts).
•However, we need to be careful not to double-count the Queen of Hearts, which is both a Queen and a Heart.
•Probability = (Number of favorable outcomes - Overlapping outcome) / Total number of possible outcomes
•Probability = (4 Queens ) + (13 Hearts - 1 Queen of Hearts) / 52 (total cards)
•Probability = (4+ 12) / 52 = 16/52
3.Probability of selecting first a Queen AND THEN a Heart:
•Since there are four Queens in a deck, the probability of selecting a Queen first is 4/52.
•Once a Queen is selected, there are now 51 cards left in the deck, and there are 13 Hearts remaining. So, the probability of
selecting a Heart second is 13/51.
•Probability = Probability of selecting a Queen * Probability of selecting a Heart given that a Queen was selected
•Probability = (4/52) * (13/51) = 1/52
So, to summarize:
1.Probability of selecting the Queen of Hearts: 1/52
2.Probability of selecting either a Queen OR a Heart: 16/52
3.Probability of selecting first a Queen AND THEN a Heart: 1/52
18
Some More Sample Practice Questions
• 4. What is the probability of getting a LETTER card (A,K,Q,J), given
that you picked a NUMBER card from the same suit?
• 5. What is the probability of drawing an ACE of HEARTS, given that
you have drawn an ACE from the deck of cards?
• 6. What is the probability of drawing an ACE of HEARTS, given that
you have drawn a FACE card from the deck of cards?
19
4)Probability of getting a LETTER card (A,K,Q,J) given that you picked
a NUMBER card from the same suit:
• For each suit (Hearts, Diamonds, Clubs, Spades), there are three
number cards (2, 3, 4, 5, 6, 7, 8, 9, 10).
• The probability of getting a LETTER card given that you picked a
NUMBER card from the same suit is zero because there are no LETTER
cards (A, K, Q, J) among the number cards.
• Therefore, the probability is 0.
• 5)1/4
• 6)0
20
Burglars OR Earthquakes
• You live in San Francisco, where there have been a lot of burglaries of
late, so you buy a burglar alarm. San Francisco is also where a number
of earthquakes will happen. The alarm is very sensitive, so both
burglars and earthquakes can set it off.
• One day, you are at a party and your neighbour John calls and tells
you that your alarm is ringing. On the other hand, your other
neighbour Mary does not call.
• Is your home being robbed?
21
• You have a new burglar alarm installed at home.
• It is fairly reliable at detecting burglary, but also sometimes responds to minor
earthquakes.
• You have two neighbors, John and Merry, who promised to call you at work
where they hear the alarm.
• John always calls when he hears the alarm, but sometimes confuses telephone
ringing with the alarm and calls too.
• Merry likes loud music and sometimes misses the alarm.
• Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
22
Burglars OR Earthquakes
• Burglary –> Alarm
• Earthquake –> Alarm
• Alarm –> JohnCalls
• Alarm –> MaryCalls
23
24
• What is the probability that the alarm has sounded but neither a
burglary nor an earthquake has occurred, and both John and
Merry call?
25
• P(j^m^a^~b^~e) = P(j | a) P(m | a) P(a|~b, ~e) P(~b) P(~e) =
26
Bayesian Networks
• A Bayesian network is a probabilistic graphical model which
represents a set of variables and their conditional dependencies using
a directed acyclic graph.
• Directed Acyclic Graph
• Table of conditional probabilities.
27
Bayesian Networks
• In general, joint distribution P over the set of variables requires
exponential space for representation and inference
• Bayesian Networks provide a graphical representation of conditional
independence relations in P
• Advantages of Bayesian Networks:
• Usually quite compact
• Requires assessment of fewer parameters
• Efficient inference
28
The Cavity, Toothache, Weather Problem
• One sunny day, you go to the dentist because you have a toothache.
The dentist checks if you have a cavity, using his instruments. If his
instruments get caught in your gums, then it is likely that you have a
cavity.
• Find the probability that you have a cavity, given you have a
toothache, the instruments get caught in your gums and it is a sunny
day.
29
Representing the Variables in a Bayesian
Network
• Weather = {Sunny}
• Cavity = {Yes, No}
• Toothache = {Yes, No}
• Catch = {Yes, No}
• Weather is independent of other
variables.
• Toothache and Catch are conditionally
independent of each other given
Cavity.
30
Revisiting the Burglaries / Earthquakes
Problem
31
Inferences
• If we know Alarm, no other evidence influences our belief that
JohnCalls:
• P(JC|MC, A, E, B) = P(JC|A)
• P(MC|JC, A, E, B) = P(MC|A)
• P(E|B) = P(E) … … …
• By the chain rule, we have:
• P(JC, MC, A, E, B) = P(JC|MC, A, E, B) * P(MC|A, E, B) * P(A|E, B) * P(E|B) *
P(B)
• P(JC, MC, A, E, B) = P(JC|A) * P(MC|A) * P(A|E, B) * P(E) * P(B)
• Full joint probability now requires only 10 parameters
32
Bayesian Networks – Qualitative Structure
• Graphical structure of Bayesian Network reflects conditional independence
among variables
• Each variable X is a node on the Directed Acyclic Graph
• Edges denote direct probabilistic influence
• Usually interpreted causally
• Parents of X are denoted by Par(X)
• Local semantics: X is conditionally independent of all NON-DESCENDENTS
given its parents
• In general, full joint probability or a Bayesian Network is defined as follows:
• 𝑃 𝑋1 , 𝑋2, … , 𝑋𝑛 , = ς𝑖 𝑃(𝑋𝑖 |𝑃𝑎𝑟 𝑋𝑖 )
33
Sequence Labeling
• Consider a problem where we have a sequence of observations,
which are triggered by a sequence of states.
• The observations are known and seen, but the states are what we are
interested in.
34
Coloured Ball Choosing
• Consider a situation where we have 3 urns (Urn1, Urn2 and Urn3), each of
which have 100 balls in them, such that some are red, some are blue and
some are green.
• We select a ball from an urn, record the observation, and put it back in the
urn.
• Then, we pick a ball from another urn and continue for a sequence.
• Our task is to now find a suitable sequence of labels of states which give us
the best probability for our observation.
• Solution: Hidden Markov Model.
35
Hidden Markov Model
• Coloured Ball Choosing
• There are 3 urns, each with 100 balls of different colours.
36
Diagrammatic Representation
37
Problem Statement
• For an observation sequence, what is the sequence of states?
• Eg. Observation = RRGGBRGR
• Si = U1/U2/U3 (a particular state)
• O: Observation sequence
• S* = “best” state sequence
• Goal: Maximize P(S*|O) by choosing the “best” S
• Assume that each urn can be chosen with equal initial probability.
38
State Transitions Probability
• P(S) = P(S1-8)
• P(S) = P(S1)*P(S2|S1)*P(S3|S1-2)*…*P(S8|S1-7)
• By Markov / Bigram Assumption, k = 1
• P(S) = P(S1)*P(S2|S1)*P(S3|S2)*…*P(S8|S7)
39
Observation Sequence Probability
• The ball depends only on the urn chosen.
• P(O|S) = P(O1|S1)*P(O2|S2)…*P(O8|S8)
40
Grouping Terms
• P(S|O) = P(S)*P(O|S)
• = [P(S0)*P(S1|S0)…*P(S8|S7)] * [P(O1|S1) * P(O2|S2) … * P(O8|S8)]
• = 𝑃 𝑆0 ∗ ς8𝑖=1 𝑃 𝑆𝑖 𝑆𝑖−1 ∗ ς8𝑖=1 𝑃(𝑂𝑖 |𝑆𝑖 )
• Where
• P(S0) is the initial probability
• P(Si|Si-1) is the transition probability
• P(Oi|Si) is the emission probability
41
Algorithm (Best Forward Probability)
• Calculate the intermediate probability of the sequence from the
starting node to the previous node for all previous states.
• P(current) = P(previous) * LP * TP
• Select the node with the best probability
42
In our example…
• IP = 0.33 for all values.
• Observation Sequence = RRGGBRGR
• Add 2 more observations – $ (start symbol) and ^ (end symbol)
• Observation Sequence = $RRGGBRGR^
• LP(U1|R) = 0.3, LP(U2|R) = 0.1, LP(U3|R) = 0.6 Urn1 Urn2 Urn3
45
Final State Sequence: U3U3U2U1U2U1U2U1
• Final Probability = 1.79159*10-6
46
Viterbi Algorithm
• The Viterbi Algorithm was first proposed by Andrew J. Viterbi in
1967.
• The Viterbi Algorithm is a dynamic programming algorithm.
• Used for finding the most likely sequence of hidden states (called the
Viterbi Path) that results in a series of observed events.
• In PoS tagging, the hidden states correspond to the tags, and the
observed events correspond to the words / tokens.
• It is used in speech recognition, speech synthesis, sequence labelling,
etc.
47
Viterbi Algorithm – Input & Output
• Input:
• State space (S) = {S1, S2, … S|T|}
• Observation space (O) = {O1, O2, … O|V|}
• Transition matrix (T) of size |T * T|
• Emission matrix (E) of size |V * T|
• Initial probabilities matrix (I) of size |1 * T|
• Sequence of observations (Y) of length N = Y 1 Y2 … YN
• Output:
• Most likely hidden state sequence (X) = T 1 T2 … TN
48
Viterbi(O, S, I, T, E, Y):X
• For each state s from 1 to |T| do:
• Viterbi[s, 1] = I(s) * E[s, 1] //To keep track of the probability
• BP[s, 1] = 0 //To keep track of the path
• For each step t from 2 to N do:
• For each state s from 1 to |T| do:
• Viterbi[s, t] = max[k in 1 to N](Viterbi[k, t-1] * T[k, s] * O[s, t])
• BP[s,t] = argmax[k in 1 to N](Viterbi[k, t-1] * T[k, s] * O[s, t])
• ZN = argmax[S in 1 to N](Viterbi[s, N])
• XN = S[ZN]
• For i in T … 2 do:
• Zi-1 = BP[Zi, i]
• Xi-1 = S[Zi-1]
• Return X
49
Complexity Analysis
• Time complexity = O(N * |T|2)
• Space complexity = O(N * |T|)
50
Implementation Example
• Consider a doctor diagnoses fever by asking patients how they feel.
The patients may only answer that they feel normal (n), dizzy (d), or
cold (c)
• There are 2 states, “Healthy” (H) and “Fever” (F) but the doctor
cannot observe them directly. They are hidden states.
• Every day, the patient can report to the doctor whether they are
normal, or cold, or dizzy. Based on this (and the previous day’s
statement), the patient should be diagnosed as either H or F.
• Find out the diagnosis of a patient who reports “Normal Cold Dizzy”
across 3 days.
51
Implementation Example – Inputs
• States = {H, F}
• Observations = {c, d, n}
• Initial Probabilities: H = 0.6, F = 0.4
52
Steps – Creation of the Trellis
53
Steps – Calculation of the probabilities on Day
1
Transition Healthy Fever
0.6*
0.5 Healthy 0.7 0.3
Fever 0.4 0.6
START 0.6 0.4
0.4*
0.1
0.4*
0.1
0.4*
0.1
0.04*0.6*0.3
0.04 0.027
0.04*0.6*0.3
0.027*0.6*0.6
0.04 0.027
0.04*0.6*0.3
0.027*0.6*0.6 0.0151
0.04 0.027
0.04*0.6*0.3
0.0151
0.027*0.6*0.6 0.0151
0.04 0.027
0.04*0.6*0.3
0.0151
0.027*0.6*0.6 0.0151
0.04 0.027
0.04*0.6*0.3
0.0151
0.027*0.6*0.6 0.0151
0.04 0.027
0.04*0.6*0.3
0.0151
0.027*0.6*0.6 0.0151
0.04 0.027
0.04*0.6*0.3
64
65
Applications of HMM in NLP
• Used in sequence labeling tasks in NLP such as part-of-speech tagging
and named entity recognition.
66
Part-of-Speech Tagging
• Involves tagging each token with a part-of-speech (Eg. noun).
• Lets say that we have only 6 tags – noun (NN), verb (VB), adjective (JJ)
adverb (RB), function word (FW) to represent all other words and
punctation (.) to represent all punctuations.
• Consider the following sentence.
• The quick brown fox jumped over the lazy dog.
• The tagged sentence is
• The_FW quick_JJ brown_JJ fox_NN jumped_VB over_FW the_FW lazy_JJ
dog_NN ._.
• Similarly, 𝑃 𝑊 𝑇 = ς𝑁
𝑖=1 𝑃(𝑤𝑖 |𝑡𝑖 )
• 𝑃(𝑤𝑖 |𝑡𝑖 ) is the Lexical Probability
73
Examples of Named Entities
Class Examples
Person Sandeep Mathias
Location Bengaluru
Organization Presidency University
Geo-political Entity Prime Minister of India
74
Named Entity Tagging
• The task of Named Entity Recognition (NER):
• Find spans of text that constitute a named entity.
• Tag the entity with the proper NER class.
75
NER Input
• Citing high fuel prices, United Airlines said Friday it has increased
fares by $6 per round trip on flights to some cities also served by
lower-cost carriers.
• American Airlines, a unit of AMR Corp., immediately matched the
move, spokesman Tim Wagner said.
• United, a unit of UAL Corp., said the increase took effect Thursday
and applies to most routes where it competes against discount
carriers, such as Chicago to Dallas and Denver to San Francisco.
76
NER – Finding NER Spans
• Citing high fuel prices, [United Airlines] said [Friday] it has increased
fares by [$6] per round trip on flights to some cities also served by
lower-cost carriers.
• [American Airlines], a unit of [AMR Corp.], immediately matched the
move, spokesman [Tim Wagner] said.
• [United], a unit of [UAL Corp.], said the increase took effect
[Thursday] and applies to most routes where it competes against
discount carriers, such as [Chicago] to [Dallas] and [Denver] to [San
Francisco].
77
NER Output
• Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has
increased fares by [MONEY $6] per round trip on flights to some cities
also served by lower-cost carriers.
• [ORG American Airlines], a unit of [ORG AMR Corp.], immediately
matched the move, spokesman [PER Tim Wagner] said.
• [ORG United], a unit of [ORG UAL Corp.], said the increase took effect
[TIME Thursday] and applies to most routes where it competes against
discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver]
to [LOC San Francisco].
78
Why NER is not so easy
• Segmentation
• In PoS tagging, no segmentation, since each word gets 1 tag.
• In NER, we have to find the span before adding the tags!
• Type Ambiguity
• Multiple types can map to same span.
• [Washington] was born into slavery on the farm of James Burroughs.
• [Washington] went up 2 games to 1 in the four-game series.
• Blair arrived in [Washington] for what may well be his last state visit.
• In June, [Washington] legislators passed a primary seatbelt law.
79
Why NER is not so easy
• Segmentation
• In PoS tagging, no segmentation, since each word gets 1 tag.
• In NER, we have to find the span before adding the tags!
• Type Ambiguity
• Multiple types can map to same span.
• [PER Washington] was born into slavery on the farm of James Burroughs.
• [ORG Washington] went up 2 games to 1 in the four-game series.
• Blair arrived in [LOC Washington] for what may well be his last state visit.
• In June, [GPE Washington] legislators passed a primary seatbelt law.
80
BIO-Tagging
• Converting the NER tagging with 1 label for multiple words, to a
sequence labeling problem like PoS tagging with 1 tag per word.
• Consider the sentence: “[PER Jane Villanueva] of [ORG United] , a
unit of [ORG United Airlines Holding] , said the fare applies to the
[LOC Chicago] route.”
• Instead of just marking the spans, we also mark out whether it is the
beginning (B), or inside (I) part of the span. Words outside the span
are tagged as other (O).
81
BIO Tagging
• The sentence: “[PER Jane Villanueva] of [ORG United], a unit of [ORG
United Airlines Holding] , said the fare applies to the [LOC Chicago]
route.”
• Becomes:
• “Jane_B-PER Villanueva_I-PER of_O United_B-ORG ,_O a_O unit_O
of_O United_B-ORG Airlines_I-ORG Holding_I-ORG ,_O said_O the_O
fare_O applies_O to_O the_O Chicago_B-LOC route_O ._O”
• Total Number of Tags = 2n+1
82
Other BIO Tagging variants
• IO Label – I is inside the span, O is outside the span.
• BIO Label – B is beginning of the span, I is inside the span, O is
outside the span.
• BIOES Label – B is beginning of the span, I is inside the span, O is
outside the span, E is end of the span, and S is to represent a single
element tag.
83
Standard Algorithms for NER
• Many supervised sequence labeling models can be used.
• Hidden Markov Models (HMM)
• Conditional Random Fields (CRF)
• Maximum Entropy Markov Models (MEMM)
• Neural Sequence Models
• Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), etc.
• Pre-trained Language Models – Eg. BERT
84