AI Important Questions
AI Important Questions
AI Important Questions
Intelligence
RCS -702
B.Tech 4th Year (7th
Semester)
UNIVERSITY ACADEMY 1
UNIT - 1
Introduction : Introduction to Artificial Intelligence
UNIVERSITY ACADEMY 2
UNIT 1 ( Introduction to AI)
Intelligent computer programs. It is related to the similar tasks of using compuyters to understand human
intelligence. Scientists want to automate human intelligence for the following reasons :
System that act like humans:The exciting new effort to make computers think . . . machines with
minds, in the full and literal sense." (Haugeland, 1985)
System that think rationally:"[The automation of] activities that we associate with human
thinking, activities such as Decision making, problem solving, learning . . ." (Bellman, 1978)
The study of mental faculties through the use of computational models." (Chamiak and McDermott, 1985)
"The study of the computations that make it possible to perceive, reason, and act." (Winston, 1992)
Ans :
UNIVERSITY ACADEMY 3
Ques 3. What is Weak AI and Strong AI ?
Ans : Weak AI deals with the creation of some form of computer based artificial intelligence which can reason and
solve problems in limited domain. Some thinking like feature may be added to machine , but true intelligence is absent.
Here we have to get the explanation of solution by us in own way rather depending on computer machine.
Strong AI claims that computers can think at the level of human beings. It truly reasons and solve complex problems.In
strong AI programs itself are explanations for any solution.
Ans : The word agent is derived from the concept that when some agency hires some person to do a particular
work on behalf of the user. Agent is that program in terms of AI , which perceives its environment through
sensors and acts upon it accordingly by using actuators. E.g : Software agent, Robotic agent, Nano robots for
body check ups/ biological agents, Internet search agent etc. Software agents carry following properties :
UNIVERSITY ACADEMY 4
Mathematically agent’s function is defined as , that maps any given percept sequence to an action. Internally
agent function can be implemented by an agent program f : P* →A , where P* sequence of zero or more
percepts, A is an action taken by the agent.
A system is said to be rational if it does the right thing, given what it knows (Irrefutable reasoning).Right thing
makes agent successful. So some performance measure is required to measure the degree of success.
Rationality depends on :
UNIVERSITY ACADEMY 5
Example of NLP grammar is given as below:
S → NP VP
NP → DET N | DET ADJ N
VP → V NP Where NP is noun phrase , VP is verb phrase , DET is
DET → the article ADJ is adjective , V is verb and N is noun . These
ADJ → big | fat
N → cat | rice all are non terminals. The, big , fat, eat , rice are
V → eat . terminals.
Ans : A lexicon is a dictionary of words (usually morphemes or root words with their derivatives, where each
word contains some meaning and syntax.)Information in lexicon is needed to help determine the function and
meanings of the words in a sentence. Entries in lexicons can be grouped and given by word category. E.g:
Articles , nouns, verbs, pronouns, adjectives etc.
UNIVERSITY ACADEMY 6
Long Question & Answers
Ques 8. Explain Goal Based Agent and Utility based Agent architecture with proper diagram.
Ans : Job of AI is to design an agent program that implements the agent function mapping percepts to actions.
This program will execute on some sort of computing device with sensors and actuators – this is called
ARCHITECTURE. Agent = Architecture + Program.
(a) Goal based agent : In these type of agent models desirable goals and promising directions towards goal
which are easy to reach are incorporated. Some times action to be selected is simple, when single action
leads to desirable goal. But when long sequence of percepts are observed , then complexity of decision
making increases. Example : In automated car driving agent.
Goal based agent may be less efficient , but flexibile enough by proper knowledge and decision
making. E,g : If it starts raining then car driving agent must be flexible enough to make correct decision
when to put the brakes on.
(b) Utility based agent : In goal based agents we just get distinction between happy and unhappy states.,
whereas a more general performance measure allows a comparison of different world states according
to exactly how happy they would make agent if the goal is reached. For this we require utility measure.
UNIVERSITY ACADEMY 7
A UTILITY FUNCTION maps a state into a real number which describes associated degree of
happiness. Two cases can be considered here for rationality:
Case 1: Conflicting goals exist and only some of them can be achieved ( e.g safety and speed of car are
conflicting requirements. So select that goal which has more degree of happiness and is more
useful ), So in car driving safety is more essential than faster speed , to avoid any accident or human
loss.
Case 2: If several goals exist and agent cannot reach any way to them with certainty, utility provides
the way in which likelihood of success can be weighted up against the importance of goal.Example :
A house hold robotic agent will give medicine to a person at a schedule time as compared to if at the
same time he is asked by another family member to play his favourite sports channel in Television.
Because utility of medicine consumption is higher than watching the television.
Ques 9. (a) What is PEAS information ? Design the PEAS information for Taxi Driver Agent and
Automated Robot in a manufacturing plant.
(b) Mention various properties of task environment.
Ans : (a) PEAS is the acronym used to define the performance and other characteristics of a rational agent.
Performance measure decides criterion for the success of an agent’s behavior. When an agent is
plunked down in the environment , it generates a sequence of actions according to the percept it receives.
The sequence of actions causes environment to go through a sequence of states. If sequence is desirable , then
agent has performed well.
UNIVERSITY ACADEMY 8
Agent Type Performance Environment Actuators Sensors
Measure
Taxi Driver Safe, fast, legal, Roads, Other traffic, Steering, Cameras,
Agent comfortable trip , pedestrians,customers. accelerator,brake,signal Speedometer, GPS
maximum mileage. ,horn, display , odometer,engine
sensors
Robot part % of parts in correct Conveyner belt with Joint arms and hands Camera, Joint
picking agent place parts, bin angle sensors.
Episodic Vs Sequential : In episodic task environment agent’s experience is divided into “ atomic
episodes”. Each episode consists of agent perceiving and then performing an action. Next episode
Is independent from the actions taken in previous episodes.
E.g : In sequential environment , current decision affects all future decisions.
E.g : In a Taxi driving agent intensity of brakes put on may have long term consequences.
Static Vs Dynamic : in environment if changes occur while agent is under action , we say it is a
dynamic environment else it is static. Static environment is easy to work on . Dynamic environment
continuously ask agent , what to do next E.g : Taxi driving is dynamic.
Discrete Vs Continuous : This is w.r.t states of an environments. E.g : Chess game has a finite number
of distinct states and discrete set of percepts and actions. Whereas taxi driving is continuous time
problem and continuous state aslo.
UNIVERSITY ACADEMY 9
Single Agent Vs Multi Agent : Agent solving a crossword puzzle alone a single agent.
Chess playing is two agent . Robot Soccer is multi agent ( Cooperative multi agent).
Ques : 10. What is Natural language processing? Mention its application domain in AI. What are
some of the problems which arise in natural language understanding for autonomous machines
like robots, intelligent computers.
Ans : In AI , we need to think of language as a pair ( source, target) for mapping b/w two objects. Language is
a medium of communication. Till now most common linguistic medium of human beings exists in the form of
speech. But processing written language is easier, than processing speech. Developing a program that
understand a natural language is difficult, Natural languages are large. They contain infinite difficulties.
So Natural Language Processing is the task to process speech or written text in such a way that a program
transforms sentences occurring as a part of a dialogue into data structures which convey the intended meaning
Of sentences to a reasoning program. A reasoning program must know about :
(i) Structure of the language
(ii) Possible semantics
(iii) Beliefs and goals of the user
(iv) General knowledge of the world.
NLP = Understanding + Generation, Natural language understanding aims at building the systems that can
make sense of free-form text. NLU system converts samples of human language or computer programs to
manipulate. into more formal representation that are easier Natural language generation aims at building
systems that can express their knowledge or explain their behavior in natural language. NLG system converts
information from computer Databases into normal sounding human language.
• Processing written text using lexical, syntactic and semantic knowledge of the language as well as the
required real world information.
• Processing spoken language includes all information needed plus additional knowledge about
phonology and removal of ambiguous text.
UNIVERSITY ACADEMY 10
• NLP also includes multi lingual translation. E,g In google search engine , or in various smart phones we
have properties of speech to text and text to speech conversion in different types of natural language.
Output
NL Text
Knowledge
input string Parser translatio or
representatio
n system n
Computer Code
Dictionary
(i) Machine translation : Text to speech recognition and speech to text conversion. E,g Features
available now a days in android phones as well as windows laptop .
(ii) Information retrieval from a given collection of documents that satisfies a certain information need.
(iii) Information extraction and data mining.
(iv) Text summarization
1. Problem of Ambiguity : There are several knowledge levels in which ambiguity may occur in natural
language.
(a) Syntactic Level : A sentence / phrase may be ambiguous at syntactic level. Syntax relates to the
structure of the language as per grammar rules and the way the words are put together.
Example : I hit the man with the hammer. Was the man hit by weapon or weapon was in the hand of
victim.
UNIVERSITY ACADEMY 11
(b) Some sentence structures have more than one correct interpretations.
(c) Lexical level : A sentence may be ambiguous at lexical level. In this a word can have more than one
meaning. Example : I went to the bank. Word bank can be a river bank or a financial institution.So
two meanings of same word.
(d) Referential Level: This is concerned with what the sentence refers to , or a sentence may refer to
more than one thing. Example : Ram killed Ravana because he liked Sita. Here referential
ambiguity occurs for He , that whom does he refers, Ram or Ravana.
(e) Semantic Level : A sentence may be ambiguous at the point of meaning (i.e two different meanings
for same concept). Example: He saw her duck (Lexical and semantic level ambiguity). Did he dip
down to avoid or he saw web footed bird.
(f) Pragmatic Level : Sentence can be ambiguous at pragmatic level i.e at level of interpretations
Depending on the context in which it occurs. Some words can have different meanings in different
situations.
Example : I went to the doctor yesterday. When exactly was the yesterday is not clear. Does
yesterday refers to the day preceding today or it is some another yesterday.
2. Problem of impreciseness is also bad , that is very long sentences cannot be easily interpreted by
machines .
3. Problem of incompleteness : Incomplete sentence may create a sort of logical error or misinterpretation.
Example : I went there. There refers to what ?
4. Problem of inaccuracy may also arise in machine translation.
UNIVERSITY ACADEMY 12
5. Problem of continuous change is also very common during NLU. Example : People in different part of
world have different accent of speaking English.
6. Presence of noise in the input to understand .Example : While speaking in front of machine ,
background noise may hinder the clear voice input to the system.
7. The quantifier scoping problem is also very common. Where to apply existential quantifier ( )and
where to use universal quantifier.
Ans : Parsing is a technique to check the grammatical structure of computer programs syntactically and
generate a parse tree if given input is successfully parsed by the formal context free grammar. But in NLP
system this traditional parsing is quite difficult to analyze, understand and implement. This is because natural
languages are inherently ambiguous at lexical level, syntax level , semantic level, referential level and
pragmatic level.
There are systematic patterns in the sentence that emerge from the knowledge of grammar.Example
sentences have parts of phrases like noun phrase , verb phrase , preposition phrase etc.Parsing is a kind of
search problem where the serach space is the set of trees consistent with a given grammar . Two Methods of
searching are : Top Down approach and Bottom up approach of parsing.
Top Down approach : In this technique we start searching from the root node of parse tree and go to
downwards level till leaf nodes to find lexicons or original words.Top down approach is Goal Driven. In task
o+f packing bags for travel , we can start with the goal in mind and make a list of items that achieve that goal.
Bottom Up Approach : This is data driven approach in which search moves are performed upwards the tree
Starting from leaf nodes and reaching to the root node. If it si done successfully then no syntax error occurs.
UNIVERSITY ACADEMY 13
UNIVERSITY ACADEMY 14
(iii) Computer Vision: There are many opinions about what sort of background is necessary for computer
vision, but one thing is certain–inspirations for new computer vision methods have come from fields as diverse
as psychology, neuroscience, physics, robotics, and statistics. Vision deals with light and its interaction with
surfaces, so of course optics plays a role in understanding computer vision systems. Cameras, lenses, focusing,
5 binocular vision, depth-of-field, sensor sensitivity, time of exposure, and other concepts from optics and photography
are all relevant to computer vision.
Often referred to as the “inverse” of computer graphics, computer vision attempts to make inferences about the
world from images. Given a picture of two objects, we would like to infer that they are roughly cubical, and
that they are likely to be dice, although we can never be completely sure. A vision system may pick up
important highlights to conclude that a surface is wet, transparent, or reflective, features associated with living
creatures, rather than inanimate objects.
Neuroscience , physiology , the human eye, the central nervous system, and the brain are all marvels of
complex structure and performance required for vision.. Studying these systems often provides insight,
inspiration, and clues about artificial vision system design.
The human visual system seems to do all of these things. Just recording the speed at which a human responds
in a particular task, like reading a word, may rule out certain theories as to how certain visual stimuli are
processed.
Probability, Statistics, and Machine Learning The mathematical subfield of probability, the field of statistics,
and the computer science discipline of machine learning have become essential tools in computer vision.
Mid-Level Vision
• Finding coherent structure so as to break the image or movie into big units
– Segmentation:
• Breaking images and videos into useful pieces
UNIVERSITY ACADEMY 15
• E.g. finding video sequences that correspond to one shot
• E.g. finding image components that are coherent in internal appearance
– Tracking:
• Keeping track of a moving object through a long sequence of views
The relations between object geometry and image geometry: Model based vision find the position and
orientation of known objects
Smooth surfaces and outlines : how the outline of a curved object is formed, and what it looks like
Aspect graphs : how the outline of a curved object moves around as you view it from different directions
High Level Vision (Probabilistic) :The relations between object geometry and image geometry
Model based vision : find the position and orientation of known objects
Smooth surfaces and outlines : how the outline of a curved object is formed, and what it looks like
Aspect graphs : how the outline of a curved object moves around as you view it from different directions
(iii)Turing Test : This test provide an answer to the question “ Can machines think like human beings”.
Alan Turing , the British scientist was a well known computer scientist and the father of artificial
intelligence. Turing left a bench mark test for an intelligent computer ; such that it must fool a person into
thinking the computer machine as a human being. This test was performed in following two phases:
PHASE I : A set up of interrogator in an isolated room , with a man and woman in separate room is
performed. Same questions are asked to both man and woman through a neutral medium , like teletype writer.
Questions asked were calculations of multiplication of big numbers like 33456012 x 6754. Some questions on
lyrics and English literature are also put.
PHASE II : In this phase man is replaced by a computer without the knowledge of the interrogator. The
interrogator does not distinguish between man , woman and machine, rather he knows them as A and B.
Interpretation: If conversation with a computer is indistinguishable from that with a human, the computer is
displaying intelligence. If we can not distinguish between Natural Intelligence and Artificial Intelligence they
UNIVERSITY ACADEMY 16
must be same. If the interrogator could not distinguish between a mn imitating a woman and a computer
imitating a man the computer succeeded in paing the test . The goal of machine was to befool the interrogator
into believing that it is a person. If computer is successful , then we can say “ machines can think like
humans”.
UNIVERSITY ACADEMY 17
UNIT-2
Introduction to Searching Methods in AI
• Searching for solutions,
• Uninformed search strategies
• Informed search strategies
• Local search algorithms and optimistic problems
• Adversarial Search
• Search for games
• Alpha Beta Pruning
UNIVERSITY ACADEMY 1
Short Questions & Answers
( operations + sequence)
In AI computing; Expert System = Knowledge Base + control straetigies
Ques 2. What are the main aspects considered before solving a complex AI problem? What
is state space representation in AI?
Ans : Following aspects are considered before solving any AI problem :
(i) What are the main assumptions about the problem?
(ii) What kind of AI techniques are required to solve it(E.g : Game playing, theorem
proving, robotics, expert system, NLP etc.)
(iii) How intelligence level has to be modeled?
(iv) How we will know when we have reached our goal state of solution.
UNIVERSITY ACADEMY 2
State Space Representation: A problem solving system that uses forward reasoning and
whose operators each work by producing a single new object, a new state in the Knowledge base
is said to represent problem in state space structure. In backward reasoning we towards goal state
from the given set of states.
Constituents of a state space problem searching:
(i) Set of initial states/state
(ii) Operator function: This function when applied to any state , then it changes it to another
state.
(iii) State space: All states reachable from initial state by any sequence of actions.
(iv) Path: Sequence of steps along the state space.
(v) Path cost: Cost (in during terms of time or money) incurred during traversing the state
space.
(vi) Pre Conditions: Values certain attributes must have to enable operators application in a
state.
(vii) Post condition: Attributes of a state altered by an operator application.
(viii) Goal State: Final state, when problem is solved.
Ans : This is the variation of breadth first search. Keeping just one node in memory might seem
to be a problem for memory limitations. So to avoid this beam search keeps track of K states,
rather than just one
(i) Beam search moves downwards only through the best nodes at each level by applying
heuristics and other nodes are ignored.
(ii) Width of beam is fixed
(iii) K- Parallel search threads share useful information among themselves. Bad moves if
occurred are halted and resources are passed to good successors.
Ques 4. Show that DFS is neither complete nor optimal search.
Ans : Since dfs explores in depth first, so if our search tree is infinite or has a cyclic graph, then
dfs may not terminate i.e. it will search for infinite time so it is not complete. Secondly dfs
may give solution at a certain depth and it may be possible that much better solutions exists
at upper level of tree space, so it is non-optimal as well.
UNIVERSITY ACADEMY 3
Ques 5. Define Heuristic search and heuristic function.
Ans : Heuristic search techniques are those in which we have more information than the initial
state, operators and goal state. This leads to more efficient searching for complex problems.
Heuristics serve as a guide to reach the goal state.
Heuristic function: This is also termed as objective function in mathematical optimization.
Heuristic functions are those that “Maps from problem state description to the measure of
desirability, usually represented in numbers”. How the states are considered, evaluated and
weights are assigned to states leads to the selection criterion of current best path to reach the goal
in most efficient manner. If heuristic is very accurate and searches true merits of each node then
search tree has more direct solution process.
A general heuristic function can be of the form f (n ,g). Where n : no of nodes, g : goal states.
Other function can be f (n) = g (n) + h (n). Where g(n) : path cost from initial state to node n.
h(n) : estimated cost of the cheapest path from n to the goal.
f(n) : Estimated cost of cheapest solution through n.
Time complexity : O (b l ) , where b :branching factor. This algorithm solves infinite path problem.
UNIVERSITY ACADEMY 4
Ques 8. Differentiate between Uninformed and Informed Search technique.
Ans :
Ques10. Heuristic path algorithm is best first search in which objective function is
f(n) = (2-w) g(n) +w h(n) for what values of w is this algorithm guaranteed to be optimal ?
What kind of search does this perform when w = 0? When w = 1? When w = 2?
Ans : For w = 2 algorithm is optimal as f(n) = ( 2 -2 ) g(n) + 2 * h(n) = 2 h(n).
If w = 0 , then f(n) = 2 g(n) it is breadth first search. If w = 1, f(n) = g(n) + h(n) .
If w = 2 , it is optimal search, as we can directly reach to goal state.
UNIVERSITY ACADEMY 5
Long Questions & Answers
Ques 11. Explain Water Jug Problem using state space search. Generate Production rules
for this problem.
Ans : Two Jugs of 4 liter and 3 liter capacity are given. Initially both are empty. Objective is to
transfer water from one jug to another in such a way, that 4 liter jug has 2 liter water and 3 gallon
jug has n liter water. Where n can be any of 0 , 1, 2 or 3 liter.
UNIVERSITY ACADEMY 6
9. If x + y ≤ 4 && y > 0 , then (x , y) →
( x + y , 0)
10. If x + y ≤ 3 && x > 0 , then (x , y) →
( 0 , x + y)
11. (0 , 2 ) → (2 , 0) This is a special case of rule 9.
12. (2 , y) → (0,y)
UNIVERSITY ACADEMY 7
Ques 12. (a) How is AI related to Knowledge? Differentiate b/w Declarative and Procedural
knowledge.
(b) What are AI techniques? Explain the properties of it , and purpose of this with
example.
Ans : (a) Knowledge is defined as the body of facts and principles accumulated by human kind
or the act , fact or state of knowing. Knowledge has familiarity with language, concepts ,
procedures , rules , ideas, abstractions , places coupled with an ability to use these properties in
modeling different aspects of a real world problem.
Intelligence requires the possession of and access to knowledge. Characteristic of
intelligent people is that they possess much knowledge. Knowledge has following properties:
(i) It is voluminous
(ii) Knowledge continuously increases
(iii) Understanding the knowledge in different contexts depends on environment. E.g :
Understanding a visual scene requires knowledge of kinds of objects in scene
(iv) Solving a particular problem requires knowledge of that domain.
(v) Knowledge must be effectively represented in a meaning ful way , without any sort of
ambiguity.
Declarative knowledge is a passive knowledge expressed in the form of statements and facts
about the real world. Example: Employee database of any organization, Telephone directory etc.
Procedural knowledge is a compiled knowledge related to the performance of some task, i.e.
solving any problem systematically step by step. Example : steps used to solve any trigonometric
problem.
(b) AI Techniques: These are the methods to solve various type of tasks. E.g : Game playing ,
theorem proving , robotics, expert system, . Simple data structures like arrays, queues etc are
unable to represent the facts of real world. So some symbolic statements are required to represent
them.
UNIVERSITY ACADEMY 8
AI technique is method that exploits knowledge that should be represented in such a way that :
(i) Knowledge captures generalization. It is not necessary to represent separately each individual
situation. Rather situations which carry important properties are grouped together.
(i) It can be understand by the people who must provide it, depending the domain
of problem
(E.g: Robotics, medical field, weather forecasting , biometrics, defense and
military, aeronautics, space research etc.)
(ii) It can be easily modifiable to correct errors.
(iii) It can be used in many situations even if it is not totally accurate / complete.
AI techniques
AI computers
Knowledge base relates with Inference Process , through which an AI system can derive
solutions which are already not in the KB. To make Computer smarter we need to learn how
Human Brain works does and what properties of brain are used to design an AI system. This is
called Cognitive Science.
Before solving a problem and selecting appropriate AI domain we require following points:
• What are the assumptions about the knowledge?
• Which technique is most appropriate for reaching to goal state.
• What is the level for modeling the intelligence?
• How will we know when we have succeeded in building an AI system?
Example : Translating English sentences into Japanese. (Requires technique of NLP and NLU)
Teaching a child to subtract integers. (Requires supervised learning)
Solving a cross word puzzle. (Requires logic theory.)
UNIVERSITY ACADEMY 9
Ques 13. What are the characteristics of AI problems? Explain with examples.
Ans : (i) Is the problem decomposable ? Is there any way to divide large , complex problem
into sub problems and then apply simple approach to each problem for getting the solution of
main problem.
E.g : Designing a large software requires to design individual modules , then start coding phase.
(ii) Can solution steps be ignored or atleast undone , if they prove to be irrelevant ?
Based on this there can be following categories of problem:
Ignorable : Solve the problem using simple control strategies , that never back tracks.
(Theorem proving)
Recoverable : These problems can be solved using more complex control structures that can
have chances of errors. In this solution steps can be undone.( 8 – puzzle
Irrecoverable : Solved by the system that expands a great deal of effort , making each decision
since decision must be final. Solution steps cannot be undone. ( Chess game , card games etc.
)Control strategy requires lots of effort , have strict rules and has good heuristic information.
(iii)Is the universe predictable? Can we earlier predict the plans and steps to be taken
during problem solution. E.g : In bridge game , 8 –puzzle game we can predict the future
moves
Certain outcome problems: 8-puzzle, Uncertain outcomes : Bridge game.
Hardest problems: Irrecoverable + Uncertain Outcome.
UNIVERSITY ACADEMY 10
Conversational Problems: In this type we have an intermediate communication
between a person and computer, either to provide additional assistance to the system
or to give additional information to the user and computer.
Ques 14. What is Control Strategy and Production System? How this is helpful in AI.
Give example with its types also.
Ans: Control Strategy is an interpreter program used to control the order in which the production
rules are fixed and resolve conflicts, if more than one rule becomes applicable simultaneously.
This strategy repeatedly applies rules to the data base until a description of the goal state is
produced.
Control strategy must be systematic + must cause motion also
Example: In water Jug problem we select the rules from given list in such a way that it must
always generate a new state in the state space (Cause a motion). Rule must be selected in such a
way that duplicate states must be avoided.
Types of control strategies: (a) Irrocavable: Rule is selected and applied irrocavably without
provision for reconsideration later. (b) Tentative: Rule is selected and applied with a provision
to return later to this point in computation to apply some other rule.
Production Systems: These were proposed for modeling human solving behavior by Newell &
Simons in 1972. These are also referred as inferential systems or a rule based systems.
Roles of production system:
(i) A powerful knowledge representation method with action associated to it.
(ii) Bridge b/w AI research and expert system.
(iii) Strong data driven nature of intelligent action. When new input is given to the system,
behavior of system changes.
(iv) New rules can easily be added to account for new situations without disturbing the
rest of the system.
UNIVERSITY ACADEMY 11
Expert systems have modules known as Inference Engine, which work based on production
systems.
Knowledge Base: Dynamic KB + Static KB. Global DB is the central data structure used by the
production system. Application of a rule changes the data base as a result it is dynamic in nature,
continuously changing when a production rule is applied to any of the system state.
So it is also known as Working Memory or Short Term Memory.
Static KB has complete information about the facts and rules and is fixed, never changes.
UNIVERSITY ACADEMY 12
.
Ques 16. What is missionaries and cannibals problem? Give the production rules for its
solution.
Ans : In this problem, three missionaries and three cannibals must cross a river using a boat
which can carry at most two people, under the constraint that, for both banks, that the
missionaries present on the bank cannot be outnumbered by cannibals. The boat cannot cross the
river by itself with no people on board.
UNIVERSITY ACADEMY 13
Solution:
Rule 6 1 ML , 1 CL 2 MB, 0 CB 0 MR , 2 CR
Rule 7 1 ML ,1 CL 1 MB, 1 CB 1 MR , 1 CR
Rule 8 0 ML , 2 CL 2 MB, 0 CB 1 MR , 1 CR
Rule 9 0 ML , 2 CL 0MB, 1 CB 3 MR , 0 CR
Rule 10 0 ML , 1 CL 0 MB, 2 CB 3 MR , 0 CR
Rule 11 0 ML , 1 CL 0 MB, 2 CB 3 MR , 0 CR
Rule 12 0 ML , 1 CL 0MB, 1 CB 3 MR , 1 CR
Rule 13 0 ML , 0 CL 0MB, 0 CB 3 MR , 3 CR
Ques 17. What are local search algorithms? Explain Hill climbing search.
Ans : Local search algorithms operate using a single current state , rather than multiple paths and
then move further to neighboring states. Paths followed by search are not retained i.e. not kept in
the memory.
Local search algorithms use very less memory. They can find reasonable solutions in large or
infinite state space. These are used to find solution of optimization problems.
UNIVERSITY ACADEMY 14
Hill climbing technique is informed search strategy which works based on greedy approach.
Heuristic search / informed search sacrifice the claims of completeness. They behave like a tour
guide. They are good to the extent that they point in generally interesting directions and bad to
the extent that they miss the points of interest to individuals.
Simple Hill Climbing: This algorithm selects the FIRST BETTER MOVE (NODE) as a
new CURRENT STATE (next state).
(5) A
B C D
A is the current state with cost =5. Now compare the cost of A to its successor nodes one by one
Starting from node B. If the optimization problem is to maximize the cost , then the child node
which has more cost as compared to A will be better next node. Since (cost of B) < (cost of A),
so move to node C. (Cost of C) > cost of A, and node C is first better node, therefore simple hill
climbing makes node C as the NEW CURRENT STATE (next state). And further successors of
C will be generated.
Steepest Ascent Hill Climbing: This algorithm considers all the moves from the CURRENT
STATE and selects the BEST NODE as a new CURRENT STATE (next state).
In the same search tree above , if we apply the steepest ascent approach then the BEST NODE
among all the successors is selected as next state. So when we compare A with its successor
nodes , D is maximum among all the other nodes.
Key Points: (a) Hill climbing algorithm terminates, when it reaches a peak where no neighbor
has a higher value.
UNIVERSITY ACADEMY 15
(a) Hill climbing is called greedy search because it selects a good neighbor state without
thinking ahead about where to go next.
(b) An operator is applied to each current state to generate its child nodes. And an
appropriate Heuristic function is used to estimate the cost of each node.
(c) Both simple and steepest ascent hill climbing may fail to find the solution. Either
algorithm may
Terminate, when no goal is found but by getting to a state from which no better states can
be generated.
Ques 18. Draw the state space graph of Hill climbing search. What are the draw backs of
this algorithm? Also discuss about time space complexity of this algorithm.
Ans : Hill climbing search generally falls into trap due to some of the following reasons which
are also the drawbacks of this method :
(a) Local Maxima: A local maxima is a state in space tree which is a peak state , better
than all its neighboring states but lower than the global maximum. Local max are
particularly disadvantageous because they often occur almost within the sight of a
solution. In this case it is called FOOTHILLS.
(b) Plateau : This is area in state space where evaluation function is flat i.e whole set of
neighboring states have the same value from current state , and to find best direction
is difficult. From flat local max no uphill moves exist and from a Shoulder Plateau, it
is possible to move forward.
UNIVERSITY ACADEMY 16
(c) Ridges : This is a sequence of local maxima with some slope. The orientation of high
region compared to the set of available moves and directions in which they move,
makes it impossible to traverse a ridge by single moves.
State space diagram is a graphical representation of the set of states our search algorithm can
reach vs the value of our objective function(the function which we wish to maximize). X-
axis: denotes the state space ie states or configuration our algorithm may reach. Y-axis:
denotes the values of objective function corresponding to to a particular state. The best
solution will be that state space where objective function has maximum value(global maximum).
(a) To deal with local max, backtrack to some earlier node and select some different path.
(b) To deal with plateau make a BIG HOP in some direction to try to get a new section of
search space. If set of rules and operators describe a single small steps apply them several
times in same direction.
(c) To deal with ridges apply two or more rules before progress. This is same as moving in
several directions at once.
Hill climbing algorithm is not complete, Whether it will find the goal state or not depends upon
the quality of the heuristic function. If the function is good enough , the search will still proceed
towards the goal state.
Space complexity: This is strongest feature. It requires a constant space. This is because it keeps
only the copy of the current state. In addition it may require additional memory for storing the
previous state and each candidate successor.
UNIVERSITY ACADEMY 17
Time complexity of Hill Climbing will be proportional to the length of the steepest gradient
path. In finite domain, search will proceed to this path and will terminate.
Ques19. Explain Blocks World problem using heuristic function in Hill Climbing Search
strategy.
Ans : In this problem a set of initial arrangement of eight blocks is provided. We have to reach
the GOAL arrangement by moving blocks in a systematic order. States are to be evaluated using
heuristic , so that we can get next best node by applying Steepest Ascent Hill Climbing
technique.
Both the function will try to maximize the score/cost of each state.
Local Heuristic: Add 1 point for each block that is resting on the correct block it is supposed to
be (as compared to the goal state). Subtract 1 point for each block at incorrect position. Global
Point for table is also considered.
Global Heuristic: For each block that has correct support structure, add 1 point for each block in
support structure. For each block having incorrect support, subtract 1 point for each block. In this
point for table is not considered.
As the value of any structure maximizes, we will be nearer to the goal state.
UNIVERSITY ACADEMY 18
Cost/score of goal state is 8(using local heuristic), because all the blocks are at its correct
position.
Now J is current new state with score 6 > cost of I (4). So , In step 2 three moves from best
state J is possible. All the neighbors of node J have lower have lower score than value of J i.e 4 ,
so J is a local maxima, and further no move is possible from states K, L and M. So search falls
in TRAP situation.To overcome the above problem of Local function, we can apply GLOBAL
heuristic.Now goal state will have score /cost of 28 and Initial state will have cost of -28.Again
the best node in next move will be that which has maximum score/cost.
UNIVERSITY ACADEMY 19
Further from state M we can have following moves :
Ques 20. Explain Branch and bound search strategy in detail with an example.
Ans . Branch-and-Bound search is a way to combine the space saving of depth-first search with
heuristic information. It is particularly applicable when many paths to a goal exist and we want
an optimal path. As in A* search, we assume that h(n) is less than or equal to the cost of a lowest-
cost path from n to a goal node.
UNIVERSITY ACADEMY 20
The idea of a branch-and-bound search is to maintain the lowest-cost path to a goal found so
far, and its cost. Suppose this cost is bound. If the search encounters a path p such
that cost(p)+h(p) ≥ bound, path pcan be pruned. If a non-pruned path to a goal is found, it must
be better than the previous best path. This new solution is remembered and bound is set to the
cost of this new solution. It then keeps searching for a better solution.
Let us take the following example for implementing the Branch and Bound algorithm.
UNIVERSITY ACADEMY 21
Step 3:
UNIVERSITY ACADEMY 22
Hence the searching path will be A-B -D-F
Advantages: As it finds the minimum path instead of finding the minimum successor so there should
not be any repetition. The time complexity is less compared to other algorithms.
Disadvantages: (i) The load balancing aspects for Branch and Bound algorithm make it parallelization
difficult.
(ii) The Branch and Bound algorithm is limited to small size network. In the problem of large
networks, where the solution search space grows exponentially with the scale of the network, the
approach becomes relatively prohibitive.
Ques 21. What is Best first search algorithm, explain? Give an example also. Compare best
first search with Hill climbing approach.
Ans : BEST FIRST SEARCH : In BFS and DFS, when we are at a node, we can consider any
of the adjacent as next node. So both BFS and DFS blindly explore paths without considering
any cost function. The idea of Best First Search is to use an evaluation function to decide which
adjacent is most promising and then explore. Best First Search falls under the category of
Heuristic Search or Informed Search.
We use a priority queue to store costs of nodes. So the implementation is a variation of BFS, we
just need to change Queue to Priority Queue. An evaluation function is used to assign a score to
UNIVERSITY ACADEMY 23
each candidate node. The algorithm maintains two lists, one containing a list of candidates yet to
explore (OPEN), and one containing a list of visited nodes (CLOSED). Since all unvisited
successor nodes of every visited node are included in the OPEN list, the algorithm is not
restricted to only exploring successor nodes of the most recently visited node. In other words,
the algorithm always chooses the best of all unvisited nodes that have been graphed, rather
than being restricted to only a small subset, such as immediate neighbors. Other search
strategies, such as depth-first and breadth-first, have this restriction. The advantage of this
strategy is that if the algorithm reaches a dead-end node, it will continue to try other nodes .
Heuristic function used is f(n)= g(n) + h(n) , which estimates the cost of cheapest path from
node n to goal node i.e solution. f(n) gives true cost of a node n . g(n) is cost of getting from
initial state to current node. H(n) additional cost of getting from current node to the goal state.
A directed graph(OR graph) is used in which each node is a point in problem space. An
alternative problem solving path exists from each branch. A parent link always point to the best
node from which it came and the list of the nodes that were generated from it. Parent link helps
to recover path to the goal once the goal is found.
UNIVERSITY ACADEMY 24
Step1. Initially start from source node "S" and search for goal "I" using given costs and Best
First search.
Step 2: Priority Queue OPEN, initially contains S. We remove S to CLOSED from OPEN and
process unvisited neighbors of S to priority queue OPEN = {A, B, C} and CLOSED = {S}.
Step 3: We remove A from OPEN and process unvisited neighbors of A to OPEN .
OPEN = {C, B, E, D} and CLOSED = { S,A}.
Example 2:
UNIVERSITY ACADEMY 25
UNIVERSITY ACADEMY 26
Comparison of Best first search and Hill Climbing
Ques 22. Explain A* search algorithm. Discuss about the admissibility of A* algorithm.
Ans : A* algorithm is the extension of BEST FIRST SEARCH , proposed by Hart and Raphael
in 1980. It combines the features of branch and bound, Dijkstra’s algorithm and Best first search.
(i) A* Search Algorithm does is that at each step it picks the node according to a value-
‘f’ which is a parameter equal to the sum of two other parameters – ‘g’ and ‘h’. At
each step it picks the node/cell having the lowest ‘f’, and process that node/cell.
(ii) f(n) = g(n) + h (n) , where g(n) is cost of getting from initial state to current node.
h(n) = estimate of additional cost of getting from current node to goal state.
F(n) = this gives total cost of reaching to goal node.
(iii) A* maintains two priority queues OPEN and CLOSED as in best first search.
OPEN contains those nodes which are unexpanded and not evaluated. CLOSED
contains those nodes , whose successors are generated and node cost is evaluated
using heuristic function.
UNIVERSITY ACADEMY 27
A* Search Algorithm
1. Initialize the OPEN list with start node , Set f = 0 + h , g =0 initially and CLOSED = Φ
2. Repeat until a goal is reached
If OPEN = = { Φ}, then failure occurs.
Else select node in OPEN with min(f) value, set this node = BEST NODE for current
path.
Remove BEST NODE from OPEN and CLOSED = { BEST NODE}.
If ( BEST NODE = = Goal Node), search succeeds.
Else generate successors of BEST NODE , but don’t set BEST NODE to point to
them yet.
(a) for each successor
(i) if successor is the goal, stop search
Else set successor node to point to the BEST NODE to recover the current path.
(iii) if a node with the same position as successor is in the OPEN list which has a
lower f than successor, skip this successor
(iv) If a node with the same position as successor is in the CLOSED list which has a lower
f than successor, skip this successor otherwise, add the node to the OPEN list.
end (for loop)
(c) If current path of SUCCESSOR is cheaper than the current best path to old parent , then
reset old’s parent Link to point to BEST NODE else do nothing. Record g (old)
and update f (old) node.)
(d) To propagate new cost downwards the graph , perform dfs. Starting from old and
changing each node’s g value. Terminating each branch when you reach either a node
with no successors or a node to which an equivalent or better path has already been
found.
(e) If successor (OPEN and CLOSED) both, add OPEN = { successor}, and add it to the
list of BEST NODE’S successor.
UNIVERSITY ACADEMY 28
Admissibility of A*
An algorithm is said to be admissible if it is guaranteed to return an optimal solution , when
one exists.
A* is admissible if it satisfies following assumptions:
(a) The branching factor is finite. That is from every node only finite number of alternate paths
emerges.
Proof : In every cycle of the main loop in A* , algo picks one node from OPEN and places it
in CLOSED. Since there are only a finite number of nodes, algorithm will terminate in a
finite number of cycles, even if it never reaches the goal.
(b) The cost of each move is greater than some arbitrarily small positive value ε.
i.e for all m , n : k (m , n) > ε.
(c) The heuristic function underestimates the cost to the goal node. i.e for all n : h(n) ≤ h*(n).
Proof : A* is complete and optimal .Admissible heuristics are by nature optimistic because they
think the cost of solving the problem is less than it actually is. g(n) is the exact cost to reach n,
we have as immediate consequence that f(n) never overestimates the true cost of a solution
through n. Let cost of optimal solution be C*. G2 is sub optimal goal node, h (G2)=0, since it is a
goal node. Therefore, f(G2) = g (G2) + h (G2) = g (G2) > C*. If h (n) does not overestimates
the cost of completing solution path, then f (n) = g(n) + h (n) ≤ C*.
A heuristic is consistent if for each node n and ach successor n ‘ of n generated by an operator ,
estimated cost of reaching the goal from n is no greater than step cost of getting to n ‘ plus the
estimated cost of reaching the goal from n ‘.
h(n) ≤ C (n , a , n ‘) + h ( n’ ) , where a is an action/operator.
If h(n) is consistent , then the values of f(n) along any path are non decreasing. If n’ is a
successor of n , then g (n ‘)= g(n) + C (n , a , n ‘ ) and f (n’) = g(n’) + h (n’)
f (n’) ≥ g(n) + h (n) = f(n). So A* expands all the nodes with f(n) < C*.
UNIVERSITY ACADEMY 29
Ques 23. Mention some observations of g(n) and h(n) values in A* search. Discuss about
under estimation and overestimation of A* algorithm.
A* is optimal if h(n) is admissible heuristic , provided h (n) never overestimates the cost to reach
the goal. g ‘(n) and h ‘ (n) may not be known. What are known is estimated cost of both.
In general g’(n) will be lower than g(n) , because algorithm may not have found the optimal path
to n , i.e. g (n) ≥ g ‘ (n). The heuristic value h(n) is the distance to the goal from current state.
If algorithm guarantees an optimal solution, it is necessary that the function underestimates The
distance to the goal. That is h(n) ≤ h ‘ (n).If above condition is true then A* is admissible.
UNIVERSITY ACADEMY 30
UNIVERSITY ACADEMY 31
Ques 24. What is problem reduction technique? Using this explain AO* search with
an example.
Ans : When a problem can be divided into a set of sub problems, where each sub problem can be
solved separately and a combination of these will be a solution, AND-OR graphs or AND - OR
trees are used for representing the solution. The decomposition of the problem or problem
reduction generates AND arcs. One AND are may point to any number of successor nodes. All
these must be solved so that the arc will rise to many arcs, indicating several possible solutions..
Figure shows an AND - OR graph.
(i) In above figure the top node A has been expanded producing two area one leading to
B and leading to C-D .
(ii) The numbers at each node represent the value of f ' at that node (cost of getting to the
goal state from current state). For simplicity, it is assumed that every operation (i.e.
applying a rule) has unit cost, i.e., each are with single successor will have a cost of 1
and each of its components.
UNIVERSITY ACADEMY 32
(iii) With the available information till now, it appears that C is the most promising node
to expand since its f ' = 3 , the lowest but going through B would be better since to
use C we must also use D' and the cost would be 9 = (3+4+1+1). Through B it would
be 6 = (5+1).
(iv) Thus the choice of the next node to expand depends not only n a value but also on
whether that node is part of the current best path form the initial mode. Figure (b)
makes this clearer. In figure the node G appears to be the most promising node, with
the least f ' value. But G is not on the current beat path, since to use G we must use
GH with a cost of 9 and again this demands that arcs be used (with a cost of 27).
Observation: In AO* algorithm, we consider best node + best path (Global View), rather than
best node + best link (Local View).
AO* algorithm uses a single structure Graph instead of OPEN & CLOSED priority queues in
A*. Each node in Graph points down to its immediate successors and up to its immediate
predecessors, and also has with it the value of h' cost of a path from itself to a set of solution
nodes. The cost of getting from the start nodes to the current node "g" is not stored as in the
A* algorithm. This is because it is not possible to compute a single such value since there may
be many paths to the same state. In AO* algorithm serves as the estimate of goodness of a node.
Also a there should value called FUTILITY is used. The estimated cost of a solution is greater
than FUTILITY then the search is abandoned as too expansive to be practical.
For representing above graphs AO* algorithm is as follows :
AO* ALGORITHM
1. Let Graph consists only to the node representing the initial state call this node INTT.
Compute h' (INIT).
2. Until INIT is labeled SOLVED or h’(INIT) > FUTILITY, repeat the
following procedure :
(I) Trace the marked arcs from INIT and select an unbounded/unexpanded node NODE.
(II) Generate the successors of NODE. If there are no successors then assign
h' (NODE) = FUTILITY. This means that NODE is not solvable. If there are successors
then for each one called SUCCESSOR, that is not also an ancestor of NODE do the following :
UNIVERSITY ACADEMY 33
(a) add SUCCESSOR to graph G
(b) if successor is not a terminal node, mark it solved and set h’ (SUCC) = 0 .
(c) If successor is not a terminal node, compute it h' (SUCC) .
(III) Propagate the newly discovered information up the graph by doing the following. Let S be a
set of nodes that have been marked SOLVED. Initialize S to NODE. Until S is empty repeat
the following procedure:
(a) Select a node from S call if CURRENT and remove it from S.
(b) Compute h' of each of the arcs emerging from CURRENT. Assign minimum h' to
CURRENT. Cost (ARC) = Σ ( h’ value of each nodes) + cost of ARC itself.
(c) Mark the minimum cost path as the best out of CURRENT.
(d) Mark CURRENT SOLVED if all of the nodes connected to it through the new marked
are have been labeled SOLVED.
(d) If CURRENT has been marked SOLVED or its h ' has just changed, its new status
must be propagate backwards up the graph .Hence all the ancestors of CURRENT
areadded to S.
Note: AO* will always find minimum cost solution. AO * is both admissible and
complete.
UNIVERSITY ACADEMY 34
Unnecessary Backward Cost propagation in AO* search algorithm
Ques 26. In the graph given below explain, when would the search will terminate with
A if node F is expanded next and its successor is node A. Give the steps
of searching also.
UNIVERSITY ACADEMY 35
Ans: If node F is expanded next with child node A, then the cost of A will be upward
propagated in this graph with cycle. Initially cost of A with AND arc is :
h(A) = h(C) + h(D) + cost of arc(A-C) + cost of arc (A- D).
h(A) = 11 + 15 + 1 + 1 = 28. Now 28 will be used to evaluate the revised cost of node F as
following:
h (F) = h(A) + 1 (because of OR path between Node F and A ).. So h(F) = 28 + 1= 29.
Now between h(E) and h(F) , h(E) = 30 > h(F) = 29 , so OR path C-F is better than path C-E.
So , Revised cost of C = h_new(F) + 1= 29 + 1=30.
Now h_new(A) = h(C) + h(D) + 2 = 30 + 15 + 2 = 47.
So , h_new (F) = 47 + 1= 48.
UNIVERSITY ACADEMY 36
Now h(E)=30 < h_new(F) =48. So node E is better node, so ARC C-E is better now.
So h_new(C) = h(E) + 1 = 30 + 1 = 31. Again revise the cost of node A through upward cost
propagation.
So h_new (A) = h_new(C) + h(D) = 31 + 15 + 2 = 48.
So , h_new (F) = h_new(A) + 1 = 48 +1 = 49.
Still node E is better than F , so h(C) = 31….. This search will continue and no change in path
exist. So cycle will repeat till cost of search does not exceeds. FUTILITY. In that case search
will be terminated.
Ques 27. How is AI useful in game playing techniques. Describe what is adversarial search?
Ans . Game playing is very important and emerging field of AI, which makes machines to play
several games that a people can play. Machine may play with another machine or human robot. It
may as well play with another person also. This requires lot of searching and knowledge.
Charles Babbage, 19th century computer architect programmed an analytical engine for CHESS.
In 1960’s Arthur Samuel developed first operational game playing program. Mathematical game
theory a branch of economics views any multi agent environment as a game provided that the
impact of each agent on others is significant, regardless of whether an agent is cooperative or
competitive.
In AI games are deterministic, zero sum games. In this two agents whose actions alternte and in
which UTILITY values at the end of game are always equal and opposite . E.g: If one player
wins a game of chess (+1), other loses (-1).A utility function gives numeric value to terminal
states .
Adversarial Search: In this two opponents (agents) , playing a game with each other compete in
an environment so as one move of agent A opposes agent B and he tries to take move advantage
over other by maximizing his UTILITY and minimizing opponents UTILIT.
E.g : Chess game , Bridge game of playing cards, Tic Tac Toe , etc.
Some games have agents to be restricted to a small number of actions whose outcomes are define
by precise rules. Physical games like Ice Hockey requires more complex rule set to be defined.
Larger range of operators /actions is needed for better efficiency and result. Only Robot Soccer
Player game is much attractive among game players in AI.
UNIVERSITY ACADEMY 37
UTILITY FUNCTION: This gives numeric value to terminal states. In chess game outcome can
be win , loss or draw with values +1 , -1 and 0 respectively.
A game tree is generated showing different states and cost of each state. Searching for goal is
performed in this tree. In a game tree two half moves, each called PLY exists. For Maximizing
player we have Max ply and for Minimizing player we have Min ply.
In game playing move generation and terminal test must be good enough for fast searching.
We can use a Plausible Move Generator function, in which only small number of promising
moves are generated. Alan Turing gave a utility function based on material advantage of pieces
in chess as following:
Add the number of black pieces(B) values , Value of white(W), and compute (W/B).
Factors considered for moving criteria by an agent can be :
(i). Piece advantage. (ii) Capbility of progress (iii) Control of center (iv) Mobility
(v) Threat of fork.
TERMINAL TEST : This test determines when the game is over. States where game is over
are called terminal states.
Ans : In MINIMAX searching the score is comparison of what is good for max player minus
what is good for min player.
(i) Nodes for Max players termed as max nodes that take on value of their highest scoring sub
nodes/successors.
(ii) Nodes for MIN players take on the value of their lowest scoring successors.
(iii) Assumption is that both the MAX and MIN players play optimally at their end to win the
game.
(iv) MINIMAX search uses simple recursive computation of minimax vlues of each successor
node in the
game tree in dfs order. Minimax values are backed up through the game tree.
UNIVERSITY ACADEMY 38
In above game tree, firstly in DFS order branches of B are explored. Since B is at MIN ply w.r.t
to MIN player So find min(E, F , G ) and value is backed up to node B.
[ min ( E, F , G)= min(3 ,12,8) = 3].
So 3 is backed up to node B as its score. Similarly Score (node C) = min (2 , 4, 6) = 2.
Score (D) = min (14, 5, 2) = 2.
Now score of MAX player A = Max ( 3 , 2 , 2) = 3. So the winning path is A – B – E.
UNIVERSITY ACADEMY 39
Ques 28. (i) What is alpha beta pruning / search
(ii) Evaluate the winning cost of MAX player in following game tree using
alpha beta pruning.
UNIVERSITY ACADEMY 40
Ans : (i) Alpha-Beta pruning is not actually a an optimization technique for MINIMAX
algorithm. It reduces the computation time by a huge factor. Because number of game states that
MINIMAX has to examine is exponential in number of moves. So we can cut the exponent to
half. This allows us to search much faster and even go into deeper levels in the game tree. It cuts
off branches in the game tree which need not be searched because there already exists a better
move available. It is called Alpha-Beta pruning because it passes 2 extra parameters in the
minimax function, namely alpha and beta. This is also called Alpha Beta cut –off .
This technique when applied , returns the same moves as compared to MINIMAX, but
prunes branches that can not put impact on the final decision. It can be applied to the tree of
any depth in depth first search order.
Alpha is the best value(highest) that the maximizer currently can guarantee at that level (along
max path) or above. This is lower bound on MAX nodes.
Beta is the best value (lowest) that the minimizer currently can guarantee at that level(along min
path.) or above. This is an upper bound on MIN nodes.
Note : Search below a MIN node can be terminated if beta value of MIN node is less than any
of alpha values bound to its ancestor MAX nodes.
Search below a MAX node can be terminated if alpha value of Max node is greater than any
of beta bound to its ancestor MIN nodes.
if isMaximizingPlayer :
bestVal = - INFINITY
for each child node :
value = minimax(node, depth+1, false, alpha, beta)
bestVal = max( bestVal, value)
alpha = max( alpha, bestVal)
if beta <= alpha:
break
UNIVERSITY ACADEMY 41
return bestVal
else :
bestVal = +INFINITY
for each child node :
value = minimax(node, depth+1, true, alpha, beta)
bestVal = min( bestVal, value)
beta = min( beta, bestVal)
if beta <= alpha:
break
return bestVal
Solution (ii) :
Step 1. Initially apply DFS on path A-B-D-I. Value of I is backed up at D as alpha =3.
Now between node I and J , beta =5 is of node J. So D can have backed up value >= 3 ,
because D is at Max Ply. Therefore value of Node J is changed to alpha =5 .
Step 4. Now consider another branch of node A, via path A-C-F-K-P (DFS order). Node K is
MIN node , so Min(P, Q) = min (0 , 7) = 0. Therefore at P, alpha = 0 , backed up as
beta =0 to node K. Node F is Max Node , so max (K , L) = mx (0 , 5) = 5.
Hence alpha =0 at Node F is changed to alpha = 5. So search below Q is pruned. Hence
node Q is not generated. At node K, beta =0 and ancestor of K i.e Node f has alpha =5,
so it will be pruned.
UNIVERSITY ACADEMY 42
Ques 29. What is Constraint satisfaction problem?
Ans : A constraint satisfaction problem (or CSP) is defined by a set of variables, X1, X2, . . . ,
Xn, and a set of constraints, C1, C2, C3…, ,Cm. Each variable Xi has a
nonempty domain Di of possible values. Each constraint Ci involves some subset of the
variables and specifies the allowable combinations of values for that subset. A state of the
problem is defined by an assignment of values to some or all of the variables, {Xi = vi,
Xj = vj, . . .}. An assignment that does not violate any constraints is called a consistent or legal
assignment. A complete assignment is one in which every variable is mentioned, and a solution
to a CSP is a complete assignment that satisfies all the constraints. Some CSPs also
require a solution that maximizes an objective function.
Varieties of CSPs
(A) Discrete variables – finite domains:
• n variables, domain size d , O(dn) complete assignments
– infinite domains:
• integers, strings, etc.
• e.g., job scheduling, variables are start/end days for each job
• need a constraint language, e.g., StartJob1 + 5 ≤ StartJob3
Continuous variables
– e.g., start/end times for Hubble Space Telescope observations.
– linear constraints solvable in polynomial time by linear programming.
UNIVERSITY ACADEMY 43
CSP is a two step process :
UNIVERSITY ACADEMY 44
Algorithm for CSP :
Step 1 : Propagate available constraints. Initially set OPEN = { set of objects which have values
assigned in a complete solution.}.Then
Do until {inconsistency is detected or OPEN = NULL}
1.1 Select an object OB from OPEN. Strengthen as much as possible the set of constraints
which apply with OB.
1.2 If this set is different from the set which was assigned the last time OB was examined or if
it is first time OB has been examined, then add to OPEN all objects which share any
constraints with OB.
1.3 Remove OB from OPEN.
1. From Column 5, M=1, since it is only carry-over possible from sum of 2 single digit number
in column 4.
2. To produce a carry from column 4 to column 5 “ S + M”is at least 9 ,So ,
'S=8 or 9' so 'S+M= 9 or 10' & so 'O = 0 or 1'. But 'M=1', so 'O = 0'.
3.If there is carry from Column 3 to 4 then 'E=9' & so 'N=0'. But 'O = 0' so there is no carry
& 'S=9' & 'c3=0'.
UNIVERSITY ACADEMY 45
3. If there is no carry from column 2 to 3 then 'E=N' which is impossible, therefore there is
carry & 'N=E+1' & 'c2=1'.
5. If there is carry from column 1 to 2 then 'N+R=E mod 10' & 'N=E+1' so 'E+1+R=E
mod 10', So 'R=9' but 'S=9', so there must be carry from column 1 to 2.
Therefore 'c1=1' & 'R=8'.
6. To produce carry 'c1=1' from column 1 to 2, we must have 'D+E=10+Y' as Y cannot be 0/1
so D+E is at least 12. As D is at most 7 & E is at least 5 (D cannot be 8 or 9 as it is
already assigned). N is at most 7 & 'N=E+1' so 'E=5 or 6'.
7. If E were 6 & D+E atleast 12 then D would be 7, but 'N=E+1' & N would also be 7 which
is impossible. Therefore 'E=5' & 'N=6'.
8. D+E is at least 12 for that we get 'D=7' & 'Y=2'.
9567
+108 5
10652
UNIVERSITY ACADEMY 46
UNIT-3
Knowledge Representation & Reasoning:
• Propositional logic
• Theory of first order logic
• Inference in First order logic
• Forward & Backward chaining,
• Resolution.
Probabilistic reasoning
• Utility theory
• Hidden Markov Models (HMM)
• Bayesian Networks
UNIVERSITY ACADEMY 1
UNIVERSITY ACADEMY 2
Short Question & Answers
Ques 1. Differentiate between declarative knowledge and procedural knowledge.
• Belief : This is any meaningful and coherent expression that can be manipulated .
• Hypothesis: This is a justified belief that is not known to be true. Thus hypothesis is a belief
which is backed up with some supporting evidence.
• Knowledge: True justified belief is called knowledge.
• Epistemology: Study of the nature of knowledge.
Ques 3. What is formal logic? Give an example.
Ans : This is a technique for interpreting some sort of reasoning process. It is a symbolic manipulation
mechanism. Given a set of sentences taken to be true , the technique determines what other sentences can be
arranged to be true. The logical nature or validity of argument depends on the form of argument.
Example: Consider following two sentences: All men are mortal 2. Socrates is a man , So we can infer
that Socrates is mortal.
Ans : CNF( Conjunctive Normal Form) : A formula P is said to be in CNF , if it is of the form
P = P1 ˄ P2 ˄P3 , ….,Pn-1 , Pn. ; n ≥1, where each Pi from i = 1 to n is a disjunction of an atom
Example: (Q P) ˄ (T ~ Q) ˄ ( P ~T).
DNF(Disjunctive Normal form) A formula P is said to be in DNF if it has the forma
P = P1 P2 P3, …. Pn-1 Pn.; n ≥1, where each Pi from i = 1 to n is a conjunction of an
UNIVERSITY ACADEMY 3
atom . Example: (Q ˄ P) (T ˄~ Q) ( P ˄ ~T)
UNIVERSITY ACADEMY 4
Ques 5.What are Horn Clauses ? What is its usefulness in logic programming?
Ans : A horn clause is a clause(disjunction of literals) with at most one positive literal. A horn clause with
exactly one positive literal is called definite clause. A horn clause with no positive literals is sometimes
called a goal clause. A dual horn clause is a clause with at most one negative literal.
Ques 6. Determine whether the following PL formula is (a) Satisfiable (b) contradictory
(c) Valid : (p q)→r q
T T T T T F T T
T F F T F T T T
F T F F T F F T
F F F F F T T T
Ques 7. Convert the following sentences into wff of Predicate Logic ( First order logic).
(i) Ruma dislikes children who drink tea.
(ii) Any person who is respected by every person is a king.
UNIVERSITY ACADEMY 5
UNIVERSITY ACADEMY 6
Long Question & Answers
Ques 8 : Define the term knowledge. What is the role of knowledge in Artificial Intelligence?
Explain various techniques of knowledge representation.
Ans : Knowledge: Knowledge is just another form of data. Data is a raw facts. When these raw facts are
organized systematically and are ready to be processed in human brain or some machine , then it becomes
the knowledge. From this knowledge we can easily draw desired conclusions which can be used to solve
real world complex and simple problems.
Example : A doctor treating a patient requires both the knowledge as well as data. The data is patient’s
record (i.e. patient’s history, measurements of vital signs , diagnosticreports, response of medicines etc…).
Knowledge is that information which the doctor has gained in medical college during his studies.
Cycle of knowledge from data is as follows :
(a) Raw data when refined , processed or analyzed yields information which becomes useful in
answering users queries.
(b) Further refinement , analysis and the adition of heuristics, information may be converted into
knowledge, which is useful in problem solving and from which additional knowledge may be
inferred.
Role of Knowledge in AI : Knowledge is central to AI. More is the knowledge then better are the chances
of a person to be more intelligent as compared from others. Knowledge also improves search efficiency of
human brain. Knowledge to support Intelligence is needed because :
(a) We can understand natural language with the help of it and use it when required.
(b) We can make decisions if we possess sufficient knowledge about the certain domain.
(c) We can recognize different objects with varying features quite easily.
(d) We can interpret various changing situations very easily and logically.
(e) We can plan strategies to solve difficult problems altogether.
(f) Knowledge is dynamic and Data is static.
An AI system must be capable of doing following three things :
(a) Store the knowledge in knowledge base(Both static and dynamic KB)
(b) Apply the knowledge stored to solve problems.
(c) Acquire new knowledge through the experience.
Three key components of an AI system.
UNIVERSITY ACADEMY 7
UNIVERSITY ACADEMY 8
1. Representation 2. Learning 3. Reasoning.
Various techniques of knowledge representation
Simple
Inheritable
relational
knowledge
knowledge
Inferential
Procedural
knowledge knowledge.
(A ) Relation Knowledge : This is the simplest way to represent knowledge in static form , which is stroed
in a database as a set of records.Facts about the set of objects and relationship between objects are set out
systematically in columns. This technique has very little opportunity for inference. But it provides
knowledge base for other powerful inference mechanisms.Example: Set of records of Employees in an
organization Set of records and related information of voters for elections.
(B) Inheritable Knowledge :One of the most useful form of inference is property inheritance. In this
method Elements of certain classes inherit attributes and values from more general classes in which they are
needed. Features of inheritable knowledge are :
➢ Property inheritance (Objects inherit values from being members of a class, data must be organized
into a hierarchy of classes.)
➢ Boxed nodes (contains objects and values of attributes of objects).
➢ Values can be objects with attributes and so on…
➢ Arrows ( point from object to its value).
➢ This structure is known as Slot and Filler Architecture, Semantic network or collection of
frames.
➢ In semantic networks nodes of classes or objects with some inherent meaning are connected in a
UNIVERSITY ACADEMY 9
network structure.
UNIVERSITY ACADEMY 10
(C) Inferential Knowledge : Knowledge is useless unless there is some inference process that can exploit
it.The required inference process implements the standard logical rules of inference. It represents knowledge
as a form of formal logic . Example : All dogs have tails , x : dog(x) →hastail(x)
This knowledge supports automated reasoning. Advantages of this approach is:
➢ It has set of strict rules.
➢ Can be used to derive more facts.
➢ Truth of new statements can be verified.
➢ Guaranteed correctness.
(D) Procedural Knowledge: This is encoded form of some procedures. Example: Small programs that
know how to do specific things , how to proceed e.g. a parser in a natural language system has the
knowledge that a noun phrase may contain articles, adjectives and nouns. It is represented by calls to
routines that know how to process articles , adjectives and nouns.
Advantages :
➢ Heuristic or domain specific knowledge can be represented
➢ Extended logical inferences, like default reasoning is incorporated.
UNIVERSITY ACADEMY 11
➢ Side effects of actions may be modeled.
UNIVERSITY ACADEMY 12
Disadvantages :
➢ Not all the cases may be represented.
➢ Not all the deductions may be correct
➢ Modularity is not necessary, control information is tedious.
Ques 9 : Define the term logic. What is the role of logic in Artificial Intelligence? Compare
Propositional logic with First order logic (Predicate Calculus).
Ans : Logic is defined as a scientific study of the process of reasoning and the system of rules and
procedures that help in reasoning process. Logic is the process of reasoning representations using
expressions in formal logic to represent the knowledge required. Inference rules and proof procedures can
apply this knowledge to solve specific problems.
We can derive new piece of knowledge by proving that it is a consequence of knowledge that is already
known. We generate logical statements to prove the certain assertions.
Role of Logic in AI
➢ Computer scientists are familiar with the idea that logic provides techniques for analyzing the
inferential properties of languages. Logic can provide specification for a programming language by
characterizing a mapping from programs to the computations that they implement.
➢ A compiler that implements the language can be incomplete as long as it approximates the logical
requirements of given problem. This makes it possible to involve logic in AI applications to vary
from relatively weak uses in which logic informs the implementation process with analysis in depth .
➢ Logical theories in AI are independent from implementations. They provide insights into the
reasoning
problem without directly informing the implementation.
➢ Ideas from logic theorem proving and model construction techniques are used in AI.
➢ Logic works as a analysis tool , knowledge representation technique for automated reasoning and
developing Expert Systems. Also it gives the base to programming language like Prolog to develop
AI softwares.
UNIVERSITY ACADEMY 13
UNIVERSITY ACADEMY 14
George Boole (1815-1864) wrote a book in , named as “ Investigation of Laws of Thoughts”
To investigate the fundamental laws of those operations of the mind by which reasoning is
performed ; to give expression to them in the symbolical language of a Calculus and upon this
foundation to establish the science of Logic and construct its method. To make this method
Itself the basis of a general method from the various elements of truth brought to view in the
course of these inquiries some probable intimations concerning the nature and constitution of
human mind.
Comparison b/w Propositional Logic & First Order Predicate Logic
S.NO PL FOPL
1. Less Declarative More Declarative
2. Contexts dependent semantics Context independent semantics
3. Ambiguous and less expressive Unambiguous and more expressive.
4. Propositions are used as components Use of predicates/relations between
with logical connectives. objects, functions , variables , logical
connectives and quantifiers( Existential
and Universal)
5. Rules of inferences are used for Rules of inferences are used along with
deduction like Modus Ponen, Modus the rules of Quantifiers .
Tollens,disjunctive syllogism etc.
6. Inference algorithms like inference Inference algorithms like Unification ,
rules , DPLL, GSAT are used. Resolution , backward and forward
chaining are used.
7. NP complete Semi-decidable
Ques 10 (A) Convert the following sentences to wff in first order predicate logic.
(i) No coat is water proof unless it has been specially treated.
(ii) A drunker is enemy of himself.
(iii) Any teacher is better than a lawyer.
(iv) If x and y are both greater than zero, so is the product of x and y.
(v)Every one in the purchasing department over 30 years is married.
(B) Determine whether each of the following sentence is satisfiable, contradictory or valid
S1 : (p q) (p q) p S2 : p → q → p
UNIVERSITY ACADEMY 15
UNIVERSITY ACADEMY 16
Ans : (A) (i) No coat is water proof unless it has been specially treated.
(iv) If x and y are both greater than zero, so is the product of x and y.
x y [ GT (x , 0 ) ˄ GT (y , 0) → GT ( times (x , y) , 0 ) ].
Where : GT : greater than , times(x ,y) : x times y (times is predicate), or we can use
product_of (x , y) , product_of is a function.
P Q pq ~q p ~q (p q) (p q) p
T T T F T T
T F T T T T
F T T F F F
F F F T T T
Hence by last column of truth table, the above statement is satisfiable.
UNIVERSITY ACADEMY 17
UNIVERSITY ACADEMY 18
Ques 11: Using the inference rules of Propositional logic , Prove the validity of following
axioms:
(i) If either algebra is required or geometry is required then all students will study
mathematics.
(ii) Algebra is required and trignometry is required therefore all students will study
mathematics.
Ans : Converting above sentences to propositional logic and applying inference rules :
(i) (A G → S)
(ii) (A ˄ T) To prove that : S is true
Hence above axioms are valid, because all are proved to be true.
Ques 12 : Determine whether the following argument is valid or not. “ If I work whole night on this
problem, then I can solve it . If I solve the problem , then I will understand the topic.
Therefore , I will work whole night on this problem, then I will understand the topic.”
Ans : Converting above sentences to propositional logic and applying inference rules :
Ques 14 : What is clause form of a wff (well-formed formula)? Convert the following formula into
clause form : x y [ z P( f(x), y, z) → { u Q( x , u) ˄ v R( y, v) } ].
Ans : Clause Form : In Theory of logic either it is propositional logic or predicate logic , while proving the
validity of statements using resolution principle it is required to convert well-formed formula into the
clause form. Clause form is the set of axioms in which propositions or formula are connected only through
OR (˅) connector.
Step 1: Elimination of Implication: Applying P → Q ~ P ˅ Q
xy (~ z P ( f(x), y, z ) ˅ ( u Q ( x , u) ˄ v R( y ,v) )
UNIVERSITY ACADEMY 21
UNIVERSITY ACADEMY 22
Step 7. Apply Distributive Law for CNF: P ˅ ( Q ˄ R ) ( P ˅ Q ) ˄ ( P ˅ R )
Ans : Resolution Principle : This is also called proof by refutation. To prove a statement is valid ,
resolution attempts to show that the negation of statement produces a contradiction with known
statements . At each step two clauses, called PARENT CLAUSES are compared / resolved,
yielding a new clause that has been inferred from them.
Example : Let two clauses in PL C1 and C2 are given as :
C1 : winter ˅ summer , C2: ~ winter ˅ cold . Assumption is that both C1 and C2
Are true. From C1 and C2 we can infer/deduce summer ˅ cold. This is RESOLVENT CLAUSE
Resolvent Clause is obtained by combining all of the literals of the two parent clauses except the
ones that cancel. If the clause that is produced is empty clause, then a contradiction has been found.
E.g : winter and ~ winter will produce an empty clause.
Algorithm of resolution in propositional logic:
Step 1: Convert all the propositions of F to clause form, where F is set of axioms.
Step 2: Negate proposition P and convert the result to clause form. Add it to the set of clauses
obtained in step 1.
Ans : (B) Let ~R is true, ad it to the set of clauses formed from given axioms(as a set of support).
C1 : P is true , C2 : ~P V ~Q V R ( By eliminating implication in
(P ˄ Q) → R
C3 : ~ S ˅ Q , C4 : ~ T ˅ Q , C5 : T , C6 : ~R.
( Eliminating implication from ( S ˅ T ) → Q
~ ( S ˅ T ) ˅ Q ≡ (~ S ˄ ~ T ) V Q ( By demorgan’s law), Now apply distributive law
We obtain : (~ S ˅ Q ) ˄ (~ T ˅ Q ) , convert it into two clauses C3 and C4 after
removing AND connector.
~P˅~Q˅R ~R
~ P ˅ ~ Q (Resolvent Clause ) P
~T˅Q ~Q
~T T
Assumption that ~ R is true is false.
So R is true.
Empty Clause
(Contradiction Found)
UNIVERSITY ACADEMY 24
UNIVERSITY ACADEMY 25
Ques 16: How is resolution in first order predicate logic different from that of propositional
performed? What is Unification Algorithm & why it is required?
Ans : In FOPL , while solving through resolution , situation is more complicated since we must consider all
the possible ways of substituting values for variables. Due to the presence of existential and universal
quantifiers in wff and arguments in predicates , the thing becomes more complicated
Finding a contradiction is to try systematically the possible substitutions and see if each produces a
contradiction. To apply resolution in predicate logic , we first need to apply unification technique.
Because in FOPL literals with arguments are to be resolved , then matching of arguments is also
required.
Unification Algorithm: Unification algorithm is used as a Recursive Procedure. Let two literals in
FOPL are P (x ,x ) and P ( y , z ). Here predicate name P matches in both literals , but arguments do
not match. O now substitution is required. Now 1st arguments of both x and y do not match. So
substitute y for x , then it will match.
UNIVERSITY ACADEMY 26
iv. A variable can’t be unified by a function which has an argument as a same variable.
UNIVERSITY ACADEMY 27
v. A constant can’t be unified by a constant.
vi. Predicate/ Literals’ with different number of arguments can’t be unified.
Ques 17: Given the following set of facts, Prove that “ Some who are intelligent can’t read ”.
UNIVERSITY ACADEMY 28
UNIVERSITY ACADEMY 29
Ques 18 : Given the following set of facts :-
UNIVERSITY ACADEMY 30
UNIVERSITY ACADEMY 31
Ans .Converting given statements into wff of FOPL
∀ 𝑥 : Food(x) → Likes (John , x) C1 C7
Food (Apples)
Food (Chicken) 𝝈 = x/Peanuts
∀ 𝑥 ∀ 𝑦 : Eats ( x , y) ˄ ~ Killed (x) → Food ( y)
∀ 𝑥 : Eats ( Bill , Peanuts ) ˄ alive (Bill )
∀ 𝑥 : Eats ( Bill , x) → Eats (Sue , x) ~ Food ( Peanuts)
C4
To Prove that : Likes ( John , Peanuts)
Ques 19 : Explain Backward and forward Chaining , with example in logic representation. Also mention
advantages and disadvantages of both the algorithms.
Ans : The process of the output of one rule activating another rule is called chaining. Chaining technique is
to break the task into small procedures and then to inform each procedure within the sequence by itself. Two
types of chaining techniques are known: forward chaining and backward chaining.
UNIVERSITY ACADEMY 32
UNIVERSITY ACADEMY 33
(A) Forward chaining :
➢ This a data-driven reasoning, and starts with the known facts and tries to match the rules
with these facts.
➢ There is a possibility that all the rules match the information (conditions). In forward chaining,
firstly the rules looking for matching facts are tested, and then the action is executed.
➢ In the next stage the working memory/term memory is updated by new facts and the matching
process all over again starts. This process is running until no more rules are left, or the goal is
reached.
➢ Forward chaining is useful when a lot of information is available. Forward chaining is useful to
be implemented if there are an infinite number of potential solutions like configuration
problems and planning.
A rule based KB is given as : and it is to prove the conclusion.
Rule1: IF A OR B THEN C
Rule 2 : IF D AND E AND F THEN G
Rule 3: IF C AND G THEN H
The following facts are presented: B, D, E, F. Goal: prove H. The structure of a forward chaining
example is given in the following figure:
Backward Chaining :
➢ The opposite of a forward chaining is a backward chaining.
➢ Contrast to forward chaining, a backward chaining is a goal-driven reasoning method. The
backward chaining starts from the goal (from the end) which is a hypothetical solution and the
inference engine tries to find the matching evidence.
UNIVERSITY ACADEMY 34
UNIVERSITY ACADEMY 35
➢ When it is found, the condition becomes the sub-goal, and then rules are searched to prove
these sub-goals. It simply matches the RHS of the goal. This process continues until all the
sub-goals are proved, and it backtracks to the previous step where a rule was chosen.
➢ If there is no rule to be established in an individual sub-goal, another rule is chosen.
➢ The backward chaining reasoning is good for the cases where there are not so much facts and
the information (facts) should be generated by the user. The backward chaining reasoning is
also effective for application in the diagnostic tasks.
In many cases the linear logic programming languages are implemented using the
backward chaining technique. The combination of backward chaining with forward
chaining provides better results in many applications.
Ques 20: What is Utility theory and its importance in AI ? Explain with the help of suitable examples.
Ans : Utility theory is concerned with people's choices and decisions. It is concerned also with people's
preferences and with judgments of preferability, worth, value, goodness or any of a number of similar
concepts. Utility means quality of being useful. So as per this each state in environment has a degree of
usefulness to an agent, that agent will prefer states with higher utility.
Interpretations of utility theory are often classified under two headings, prediction and prescription:
(i) The predictive approach is interested in the ability of a theory to predict actual choice behavior.
(ii) The prescriptive approach is interested in saying how a person ought to make a decision.
E.g : Psychologists are primarily interested in prediction.
Economists in both prediction and prescription. In statistics the emphasis is on prescription
in decision making under uncertainty. The emphasis in management science is prescriptive
also.
Sometimes it is useful to ignore uncertainty, focus on ultimate choices. Other times, must model
uncertainty explicitly. Examples: Insurance markets, Financial markets., Game theory. Rather than
choosing outcome directly, decision-maker chooses uncertain prospect (or lottery). A lottery is a probability
distribution over outcomes.
This has two basic components; consequences (or outcomes) and lotteries.
(a) Consequences: These are what the decision-maker ultimately cares about.
Example: “I get pneumonia, my health insurance company covers most of the costs, but I have to pay
a $500 deductible.” Consumer does not choose consequences directly. Lotteries Consumer chooses a
lottery, p
(b) Lotteries are probability distributions over consequences: p : C → [0, 1] ;
with ∑c ∈ C p (c) = 1. Set of all lotteries is denoted by P. Example: “A gold-level health insurance
plan, which covers all kinds of diseases, but has a $500 deductible.” Makes sense because consumer
assumed to rank health insurance plans only insofar as lead to different probability distributions over
consequences.
Utility Function : U : P → R has an expected utility form if there exists a function
u : C → R such that U (p) = ∑ p (c) u (c) for all p ∈ P. c ∈ C . In this case, the function U is called an
expected utility function, and the function u is call a von Neumann-Morgenstern utility function. These
functions are used to capture agent’s preferences between various world states .This function assigns a
single number to express desirability of a state utilities. Utilities are combined with outcome probabilities of
actions to give an expected utility for each action. U (s) : Means utility of state S , for agent’s Decision.
Maximum expected Utility ( MEU) : This represents that a rational agent should select an action that
maximizes the agent’s expected utility. MEU principle says “ If an agent maximizes a utility function that
correctly reflects the performance measure by which its behavior is being judged , then it will achieve the
highest possible performance score if we average over the environment of agent.”
Ques 21: What are constraint notations in utility theory ? Define the term Lottery. Also mention the
following axioms of Utility Theory :
(i) Orderability (ii) Substitutability (iii) Monotonicity (iv)Decomposability.
Ans : Constraint Notations in Utility theory for two outcomes / consequences A and B are as mentioned
below :
➢ A B : A is preferred over B.
➢ A ~ B : Agent is indifferent between A and B.
➢ A ≥ B : Agent prefers A to B or is indifferent b/w them.
A Lottery L with possible outcomes C1 , C2 , C3 …..Cn that can occur with probabilities [ p1 , C1 ;
p2 , C2 ; …..; pn , Cn ].Each outcome of a lootery can be an atomic state or another lottery.
(i) Orderability : Given any two states , a rational agent must prefer one to other or else rate the
two as equally preferable. So agent can’t avoid the decision.
( A B) ˅ ( B A) ˅ ( A ~ B)
(ii) Substitutability: If an agent A is indifferent b/w two lotteries A and B , then the agent is
indifferent b/w two more complex lotteries that are same except that B is substituted for A in
one of them.
( A ~ B) [ p , A ; 1 – p , c ] ~ [ p , B ; 1 – p , c]
[ p , A ; 1 – p, [ q , B ; 1 – q , C ] ] ~ [ p , A ; (1 - p) q, B ; (1 - p) (1 -q) , C]
Ans : Probabilistic Reasoning in Intelligent Systems is a complete and accessible account of the theoretical
foundations and computational methods that underlie plausible reasoning under uncertainty.
Intelligent agent’s almost never have acess to the whole truth about their environment. So agents act under
uncertainty. The agent’s knowledge can only provide degree of belief. Main concept for dealing with degree
of belief is PROBABILITY THEORY.
➢ If probability is 0 , then belief is that statement is false.
➢ If probability is 1 , then belief is that statement is true.
Percepts received from the environment form the evidence on which probability assertions are based.
As agent receives new percepts , its probability assessments are updated to reflect new Evidence.
➢ Before the evidence is find , we talk about prior (unconditional) probability.
➢ After the evidence is given , we deal with posterior (conditional ) probability.
Probability associated with a proposition (sentence) P is the degree of belief associated with it in the
absence of any other information.
• In AI applications, sample points are defined by set of random variables
– Random vars: boolean, discrete, continuous
Probability Distribution: With respect to some random variable we talk about the probabilities of all
possible outcomes of a random variable. E.g : Let weather is random variable , Given that :
➢ P( weather = sunny) = 0.7 , P( weather = rainy) = 0.2 , P( weather = cloudy) = 0.08
P( weather = snowy ) = 0.02
Joint Probability Distribution: Joint probability distribution for a set of random variables gives the
probability of every atomic event on those random variables (i.e., every sample point).In this case
P(Weather, Cavity) can be given by a 4 × 2 matrix of values
Ques 23:Explain in detail Markov Model and its applications in Artificial Intelligence.
➢ Markov model is an un-précised model that is used in the systems that does not have any fixed
patterns of occurrence i.e. randomly changing systems.
➢ Markov model is based upon the fact of having a random probability distribution or pattern that may
be analysed statistically but cannot be predicted precisely.
➢ In Markov model, it is assumed that the future states only depend upon the current states and not the
previously occurred states. In I order markov, current state depends only on just previous state. i.e.
Conditional probability is : P ( Xt | X0 : t-1) = P ( Xt | X t-1)
Set of states: { S1 S2 , S3 …. Sn }. Process moves from one state to another generating a sequence of
states.
Observable state sequence lead to a Markov Chain Model. Non Observable state leads to Hidden
Markov Models.
Transition Probability Matrix: Each time when a new state is reached the system is set to have
incremented one step ahead. Each step represents a time period which would result in another possible
state. Let Si is state I of environment for I = 1 , 2… n.
states of the model are hidden. Each state can emit an output which is observed . This model is used
because simple markov chain is too restricted for complex applications.
➢ In Hidden Markov-Model, every individual state has limited number of transitions and emissions.
State sequences are not directly observable, rather it can be recognized from the sequence of
observations produced by the system.
➢ Probability is assigned for each transition between states.
➢ Hence, the past states are totally independent of future states.
➢ The fact that HMM is called hidden because of its ability of being a memory less process i.e. its
future and past states are not dependent on each other.
➢ This can be achieved on two algorithms called as:
(i) Forward Algorithm. (ii) Backward Algorithm.
Components of HMM :
➢ Set of states: { S1 S2 , S3 …. Sn }.
➢ Sequence of states generated by the system : { Si1 , Si2 , …….Sik-1 , Sik }
➢ Joint probability Distribution by Markovian Chain :
P ( Sik | Si1 , Si2 ,……., Sik-1) = P ( Sik | Sik - 1)
Observations / Visible states : { V1 , V2 , …Vm-1 , V m}
Ques 25 : Consider the following data provided for Weather Forecasting Scenario.
Two states (Hidden) : ‘Low’ and ‘High’ atmospheric pressure.
Two observations (Visible States) : ‘Rain’ and ‘Dry’.
Suppose we want to calculate a probability of a sequence of observations in our
example, { ‘Dry’,’ Rain’}.
Ans : Solution :
Transition probabilities:
P(‘Low’|‘Low’) = 0.3
P(‘High’|‘Low’) = 0.7,
P(‘Low ’|‘High’) = 0.2 ,
P(‘High ’|‘High’) = 0.8
Observation probabilities:
P(‘Rain ’|‘Low’) = 0.6
P(‘Dry ’|‘Low’) = 0.4
P(‘Rain ’|‘High’) = 0.4
P(‘Dry ’|‘High’) =0.3 .
= 0.4*0.4*0.6*0.4*0.3
Ques 26 : Explain in detail Bayesian Theory and its use in AI. Define Likelihood ratio.
Ans : In probabilistic reasoning our conclusions are generally based on available evidences and past
experience . This information is mostly incomplete. When outcomes are unpredictable we use probabilistic
reasoning, E.g Weather forecasting system, Disease Diagnosis, Traffic congestion control system.
➢ When a doctor examines a patient’s history , symptoms , test rules , evidence of possible disease.
➢ In weather fore casting prediction of tomorrow’s cloud coverage , wind speed and direction , sun
heat intensity.
➢ A Business manager must take decision based on uncertain predictions , when to launch a new
product . Factors can be : Target consumer’s life style , population growth in specific city / state,
Average income of consumers, economic scenario of the country . All this can be depend on past
experience of market.
From the product rule of probability theory we express the following equations:
P ( a ∧ b ) = P(a ∣ b) . P( b ) ..............Eq 1.
P( a ∧ b ) = P( b ∣ a ) P( a )............... Eq 2.
𝑷(𝒂 |𝒃) 𝑷 (𝒃)
On Equating both the equations: P(b|a)=
𝑷(𝒂)
Baye’s rule is used in modern AI systems for probabilistic inferences. It uses he notion of conditional
probability: P ( H | E ), This expression is read as “ The probability of hypothesis H given that we have
observed evidence E ”. For this we require prior probability H ( if we have no evidence) and extent to which
E provides evidence of H.
𝑷( 𝑬 |𝑯𝒊).𝑷(𝑯𝒊)
Baye’s theorem states : P ( Hi | E ) =
∑𝑲𝒏=𝟏 𝑷 (𝑬 |𝑯𝒏).𝑷(𝑯𝒏)
➢ Inference in Belief Networks: agate beliefs. After constructing such a network an inference engine
can use it to maintain and propagate beliefs. When new information is received , the effects can be
propagated throughout the network , until equilibrium probabilities are reached.
(a) Diagnostic inference: symptoms to causes
(b) Causal inference: causes to symptoms
(c) Intercausal inference
(d) Mixed inference: mixes those above
➢ Inference in Multiply Connected Belief Networks
(a) Multiply connected graphs have 2 nodes connected by more than one path
(b) Techniques for handling:
✓ Clustering: Group some of the intermediate nodes into one meganode.
UNIVERSITY ACADEMY 1
UNIVERSITY ACADEMY 2
Short Questions & Answers
Ques 1. Name out three basic techniques of machine learning .
Ans : (a) Supervised Learning (b) Unsupervised Learning (c) Reinforcement Learning.
Ans :
Ans : These are used in Decision Making learning technique. This consists of a vector of input attributes
X, and a single Boolean output y. Example: Set of examples ( X1 , Y1)……( X6 , Y6).
Positive examples are in which goal is true . Negative examples are in which goal is false.
Complete set is called Training Set.
Ques 4. Compare the Decision tree method with Naïve Baye’s Learning.
Ans : (i) Naïve Baye’s learns little less efficiently as compared to decision tree learning.
(ii) Naïve Baye’s learning works well fro wide range of applications as compared to decision tree.
(iii) Naïve Baye’s Scale well to very large problems. E.g : If n Boolean attributes , then 2n + 1
Parameters are required.
UNIVERSITY ACADEMY 3
UNIVERSITY ACADEMY 4
Long Question & Answers
Ques 6. Explain Machine learning. Illustrate learning model? Mention some factors that affect the
learning.
Ans : Machine learning is the sub field of AI in which we try to improve decision making power of
intelligent agents. Agent has a performance element that decides what actions to take and a learning element
that modifies the performance element so that it makes better decisions. Design of learning element is
affected by following three major factors :
1) Which components of performance element are to be learned.
2) What feedback is available to learn these components.
3) What is representation method used for components.
Following are some ways of learning mostly used in machines:
(A) Logical learning (B) Inductive learning (C) Deductive learning.
(B)
Logical Learning: In this process a new concept or solution through the use of similar known concepts.
We use this type of learning when solving problems on an exam , where previously learned examples serve
as a guide or when we learn to drive a truck using our knowledge of car driving.
Inductive Learning: This technique requires the use of inductive inference, a form of invalid but useful
inference. We use inductive learning when we formulate a general concept after seeing a number of
instances or examples of the concept. E.g : When we learn the concept of color or sweet taste after
experiencing sensation associated with several objects.
Deductive Learning: This is performed through a sequence of deductive inference steps using known facts.
From the known facts , new facts or relationships are logically delivered. E.g : If we have an information
that weather is Hot and Humid then we can infer that it may Rain also. Another example may be , let
P → Q & Q → R , 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑖𝑛𝑓𝑒𝑟 𝑡ℎ𝑎𝑡 𝑃 → 𝑅
UNIVERSITY ACADEMY 5
UNIVERSITY ACADEMY 6
Environment has been included as a part of the overall learning system. It produces random stimuli, which
work as a organized training source such as a teacher which provides carefully selected training examples
for learner component. A user working on a keyboard can also be an environment for some specific
systems.
Inputs to the learning system may be physical stimuli, some sound , signal ,description of text , symbolic
notations . Information is used to create and modify knowledge structures in the KB. Same knowledge is
used by the performance component to carry out some tasks, such as solving a problem, playing a computer
game .
Performance component produces a response/actions when a task is provided. The Critic module then
evaluates this response relative to an optimal response. A feedback indicating whether or not the
performance is acceptable. It is then forwarded by critic module to learner component for its subsequent use
in modifying the structure in knowledge base.
Factors affecting the Machine Learning Process:
1) Types of training provided. E.g: Supervised technique , Unsupervised technique etc.
2) Form and extent of any initial background knowledge or past history.
3) The types of feedbacks provided.
4) Learning algorithms applied.
UNIVERSITY ACADEMY 7
UNIVERSITY ACADEMY 8
Ques7. Differentiate between Supervised Learning and Unsupervised Learning. Also mention some of
the application areas of both.
Ans :
S.No Supervised Learning Unsupervised Learning
1. Learning of a function can be done from Learning can be used to draw inference from some data
its inputs and outputs, set containing input data
Classifies the data on the basis of training Clusters the data on the basis of similarities according
set available and uses that data for to the characteristics found in the data and grouping
2.
classifying new data. similar objects into clusters.
Ques. 8 Write Short notes on the following: (a) Statistical Learning (b) Naïve Baye’s Model
Ans : (a) Statistical Learning Technique: In this technique main idea is data and hypothesis. Here data is
evidence i.e. instantiations of some or all random variables describing the domain. Bayesian learning
calculates probabilities of each hypothesis given the data and makes prediction.
Let D: data set, with observed value d as an output. Then the probability of each hypothesis is obtained by
Baye’s Rule as: P ( hi | d) = 𝜶 𝑷 (𝒅 |𝒉𝒊)𝑷 ( 𝒉𝒊 ).
UNIVERSITY ACADEMY 9
UNIVERSITY ACADEMY 10
For prediction of an unknown quantity x , expression is given as below :
P ( x | d ) = ∑𝒊 𝑷 ( 𝒙 |𝒅 , 𝒉𝒊) 𝑷 (𝒉𝒊 | 𝒅 ) = ∑𝒊 𝑷 ( 𝒙 | 𝒉𝒊 ) 𝑷 (𝒉𝒊 | 𝒅 ).
Prediction above is weighted averages over predictions of individual hypothesis. Hypothesis are
intermediate values between raw data and predictions. A very common approximation which is generally
used is to make predictions based on a single most probable hypothesis i.e. an hi that maximizes
P ( hi | d ) is called Maximum a Posteriori.
(b) Naïve Baye’s Model: This is the most common Bayesian network model used in machine learning.
In this model the class variable C ( to be predicted) is the root and attribute Xi are leaves. Model is called
Naïve because it assumes that attributes are conditionally independent of each other, given the class.
Once the model has been trained using maximum likelihood technique, it can be used to classify new
examples for which the class variable C is unobserved. For the observed attributes x1 , x2 ,……xn, the
Ques.9 What is learning with complete data? Explain Maximum Likelihood Parameter Learning
with Discrete Model in detail.
Ans . Statistical learning methods are based on simple task parameter learning with complete data.
Parameter learning involves finding the numerical parameters for a probability model with a fix
structure. E.g: In Bayesian network conditional probabilities are obtained for a given scenario. Data are
complete when each point contains values for every variable in a specific learning model.
Maximum Likelihood Parameter Learning : Suppose we buy a bag of lime and cherry candy from a
new manufacturer whose lime–cherry proportions are completely unknown—that is, the fraction could be
anywhere between 0 and 1. Parameter 𝜃 is proportion of cherry candies.
Hypothesis is : h , proportion of limes = 1 - 𝜃
If we assume that all proportions are equally likely a priori, then a maximum-likelihood approach is
reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor
(the flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the probability
of cherry is . Now suppose we unwrap N candies, of which c are cherries and l = N - c are limes
Likelihood of above data set is as given below:
UNIVERSITY ACADEMY 11
So maximum likelihood is value of 𝜃 that maximizes above equation .Computing log likelihood:
By taking logarithms, we reduce the product to a sum over the data, which is usually easier
to maximize.) To find the maximum-likelihood value ofθ, we differentiate L with respect to
θ and set the resulting expression to zero:
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zer
UNIVERSITY ACADEMY 12
Decomposes into the separate learning problems one for each parameter. Also parameter values for a
UNIVERSITY ACADEMY 13
variable given its parents are just observed frequencies of variable values for each setting of parent
values.
Let us look at another example: Suppose this new candy manufacturer wants to give a
little hint to the consumer and uses candy wrappers colored red and green. The Wrapper for
each candy is selected probabilistically, according to some unknown conditional distribution,
depending on the flavor. The corresponding probability model has three parameters: θ, θ1, and θ2.
θ1 : wrapper color of cherry candy. θ2. : Wrapper color of lime candy.
Let us assume a case for Cherry Candy Wrapper, then using Joint probability distribution we can have
following equation:
P (Flavor = Cherry, Wrapper = Green | hθ , θ1, θ2 ).
Now for Maximum Likelihood Estimation , simplify it by taking Log , to come up with addition form :
Now compute I order partial derivatives w.r.t θ, θ1, θ2 , Equate it to zero , we will get values of parameters.
UNIVERSITY ACADEMY 14
UNIVERSITY ACADEMY 15
Ques.10 Write short notes on
(a) Continuous model for Maximum likelihood Estimation
(b) Learning with Hidden Variables.
(c) EM Algorithm.
Ans : (a) Continuous model for Maximum likelihood Estimation : Continuous variables are very common
in real-world applications, it is important to know how to learn continuous models from data. The principles for
maximum-likelihood learning are identical to those of the discrete case. In learning the parameters of a Gaussian
density function on a single variable. That is, the data are generated as follows:
The parameters of this model are the mean and the standard deviation. Let the observed values be x1, X2 … xN .
:
Now setting theI order partial derivative equal to zero we obtain:
The maximum-likelihood value of the mean is the sample average and the maximum likelihood value of
the standard deviation is the square root of the sample variance.
(b) Learning with Hidden Variables : Many real world problems have hidden variables (also called
Latent Variables), which are not observable in given data set samples.
Example : (i) In medical diagnosis, records mostly consist of symptoms , treatment used and outcom
of the treatment. But seldom have direct observation of disease itself.
(ii) A scenario of traffic congestion prediction at office hours ( Hidden variables can be an
unobservable “ Rainy Day” causing very less traffic at peak hours.
Example : Let Bayesian Network for heart disease ( a hidden variable ) is as given in below figure :
UNIVERSITY ACADEMY 16
UNIVERSITY ACADEMY 17
. In figure (a): Each variable has three possible values and is labeled with the number
of independent parameters in its conditional distribution. In figure (b): The
equivalent network with Heart Disease removed. Note that the symptom variables are no
longer conditionally independent given their parents. Therefore Latent Variables can dramatically reduce
the number of parameters required to specify a Bayesian Network. This can reduce he amount of data
needed to learn the parameters.
(c) EM Algorithm( Expectation Maximization Algorithm) : This algorithm is used to solve the
problems arised in Laerning with hidden variables. Basic idea is to pretend that we know the
parameters of model and then infer the probability that each data point belongs to each component is
fitted to entire data set with each point weighted by the probability that it belongs to that component.
➢ Expectation maximization the process that is used for clustering the data sample.
➢ EM for a given data, has the ability to predict feature values for each class on the basis of
classification of examples by learning the theory that specifies it.
➢ It works on the concept of, starting with the random theory and randomly classified data along
with the execution of below mentioned steps. Compute expected values of each hidden
variables for each examples and then re-computing the parameters using the expected values as
if they were observed values. Let X is the observed values in all examples. Z is the set of all
hidden variables. 𝜃 is all parameters for probability model. 𝜽 = { 𝝁, 𝜮 }
UNIVERSITY ACADEMY 18
UNIVERSITY ACADEMY 19
➢ E- Step: In this computation of sum (i.e. expectation of Log likelihood of completed data w.r.t.
P ( Z = z | x , 𝜃𝑖 ) , which is posteriori over hidden variables.
➢ M – Step: In this step we find new values of the parameters that maximize the Log Likelihood
of data given the expected values of hidden indicator variables.
➢ EM algorithm increases the Log Likelihood of data at every iteration. Under certain conditions
EM can be proven to reach a local maximum in likelihood. So EM is like Gradient Based Hill
Climbing Algorithm.
Ques. 11 Explain Re-inforcement learning technique in detail .Also Mention its applications in the
field of Artificial intelligence.
Ans : Re-inforcement learning : This type of learning technique is used for agents learning when there is
no teacher telling the agent what action to take in each circumstances.
Example 1 : Let a chess playing agent by supervised learning given examples of game situations along
with the best moves for those situations. He can also try random moves , so agent can eventually build a
predictive model of its environment. Issue is that “Without some feedback about what is good and bad ,
agent will have no grounds for deciding which move to select.” Agent needs to know that something good
has happened when it wins and that something bad has occurred. This kind of Feedback is called Reward
or Re-inforcement .
UNIVERSITY ACADEMY 20
UNIVERSITY ACADEMY 21
A General Learning Model of Reinforcement Learning:
UNIVERSITY ACADEMY 22
UNIVERSITY ACADEMY 23
Architectures in Reinforcement Learning
Policy: This defines learning agent’s behavior at a particular time. It is a mapping from perceived states of
environment to actions to be taken when present in those states. Policy can be a simple function , a look up
table or a search process too.
Reward Function: This is used to define a goal. It maps each perceived state action pair of environment to
a single number; a reward point that indicates desirability of that state. Objective is to maximize total reward
function received in long run. Reward functions are stochastic/random.
Value function: Reward function indicates what is good in an immediate sense, a value function specifies
what is good in the long run. Value of a state is total amount of reward an agent can expect to accumulate
over the future.
Model: this represents behavior of the environment . Models are used for planning, i.e a way of deciding for
a course of actions by considering future situations.
UNIVERSITY ACADEMY 24
Application areas of Reinforcement learning are as mentioned below:
1) The most recent version of Deep Mind’s AI system for playing Go) means interest in reinforcement
learning (RL) is bound to increase.
2) RL requires a lot of data, and as such, it has often been associated with domains where simulated
data is available (gameplay, robotics).
3) Automation of well-defined tasks, that would benefit from sequential decision-making that RL can
help automate (or at least, where RL can augment a human expert).
4) Industrial automation is another promising area. It appears that RL technologies from
DeepMind helped Google significantly reduce energy consumption (HVAC) in its own data centers.
5) The use of RL can lead to training systems that provide custom instruction and materials tuned to the
needs of individual students. A group of researchers is developing RL algorithms and statistical
methods that require less data for use in future tutoring systems.
6) Many RL applications in health care mostly pertain to finding optimal treatment policies.
7) Companies collect a lot of text, and good tools that can help unlock unstructured text will find users.
8) A technique for automatically generating summaries from text based on content “abstracted” from
some original text document).
9) A Financial Times article described an RL-based system for optimal trade execution. The system
(dubbed “LOXM”) is being used to execute trading orders at maximum speed and at the best
possible price.
10) Many warehousing facilities used by E - Commerce sites and other supermarkets use these intelligent
robots for sorting their millions of products every day and helping to deliver the right products to the
right people. If you look at Tesla’s factory, it comprises of more than 160 robots that do major part of
work on its cars to reduce the risk of any defect.
11) Reinforcement learning algorithms can be built to reduce transit time for stocking as well as
retrieving products in the warehouse for optimizing space utilization and warehouse operations.
12).Reinforcement Learning and optimization techniques are utilized to assess the security of the electric
power systems and to enhance Microgrid performance. Adaptive learning methods are employed to
develop control and protection schemes.
UNIVERSITY ACADEMY 25
UNIVERSITY ACADEMY 26
Ques 12. Discuss Various Types of Reinforcement Learning Techniques.
Ans : Reinforcement learning are of following three types :
(a). Passive Reinforcement (b) Temporal Difference Learning (c) Active Reinforcement learning.
Passive Reinforcement Learning: In this technique agent’s policy is fixed and the task to learn the utilities
of state action pairs. If policy is 𝜋 and state is S , then agent always executes the action 𝜋(𝑆).
➢ Goal is to learn how good policy is i.e to learn th e utility function 𝑈𝜋 ( S). Passive learning agent is
not aware of the transition model T ( S , a , S’) , which specifies probability of reaching state S’ from
state S after action a.
➢ Passive learning also not knows the Reward Function R (S).
➢ A utility is defined to be the expected sum of rewards obtained if policy 𝜋 is followed.
𝐔𝛑( S) = E [ ∑𝐭=𝟎
∞ 𝛄𝐭 𝐑 (𝐒 ) ∶ 𝛑 , 𝐒 = 𝐒 ] , where 𝛄 𝐢𝐬 𝐚 𝐝𝐢𝐬𝐜𝐨𝐮𝐧𝐭 𝐟𝐚𝐜𝐭𝐨𝐫.
𝐭 𝟎
Temporal difference Learning: When a transition occurs from state S to state S’, we update 𝑈𝜋( S) as
following: 𝐔𝛑( S) ← 𝐔𝛑( S) + 𝜶 ( 𝑹 (𝑺) + 𝜸 𝐔𝛑( S’) - 𝐔𝛑( S) ) .
: Learning rate parameter. This update rule uses the difference in utilities between successive states, it is often
called TEMPORAL DIFFERENCE equation.
Active Reinforcement Learning: The compression achieved by a function approximator allows the
learning agent to generalize from states it has visited to states it has not visited.
E.g : An evaluation function for CHESS that is represented as a weighted linear function of a set of features
or a basis function f1 , f2 , ……. fn.
UNIVERSITY ACADEMY 27
UNIVERSITY ACADEMY 28
Ques 13. What is Decision Tree Learning? Why it is useful in AI applications?
Ans : Decision tree method is one of the most simplest and yet most successful forms of learning algorithm.
It emphasis is towards the area of Inductive Learning. In inductive learning “ a collection of examples of f is
given , we return a function h that approximates f ”, where example f is “ A pair ( x , f(x) ) ”, where x is
input and f (x) is output of function applied to x.
➢ H is hypothesis. A good hypothesis will generalize well i.e will predict examples correctly.
➢ A decision tree takes input an object , with certain feature set and returns a decision of predicted
output value. Output may be Discrete or Continuous.
➢ Learning a discrete function is known as classification learning, wheras learning a continuous
function is termed as Regression in decision tree.
➢ Decision tree reaches itsdecision by performing a sequence of tests.
➢ Each internal node is a test value of one of the properties and branches from node are labeled with
possible values of the test.
➢ Each leaf node consists of return value.
➢ Application of Decision Tree learning is in designing an expert System based on Decision Tree
Architecture.
➢ Decision trees are completely expressive with the class of propositional logic.
➢ Various propositions are connected via logical OR operator( V).
UNIVERSITY ACADEMY 29
UNIVERSITY ACADEMY 30
Boolean Decision Trees : This technique consists of a vector of input attributes , X and a single Boolean
output Y.
E.g : Set of examples ( X1 , Y1….., (X6 , Y6) ).
➢ Positive examples are those in which goal is true.
➢ Negative examples are those in which goal is false.
➢ Complete set is known as a TRAINING SET.
Example:
➢ Given this classifier, the analyst can predict the response of a potential customer (by sorting
it down the tree), and understand the behavioral characteristics of the entire potential
customers population regarding direct mailing.
➢ Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.
➢ For example, one of the paths in below figure can be converted into the rule :“If customer
age is is less than or equal to or equal to 30, and the customer is “Male” – then the
customer will respond to the mail”.
UNIVERSITY ACADEMY 31
UNIVERSITY ACADEMY 32
Application Areas of Decision Tree Learning
1) Variable selection: The number of variables that are routinely monitored in clinical settings has
increased dramatically with the introduction of electronic data storage. Many of these variables are of
marginal relevance and, thus, should probably not be included in data mining exercises.
2) Handling of missing values: A common - but incorrect - method of handling missing data is to
exclude cases with missing values; this is both inefficient and runs the risk of introducing bias in the
analysis. Decision tree analysis can deal with missing data in two ways: it can either classify missing
values as a separate category that can be analyzed with the other categories or use a built decision tree
model which set the variable with lots of missing value as a target variable to make prediction and
replace these missing ones with the predicted value.
3) Prediction: This is one of the most important usages of decision tree models. Using the tree model
derived from historical data, it’s easy to predict the result for future records.
4) Data manipulation: Too many categories of one categorical variable or heavily skewed continuous
data are common in medical research.
UNIVERSITY ACADEMY 33
UNIVERSITY ACADEMY 34
Ques 14 : Write Short Notes on the following : (A) Regression Trees
(B) Bayesian Parameter Learning.
Ans : Regression Trees : Regression trees are commonly used to solve the problems where target variable
is numerical / continuous instead of discrete. Regression trees posses following properties :
a) Leaf nodes predict the average value of all instances.
b) Splitting criteria : Minimize the variance of the values in each subset 𝐒𝐢
| 𝑺𝒊 |
c) Standard Deviation Reduction : SDR ( A, S) = SD (S) - ∑ SD (𝑺 )
𝒊| 𝑺| 𝒊
Bayesian Parameter Learning: This learning technique works on parametric variables which are random
having some prior distribution. An optimal learning classifier can be designed using “Class conditional
densities”, p ( x | 𝑤𝑖). In a typical case we merely have some unclear knowledge about situations with given
number of samples and training. Observation of samples converts this to a posteriori density, and true values
of parameters are revised. In Bayesian learning sharpening of Posteriori Density Function is done, causing it
to peak near the true values.
• We assume priors are known: P (i | D) = P(i). p(x , D)P(i )
• Also, assume functional independence : p( | x, D) = c
i
p(x j , D)P( j )
j =1
• Any information we have about about prior to collecting samples is contained in p(D|).
• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true
value of .
UNIVERSITY ACADEMY 35
UNIVERSITY ACADEMY 36
UNIT – 5
Pattern Recognition:
• Introduction of Design principles of pattern recognition system
• Statistical Pattern recognition
• Parameter estimation methods
• Principle Component Analysis (PCA)
• Linear Discriminant Analysis (LDA).
Classification Techniques
• Nearest Neighbor (NN) Rule
• Bayes Classifier
• Support Vector Machine (SVM)
• K – means clustering.
UNIVERSITY ACADEMY 1
UNIVERSITY ACADEMY 2
Short Questions & Answers
Ans. Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and
regularities in data,It is the study of how machines can observe the environment intelligently , learn to
distinguish patterns of interest from their backgrounds and make reasonable & correct decisions about the
different classes of objects. Patterns may be a finger print image, handwritten cursive word, a human face,
iris of human eye ora speech signal. These examples are called input stimuli. Recognition establishes a
close match between some new stimulus and previously stored stimulus patterns. Pattern recognition
systems are in many cases trained from labeled "training" data (supervised learning), but when no labeled
data are available other algorithms can be used to discover previously unknown patterns (unsupervised
learning).At the most abstract level patterns can also be some ideas , concepts , thoughts , procedures
Activated in human brain and body. This is known as the study of human psychology (Cognitive
Science)
Example: In automatic sorting of integrated circuit amplifier packages, there can be three possible types :
metal –cane , dual –in-line and flat pack. The unknown object should be classified as being one of these
types.
Ques 2. Define Measurement space and Feature space in classification process for objects.
Ans: Measurement space: This is the set of all pattern attributes which are stored in a vector form.
It is a range of characteristic attribute values. In vector form measurement space is also called
observation space /data space. E.g : W = [ W1 , W2 ,……,Wn-1, Wn ] for n pattern classes.
𝑥1
W is a pattern vector. Let X = [ ] , X is a pattern vector for flower , x1 is petal length and x2 is petal
𝑥2
width.
Feature Space: The range of subset of attribute values is called Feature Space F. This subset represents a
reduction in attribute space and pattern classes are divided into sub classes. Feature space signifies the
most important attributes of a pattern class observed in measurement space.
UNIVERSITY ACADEMY 3
UNIVERSITY ACADEMY 4
Ques 3. What is dimensionality reduction problem?
Ans. In machine learning classification problems, there are often too many factors on the basis of which
the final classification is done. These factors are basically variables called features. The higher the
number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most
of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms
come into play. Dimensionality reduction is the process of reducing the number of random variables
under consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction. The various methods used for dimensionality reduction include:
UNIVERSITY ACADEMY 5
UNIVERSITY ACADEMY 6
Ques 7. How K-Mean is different by KNN.
Ans.
2. All the variables are independent All the variables are dependent
4. The points in each cluster tend to be near Combines the classification of the K
each other. nearest points
Ans. Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups,
where k is the number of groups pre-specified by the analyst. There are different types of partitioning
clustering methods. The most popular is the K-means clustering , in which, each cluster is represented by
the center or means of the data points belonging to the cluster. The K-means method is sensitive to
outliers.
UNIVERSITY ACADEMY 7
UNIVERSITY ACADEMY 8
Long Questions & Answers
Step 1. Stimuli produced by the objects are perceived by sensory devices. Important attributes like
( shape , size , color , texture) produce the strongest inputs. Data collection involves identification
of attributes of objects and creating Measurement space.
Measurement space: This is the set of all pattern attributes which are stored in a vector form.
It is a range of characteristic attribute values. In vector form measurement space is also called
observation space /data space. E.g : W = [ W1 , W2 ,……,Wn-1, Wn ] for n pattern classes.
𝑥1
W is a pattern vector. Let X = [ ] , X is a pattern vector for flower , x1 is petal length and x2 is
𝑥2
petal width. Pattern classes can be W1= Lilly , W2= Rose , W3 = Sunflower.
UNIVERSITY ACADEMY 9
Step 2.After this features are selected and feature space vector is designed. The range of subset of
attribute values is called Feature Space F. This subset represents a reduction in attribute space and pattern
classes are divided into sub classes. Feature space signifies the most important attributes of a pattern
class observed in measurement space. Feature space is shown in smaller size than M- space.
Step 3.AI models based on probability theory, E.g : Bayesian Model and Hidden Markov Models are
Used for grouping or clustering the objects. Attributes selected are those which provide High Inter
Class and Low Inter Class groupings.
Step 4. Using Unsupervised(for feature extraction) or Supervised Learning techniques( classification)
training of classifiers is performed. When we present a pattern recognition with a set of classified
patterns so that it can learn the characteristics of the set, we call it training.
Step 5.In evaluation of classifier testing is performed. In this an unknown pattern is given to the PR
System for identifying its correct class .Using the selected attribute values, object/class
characterization models are learned by forming generalized prototype descriptors, Classification
rules or Decision Functions. The range of decision function values is known as Decision space D
of r – dimensions. We also evaluate performance, efficiency of the classifier for further
improvement.
𝑑1
𝑑2
𝑑3
D= . Recognition of familiar objects is achieved through the application of the rules learned
.
. in step 4, by comparing and matching of objects feature with stored models.
[𝑑𝑛]
Ques 11. What are the design principles of a Pattern Recognition System ? What are major steps
involved in this process?
Ans : Design principles of a Pattern Recognition System are as mentioned below :
UNIVERSITY ACADEMY 10
➢ Segmentation of large objects.
UNIVERSITY ACADEMY 11
ii. Designing of a robust PR system against the variation in illumination and brightness in
environment.
iii. Designing parameters based on translation, scaling and rotation.
iv. Color and texture representation by histograms.
v. Designing brightness based and feature based PR systems.
This system comprises of mainly five components namely sensing, segmentation, feature extraction,
classification and post processing. All of these together generates a System and works as follows:
1. Sensing and Data Acquisition: It includes, various properties that describes the object, such as its
entities and attributes which are captured using sensing device.
2. Segmentation: Data objects are segmented into smaller segments in this step.
3. Post Processing & Decision: Certain refinements and adjustments are done as per the changes in
features of the data objects which are in the process of recognition. Thus, decision making can be
done once, post processing is completed.
Need of Pattern Recognition System
Pattern Recognition System is responsible for generating patterns and similarities among given
problem/data space, that can further be used to generate solutions to complex problems effectively and
efficiently. Certain problems that can be solved by humans, can also be made to be solved by machine by
using this process. Affective computing which gives a computer the ability to recognize and express
emotions, to respond intelligently to human emotions that contribute to rational decision making.
Ques12. Discuss about the four best approaches for a Pattern Recognition system. Also Discuss
some of the main application area with example of PR system.
Ans : Approaches of PR system are as mentioned below :
1). Template Matching 2). Statistical Approach 3). Syntactic Approach 4). ANN Approach.
TEMPLATE MATCHING: This approach of pattern recognition is based on finding the similarity
between two entities ( points , curves / shapes) of same type. A 2-D shape or a prototype of a pattern to be
recognized is available. Template is a d x d mask or window. Pattern to be recognized is matched against
stored template in a knowledge base.
UNIVERSITY ACADEMY 12
UNIVERSITY ACADEMY 13
STATISTICAL APPROACH: Each pattern is represented in terms of d- features in d- dimension space.
Goal is to select those features that allow pattern vectors belonging to different categories to occupy
compact and disjoint regions. Separation of pattern classes is determined. Decision surfaces and lines are
drawn which are determined by probability distribution of random variables w.r.t each pattern class.
UNIVERSITY ACADEMY 14
UNIVERSITY ACADEMY 15
Ques 13. Write short notes on :
(A) Decision theoretic classification (B) Optimum Statistical Classifier
Ans : (A) Decision theoretic classification: This is a statistical pattern recognition technique. Which is
based on the use of decision functions to classify the objects. A decision function maps pattern vectors X
into decision regions of D i.e. f : X → D. These functions are also termed as Discriminant Functions.
➢ Given a set of Objects O = { O1 , O2 ….On }. Let each Di have K- observable attributes
(Measurement space and relations are V = { V1 , V2 , …..Vk}).
➢ Determine the following parameters :
a) A subset of m ≤ 𝑘 of Vi , X = [ X1 , X2 ,…..Xm] whose values uniquely characterize Oi.
b) C ≥ 2 grouping of Oi which exhibits High Inter Class and Low Inter Class similarities,
such that a decision function d (X) can be found which partition D into C disjoint
regions. These regions are used to classify each object Oi for some class.
➢ For W pattern Classes , W = [ W1 , W2 ,……,Wn-1, Wn ]to find W decision functions
d1(x) , d2 (x) ,……dw (x). with property that if a pattern X belongs to class Wi , then
di(X) > dj (x) , for j = 1, 2, ….,w ; j ≠ i.
➢ Linear decision function can be in the form of line equation as : d (X) = W1 X1 + W1 X1 for
a 2-D pattern vector.
If d (x) =0 then it is
indeterminate.
Class.
Fig : (a)
Decision Boundary: di (x) – dj ( x) = 0 . Aim is to identify decision boundary between two
classes by a Single function dij (x) = di(x) – dj (x) =0.
UNIVERSITY ACADEMY 16
UNIVERSITY ACADEMY 17
When a line can be found that separates classes into two or more clusters we say classes are Linearly
Separable else they are called Non Linear Separable Classes.
(B) Optimum Statistical Classifier: This is a pattern classification approach developed on the basis of
probabilistic technique because of randomness under which pattern classes are normally generated.
It is based on Bayesian theory and conditional probabilities.” Probability that a particular pattern x is
from class Wi is denoted as P (Wi | x)”. If a pattern classifier decides that x came from Wj when it actually
came from Wi it incurs a loss Li j .
Average Loss incurred in assigning x to class Wj is given by following equation:
UNIVERSITY ACADEMY 18
UNIVERSITY ACADEMY 19
Ques 14. Explain Maximum Likelihood technique under Parameter Estimation of Classification.
Ans : Estimation model consists of a number of parameters. So, in order to calculate or estimate the
parameters of the model, the concept of Maximum Likelihood is used.
• Whenever the probability density functions of a sample are unknown, they can be calculated by
taking the parameters inside sample as quantities having unknown but fixed values.
• Consider we want to calculate the height of a number of boys in a school. But, it will be a time
consuming process to measure the height of all the boys. So, the unknown mean and unknown
variance of the heights being distributed normally, by maximum likelihood estimation we can
calculate the mean and variance by only measuring the height of a small group of boys from the
total sample.
Let we separate a collection of samples as per the class, having C data sets , D1 , D2 ,….Dc
with samples in Dj drawn accurately to probability p (x | 𝑊𝑗). Let this has a known parametric
form and is determined by value 𝜃𝑗. E.g : p (x | 𝑾𝒋) ~ N (𝝁𝒊 , 𝜮𝒋 ) , 𝜽𝒋 consists of these parameter.
To show dependence we have :
p (x | 𝑾𝒋, 𝜽𝒋) . Objective is to use information provided by training samples to achieve good
estimates for unknown parameter vectors 𝜽𝟏 , 𝜽𝟐 , 𝜽𝟑 … . . 𝜽𝒄−𝟏, 𝜽𝒄 associated with each
category. Assume samples in Di give no information about 𝜃𝑗, if i ≠ 𝑗 i.e Parameters of Different
Classes are functionally independent. Let set D has n samples [ X1 , X2 ,…..Xn],
∴ p ( D | 𝜽 ) = ∏𝒏𝒌=𝟏 𝑷( 𝑿𝒌 |𝜽 ).
p ( D | 𝜃 ) is likelihood of 𝜃 w.r.t set of samples.” Maximum likelihood estimate of 𝜽 is by
definition value 𝜽̂ that maximizes p ( D | 𝜃 ).
Logarithmic Form : Since Log makes the expressions simpler in the form of addition , 𝜃 that
maximizes log likelihood also maximizes likelihood. If number of parameters to be estimated is p,
we let 𝜽 denote p – component vector i.e 𝜽 = ( , 𝜽𝟐 , 𝜽𝟑 … . . 𝜽𝒑−𝟏, 𝜽𝒑)𝒕.
UNIVERSITY ACADEMY 20
UNIVERSITY ACADEMY 21
𝝏
Let gradient operator 𝛁
𝜽
=𝝏𝜽 , Define l (𝜃) : log likelihood function.
𝟏
𝝏
𝝏𝜽𝟐
… .
…
𝝏
[ 𝝏𝜽𝒑]
Therefore, l (𝜽) = ln 𝑝 (𝐷 | 𝜃) ⇒ 𝜃̂ = arg 𝑚𝑎𝑥𝜃 l(𝜃)
l (𝜽) = ∑𝒌=𝟏
𝒏 𝐥𝐧 𝒑 (𝑿𝒌 | 𝜽) and 𝛁𝜽 𝒍 = ∑𝒏 𝒌=𝟏 𝛁𝜽 𝒍𝒏 𝒑 (𝑿𝒌 | 𝜽)
Ques 15. (A) Write down the steps for K nearest Neighbor estimation.
(B) Mention some of the advantages and disadvantages of KNN technique.
1. Calculate “d(x,xi)” i =1, 2,….., n; where d denotes the Euclidean distance between the
points.
2. Arrange the calculated n Euclidean distances in non-decreasing order.
3. Let k be a +ve integer, take the first k distances from this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
6. If ki >kj ∀ i ≠ j then put x in class i.
( B ) Advantages of KNN :
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems
Disadvantages are:
1. Memory Intensive / Computationally expensive
2. Sensitive to scale of data
3. Not work well on rare event (skewed) target variable
4. Struggle when high number of independent variables
UNIVERSITY ACADEMY 22
UNIVERSITY ACADEMY 23
Ques 16.. Explain the function of clustering.
Ans. To measure the quality of clustering ability of any partitioned data set, criterion function is used.
1. Consider a set , B = { x1,x2,x3…xn} containing “n” samples, that is partitioned exactly into “t”
disjoint subsets i.e. B1, B2,…..,Bt.
2. The main highlight of these subsets is, every individual subset represents a cluster.
3. Sample inside the cluster will be similar to each other and dissimilar to samples in other clusters.
4. To make this possible, criterion functions are used according the occurred situations.
UNIVERSITY ACADEMY 24
UNIVERSITY ACADEMY 25
Ques17. Solve it with the help of K-mean clustering.
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Centre points are : (1,1) and (5,7)
Ans.
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the
A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:
Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
UNIVERSITY ACADEMY 26
UNIVERSITY ACADEMY 27
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each
individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid)
Cluster 1 of Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In
other words, each individual's distance to its own cluster mean should be smaller that the distance to the
other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
Ans : A very common problem in statistical pattern recognition is of Feature Selection i.e. a process of
transforming Measurement Space to Feature Space (Set of data which are of interest).
Transformation reduces the dimensionality of data features . Let we have a m- dimensional
vector , X = [𝑋1, 𝑋2, … . . ] and we want to convert it in l-dimensions ( where l << m) .
UNIVERSITY ACADEMY 28
UNIVERSITY ACADEMY 29
𝑿𝟏
𝑿𝟏
𝑿𝟐 𝑿 𝟐
X = … , after reducing the dimensions vector X = … .
… ….
[𝑿𝒎] [𝑿𝑳]
This reduction causes mean square error . So we need to find that does there exist an invertible transform
T, such that truncation of 𝑇𝑥 is optimal in terms of Mean Square Error. So T must have some components
of low variance( 𝜎2) where 𝝈𝟐 = E [ (𝒙 − 𝝁)𝟐 ] , E is expectation function, x is random variable , and 𝜇 is
mean value. 𝝁 = 𝟏 ∑𝒎 𝑿
𝒙 𝒌=𝟏 𝒌
Definition of PCA: This is a mathematical procedure that uses Orthogonal transforms to convert a set of
observations of possibly correlated variables into a set of linearly uncorrelated variables known as
Principal Components. So here we preserve the most variance with reduced dimensions and minimum
mean square error.
➢ Number of principal components are less than or equal to number of original variables.
➢ First Principal component has largest variance. Successively it decreases.
➢ These are defined by Best Eigen Vectors of Covariance Matrix of vector X.
UNIVERSITY ACADEMY 30
UNIVERSITY ACADEMY 31
• Degree to which the variables are linearly correlated is
represented by their covariance.
• S − I = 0
• The eigenvalues, 1, 2, ... p are the variances of the coordinates on each principal component
axis. Coordinates of eac…h object i on the kth principal axis, known as the scores on PC k, are
computed as mentioned below:
zki = u1k x1i + u2k x2i ++ u pk x pi
Steps of PCA
• Let 𝜇 be the mean vector (taking the mean of all rows)
• Adjust the original data by the mean 𝜑 = Xk – 𝜇
• Compute the covariance matrix C of adjusted X
• Find the eigenvectors and eigenvalues of C.
• For matrix C, vectors e (=column vector) having same direction as Ce :
• eigenvectors of C is e such that Ce=e,
• is called an eigenvalue of C. Ce=e (C-I)e=0
UNIVERSITY ACADEMY 32
UNIVERSITY ACADEMY 33
Applications of PCA in AI:
➢ Face recognition
➢ Image compression
➢ Gene expression analysis
Ans : PCA finds components that are useful for data representation , but drawback is that PCA can not
discriminate components /data between different classes. If we group all the samples , then those
directions that are discarded by PCA might be exactly the directions needed for distinguishing between
classes.
➢ PCA is based on representation for efficient direction
➢ LDA is based on discrimination for efficient direction.
Objective of LDA is to perform dimensionality reduction while preserving as much of the class
discrimination information as possible. Here in LDA data is projected from d – dimensions onto a line.
If the samples formed well separated compact clusters in d- space then projection onto an arbitrary line
will usually produce poor recognition performance. By rotating the line we can find an orientation for
which projected samples are well separated.
UNIVERSITY ACADEMY 34
UNIVERSITY ACADEMY 35
In order to find a good projection vector, we need to define a measure of separation between the
projections.
UNIVERSITY ACADEMY 36
UNIVERSITY ACADEMY 37
The solution proposed by Fisher is to maximize a function that represents the difference between the
means, normalized by a measure of the within-class variability, or the so-called scatter. • For each class
we define the scatter, an equivalent of the variance, as; (sum of square differences between the projected
samples and their class mean).
The Fisher linear discriminant is defined as the linear function w Tx that maximizes the criterion function:
(the distance between the projected means normalized by the within class scatter of the projected
samples.
In order to find the optimum projection w*, we need to express J(w) as an explicit function of w.. We will
define a measure of the scatter in multivariate feature space x which are denoted as scatter matrices.
UNIVERSITY ACADEMY 38
UNIVERSITY ACADEMY 39
Ques 20 : Explain Support Vector Machines in detail. What are advantages and disadvantages
of SVM ?
Ans : Support Vector Machines: This is a linear machine with a case of separable patterns that may
arise in the context of pattern classification . Idea is to construct a HYPERPLANE as a
direction surface in such a way that the margin of separation between positive and negative
examples is maximized.
A good example of such a system is classifying a set of new documents into positive or negative
sentiment groups, based on other documents which have already been classified as positive or
negative. Similarly, we could classify new emails into spam or non-spam, based on a large
corpus of documents that have already been marked as spam or non-spam by humans. SVMs are
highly applicable to such situations.
• SVM is an approximate implementation of Structural Risk Minimization.
• Error rate of a machine on test data is bounded by the sum of training error rate and term
that depends on Vapnik Chervonenki’s dimension.
• SVM sets first term to zero and minimizes second term. We use SVM learning algorithm
to construct following three types of learning machines :
(i) Polynomial learning machine
(ii) Two layer perceptrons
(iii) Radial basis function N/W
Condition is as : Test error rate ≤ Train error rate + f ( N , h , p ).
Where N: Size of training set , h : Measure of model complexity
P: Probability that this bound fails.
If we consider an element of our p-dimensional feature space, i.e. →x= (x1,...,xp)∈Rp, then we can
mathematically define an affine Hyper plane by the following equation: b0+b1x1+...+bpxp=0 ,
b0≠0 gives us an affine plane (i.e. it does not pass through the origin). We can use a more succinct
notation for this equation by introducing the summation sign: b0+p∑ j=1bj xj=0. The line that
maximizes the minimum margin is better. Maximum margin separator is determined by a subset of data
points. Data points in the subset are called Support Vectors. Support vectors are used to decide which side
of separator a test case is ON.
Consider a training set { ( 𝑿𝒊 , 𝒅𝒊 ,) } for i= 1 to n , where Xi is input pattern for ith example.
UNIVERSITY ACADEMY 40
UNIVERSITY ACADEMY 41
And di is the desired response (Target output). Let , = +𝟏 and 𝒅𝒊 , = −𝟏 Pattern classes for positive
and negative examples are linearly separable. Hyper Plane decision surface is given as below equation:
𝑾𝑻 X + b = 0 , then di =0(when data point is on the line)
where W : adjustable weight factor and b is Bias .
Therefore, 𝑾𝑻 Xi + bi ≥ 𝟎 for , = +𝟏 and 𝑾𝑻 Xi + bi < 𝟎 for 𝒅𝒊 , = −𝟏 .
Closest data point is called Margin of Separation. Denoted by ρ. Objective of SVM is to maximize
ρ for Optimal Hyper plane.
Ques 21: What is Nearest Neighbor rule of classification? Mention some of the metrics used in
method.
Ans : Nearest neighbor algorithm assigns to a test pattern the class label of its closest neighbor.
Let n training patterns ( 1 , 𝜃1 ,) , (𝑋2 , 𝜃2 ,) …….., (𝑋𝑛 ,𝜃𝑛 ) where Xi is of dimension d and
𝜃𝑖 , 𝑖𝑠 𝑖𝑡ℎ 𝑝𝑎𝑡𝑡𝑒𝑟𝑛. If P is the test pattern then if d ( P , Xk ) = min { d ( P , Xi) }, i = 1 to n.
Error: In NN classifier error is at most twice the Bayes Error , when the number of training samples
tends to infinity.
𝑪 𝑬 ( 𝜶𝒃𝒂𝒚𝒆𝒔)
E( 𝒃𝒂𝒚𝒆𝒔 ) ≤E( 𝒏𝒏 ) ≤ 𝑬 ( 𝜶𝒃𝒂𝒚𝒆𝒔 ) [𝟐 − 𝑪−𝟏
]
Distance metrics Used in Nearest Neighbor Classification: