cs188 Fa24 hw4
cs188 Fa24 hw4
cs188 Fa24 hw4
Note: This is a typical exam-level question. On the exam, you would be under time pressure, and have to complete this
question on your own. We strongly encourage you to first try this on your own to help you understand where you currently
stand. Then feel free to have some discussion about the question with other students and/or staff, before independently
writing up your solution.
Note: Leave the self-assessment sections blank for the original submission of your homework. After the homework dead-
line passes, we will release the solutions. At that time, you will review the solutions, self-assess your initial response, and
complete the self-assessment sections below. The deadline for the self-assessment is 1 week after the original submission
deadline.
Your submission on Gradescope should be a PDF that matches this template. Each page of the PDF should align with the
corresponding page of the template (page 1 has name/collaborators, question begins on page 2.). Do not reorder, split,
combine, or add extra pages. The intention is that you print out the template, write on the page in pen/pencil, and then
scan or take pictures of the pages to make your submission. You may also fill out this template digitally (e.g. using a
tablet.)
First name
Last name
SID
Collaborators
1
CS 188 Introduction to
Fall 2024 Artificial Intelligence Written HW4
Q1. [15 pts] MDP
Pacman is in the following grid, starting at state 𝐻. He can take any one of four actions: Up, Down, Left and Right. If he tries
to take an action that would move him towards a wall, then he stays in place. For example, if he goes down from 𝐻, he will
remain at 𝐻. 𝐴, 𝐸, and 𝐹 are exit states. Once Pacman enters an exit state, he will exit immediately and receive the appropriate
award shown in the diagram and listed below.
(a) Fill in the optimal action(s) for 𝐻 for the given discount factors.
(i) [1 pt]
𝛾=1 □ Up □ Left □ Right
(ii) [1 pt]
1
𝛾= 5
□ Up □ Left □ Right
(iii) [1 pt]
1
𝛾= 10
□ Up □ Left □ Right
2
(b) Assume for this part only that the transition function for the MDP is deterministic for all transitions except the following:
𝑇 (𝐻, Right, 𝐼) = 𝑝 𝑇 (𝐻, Left, 𝐸) = 𝑝
𝑇 (𝐻, Right, 𝐸) = 1 − 𝑝 𝑇 (𝐻, Left, 𝐺) = 1 − 𝑝
What values of 𝑝 for the given discount factor 𝛾 would ensure that Pacman strictly prefers taking action Right from 𝐻:
(i) [2 pts]
𝛾=1 □0 □ 0.1 □ 0.5 □ 0.9 □1 □ None of the above
(ii) [2 pts]
1
𝛾= 10
□0 □ 0.1 □ 0.5 □ 0.9 □1 □ None of the above
Q1(a-b) Self-Assessment - leave this section blank for your original submission. We will release the solutions to this
problem after the deadline for this assignment has passed. After reviewing the solutions for this problem, assess your initial
response by checking one of the following options:
# I fully solved the problem correctly, including fully correct logic and sufficient work (if applicable).
# I got part or all of the question incorrect.
If you selected the second option, explain the mistake(s) you made and why your initial reasoning was incorrect (do not re-
iterate the solution. Instead, reflect on the errors in your original submission). Approximately 2-3 sentences for each incorrect
sub-question.
3
(c) Consider the following new grid:
Here, 𝐷, 𝐸 and 𝐹 are exit states. Pacman starts in state 𝐴. The reward for entering each state is reflected in the grid.
Assume that discount factor 𝛾 = 1.
(i) [3 pts] Write the optimal values 𝑉 ∗ (𝑠) for 𝑠 = 𝐴 and 𝑠 = 𝐶 and the optimal policy 𝜋 ∗ (𝑠) for 𝑠 = 𝐴.
𝑉 ∗ (𝐴) =
𝑉 ∗ (𝐶) =
𝜋 ∗ (𝐴) = # Up # Down # Left # Right
(ii) [2 pts] Now, instead of Pacman, Pacbaby is travelling in this grid. Pacbaby has a more limited set of actions than
Pacman and can never go left. Hence, Pacbaby has to choose between actions: Up, Down and Right.
Pacman is rational, but Pacbaby is indecisive. If Pacbaby enters state 𝐶, Pacbaby finds the two best actions and
randomly, with equal probability, chooses between the two. Let 𝜋 ∗ (𝑠) represent the optimal policy for Pacman. Let
𝑉 (𝑠) be the values under the policy where Pacbaby acts according to 𝜋 ∗ (𝑠) for all 𝑠 ≠ 𝐶, and follows the indecisive
policy when at state 𝐶. What are the values 𝑉 (𝑠) for 𝑠 = 𝐴 and 𝑠 = 𝐶?
𝑉 (𝐴) =
𝑉 (𝐶) =
(iii) [3 pts] Now Pacman knows that Pacbaby is going to be indecisive when at state 𝐶 and he decides to recompute the
optimal policy for Pacbaby at all other states, anticipating his indecisiveness at 𝐶. What is Pacbaby’s new policy
𝜋(𝑠) and new value 𝑉 (𝑠) for 𝑠 = 𝐴?
𝑉 (𝐴) =
𝜋(𝐴) = # Up # Down # Left # Right
4
Q1(c) Self-Assessment - leave this section blank for your original submission. We will release the solutions to this problem
after the deadline for this assignment has passed. After reviewing the solutions for this problem, assess your initial response
by checking one of the following options:
# I fully solved the problem correctly, including fully correct logic and sufficient work (if applicable).
If you selected the second option, explain the mistake(s) you made and why your initial reasoning was incorrect (do not re-
iterate the solution. Instead, reflect on the errors in your original submission). Approximately 2-3 sentences for each incorrect
sub-question.