We get smarter by learning utilities without a world
model and get active by exploring and learning policies to go with
them.
Background
We conclude our study of learning for Markov decision processes by
letting our passive agent forgo a world transition model, directly
estimating utilities under a fixed policy. Finally, our most general
agent will forgo the world model and learn a policy directly through
exploration.
Preparation
Change to the directory containing example implementations of several
algorithms from Chapters 17 and 21.
$ cd~weinman/courses/CSC261/code/rlc
Exercises
A: Temporal-difference learning
The program td contains a setup for running an agent that
directly learns utilities (using no world transition model) in a sequential
environment for a specified number of iterations by implementing PASSIVE-TD-AGENT
of AIMA Figure 21.4 (p. 837).
Test the passive, model-free TD-learner using the (presumably optimal)
policy learned by policy iteration. For example,
$ ./tdgamma 4x3.mdptrials< 4x3.policy
How large must trials be to make your utilities match
those of AIMA Figure 21.1(b), p. 832, to at least two decimals?
B: Q-Learning
The program qlearn contains a setup for running an agent
that directly learns a policy (using no utilities or transition model)
in a sequential environment for a specified number of iterations,
as done by tThe Q-LEARNING-AGENT of AIMA Figure 21.8 (p. 844).
Test the active, model-free Q-function learner on the 4×3
grid world,
$ ./qlearn gammarewardattempts 4x3.mdp trials
where reward represents the optimistic best reward(R+) of the exploration function (p. 842)
f(u,n)=
{R+
ifn < Ne
{u
otherwise
What values for rewardandattempts(Ne) are required to find the optimal policy at a given number
of trials? Why?
Can you adjust the values ofreward(R+) and/orattempts(Ne)
as the number of trials increases?
For those with Extra Time
Complete Problems 5 and 6 in the Markov Decision Processes lab if
you have not already. Once you have completed them, you may continue
with the remainder of this lab.
C: Temporal-Difference Learning
Predict
running the TD-learner on the larger
16×4 world, record how many trials do you expect your agent
to require to estimate state utilities accurately? Why?
Experiment
Run td on the 16×4 grid world using
the (presumably optimal) policy learned by policy iteration and the
same discount factor and error tolerance you used for the 4×3
grid world. How many trials does it take for the learned utilities
to match those from your prior lab as closely as possible?
Reflect
Compare your predictions to your experimental outcome.
Speculate about the cause of any discrepancies.
D: Q-Learning
Predict
What values of reward (R+), attempts
(Ne), and trials do you expect your agent to require to learn
the best possible policy in the larger 16×4 world. Why?
Experiment
Run your qlearn on the 16×4 grid
world using the same discount factor you used for the 4×3
grid world. What settings are required for your agent to learn the
best policy?
Reflect
Compare your predictions to your experimental outcome.
Speculate about the cause of any discrepancies.