Lab: Active Reinforcement Learning

CSC261 - Artificial Intelligence - Weinman



Summary:
We get smarter by learning utilities without a world model and get active by exploring and learning policies to go with them.

Background

We conclude our study of learning for Markov decision processes by letting our passive agent forgo a world transition model, directly estimating utilities under a fixed policy. Finally, our most general agent will forgo the world model and learn a policy directly through exploration.

Preparation

Change to the directory containing example implementations of several algorithms from Chapters 17 and 21.
cd ~weinman/courses/CSC261/code/rlc

Exercises

A: Temporal-difference learning

The program td contains a setup for running an agent that directly learns utilities (using no world transition model) in a sequential environment for a specified number of iterations by implementing PASSIVE-TD-AGENT of AIMA Figure 21.4 (p. 837).
  1. Test the passive, model-free TD-learner using the (presumably optimal) policy learned by policy iteration. For example,
    ./td gamma 4x3.mdp trials < 4x3.policy
  2. How large must trials be to make your utilities match those of AIMA Figure 21.1(b), p. 832, to at least two decimals?

B: Q-Learning

The program qlearn contains a setup for running an agent that directly learns a policy (using no utilities or transition model) in a sequential environment for a specified number of iterations, as done by tThe Q-LEARNING-AGENT of AIMA Figure 21.8 (p. 844).
  1. Test the active, model-free Q-function learner on the 4×3 grid world,
    ./qlearn gamma reward attempts 4x3.mdp trials
    where reward represents the optimistic best reward (R+) of the exploration function (p. 842)
    f(u,n)=
    {R+
    if n < Ne
    {u
    otherwise
  2. What values for reward and attempts (Ne) are required to find the optimal policy at a given number of trials? Why?
  3. Can you adjust the values of reward (R+) and/or attempts (Ne) as the number of trials increases?

For those with Extra Time

Complete Problems 5 and 6 in the Markov Decision Processes lab if you have not already. Once you have completed them, you may continue with the remainder of this lab.

C: Temporal-Difference Learning

Predict
running the TD-learner on the larger 16×4 world, record how many trials do you expect your agent to require to estimate state utilities accurately? Why?
Experiment
Run td on the 16×4 grid world using the (presumably optimal) policy learned by policy iteration and the same discount factor and error tolerance you used for the 4×3 grid world. How many trials does it take for the learned utilities to match those from your prior lab as closely as possible?
Reflect
Compare your predictions to your experimental outcome. Speculate about the cause of any discrepancies.

D: Q-Learning

Predict
What values of reward (R+), attempts (Ne), and trials do you expect your agent to require to learn the best possible policy in the larger 16×4 world. Why?
Experiment
Run your qlearn on the 16×4 grid world using the same discount factor you used for the 4×3 grid world. What settings are required for your agent to learn the best policy?
Reflect
Compare your predictions to your experimental outcome. Speculate about the cause of any discrepancies.
Copyright © 2011, 2013, 2015 Jerod Weinman.
ccbyncsa.png
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 International License.