https://www.coursera.org/specializations/reinforcement-learning
textbook: Reinforcement Learning: An Introduction - Richard Sutton and Andrew Barto
******** Fundamentals of Reinforcement Learning **** week1
multi-arm bandit
(state added) => MDP (Markov Decision Process): state-value fun., action-value fun., policy
reward hypothesis (Michael Littman)
episode, discount rate
state-value Bellman eq., action-value Bellman eq. state-value optimality eq., action-value optimality eq.
iterative policy evaluation by state-value Bellman eq. policy improvement by greedification policy iteration = policy evaluation + policy improvement (these iterations are dynamic programming)
general policy iteration
(brute-force policy evaluation is practically impossible in cases) => policy evaluation with dynamic programming, which requires a model of environment (?)
Warren Power’s sample case
******** Sampling-based Learning **** week1
general policy iteration using Monte Carlo methods: sampling of episodes and then averaging
epsilon-soft policy off-policy learning: behavior policy vs. target policy => needs importance sampling
batch RL (Emma Brunskill)
temporal difference (TD) learning: update by each (s, a, r, s’), not by episodes that is, it requires just experiences (not the model of environment like dynamic programming)
Richard Sutton: prediction learning is natural way we learn and TD learning is a proper formalizatiion of prediction learning
“Comparing TD and Monte Carlo” of Week 3 of Sample-based Learning Methodes => also shows effect of step size
**** week4
Sarsa: TD algorithm from action-value Bellman eq., policy iteration, on-policy
Q-learning: TD algorithm from action-value Bellman optimal eq., value iteration, off-policy without importance sampling
Q-learning is faster than Sarsa Sarsa is more reliable than Q-learning
Expected Sarsa: lower variance than Sarsa, needs more computation, off-policy without importance sampling
greedy Expected Sarsa = Expected sarsa with greedy target policy = Q-learning
**** week5
Dyna-Q: model based learning is like off-policy learning with experience replay
Dyna-Q+: with inaccurate model
model-based learning (Drew Bagnell) with quadratic value function approximation for continuous state-actions
******** Prediction and Control with Function Approximation
**** week2
course-coding, tile coding
function approximation with NN
3 functions in RL: policy, values(state, action), model
**** week3
average reward differential returns and differential value functions
reward design: intrinsic reward (Satinder Singh)
**** week4
softmax policy (compared to epsilon-greedy)
policy gradient learning: learn policy directly (but it still needs action-value function)
approximating action-value with state-value, which is Critic => Actor-Critic Critic learns using semi-gradient TD Actor learning using TD error from Critic
Gaussian policy for continuous actions
******** A Complete Reinforcement Learning System (Capstone)
**** week2
The Hodonistic Neuron (by Harry Klopf) Eligibility Trace: contingent (Actor) or non-contingent (Critic)
Agnostic System Identification for Model-Based Reinforcement Learning 2012 (Drew Bagnell)
mobile health (Susan Murphy)
Adam optimizer = momentum + vector step-size
data efficiency needed => experience replay with replay buffer
reproduciibility crisis (Joelle Pineau)
|