read word

https://www.coursera.org/specializations/reinforcement-learning

textbook: Reinforcement Learning: An Introduction - Richard Sutton and Andrew Barto

******** Fundamentals of Reinforcement Learning
**** week1

multi-arm bandit

(state added)
=> MDP (Markov Decision Process): state-value fun., action-value fun., policy

reward hypothesis (Michael Littman)

episode, discount rate

state-value Bellman eq., action-value Bellman eq.
state-value optimality eq., action-value optimality eq.

iterative policy evaluation by state-value Bellman eq.
policy improvement by greedification
policy iteration = policy evaluation + policy improvement
(these iterations are dynamic programming)

general policy iteration

(brute-force policy evaluation is practically impossible in cases)
=> policy evaluation with dynamic programming,
which requires a model of environment (?)

Warren Power’s sample case

******** Sampling-based Learning
**** week1

general policy iteration using Monte Carlo methods: sampling of episodes and then averaging

epsilon-soft policy
off-policy learning: behavior policy vs. target policy
=> needs importance sampling

batch RL (Emma Brunskill)

temporal difference (TD) learning: update by each (s, a, r, s’), not by episodes
that is, it requires just experiences (not the model of environment like dynamic programming)

Richard Sutton: prediction learning is natural way we learn
and TD learning is a proper formalizatiion of prediction learning

“Comparing TD and Monte Carlo” of Week 3 of Sample-based Learning Methodes
=> also shows effect of step size

**** week4

Sarsa: TD algorithm from action-value Bellman eq., policy iteration, on-policy

Q-learning: TD algorithm from action-value Bellman optimal eq., value iteration, off-policy without importance sampling

Q-learning is faster than Sarsa
Sarsa is more reliable than Q-learning

Expected Sarsa: lower variance than Sarsa, needs more computation, off-policy without importance sampling

greedy Expected Sarsa = Expected sarsa with greedy target policy = Q-learning

**** week5

Dyna-Q: model based learning
is like off-policy learning with experience replay

Dyna-Q+: with inaccurate model

model-based learning (Drew Bagnell) with quadratic value function approximation for continuous state-actions

******** Prediction and Control with Function Approximation

**** week2

course-coding, tile coding

function approximation with NN

3 functions in RL: policy, values(state, action), model

**** week3

average reward
differential returns and differential value functions

reward design: intrinsic reward (Satinder Singh)

**** week4

softmax policy (compared to epsilon-greedy)

policy gradient learning: learn policy directly (but it still needs action-value function)

approximating action-value with state-value, which is Critic
=> Actor-Critic
Critic learns using semi-gradient TD
Actor learning using TD error from Critic

Gaussian policy for continuous actions

******** A Complete Reinforcement Learning System (Capstone)

**** week2

The Hodonistic Neuron (by Harry Klopf)
Eligibility Trace: contingent (Actor) or non-contingent (Critic)

Agnostic System Identification for Model-Based Reinforcement Learning 2012 (Drew Bagnell)

mobile health (Susan Murphy)

Adam optimizer = momentum + vector step-size

data efficiency needed => experience replay with replay buffer

reproduciibility crisis (Joelle Pineau)