Q-learning
Reinforcement Learning
From A to Z by repeat again and again.
Bellman Equation
Learning from a reward after an action at a state in an environment.
s : state
a : action
R : Reward
V : Value
: discount
Markov Decision Process(MDP)
Deterministic Search
100% chances take the action we want.
Non-Deterministic Search
not 100% chances take the action we want.
A stochastic process has the Markov property if the conditional probability distribution of future states of the process(conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process.
Markov Decision Processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
P : probability
Policy vs Plan
Plan : Take an action which get to the highest value intuitively.
Policy : Take an conservative action which get the the highest value safely.
Living Penalty
Make negative rewards at non-terminal states so that agent will end episode faster.
Proper living penalty makes agent act more intuitive.
Improper living penalty might cause agent goes wrong.
Where’s the Q?
Q = Quality
Q(s,a) means the value of the action(a) at a certain state(s).