Introduction to Reinforcement Learning

[Lab] Reinforcement Learning: Code from scratch (NO ML libraries)

Remark:

– No late lab will be accepted.
– Code yourself from scratch. No lab will be considered if any ML library is used.
– Do thoroughly all the demanded tasks.
– Study the theory for the questions.

For this lab session, you are asked to plan the motion of a 2D mobile robot using the Markov Decision Process formalism. Consider the following 2D map for the autonomous navigation of a mobile robot

This map consists of 12 cells. The dashed cell at (x1,x2)=(2,2) represents an obstacle to be avoided. The cell with reward “+1” at (x1,x2)=(4,3) is a desired absorbing cell (the goal), while the cell with reward “-1” at (x1,x2)=(4,2) is an undesired absorbing cell (e.g., a pit). On the other hand, the mobile robot can take four actions: A={N, S, E, W }, where N, S, E, W represent north, south, east and west, respectively. If A=N, then the mobile robot behaves following transition probability distribution indicated in the above figure. This is also true for the rest of actions. Further, the reward function is defined as follows

Finally, assign the discount factor (γ) to be 0.99.

For all states, find the optimal value function V ∗(s) and the optimal policy function π∗(s) using the value iteration algorithm.

For all states, find the optimal action-value function Q∗(s,a) and the optimal policy function π∗(s) using the Q-learning algorithm.

Requirements: Make sure to comment your code and remember to not use ML libraries, code form scratch | .doc file

[Lab] Reinforcement Learning
Remark:
No late lab will be accepted.
Code yourself from scratch. No lab will be considered if any ML library is used.
Do thoroughly all the demanded tasks.
Study the theory for the questions.
For this lab session, you are asked to plan the motion of a 2D mobile robot using the Markov Decision
Process formalism. Consider the following 2D map for the autonomous navigation of a mobile robot
This map consists of 12 cells. The dashed cell at (x1; x2)=(2; 2) represents an obstacle to be avoided.
The cell with reward +1″ at (x1; x2)=(4; 3) is a desired absorbing cell (the goal), while the cell
with reward -1″ at (x1; x2)=(4; 2) is an undesired absorbing cell (e.g., a pit). On the other hand, the
mobile robot can take four actions: A=fN; S;E;Wg, where N; S;E;W represent north, south, east and
west, respectively. If A=N, then the mobile robot behaves following transition probability distribution
indicated in the above gure. This is also true for the rest of actions. Further, the reward function is
dened as follows
R =
8><
>:
+1 (x1; x2)=(4; 3)
?1 (x1; x2)=(4; 2)
?0:02 otherwise
Finally, assign the discount factor ( ) to be 0.99.
For all states, nd the optimal value function V (s) and the optimal policy function (s) using
the value iteration algorithm.
For all states, nd the optimal action-value function Q(s; a) and the optimal policy function
(s) using the Q-learning algorithm.
1

Introduction to Reinforcement Learning
February 27, 2021
References: Sutton and Barto’s Reinforcement Learning book (2nd edition), Wikipedia
Reinforcement learning (RL) is an area of machine learning concerned with how software agents
should take actions in an environment so as to maximize some notion of cumulative reward.
The problem addressed by reinforcement learning is also studied in many other disciplines such
as game theory, control theory, operations research, information theory, multi-agent systems,
swarm intelligence, statistics and genetic algorithms. In operations research and control litera-
ture, reinforcement learning is called approximate dynamic programming or neuro-dynamic pro-
gramming. In machine learning, the environment is typically formulated as a Markov Decision
Process (MDP), as many reinforcement learning algorithms for this context utilize dynamic
programming techniques. The main dierence between the classical dynamic programming
methods and reinforcement learning algorithms is that the latter do not assume knowledge of
an exact mathematical model of the MDP, and they target large MDPs where exact methods
become infeasible.
Reinforcement learning is considered as one of three machine learning paradigms, alongside
supervised learning and unsupervised learning. It diers from supervised learning in that cor-
rect input/output pairs need not be presented, and sub-optimal actions need not be explicitly
corrected. Instead, the focus is on performance, which involves nding a balance between
exploration (of uncharted territory) and exploitation (of current knowledge).
1
1 Comparison between dierent reinforcement learning
algorithms
Figure 1: Source: Wikipedia
2 Markov Decision Process (MDP)
Basic reinforcement learning is modeled as a Markov Decision Process (MDP):
S: set of states
A: set of actions
fPsag: state transition distributions with
X
s0
Psa(s0) = 1; Psa(s0) 0
Psa gives the probability distribution of ending up at state s0 when the actiona a is
performed from the state s:
s a ????! s0
: discount factor, where 0 < 1
R: reward function R : s ! R
2
The rules are often stochastic. The observation typically involves the scalar, immediate re-
ward associated with the last transition. The agent is often assumed to observe the current
environmental state (full observability). If not, the agent has partial observability. Sometimes
the set of actions available to the agent is restricted.
A reinforcement learning agent interacts with its environment in discrete time steps. At each
time t, the agent receives an observation ot, which typically includes reward rt. It then chooses
an action at from the set of available actions, which is subsequently sent to the environment.
The environment moves to a new state st+1 and the reward rt+1 associated with the transition
(st; at; st+1) is determined. The goal of a reinforcement learning is to collect as much reward
as possible. The agent can (possibly randomly) choose any action as a function of the history.
When the agent’s performance is compared to that of an agent that acts optimally, the dier-
ence in performance gives rise to the notion of regret. In order to act near optimally, the agent
must reason about the long term consequences of its actions (i.e., maximize future income),
although the immediate reward associated with this might be negative. Thus, reinforcement
learning is particularly well-suited to problems that include a long-term versus short-term
reward trade-o.
Two elements make reinforcement learning powerful: the use of samples to optimize perfor-
mance and the use of function approximation to deal with large environments.
2.1 Example
11 states (11 cells)
4 actions: A = fN; S;E;Wg
Because the system is noisy, the action is modeled as
Figure 2: Action model
This is a very crude action model. That is, for example
P(3;1);N ((3; 2)) = 0:8
P(3;1);N ((4; 1)) = 0:1
P(3;1);N ((2; 1)) = 0:1
P(3;1);N ((3; 3)) = 0:0
…
3
Rewards:
R =
8><
>:
1; cell = (4; 3)
?1; cell = (4; 2)
?0:02; otherwise
The reward values corresponding to any cell dierent from (4,3) and (4,2) can be inter-
preted as battery (or time) consumption before arriving to the goal cell.
Stop condition: nish the algorithm when the robot hits either the cell of R = 1 or that
of R = ?1.
3 The goal of reinforcement learning
How does the MDP work?
At state s0, choose a0. Then get to s1 Ps0a0 . Afterwards, choose a1, then get to s2 Ps1a1 .
And, so on.
To evaluate how well the robot did by visiting the states s0; s1; s2, etc.:
dene the reward function
apply it to the sequence of states
add up the sum of the rewards obtained along the sequence of the states that the robot
visits. But, we do this in a discounted manner.
Dene the cumulative discounted reward or the total pay-o:
R(s0) + R(s1) + 2R(s2) + (1)
Since 0 < 1, the reward that we get at time t1 is slightly smaller than the reward obtained
at time t0. And, the reward that we get at time t2 is weighted such that it is even smaller than
that of t1. And so on.
If this is an economic application, then the reward is either the earning or the loss. Then,
has a natural interpretation of time-valued money. That is, a dollar of today is slightly more
worthy that a dollar of tomorrow. Because of the added bank interest, a dollar in the bank
adds a little bit of interest. Conversely, having to pay out a dollar tomorrow is better than
having to pay out a dollar today.
In other words, the eect of the discount factor tends to weight wins and losses in the future
less than to weight immediate wins and losses.
The goal of the reinforcement learning is
to choose actions over time (a0; a1; )
to maximize the expected value of this total pay-o (i.e., the cumulative discounted
reward):
E[R(s0) + R(s1) + 2R(s2) + ]:
More concretely, we want our reinforcement learning algorithm to compute a policy: : S ?!
A. A policy is a function that maps the state space to action space. That is, a policy is a
function that recommends what action one should take for a given state.
4
Example:
Figure 3: Example 2
4 (State-)Value function
For any , dene (state-)value function v : S ! R such that V (s) is the expected total
pay-o starting in state s, and execute .
V (s) = E[R(s0) + R(s1) + j; s0 = s] (2)
This is an expression loosely written because is not actually a random variable, but this
expression is commonly used.
Example:
(a) Policy (b) Value function
Figure 4: Policy and value function
In this example, we see that the actions corresponding to the bottom two rows are bad because
there is a high chance that the robot will end up in the cell with ?1 reward. We can recognize
this fact by looking at the negative values of the value function corresponding to the bottom
two rows. This means that the expected value of the total pay-o is negative if we are at any
state in these rows.
5
4.1 The Bellman’s equations
The value function can also be expressed recursively as follows:
V (s) = E[R(s0) + (R(s1) + R(s2) + )j; s0 = s] (3)
By mapping s0 ! s and s1 ! s0,
V (s) = R(s) +
X
s0
Ps(s)(s0)V (s0) (4)
This equation is known as the Bellman’s equations.
It urns out that the Bellman’s equations give a way to solve the value function for a given
policy in closed form.
Now the question is, for a given policy , how do I nd its value function V (s)?
By looking at the Bellman’s equations, we realize that if I want to solve the value function for
a given policy , then see see that the Bellman’s equations impose constraints on the value
function.
The value function depends on some constant R(s) and a linear function of some other values.
So for any state in the MDP, we can write such an equation, and this imposes a set of linear
constraints on what the value function could be. And, by solving this system of linear functions,
one can nally solve the value function V (s).
Example:
V ((3; 1)) = R((3; 1)) + [0:8V ((3; 2)) + 0:1V ((4; 1)) + 0:1V ((2; 1))]
From the above example, we identify the unknown value function. We know that we have 11
unknowns (one value-function value for each state) and 11 equations (one equation for each
state). Hence, we can solve uniquely the value function.
5 Optimal value function and optimal policy function
5.1 Optimal value function
V (s) = max

V (s) (5)
For any given state s, the optimal value function says that when I suppose that I take the
maximum over all possible policies, what is the best expected value of the total pay-o (i.e.,
the sum of the discounted reward).
The version of the Bellman’s equations for V (s) is
V (s) = R(s) + max
a

X
s0
Psa(s0)V (s0) (6)
This equation tells us that the optimal expected pay-o is the immediate reward plus the
future optimal expected pay-o by taking the best actions.
6
5.2 Optimal policy function
(s) = arg max
a
X
s0
Psa(s0)V (s0) (7)
Hence, the optimal policy is the action that maximizes the future expected pay-o.
Then, how can we nd ?
The strategy will be nding rst V (s) and then using the denition of the optimal policy,
nd (s). But, the denition of V (s) does not lead to a nice algorithm for computing it. On
the other hand, I know how to compute V (s) for any state for a given by solving a linear
system of equations. But, there is an exponentially large number of policies. If 11 states and
4 actions are considered, then there can be 411 policies!. Hence, I can not use the brute force
method to nd the policy that maximizes the value function. Therefore, we need an ecient
intelligent algorithm to nd V and compute afterwards.
6 Value iteration algorithm
Algorithm 1: Value iteration
1 Initialize V (s) = 0 8s;
2 while until convergence do
3 For every s, update V (s) R(s) + maxa
X
s0
Psa(s0)V (s0);
4 return V (s);
This will make V (s) ?! V (s). And, using the denition of (s), one can compute the
optimal policy from V (s). There are two ways of implementing this algorithm.
6.1 Synchronous update
8s; V (s) B(V (s))
where B() is the Bellman back-up operator. One calculates the RHS values for all states. And
update the value function all at once.
6.2 Asynchronous update
Update V (s) one by one. Hence, once one value function is updated for a state, this new value
is used to the value functions that depend on it, in the same optimization iteration.
Note: Asynchronous update is usually a bit faster than the synchronous update. But, it is
easier to analyze with the synchronous update approach.
Example:
At F,
W :
X
s0
Psa(s0)V (s0) = 0:8 0:75 + 0:1 0:69 + 0:1 0:71 = 0:740
N :
X
s0
Psa(s0)V (s0) = 0:8 0:69 + 0:1 0:75 + 0:1 0:49 = 0:676
(8)
Hence, it is recommendable to go to W rather than to N.
7
(a) Policy (b) Value function
Figure 5: Optimal policy and optimal value function
7 Policy iteration algorithm
Algorithm 2: Value iteration
1 Initialize randomly ;
2 while until convergence do
3 Let V V (i.e., solve the Bellman’s equations) ;
4 Let (s) arg maxa
X
s0
Psa(s0)V (s0);
5 return V (s); (s);
Then, it turns out that V ?! V and ?! .
In the policy iteration algorithm, the time consuming step is solving the Bellman’s equations.
Note:
If the number of states is less than 1000, then the PI is preferred, but if the number of states
is more than this value, then the VI is preferred.
8 Exploration
Reinforcement learning requires clever exploration mechanisms. Randomly selecting actions,
without reference to an estimated probability distribution, shows poor performance. The case
of (small) nite MDP is relatively well understood. However, due to the lack of algorithms
that properly scale well with the number of states (or scale to problems with innite state
spaces), simple exploration methods are the most practical.
One such method is -greedy, when the agent chooses the action that it believes has the best
long-term eect with probability 1 ? . If no action which satises this condition is found,
the agent chooses an action uniformly at random. Here, 0 < < 1 is a tuning parameter,
which is sometimes changed, either according to a xed schedule (making the agent explore
progressively less), or adaptively based on heuristics.
8
9 Q-learning
Q-learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn
a policy, which tells an agent what action to take under what circumstances. It does not
require a model (hence the connotation of model-free”) of the environment, and it can handle
problems with stochastic transitions and rewards, without requiring adaptations.
For any nite MDP (FMDP), Q-learning nds a policy that is optimal in the sense that it
maximizes the expected value of the total reward over any and all successive steps, starting
from the current state. Q-learning can identify an optimal action-selection policy for any given
FMDP, given innite exploration time and a partly-random policy. Q” names the function
that returns the reward used to provide the reinforcement and can be said to stand for the
quality” of an action taken in a given state.
9.1 Algorithm
The algorithm calculates the quality of a state-action combination through the (state-action)value
function called Q function:
Q : S A ?! R:
Before learning begins, the Q function is initialized to arbitrary or xed values. Then, at each
time t, the agent selects an action at (either following a policy function () or chosen randomly),
observes a reward rt, enters a new state st+1 (that may depend on both the previous state st
and the selected action). Afterwards, Q is updated. The core of the algorithm is a simple value
iteration update, using the weighted average of the old value and the new information:
Q(st; at) ? (1?) Q(st; at) +

rt + max
a
Q(st+1; a)

;
where rt is the reward received when moving from the state st to the state st+1, and is the
learning rate (0 < 1).
An episode of the algorithm ends when state st+1 is a nal or terminal state. However, Q-
learning can also learn in non-episodic tasks. If the discount factor is lower than 1, the action
values are nite even if the problem can contain innite loops.
For all nal states sf , Q(sf ; a) is never updated, but is set to the reward value r observed for
state sf . In most cases, Q(sf ; a) can be taken to equal zero.
Algorithm 3: (state-action-)Value iteration
1 Initialize Q(s; a) = 0 8s; 8a;
2 while until convergence do
3 For every st and at, update
Q(st; at) ? (1?) Q(st; at) + (rt + maxa Q(st+1; a));
4 return Q;
This will make Q(s; a) ?! Q(s; a). And, using the denition of (s), one can compute the
optimal policy from Q(s; a). There are two ways of implementing this algorithm.
9

Continue to order Get a quote

Calculate the price of your order

Type of paper needed:

Pages:

550 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

Free title page and bibliography
Unlimited revisions
Plagiarism-free guarantee
Money-back guarantee
24/7 support

On-demand options

Writer’s samples
Part-by-part delivery
Overnight delivery
Copies of used sources
Expert Proofreading

Paper format

275 words per page
12 pt Arial/Times New Roman
Double line spacing
Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Introduction to Reinforcement Learning

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee