## 1. **What is Reinforcement learning ?**

In one liner we can say “A reinforcement Learning is a process where the Agent learns from an environment by interacting with it, and receives the awards for performing specific actions”.

Let us try to understand by taking day-to-day examples. Let us start with a simple analogy. If you have a pet at home, you may have used this technique with your pet.

A clicker (or whistle) is a technique to let your pet know some treat is just about to get served! This is essentially “reinforcing” your pet to practice good behavior. You click the “clicker” and follow up with a treat. And with time, your pet gets accustomed to this sound and responds every time he/she hears the click sound. With this technique, you can train your pet to do “good” works when required.

Now let’s make these replacements in the example:

- The pet becomes the artificial agent
- The treat becomes the reward function
- The good behavior is the resultant action

The above example explains what reinforcement learning looks like. This is actually a classic example of reinforcement learning.

To apply this on an artificial agent, you have a kind of a feedback loop to reinforce your agent. It rewards when the actions performed is right and punishes in-case it was wrong. Basically what you have the different variables are :-

- an
**internal state**, which is maintained by the agent to learn about the environment. - a
**reward function,**which is used to train your agent how to behave. - an
**environment**, which is a scenario the agent has to face. - an
**action**, which is done by the agent in the environment. - and last but not the least, an
**agent**which does all the deeds!.

Reference Source: UTCS RL Reading Group

## 2. Examples of Reinforcement Learning :-

Let’s look at some real-life applications of reinforcement
learning. Generally, we know the start state and the end state of an agent, but
there could be multiple paths to reach the end state –
reinforcement learning finds an application in these scenarios. **This essentially means that driverless cars, self navigating vaccum
cleaners, scheduling of elevators are all applications of Reinforcement
learning.**

## 3. What is a Reinforcement Learning Platform?

**Reinforcement learning environment :-**

A reinforcement learning environment is what an agent can observe and act upon. The horizon of an agent is much bigger, but it is the task of the agent to perform actions on the environment which can help it maximize its reward. As per “A brief introduction to reinforcement learning” by Murphy (1998),

*“The environment is a modeled as a stochastic finite state machine with inputs (actions sent from the agent) and outputs (observations and rewards sent to the agent)* “.

Let’s take an example,

This is a typical game of mario. Remember how you played this game. Now consider that you are the “agent” who is playing the game.

Now you have “access” to a land of opportunities, but you don’t know what will happen when you do something, say smash a brick. You can see a limited amount of “environment”, and until you traverse around the world you can’t see everything. So you move around the world, trying to perceive what entails ahead of you, and at the same time try to increase your chances to attain your goal.

This whole “story” is not created by itself. You have to “render” it first. And that is the main task of the platform, viz to create everything required for a complete experience – the environment, the agent and the rewards.

## 4. **Major Reinforcement Learning Platforms :-**

## **Deepmind Lab**

**DeepMind Lab is a fully 3D game-like platform tailored for agent-based AI research**

A recent release by Google Deepmind, **Deepmind lab** is an integrated agent-environment platform for general artificial intelligence research with a focus on first person perspective games. It was built to accomodate the research done at DeepMind. Deepmind lab is based on an open-source engine ioquake3 , which was modified to be a flexible interface for integration with artificial systems. It has richer and realistic visuals. It also had a closer integration with the gaming environment.

## Now Let us deep dive into the Components of the Reinforcement Learning.

## **Markov Decision Process :-**

The
mathematical framework for defining a solution in reinforcement learning
scenario is called **Markov Decision Process**. This can be designed as:

**Set of states, S****Set of actions, A****Reward function, R****Policy, π****Value, V**

We have to take an action (A) to transition from our start state to our end state (*S*). In return getting rewards (R) for each action we take. Our actions can lead to a positive reward or negative reward.

The set of actions we took define our policy (π) and the rewards we get in return defines our value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have to maximize

## a) Problem-1 : **Shortest Path Problem** :-

Let me take you through another example to make it clear.

This is a representation of a shortest path problem. The task is to go from place A to place F, with as low cost as possible. The numbers at each edge between two places represent the cost taken to traverse the distance. The negative cost are actually some earnings on the way. We define “Value” is the total cumulative reward when you do a policy.

Here,

- The set of states are the nodes, viz {A, B, C, D, E, F}
- The action to take is to go from one place to other, viz {A -> B, C -> D, etc}
- The reward function is the value represented by edge, i.e. cost
- The policy is the “way” to complete the task, viz {A -> C -> F}

Now suppose you are at place A, the only visible path is your next destination and anything beyond that is not known at this stage.

You can take a greedy approach and take the best possible next step, which is going from {A -> D} from a subset of {A -> (B, C, D, E)}. Similarly now you are at place D and want to go to place F, you can choose from {D -> (B, C, F)}. We see that {D -> F} has the lowest cost and hence we take that path.

So here, our policy was to take {A -> D -> F} and our Value is -120.

This algorithm is known as **epsilon greedy, **which is literally a greedy approach to solve the problem. Now if “the salesman” want to go from place A to place F again, you would always choose the same policy.

**Other ways of travelling?**

Notice that the policy we took is not an optimal policy. We would have to “explore” a little bit to find the optimal policy. The approach which we took here is “policy based learning”, and our task is to find the optimal policy among all the possible policies. There are different ways to solve this problem, I’ll briefly list down the major categories

**Policy based,**where our focus is to find optimal policy.**Value based,**where our focus is to find optimal value, i.e. cumulative reward.**Action based,**where our focus is on what optimal actions to take at each step.

## **4. An implementation of Reinforcement Learning :-**

We will be using Deep Q-learning algorithm. Q-learning is a policy based learning algorithm with the function approximator as a neural network. This algorithm was used by Google to beat humans at “Atari games!”

Let’s see a pseudocode of Q-learning:

- Initialize
the Values table ‘
**Q(s, a)’**. - Observe
the current state ‘
**s’**. - Choose
an action
**‘a’**for that state based on one of the action selection policies (eg. epsilon greedy) - Take the
action, and observe the reward
**‘****r’**as well as the new state**‘****s’**. - Update the Value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the formula and parameters described above.
- Set the state to the new state, and repeat the process until a terminal state is reached.

A simple description of Q-learning can be summarized as follows:

## **Problem -2 : The Cartpole problem.**

When we were a kid, I remember that I would pick a stick and try to balance it on one hand. Me and my friends used to have this competition where whoever balances it for more time would get a “reward”, a chocolate!

This will be the output of our model:

Now that you have seen a basic implementation of Re-inforcement learning, let us start moving towards a few more problems, increasing the complexity little bit every time.

## **Problem-3 : Towers of Hanoi** :-

“The Towers of Hanoi” was invented in 1883 and consists of 3 rods along with a number of sequentially-sized disks (3 in the figure above) starting at the leftmost rod. The objective is to move all the disks from the leftmost rod to the rightmost rod **with the least number of moves**. (You can read more on wikipedia)

If we have to map this problem, let us start with states:

**Starting state**– All 3 disks in leftmost rod (in order 1, 2 and 3 from top to bottom)**End State**– All 3 disks in rightmost rod (in order 1, 2 and 3 from top to bottom)

**All
possible states: **

Here are our 27 possible states:

All disks in a rod | One disk in a Rod | (13) disks in a rod | (23) disks in a rod | (12) disks in a rod |

(123)** | 321 | (13)2* | (23)1* | (12)3* |

*(123)* | 312 | (13)*2 | (23)*1 | (12)*3 |

**(123) | 231 | 2(13)* | 1(23)* | 3(12)* |

132 | *(13)2 | *(23)1 | *(12)3 | |

213 | 2*(13) | 1*(23) | 3*(12) | |

123 | *2(13) | *1(23) | *3(12) |

Where (12)3* represents disks 1 and 2 in leftmost rod (top to bottom) 3 in middle rod and * denotes an empty rightmost rod

**Numerical
Reward:**

Since we want to solve the problem in least number of steps, we can attach a reward of -1 to each step.

**Policy:**

Now, without going in any technical details, we can map possible transitions between above states. For example (123)** -> (23)1* with reward -1. It can also go to (23)*1

If you can now see a parallel, each of these 27 states mentioned above can represent a graph similar to that of shortest path algorithm above and we can find the most optimal solutions by experimenting various states and paths.

**Recent Advancements in Reinforcement Learning :-**

As you would realize that the complexity of this Rubix Cube is many folds higher than the Towers of Hanoi. You can also understand how the possible number of options have increased in number. Now, think of number of states and options in a game of Chess and then in Go!( a Chinese Game having lot of strategies). ” Google DeepMind” had created a deep reinforcement learning algorithm which defeated “Lee Sedol” ( worlds best Alpha GO Champion )

With the recent success in Deep Learning, now the focus is slowly shifting to applying deep learning to solve reinforcement learning problems. Similar breakthroughs are being seen in video games, where the algorithms developed are achieving human-level accuracy and beyond. Research is still at par, with both industrial and academic masterminds working together to accomplish the goal of building better self-learning robots.

Some major domains where RL has been applied are as follows:

- Game Theory and Multi-Agent Interaction
- Robotics
- Computer Networking
- Vehicular Navigation
- Medicine and
- Industrial Logistic.

All the best!

**“Technology is just a tool for working together and for motivating Teachers are the most important. — Bill Gates **

**“Keep Learning and Sharing with Grouply.org”**