Exploration vs Exploitation

<aside> ✨

</aside>

<aside> ✨

To decide the agent’s actions it follows some strategies to decide the best move first.

</aside>

Epsilon Greedy Strategy

<aside> ✨

$\epsilon$ = exploration rate

</aside>

initially set to 1
- this shows the probability the environment rather than exploit it
with time this $\epsilon$ value decays with some certain rate that we set
- and the focus shifts more towards the exploitation side
- the agent becomes the greedy in a sense → once it knows enough about the environment it begins to exploit it

Choosing strategies

introduce a number btw 0 and 1
- let’s call it $r$ (random number)
  - if
  <aside> ✨
  
  $r \gt \epsilon = 0.42$
  - exploitation
    - choose the action with the highest Q-value. </aside>
  <aside> ✨
  
  $r \lt \epsilon = 0.42$
  - exploration
    - randomly choosing it’s action and exploring the impact/affect of it’s actions on the environment. </aside>

Progression

<aside> ✨

$r \lt \epsilon = 1$

indicates a 100% probability that the agent will explore the environment during the first episode (ref. prev notes) </aside>
so initially the agent will randomly choose and action and then based on the reward it got (positive or negative) , it will update the Q-value.

Eg progression

Let's suppose the lizard chooses to move right as its action from the starting state. We can see the reward we get in this new state is -1 since, recall, empty tiles have a reward of -1 point.

Updating the Q-Value

To update the Q-value for the action of moving right taken from the previous state, we use the Bellman equation that we highlighted previously

<aside> ✨

$q_(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_(s', a') \right]$

</aside>
- the goal is to converge to a optimal Q-value $q_*(s, a)$
- for this we need to find the loss
  
  <aside> ✨
  
  $q_*(s, a) - q(s, a) = \text{loss}$
  
  </aside>
  - putting in value using the bellman equation
    
    <aside> ✨
    
    $\mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_*(s', a') \right] - \mathbb{E} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \right] = \text{loss}$
    
    </aside>

Max steps

<aside> ✨

we can also specify a max number of steps that our agent can take before the episode auto-terminates. With the way the game is set up right now, termination will only occur if the lizard reaches the state with five crickets or the state with the bird.

We could define some condition that states if the lizard hasn't reached termination by either one of these two states after $100$ steps, then terminate the game after the $100^{th}$ step.

</aside>

The learning rate

<aside> ✨

Symbol : $\alpha$

<aside> ✨

The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent abandons the previous Q-value in the Q-table for a given state-action pair for the new Q-value.

</aside>