<aside> ✨
</aside>
<aside> ✨
To decide the agent’s actions it follows some strategies to decide the best move first.
</aside>
<aside> ✨
$\epsilon$ = exploration rate
</aside>
initially set to 1
this shows the probability the environment rather than exploit it
with time this $\epsilon$ value decays with some certain rate that we set
and the focus shifts more towards the exploitation side
the agent becomes the greedy in a sense → once it knows enough about the environment it begins to exploit it
let’s call it $r$ (random number)
<aside> ✨
$r \gt \epsilon = 0.42$
<aside> ✨
$r \lt \epsilon = 0.42$
<aside> ✨
$r \lt \epsilon = 1$
indicates a 100% probability that the agent will explore the environment during the first episode (ref. prev notes) </aside>
so initially the agent will randomly choose and action and then based on the reward it got (positive or negative) , it will update the Q-value.
To update the Q-value for the action of moving right taken from the previous state, we use the Bellman equation that we highlighted previously
<aside> ✨
$q_(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_(s', a') \right]$
</aside>
the goal is to converge to a optimal Q-value $q_*(s, a)$
for this we need to find the loss
<aside> ✨
$q_*(s, a) - q(s, a) = \text{loss}$
</aside>
putting in value using the bellman equation
<aside> ✨
$\mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q_*(s', a') \right] - \mathbb{E} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \right] = \text{loss}$
</aside>
<aside> ✨
we can also specify a max number of steps that our agent can take before the episode auto-terminates. With the way the game is set up right now, termination will only occur if the lizard reaches the state with five crickets or the state with the bird.
We could define some condition that states if the lizard hasn't reached termination by either one of these two states after $100$ steps, then terminate the game after the $100^{th}$ step.
</aside>
<aside> ✨
Symbol : $\alpha$
<aside> ✨
The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent abandons the previous Q-value in the Q-table for a given state-action pair for the new Q-value.
</aside>
</aside>