<aside>
✨
</aside>
Basic Info
<aside>
✨
RL technique used to find the optimal policy in a MDP
</aside>
How does Q-Learning work (cricket example)
<aside>
✨
- Iteratively updates the q-value for each state-action pair using the bellman equation
- until the q-function converges the the optimal q-function ($q_*$)
</aside>
➡️ This approach is called value iteration .
Example
- lizard → wants to eat as many crickets as possible while avoiding the bird
- the lizard has a few actions available
- rewards
- for a tile with 1 cricket →
+1
- for a tile with 0 cricket →
-1
- for a tile with 5 crickets →
+10
- for a tile with the bird →
-10

➡️ INITIALLY
- the q value are initially
0
- since the lizard knows nothing in the start
- throughout the process via value iteration we will keep updating the q values .
➡️ Q-Table
-
INITIAL TABLE
| States |
Left |
Right |
Up |
Down |
| 1 cricket |
0 |
0 |
0 |
0 |
| Empty 1 |
0 |
0 |
0 |
0 |
| Empty 2 |
0 |
0 |
0 |
0 |
| Empty 3 |
0 |
0 |
0 |
0 |
| Bird |
0 |
0 |
0 |
0 |
| Empty 4 |
0 |
0 |
0 |
0 |
| Empty 5 |
0 |
0 |
0 |
0 |
| Empty 6 |
0 |
0 |
0 |
0 |
| 5 crickets |
0 |
0 |
0 |
0 |
-
FINAL TABLE
| State |
Left |
Right |
Up |
Down |
| 0 (R) |
- |
0.3 |
- |
0.2 |
| 1 |
0.1 |
0.2 |
- |
0.1 |
| 2 |
0.1 |
- |
- |
0.4 |
| 3 |
- |
- |
0.1 |
- |
| 4 (Bird) |
- |
- |
- |
- |
| 5 |
- |
- |
0.1 |
0.3 |
| 6 |
- |
0.1 |
0.2 |
- |
| 7 (1🦗) |
0.2 |
0.3 |
- |
0.4 |
| 8 (5🦗) |
- |
- |
- |
- |
Fin
<aside>
✨
Notes by : Mehul (mehul.xyz)
</aside>