Computer Inquisitive: Understanding Q-learning algorithm with example using Python

Before knowing Q-learning algorithm, let's first understand the motivation behind this. what kind questions can be answered by Q-learning algorithm.

Problem:

Let's we have a house of 7 rooms: named from 0 to 6. Outside area of the house is numbered 7. Only through room 0, 2, and 6, one can go out of the house, i.e. to 7. Let a robot is in room 1 and needs to go out. In order to find the path from room 1 to go out of the house, Q-learning algorithm can be applied.

Q-learning approach:

1. First step is to convert the above room diagram into a graph of "states or nodes" and "actions or edges". For the above house outline, we can derive the following graph:

The number of nodes/states is equal to the number of rooms. Edge represents the action which infers the movement from one room to other. The value on edge represents the reward value when robot move from one room to other. The robot will get 100 reward value only when it move to outside of the house (i.e. state 7) from room 0, 2, and 6. If the robot move to a room except 7, the reward value will be 0.

2. Second step is to convert the graph into a matrix called R matrix. For the above graph, following is the R matrix:

R-matrix is a matrix with dimension [# of states x # of actions]. In our case the dimension of the R matrix would be 8x8 since we have 8 number of states and 8 number of actions.

The value in a cell represent the edge value from the corresponding states. The cell value can be

-1, if no edge is present

0, if there is an edge present from state in the row to the state in the column and the reward value is 0

100, if there is an edge present from state in the row to the state in the column and the reward value is 100

e.g. the value in the last column of first row is 100. This is because we have the reward value of 100 from the state 0 to the state 7 in the graph in First step.

The cell value in 5th column of second row is -1. This is because there is no edge from state 1 to state 4 in the graph in First step.

The cell value in 4th column of 5th row is 0. This is because there is an edge from state 1 to state 4 is present and the reward value is 0 in the graph in First step.

3. Third step is to create a Q matrix with dimension equal to the dimension of R matrix. Initialize the Q matrix with zeros i.e. all the cell values will be 0.

Decide the value of gamma or learning factor. Let gamma=0.8

Now we have two stages: Training stage and Testing stage

4. Forth step is Training stage:

In this stage we will train the robot by iteratively updating the Q matrix.

For training stage the steps are as follows:

4.1. Randomly select any state, called current_state.

4.2. Find all the possible next states. This means the states to where the robot can move.

4.3. Select any random state from the above step 4.2, called next_state.

4.4. Now apply the following equation to update Q matrix.

Find the maximum value from the (next_state)th row of Q matrix. Let max_value be that maximum value

Q[ current_state , next_state ] = R[ current_state , next_state ] + gamma * max_value

4.5. Go to Step 4.1 and repeat this process for a certain number of time. Let 20000.

Following is the Q matrix after 10000 iterations( also called episodes)

This Q matrix will be now used by the robot to find the path from any room to out of the house.

5. Fifth step is Testing stage:

In this step we will use the updated Q matrix obtained from the previous step.

Following are the steps for the final testing stage:

5.1. Set the cs = initial_state.

5.2. Find the state s, with maximum value in the (cs)th row of Q matrix.

5.3. if s is the goal state (in this example goal state is 7), return sequence of all states

else, cs = s and repeat from the Step 5.2.

Below is the Python code for the above example. The code will print the Q matrix after each 1000 iterations/episodes. The Python file (Q-learning_example.py) can be downloaded from here.

====================================

import numpy as np

nos = 8 # number of states
goal_state = 7 # the goal state
intermediate_Q_mat_at_step=1000

# R matrix
R = np.matrix([ [-1,0,-1,-1,-1,-1,-1,100],
  [0,-1,0,-1,-1,-1,-1,-1],
  [-1,0,-1,0,-1,-1,-1,100],
  [-1,-1,0,-1,0,-1,-1,-1],
  [-1,-1,-1,0,-1,0,-1,-1],
                [-1,-1,-1,-1,0,-1,0,-1],
                [-1,-1,-1,-1,-1,0,-1,100],
  [0,-1,0,-1,-1,-1,0,100] ])

# Q matrix
Q = np.matrix(np.zeros([nos,8]))

# Gamma (learning factor)
gamma = 0.8


def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act


# A random action is selected
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action


# update the Q-matrix for training phase
def update(current_state, action, gamma):
    
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
        
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma *  Q[action, max_index]
    


############ Training stage ###################################


# Train for 10000 iterations.
for i in range(10000): 
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state,action,gamma)
    # print the intermediate Q matrix
    if((i % intermediate_Q_mat_at_step) == 0):
        print("Q-matrix at step ", i)
        print(Q/np.max(Q)*100)
    
# Normalize the "trained" Q matrix
print("Trained Q matrix:")
print(Q/np.max(Q)*100)


###################### Testing Stage  ###################################


current_state = 4
steps = [current_state]

while current_state != goal_state:

    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    
    steps.append(next_step_index)
    current_state = next_step_index

# Print selected sequence of steps
print("Selected path:")
print(steps)