Tuesday, October 9, 2018

Q-learn: solving Knight and Princess problem from scratch with Q-learn algorithm


The Problem:

The problem is about finding a path to the Princess ignoring the enemies. In the below figure "K" represents "Knights", "P" represents "Princess", and "E" represents "Enemy". In the figure, blank cell means, the knight can move to that cell. e.g, the knight can move to right and down only. Knight will be killed if moves to the enemy cell.


Solution:

*You can download the entire Python code here.

We will solve this using Q-learn algorithm and implement using Python. We will not use any other library such as gym
To implement from the scratch, we need to understand the Q-learn algorithm and the problem itself. It is recommended, to go through the theory of Q-learn reinforcement algorithm. 

Step-1: Represent the problem mathematically.
As we know, the Q-learn algorithm is all about states and actions. So in our problem, let the current position of the Knight is the "state". So lets numbered the all possible position as states number (starting from state #0), in below figure.


So, the state of knight is 0. The state of Princess is 22. The states of enemies are 6, 8, 12, 15, and 19. The objective is to move the knight from state 0 to state 22 ignoring states 6, 8, 12, 15, and 19.
The possible actions from each state can be UP (action 0), DOWN (action 1), LEFT (action 2), and RIGHT (action 3). For each movement, the knight will be rewarded. 


Step-2: Create the reward matrix


1
2
import numpy as np
R = np.full((25,4),-1)
 
For each legal movement the knight will be rewarded -1 (which is default). 
However if the next state is the states of enemies, then the reward value will be -100.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# reward to reach knight at state 6 is -100
R[1][1] = -100.0
R[5][3] = -100.0
R[7][2] = -100.0
R[11][0] = -100.0
# reward to reach knight at state 8 is -100
R[3][1] = -100.0
R[7][3] = -100.0
R[9][2] = -100.0
R[13][0] = -100.0
# reward to reach knight at state 12 is -100
R[7][1] = -100.0
R[11][3] = -100.0
R[13][2] = -100.0
R[17][0] = -100.0
# reward to reach knight at state 15 is -100
R[10][1] = -100.0
R[16][2] = -100.0
R[20][0] = -100.0
# reward to reach knight at state 19 is -100
R[14][1] = -100.0
R[18][3] = -100.0
R[24][0] = -100.0

And if the next state is 22 (i.e. state of Princess) than the reward value will be 100.

1
2
3
4
# reward to reach the castle (prince) at state 22 is 100
R[17][1] = 100.0
R[21][3] = 100.0
R[23][2] = 100.0


To define the illegal movement by using -2 as reward value. For example, the knight can not move UP from state 0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# -2 represent the illegal movement
illegal = -2
# illegal movement from states in FIRST ROW
R[0][0]=illegal
R[1][0]= illegal
R[2][0]= illegal
R[3][0]= illegal
R[4][0]= illegal
# illegal movement from states in FIRST COLUMN
R[0][2]= illegal
R[5][2]= illegal
R[10][2]= illegal
R[15][2]= illegal
R[20][2]= illegal
# illegal movement from states in LAST ROW
R[20][1]= illegal
R[21][1]= illegal
R[22][1]= illegal
R[23][1]= illegal
R[24][1]= illegal
# illegal movement from states in LAST COLUMN
R[4][3]= illegal
R[9][3]= illegal
R[14][3]= illegal
R[19][3]= illegal
R[24][3]= illegal

Step-2: Create the Q matrix

1
2
3
4
5
noOfState = 25
noOfAction = 4

# Q matrix
Q = np.zeros((noOfState,noOfAction))
Set the learning rate (alpha) to 0.8 and the discount rate (gamma) to 0.2.

1
2
gamma = 0.2
alpha = 0.8
Step-3: Create the supporting functions 
Function to return all possible action that can be taken from any state
1
2
3
4
def available_directions_to_go(state):
    curr_state_row = R[state,]
    act = np.where(curr_state_row != -2)
    return act[0]


Function to select and return one action from all possible actions
1
2
3
def next_action_dir(all_psbl_directions):
    next_action_dir = int(np.random.choice(all_psbl_directions))
    return next_action_dir


Function to return the next state value after applying any specific action

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def next_state(current_state, direction):
    if direction == 0: # UP direction
        st= current_state-5
    if direction == 1: # DOWN direction
        st = current_state+5
    if direction == 2: # LEFT direction
        st = current_state-1
    if direction == 3: # RIGHT direction
        st = current_state+1
    if verify_state(st) == False:
        print("6. Something wrong: (current_state, direction) ", current_state, direction)
    return st


Function to update the Q matrix

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def updateQ(cur_state,action_dir,gamma):
    if verify_state(cur_state) == False or verify_action(action_dir)==False:
        print("3. something is Wrong")    
    next_st = next_state(cur_state, action_dir)
    if verify_state(next_st) == False:
        print("4. something is Wrong: (next_st, cur_state, action_dir)",next_st, cur_state, action_dir)
        print(Q)
    index_of_max_val = np.where(R[next_st,] == np.max(R[next_st,]))[0][0]
    if verify_action(index_of_max_val)==False:
        print("5. something is Wrong")        
    Q[cur_state, action_dir] = Q[cur_state, action_dir] + alpha*(R[cur_state, action_dir]+ gamma*(Q[next_st,index_of_max_val] - Q[cur_state, action_dir]))

Step- 4: Training 
Now its time to train the Q matrix.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for i in range(50000):
    cur_state = np.random.randint(0,noOfState)
    if verify_state(cur_state) == False:
        print("1. something is Wrong")
    next_act_dir = next_action_dir(available_directions_to_go(cur_state))
    if verify_action(next_act_dir) == False:
        print("2. something is Wrong")
    updateQ(cur_state, next_act_dir, gamma)
    if i%10000 == 0:
        print("10k iterations finished.")


Step-5: Testing 
After training the Q-matrix, the resultant matrix can be use to fin the path from state 0 to state 22.



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
print("After training below is the Q matrix: ")
for i in range(Q.shape[0]):
    print(i,"  ",Q[i,])
start_st = 0
goal_st = 22
path = [start_st]
while start_st != goal_st:
    tmp = Q[start_st,]
    row_max = np.max(tmp[tmp != 0])
    max_reward_act_dir = np.where(Q[start_st,]==row_max)[0][0]
    next_st =  next_state(start_st, max_reward_act_dir)
    path.append(next_st)
    start_st = next_st
print("Path from state ",start_st," to state ",goal_st,": ",path)


The procedure to find the path is as follows:
5.a.  Start from the state of knight. In our case it is state 0
5.b.  Find the maximum value but should not be 0. The cell value 0 represents the illegal movement. So we should ignore such movement.
5.c.  Find the corresponding action direction. For example, if the maximum value is present at index 2. This mean, the knight should move towards LEFT direction.
5.d.  Go the next state and make that state as current state 
5.e.  If the current state is not goal state or the state of Princess, repeat from Step-5.b
5.f.  If the current state is goal state, print the sequence of states knight has visited.

Change the value of start_st and goal_st to change the  position of Knight and Princess, respectively. Play around the value of gamma and alpha to train the Q matrix.
You can download the entire Python code here.


Thats all for now.

-Thanks


No comments:

Post a Comment