The Problem:
The problem is about finding a path to the Princess ignoring the enemies. In the below figure "K" represents "Knights", "P" represents "Princess", and "E" represents "Enemy". In the figure, blank cell means, the knight can move to that cell. e.g, the knight can move to right and down only. Knight will be killed if moves to the enemy cell.
Solution:
*You can download the entire Python code here.
We will solve this using Q-learn algorithm and implement using Python. We will not use any other library such as gym.
We will solve this using Q-learn algorithm and implement using Python. We will not use any other library such as gym.
To implement from the scratch, we need to understand the Q-learn algorithm and the problem itself. It is recommended, to go through the theory of Q-learn reinforcement algorithm.
Step-1: Represent the problem mathematically.
As we know, the Q-learn algorithm is all about states and actions. So in our problem, let the current position of the Knight is the "state". So lets numbered the all possible position as states number (starting from state #0), in below figure.
So, the state of knight is 0. The state of Princess is 22. The states of enemies are 6, 8, 12, 15, and 19. The objective is to move the knight from state 0 to state 22 ignoring states 6, 8, 12, 15, and 19.
The possible actions from each state can be UP (action 0), DOWN (action 1), LEFT (action 2), and RIGHT (action 3). For each movement, the knight will be rewarded.
Step-2: Create the reward matrix
1 2 | import numpy as np R = np.full((25,4),-1) |
For each legal movement the knight will be rewarded -1 (which is default).
However if the next state is the states of enemies, then the reward value will be -100.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # reward to reach knight at state 6 is -100 R[1][1] = -100.0 R[5][3] = -100.0 R[7][2] = -100.0 R[11][0] = -100.0 # reward to reach knight at state 8 is -100 R[3][1] = -100.0 R[7][3] = -100.0 R[9][2] = -100.0 R[13][0] = -100.0 # reward to reach knight at state 12 is -100 R[7][1] = -100.0 R[11][3] = -100.0 R[13][2] = -100.0 R[17][0] = -100.0 # reward to reach knight at state 15 is -100 R[10][1] = -100.0 R[16][2] = -100.0 R[20][0] = -100.0 # reward to reach knight at state 19 is -100 R[14][1] = -100.0 R[18][3] = -100.0 R[24][0] = -100.0 |
And if the next state is 22 (i.e. state of Princess) than the reward value will be 100.
1 2 3 4 | # reward to reach the castle (prince) at state 22 is 100 R[17][1] = 100.0 R[21][3] = 100.0 R[23][2] = 100.0 |
To define the illegal movement by using -2 as reward value. For example, the knight can not move UP from state 0.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # -2 represent the illegal movement illegal = -2 # illegal movement from states in FIRST ROW R[0][0]=illegal R[1][0]= illegal R[2][0]= illegal R[3][0]= illegal R[4][0]= illegal # illegal movement from states in FIRST COLUMN R[0][2]= illegal R[5][2]= illegal R[10][2]= illegal R[15][2]= illegal R[20][2]= illegal # illegal movement from states in LAST ROW R[20][1]= illegal R[21][1]= illegal R[22][1]= illegal R[23][1]= illegal R[24][1]= illegal # illegal movement from states in LAST COLUMN R[4][3]= illegal R[9][3]= illegal R[14][3]= illegal R[19][3]= illegal R[24][3]= illegal |
Step-2: Create the Q matrix
1 2 3 4 5 | noOfState = 25 noOfAction = 4 # Q matrix Q = np.zeros((noOfState,noOfAction)) |
Set the learning rate (alpha) to 0.8 and the discount rate (gamma) to 0.2.
1 2 | gamma = 0.2 alpha = 0.8 |
Step-3: Create the supporting functions
Function to return all possible action that can be taken from any state
Function to select and return one action from all possible actions
Function to return the next state value after applying any specific action
Function to update the Q matrix
Thats all for now.
-Thanks
1 2 3 4 | def available_directions_to_go(state): curr_state_row = R[state,] act = np.where(curr_state_row != -2) return act[0] |
Function to select and return one action from all possible actions
1 2 3 | def next_action_dir(all_psbl_directions): next_action_dir = int(np.random.choice(all_psbl_directions)) return next_action_dir |
Function to return the next state value after applying any specific action
1 2 3 4 5 6 7 8 9 10 11 12 | def next_state(current_state, direction): if direction == 0: # UP direction st= current_state-5 if direction == 1: # DOWN direction st = current_state+5 if direction == 2: # LEFT direction st = current_state-1 if direction == 3: # RIGHT direction st = current_state+1 if verify_state(st) == False: print("6. Something wrong: (current_state, direction) ", current_state, direction) return st |
Function to update the Q matrix
1 2 3 4 5 6 7 8 9 10 11 | def updateQ(cur_state,action_dir,gamma): if verify_state(cur_state) == False or verify_action(action_dir)==False: print("3. something is Wrong") next_st = next_state(cur_state, action_dir) if verify_state(next_st) == False: print("4. something is Wrong: (next_st, cur_state, action_dir)",next_st, cur_state, action_dir) print(Q) index_of_max_val = np.where(R[next_st,] == np.max(R[next_st,]))[0][0] if verify_action(index_of_max_val)==False: print("5. something is Wrong") Q[cur_state, action_dir] = Q[cur_state, action_dir] + alpha*(R[cur_state, action_dir]+ gamma*(Q[next_st,index_of_max_val] - Q[cur_state, action_dir])) |
Step- 4: Training
Now its time to train the Q matrix.
1 2 3 4 5 6 7 8 9 10 | for i in range(50000): cur_state = np.random.randint(0,noOfState) if verify_state(cur_state) == False: print("1. something is Wrong") next_act_dir = next_action_dir(available_directions_to_go(cur_state)) if verify_action(next_act_dir) == False: print("2. something is Wrong") updateQ(cur_state, next_act_dir, gamma) if i%10000 == 0: print("10k iterations finished.") |
Step-5: Testing
After training the Q-matrix, the resultant matrix can be use to fin the path from state 0 to state 22.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | print("After training below is the Q matrix: ") for i in range(Q.shape[0]): print(i," ",Q[i,]) start_st = 0 goal_st = 22 path = [start_st] while start_st != goal_st: tmp = Q[start_st,] row_max = np.max(tmp[tmp != 0]) max_reward_act_dir = np.where(Q[start_st,]==row_max)[0][0] next_st = next_state(start_st, max_reward_act_dir) path.append(next_st) start_st = next_st print("Path from state ",start_st," to state ",goal_st,": ",path) |
The procedure to find the path is as follows:
5.a. Start from the state of knight. In our case it is state 0
5.b. Find the maximum value but should not be 0. The cell value 0 represents the illegal movement. So we should ignore such movement.
5.c. Find the corresponding action direction. For example, if the maximum value is present at index 2. This mean, the knight should move towards LEFT direction.
5.d. Go the next state and make that state as current state
5.e. If the current state is not goal state or the state of Princess, repeat from Step-5.b
5.f. If the current state is goal state, print the sequence of states knight has visited.
Change the value of start_st and goal_st to change the position of Knight and Princess, respectively. Play around the value of gamma and alpha to train the Q matrix.
You can download the entire Python code here.
Change the value of start_st and goal_st to change the position of Knight and Princess, respectively. Play around the value of gamma and alpha to train the Q matrix.
You can download the entire Python code here.
Thats all for now.
-Thanks