Tuesday, October 9, 2018

Q-learn: solving Knight and Princess problem from scratch with Q-learn algorithm


The Problem:

The problem is about finding a path to the Princess ignoring the enemies. In the below figure "K" represents "Knights", "P" represents "Princess", and "E" represents "Enemy". In the figure, blank cell means, the knight can move to that cell. e.g, the knight can move to right and down only. Knight will be killed if moves to the enemy cell.


Solution:

*You can download the entire Python code here.

We will solve this using Q-learn algorithm and implement using Python. We will not use any other library such as gym
To implement from the scratch, we need to understand the Q-learn algorithm and the problem itself. It is recommended, to go through the theory of Q-learn reinforcement algorithm. 

Step-1: Represent the problem mathematically.
As we know, the Q-learn algorithm is all about states and actions. So in our problem, let the current position of the Knight is the "state". So lets numbered the all possible position as states number (starting from state #0), in below figure.


So, the state of knight is 0. The state of Princess is 22. The states of enemies are 6, 8, 12, 15, and 19. The objective is to move the knight from state 0 to state 22 ignoring states 6, 8, 12, 15, and 19.
The possible actions from each state can be UP (action 0), DOWN (action 1), LEFT (action 2), and RIGHT (action 3). For each movement, the knight will be rewarded. 


Step-2: Create the reward matrix


1
2
import numpy as np
R = np.full((25,4),-1)
 
For each legal movement the knight will be rewarded -1 (which is default). 
However if the next state is the states of enemies, then the reward value will be -100.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# reward to reach knight at state 6 is -100
R[1][1] = -100.0
R[5][3] = -100.0
R[7][2] = -100.0
R[11][0] = -100.0
# reward to reach knight at state 8 is -100
R[3][1] = -100.0
R[7][3] = -100.0
R[9][2] = -100.0
R[13][0] = -100.0
# reward to reach knight at state 12 is -100
R[7][1] = -100.0
R[11][3] = -100.0
R[13][2] = -100.0
R[17][0] = -100.0
# reward to reach knight at state 15 is -100
R[10][1] = -100.0
R[16][2] = -100.0
R[20][0] = -100.0
# reward to reach knight at state 19 is -100
R[14][1] = -100.0
R[18][3] = -100.0
R[24][0] = -100.0

And if the next state is 22 (i.e. state of Princess) than the reward value will be 100.

1
2
3
4
# reward to reach the castle (prince) at state 22 is 100
R[17][1] = 100.0
R[21][3] = 100.0
R[23][2] = 100.0


To define the illegal movement by using -2 as reward value. For example, the knight can not move UP from state 0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# -2 represent the illegal movement
illegal = -2
# illegal movement from states in FIRST ROW
R[0][0]=illegal
R[1][0]= illegal
R[2][0]= illegal
R[3][0]= illegal
R[4][0]= illegal
# illegal movement from states in FIRST COLUMN
R[0][2]= illegal
R[5][2]= illegal
R[10][2]= illegal
R[15][2]= illegal
R[20][2]= illegal
# illegal movement from states in LAST ROW
R[20][1]= illegal
R[21][1]= illegal
R[22][1]= illegal
R[23][1]= illegal
R[24][1]= illegal
# illegal movement from states in LAST COLUMN
R[4][3]= illegal
R[9][3]= illegal
R[14][3]= illegal
R[19][3]= illegal
R[24][3]= illegal

Step-2: Create the Q matrix

1
2
3
4
5
noOfState = 25
noOfAction = 4

# Q matrix
Q = np.zeros((noOfState,noOfAction))
Set the learning rate (alpha) to 0.8 and the discount rate (gamma) to 0.2.

1
2
gamma = 0.2
alpha = 0.8
Step-3: Create the supporting functions 
Function to return all possible action that can be taken from any state
1
2
3
4
def available_directions_to_go(state):
    curr_state_row = R[state,]
    act = np.where(curr_state_row != -2)
    return act[0]


Function to select and return one action from all possible actions
1
2
3
def next_action_dir(all_psbl_directions):
    next_action_dir = int(np.random.choice(all_psbl_directions))
    return next_action_dir


Function to return the next state value after applying any specific action

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def next_state(current_state, direction):
    if direction == 0: # UP direction
        st= current_state-5
    if direction == 1: # DOWN direction
        st = current_state+5
    if direction == 2: # LEFT direction
        st = current_state-1
    if direction == 3: # RIGHT direction
        st = current_state+1
    if verify_state(st) == False:
        print("6. Something wrong: (current_state, direction) ", current_state, direction)
    return st


Function to update the Q matrix

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def updateQ(cur_state,action_dir,gamma):
    if verify_state(cur_state) == False or verify_action(action_dir)==False:
        print("3. something is Wrong")    
    next_st = next_state(cur_state, action_dir)
    if verify_state(next_st) == False:
        print("4. something is Wrong: (next_st, cur_state, action_dir)",next_st, cur_state, action_dir)
        print(Q)
    index_of_max_val = np.where(R[next_st,] == np.max(R[next_st,]))[0][0]
    if verify_action(index_of_max_val)==False:
        print("5. something is Wrong")        
    Q[cur_state, action_dir] = Q[cur_state, action_dir] + alpha*(R[cur_state, action_dir]+ gamma*(Q[next_st,index_of_max_val] - Q[cur_state, action_dir]))

Step- 4: Training 
Now its time to train the Q matrix.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for i in range(50000):
    cur_state = np.random.randint(0,noOfState)
    if verify_state(cur_state) == False:
        print("1. something is Wrong")
    next_act_dir = next_action_dir(available_directions_to_go(cur_state))
    if verify_action(next_act_dir) == False:
        print("2. something is Wrong")
    updateQ(cur_state, next_act_dir, gamma)
    if i%10000 == 0:
        print("10k iterations finished.")


Step-5: Testing 
After training the Q-matrix, the resultant matrix can be use to fin the path from state 0 to state 22.



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
print("After training below is the Q matrix: ")
for i in range(Q.shape[0]):
    print(i,"  ",Q[i,])
start_st = 0
goal_st = 22
path = [start_st]
while start_st != goal_st:
    tmp = Q[start_st,]
    row_max = np.max(tmp[tmp != 0])
    max_reward_act_dir = np.where(Q[start_st,]==row_max)[0][0]
    next_st =  next_state(start_st, max_reward_act_dir)
    path.append(next_st)
    start_st = next_st
print("Path from state ",start_st," to state ",goal_st,": ",path)


The procedure to find the path is as follows:
5.a.  Start from the state of knight. In our case it is state 0
5.b.  Find the maximum value but should not be 0. The cell value 0 represents the illegal movement. So we should ignore such movement.
5.c.  Find the corresponding action direction. For example, if the maximum value is present at index 2. This mean, the knight should move towards LEFT direction.
5.d.  Go the next state and make that state as current state 
5.e.  If the current state is not goal state or the state of Princess, repeat from Step-5.b
5.f.  If the current state is goal state, print the sequence of states knight has visited.

Change the value of start_st and goal_st to change the  position of Knight and Princess, respectively. Play around the value of gamma and alpha to train the Q matrix.
You can download the entire Python code here.


Thats all for now.

-Thanks


Wednesday, October 3, 2018

Understanding Q-learning algorithm with example using Python

Before knowing Q-learning algorithm, let's first understand the motivation behind this. what kind questions can be answered by Q-learning algorithm.

Problem: 
Let's we have a house of 7 rooms: named from 0 to 6. Outside area of the house is numbered 7. Only through room 0, 2, and 6, one can go out of the house, i.e. to 7. Let a robot is in room 1 and needs to go out. In order to find the path from room 1 to go out of the house, Q-learning algorithm can be applied.

Q-learning approach:

1. First step is to convert the above room diagram into a graph of "states or nodes" and "actions or edges". For the above house outline, we can derive the following graph:


The number of nodes/states is equal to the number of rooms. Edge represents the action which infers the movement from one room to other. The value on edge represents the reward value when robot move from one room to other. The robot will get 100 reward value only when it move to outside of the house (i.e. state 7) from room 0, 2, and 6. If the robot move to a room except 7, the reward value will be 0.

 2. Second step is to convert the graph into a matrix called R matrix. For the above graph, following is the R matrix:
R-matrix is a matrix with dimension [# of states x # of actions]. In our case the dimension of the R matrix would be 8x8 since we have 8 number of states and 8 number of actions.


The value in a cell represent the edge value from the corresponding states. The cell value can be 
        -1, if no edge is present
        0, if there is an edge present from state in the row to the state in the column and the reward value is 0
        100, if there is an edge present from state in the row to the state in the column and the reward value is 100 
e.g. the value in the last column of first row is 100. This is because we have the reward value of 100 from the state 0 to the state 7 in the graph in First step
The cell value in 5th column of second row is -1. This is because there is no edge from state 1 to state 4 in the graph in First step
The cell value in 4th column of 5th row is 0. This is because there is an edge from state 1 to state 4 is present and the reward value is 0 in the graph in First step

3. Third step is to create a Q matrix with dimension equal to the dimension of R matrix. Initialize the Q matrix with zeros i.e. all the cell values will be 0.



Decide the value of gamma or learning factor. Let gamma=0.8

Now we have two stages: Training stage and Testing stage

4. Forth step is Training stage:
In this stage we will train the robot by iteratively updating the Q matrix. 
For training stage the steps are as follows:
4.1.   Randomly select any state, called current_state
4.2.   Find all the possible next states. This means the states to where the robot can move.
4.3.   Select any random state from the above step 4.2, called next_state.
4.4.   Now apply the following equation to update Q matrix.
        Find the maximum value from the (next_state)th row of Q matrix. Let max_value be that maximum value
        Q[ current_state , next_state ] = R[ current_state , next_state ] + gamma * max_value
4.5.   Go to Step 4.1 and repeat this process for a certain number of time. Let 20000. 

Following is the Q matrix after 10000 iterations( also called episodes)


This Q matrix will be now used by the robot to find the path from any room to out of the house.

5. Fifth step is Testing stage:
In this step we will use the updated Q matrix obtained from the previous step.
Following are the steps for the final testing stage: 
5.1.  Set the cs = initial_state.
5.2.  Find the state s, with maximum value in the (cs)th row of Q matrix. 
5.3.  if s is the goal state (in this example goal state is 7), return sequence of all states
        else, cs = s and repeat from the Step 5.2.

Below is the Python code for the above example. The code will print the Q matrix after each 1000 iterations/episodes. The Python file (Q-learning_example.py) can be downloaded from here.
====================================

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np

nos = 8 # number of states
goal_state = 7 # the goal state
intermediate_Q_mat_at_step=1000

# R matrix
R = np.matrix([ [-1,0,-1,-1,-1,-1,-1,100],
  [0,-1,0,-1,-1,-1,-1,-1],
  [-1,0,-1,0,-1,-1,-1,100],
  [-1,-1,0,-1,0,-1,-1,-1],
  [-1,-1,-1,0,-1,0,-1,-1],
                [-1,-1,-1,-1,0,-1,0,-1],
                [-1,-1,-1,-1,-1,0,-1,100],
  [0,-1,0,-1,-1,-1,0,100] ])

# Q matrix
Q = np.matrix(np.zeros([nos,8]))

# Gamma (learning factor)
gamma = 0.8


def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act


# A random action is selected
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action


# update the Q-matrix for training phase
def update(current_state, action, gamma):
    
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
        
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma *  Q[action, max_index]
    


############ Training stage ###################################


# Train for 10000 iterations.
for i in range(10000): 
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state,action,gamma)
    # print the intermediate Q matrix
    if((i % intermediate_Q_mat_at_step) == 0):
        print("Q-matrix at step ", i)
        print(Q/np.max(Q)*100)
    
# Normalize the "trained" Q matrix
print("Trained Q matrix:")
print(Q/np.max(Q)*100)


###################### Testing Stage  ###################################


current_state = 4
steps = [current_state]

while current_state != goal_state:

    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    
    steps.append(next_step_index)
    current_state = next_step_index

# Print selected sequence of steps
print("Selected path:")
print(steps)


References:

Tuesday, October 2, 2018

Regression vs Classification in Machine Learning

REGRESSION :
  • The output variable takes continues values
  • This predicts a value from a continue set
  • Regression involves estimating or predicting a response
  • e.g. The price of a house depending on the 'size', 'location' can be some 'numerical value'. This numerical value can be continues. 
  • Given  f : x -> y,  if y is real number/continues, then this is a regression problem.
  • Regression models makes predictions that answer questions like the following:
    • what is the value of a house in Taipei, Taiwan ?
    • what is the probability that a user click on this ad ?
  • Different types of regression :
    • Linear regression
    • Logistic regression
    • Polynomial regression
    • Stepwise regression
    • Ridge regression
    • Lasso regression
    • ElasticNet regression
    • and many more ....

CLASSIFICATION :
  • The output variable takes class labels
  • classification predicts the "belonging" to the class
  • classification is identifying group membership
  • e.g. The prediction of house price can be classified into 'very costly', 'costly', 'affordable', 'cheap', 'very cheap'. Here each class may correspond to some range of values.
  • Given  f : x -> y,  if y is discrete/categorical variable, then this is classification problem.
  • Classification model predict discrete values that ans. the questions like the following:
    • Is the given email message spam or not spam ?
    • Is this an image of dog, a cat or a hamster ?
  • Different classification methods:
    • Linear classifier: Logistic regression, Naive Bayes classifier
    • Support Vector Machines
    • Decision Trees
    • Boosted Trees
    • Random forest
    • Neural Networks
    • Nearest neighbor
    • and many more ....



[src] :