Computer Inquisitive: October 2018

Tuesday, October 9, 2018

Q-learn: solving Knight and Princess problem from scratch with Q-learn algorithm

The Problem:

The problem is about finding a path to the Princess ignoring the enemies. In the below figure "K" represents "Knights", "P" represents "Princess", and "E" represents "Enemy". In the figure, blank cell means, the knight can move to that cell. e.g, the knight can move to right and down only. Knight will be killed if moves to the enemy cell.

Solution:

*You can download the entire Python code here.

We will solve this using Q-learn algorithm and implement using Python. We will not use any other library such as gym.

To implement from the scratch, we need to understand the Q-learn algorithm and the problem itself. It is recommended, to go through the theory of Q-learn reinforcement algorithm.

Step-1: Represent the problem mathematically.

As we know, the Q-learn algorithm is all about states and actions. So in our problem, let the current position of the Knight is the "state". So lets numbered the all possible position as states number (starting from state #0), in below figure.

So, the state of knight is 0. The state of Princess is 22. The states of enemies are 6, 8, 12, 15, and 19. The objective is to move the knight from state 0 to state 22 ignoring states 6, 8, 12, 15, and 19.

The possible actions from each state can be UP (action 0), DOWN (action 1), LEFT (action 2), and RIGHT (action 3). For each movement, the knight will be rewarded.

Step-2: Create the reward matrix

1 2	import numpy as np R = np.full((25,4),-1)

For each legal movement the knight will be rewarded -1 (which is default).

However if the next state is the states of enemies, then the reward value will be -100.

# reward to reach knight at state 6 is -100
R[1][1] = -100.0
R[5][3] = -100.0
R[7][2] = -100.0
R[11][0] = -100.0
# reward to reach knight at state 8 is -100
R[3][1] = -100.0
R[7][3] = -100.0
R[9][2] = -100.0
R[13][0] = -100.0
# reward to reach knight at state 12 is -100
R[7][1] = -100.0
R[11][3] = -100.0
R[13][2] = -100.0
R[17][0] = -100.0
# reward to reach knight at state 15 is -100
R[10][1] = -100.0
R[16][2] = -100.0
R[20][0] = -100.0
# reward to reach knight at state 19 is -100
R[14][1] = -100.0
R[18][3] = -100.0
R[24][0] = -100.0

And if the next state is 22 (i.e. state of Princess) than the reward value will be 100.

# reward to reach the castle (prince) at state 22 is 100
R[17][1] = 100.0
R[21][3] = 100.0
R[23][2] = 100.0

To define the illegal movement by using -2 as reward value. For example, the knight can not move UP from state 0.

# -2 represent the illegal movement
illegal = -2
# illegal movement from states in FIRST ROW
R[0][0]=illegal
R[1][0]= illegal
R[2][0]= illegal
R[3][0]= illegal
R[4][0]= illegal
# illegal movement from states in FIRST COLUMN
R[0][2]= illegal
R[5][2]= illegal
R[10][2]= illegal
R[15][2]= illegal
R[20][2]= illegal
# illegal movement from states in LAST ROW
R[20][1]= illegal
R[21][1]= illegal
R[22][1]= illegal
R[23][1]= illegal
R[24][1]= illegal
# illegal movement from states in LAST COLUMN
R[4][3]= illegal
R[9][3]= illegal
R[14][3]= illegal
R[19][3]= illegal
R[24][3]= illegal

Step-2: Create the Q matrix

noOfState = 25
noOfAction = 4

# Q matrix
Q = np.zeros((noOfState,noOfAction))

Set the learning rate (alpha) to 0.8 and the discount rate (gamma) to 0.2.

1 2	gamma = 0.2 alpha = 0.8

Step-3: Create the supporting functions

Function to return all possible action that can be taken from any state

def available_directions_to_go(state):
    curr_state_row = R[state,]
    act = np.where(curr_state_row != -2)
    return act[0]

Function to select and return one action from all possible actions

1
2
3

def next_action_dir(all_psbl_directions):
    next_action_dir = int(np.random.choice(all_psbl_directions))
    return next_action_dir

Function to return the next state value after applying any specific action

def next_state(current_state, direction):
    if direction == 0: # UP direction
        st= current_state-5
    if direction == 1: # DOWN direction
        st = current_state+5
    if direction == 2: # LEFT direction
        st = current_state-1
    if direction == 3: # RIGHT direction
        st = current_state+1
    if verify_state(st) == False:
        print("6. Something wrong: (current_state, direction) ", current_state, direction)
    return st

Function to update the Q matrix

def updateQ(cur_state,action_dir,gamma):
    if verify_state(cur_state) == False or verify_action(action_dir)==False:
        print("3. something is Wrong")    
    next_st = next_state(cur_state, action_dir)
    if verify_state(next_st) == False:
        print("4. something is Wrong: (next_st, cur_state, action_dir)",next_st, cur_state, action_dir)
        print(Q)
    index_of_max_val = np.where(R[next_st,] == np.max(R[next_st,]))[0][0]
    if verify_action(index_of_max_val)==False:
        print("5. something is Wrong")        
    Q[cur_state, action_dir] = Q[cur_state, action_dir] + alpha*(R[cur_state, action_dir]+ gamma*(Q[next_st,index_of_max_val] - Q[cur_state, action_dir]))

Step- 4: Training

Now its time to train the Q matrix.

for i in range(50000):
    cur_state = np.random.randint(0,noOfState)
    if verify_state(cur_state) == False:
        print("1. something is Wrong")
    next_act_dir = next_action_dir(available_directions_to_go(cur_state))
    if verify_action(next_act_dir) == False:
        print("2. something is Wrong")
    updateQ(cur_state, next_act_dir, gamma)
    if i%10000 == 0:
        print("10k iterations finished.")

Step-5: Testing

After training the Q-matrix, the resultant matrix can be use to fin the path from state 0 to state 22.

print("After training below is the Q matrix: ")
for i in range(Q.shape[0]):
    print(i,"  ",Q[i,])
start_st = 0
goal_st = 22
path = [start_st]
while start_st != goal_st:
    tmp = Q[start_st,]
    row_max = np.max(tmp[tmp != 0])
    max_reward_act_dir = np.where(Q[start_st,]==row_max)[0][0]
    next_st =  next_state(start_st, max_reward_act_dir)
    path.append(next_st)
    start_st = next_st
print("Path from state ",start_st," to state ",goal_st,": ",path)

The procedure to find the path is as follows:

5.a. Start from the state of knight. In our case it is state 0

5.b. Find the maximum value but should not be 0. The cell value 0 represents the illegal movement. So we should ignore such movement.

5.c. Find the corresponding action direction. For example, if the maximum value is present at index 2. This mean, the knight should move towards LEFT direction.

5.d. Go the next state and make that state as current state

5.e. If the current state is not goal state or the state of Princess, repeat from Step-5.b

5.f. If the current state is goal state, print the sequence of states knight has visited.

Change the value of start_st and goal_st to change the position of Knight and Princess, respectively. Play around the value of gamma and alpha to train the Q matrix.
You can download the entire Python code here.

Thats all for now.

-Thanks

Wednesday, October 3, 2018

Understanding Q-learning algorithm with example using Python

Before knowing Q-learning algorithm, let's first understand the motivation behind this. what kind questions can be answered by Q-learning algorithm.

Problem:

Let's we have a house of 7 rooms: named from 0 to 6. Outside area of the house is numbered 7. Only through room 0, 2, and 6, one can go out of the house, i.e. to 7. Let a robot is in room 1 and needs to go out. In order to find the path from room 1 to go out of the house, Q-learning algorithm can be applied.

Q-learning approach:

1. First step is to convert the above room diagram into a graph of "states or nodes" and "actions or edges". For the above house outline, we can derive the following graph:

The number of nodes/states is equal to the number of rooms. Edge represents the action which infers the movement from one room to other. The value on edge represents the reward value when robot move from one room to other. The robot will get 100 reward value only when it move to outside of the house (i.e. state 7) from room 0, 2, and 6. If the robot move to a room except 7, the reward value will be 0.

2. Second step is to convert the graph into a matrix called R matrix. For the above graph, following is the R matrix:

R-matrix is a matrix with dimension [# of states x # of actions]. In our case the dimension of the R matrix would be 8x8 since we have 8 number of states and 8 number of actions.

The value in a cell represent the edge value from the corresponding states. The cell value can be

-1, if no edge is present

0, if there is an edge present from state in the row to the state in the column and the reward value is 0

100, if there is an edge present from state in the row to the state in the column and the reward value is 100

e.g. the value in the last column of first row is 100. This is because we have the reward value of 100 from the state 0 to the state 7 in the graph in First step.

The cell value in 5th column of second row is -1. This is because there is no edge from state 1 to state 4 in the graph in First step.

The cell value in 4th column of 5th row is 0. This is because there is an edge from state 1 to state 4 is present and the reward value is 0 in the graph in First step.

3. Third step is to create a Q matrix with dimension equal to the dimension of R matrix. Initialize the Q matrix with zeros i.e. all the cell values will be 0.

Decide the value of gamma or learning factor. Let gamma=0.8

Now we have two stages: Training stage and Testing stage

4. Forth step is Training stage:

In this stage we will train the robot by iteratively updating the Q matrix.

For training stage the steps are as follows:

4.1. Randomly select any state, called current_state.

4.2. Find all the possible next states. This means the states to where the robot can move.

4.3. Select any random state from the above step 4.2, called next_state.

4.4. Now apply the following equation to update Q matrix.

Find the maximum value from the (next_state)th row of Q matrix. Let max_value be that maximum value

Q[ current_state , next_state ] = R[ current_state , next_state ] + gamma * max_value

4.5. Go to Step 4.1 and repeat this process for a certain number of time. Let 20000.

Following is the Q matrix after 10000 iterations( also called episodes)

This Q matrix will be now used by the robot to find the path from any room to out of the house.

5. Fifth step is Testing stage:

In this step we will use the updated Q matrix obtained from the previous step.

Following are the steps for the final testing stage:

5.1. Set the cs = initial_state.

5.2. Find the state s, with maximum value in the (cs)th row of Q matrix.

5.3. if s is the goal state (in this example goal state is 7), return sequence of all states

else, cs = s and repeat from the Step 5.2.

Below is the Python code for the above example. The code will print the Q matrix after each 1000 iterations/episodes. The Python file (Q-learning_example.py) can be downloaded from here.

====================================

import numpy as np

nos = 8 # number of states
goal_state = 7 # the goal state
intermediate_Q_mat_at_step=1000

# R matrix
R = np.matrix([ [-1,0,-1,-1,-1,-1,-1,100],
  [0,-1,0,-1,-1,-1,-1,-1],
  [-1,0,-1,0,-1,-1,-1,100],
  [-1,-1,0,-1,0,-1,-1,-1],
  [-1,-1,-1,0,-1,0,-1,-1],
                [-1,-1,-1,-1,0,-1,0,-1],
                [-1,-1,-1,-1,-1,0,-1,100],
  [0,-1,0,-1,-1,-1,0,100] ])

# Q matrix
Q = np.matrix(np.zeros([nos,8]))

# Gamma (learning factor)
gamma = 0.8


def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act


# A random action is selected
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action


# update the Q-matrix for training phase
def update(current_state, action, gamma):
    
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
        
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma *  Q[action, max_index]
    


############ Training stage ###################################


# Train for 10000 iterations.
for i in range(10000): 
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state,action,gamma)
    # print the intermediate Q matrix
    if((i % intermediate_Q_mat_at_step) == 0):
        print("Q-matrix at step ", i)
        print(Q/np.max(Q)*100)
    
# Normalize the "trained" Q matrix
print("Trained Q matrix:")
print(Q/np.max(Q)*100)


###################### Testing Stage  ###################################


current_state = 4
steps = [current_state]

while current_state != goal_state:

    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    
    steps.append(next_step_index)
    current_state = next_step_index

# Print selected sequence of steps
print("Selected path:")
print(steps)

References:

https://en.wikipedia.org/wiki/Q-learning

http://firsttimeprogrammer.blogspot.com/2016/09/getting-ai-smarter-with-q-learning.html

https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe

http://mnemstudio.org/path-finding-q-learning-tutorial.htm

https://people.csail.mit.edu/alizadeh/papers/deeprm-hotnets16.pdf

https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12

Tuesday, October 2, 2018

Regression vs Classification in Machine Learning

REGRESSION :

The output variable takes continues values
This predicts a value from a continue set
Regression involves estimating or predicting a response
e.g. The price of a house depending on the 'size', 'location' can be some 'numerical value'. This numerical value can be continues.
Given f : x -> y, if y is real number/continues, then this is a regression problem.
Regression models makes predictions that answer questions like the following:

what is the value of a house in Taipei, Taiwan ?
what is the probability that a user click on this ad ?

Different types of regression :

Linear regression
Logistic regression
Polynomial regression
Stepwise regression
Ridge regression
Lasso regression
ElasticNet regression
and many more ....

CLASSIFICATION :

The output variable takes class labels
classification predicts the "belonging" to the class
classification is identifying group membership
e.g. The prediction of house price can be classified into 'very costly', 'costly', 'affordable', 'cheap', 'very cheap'. Here each class may correspond to some range of values.
Given f : x -> y, if y is discrete/categorical variable, then this is classification problem.
Classification model predict discrete values that ans. the questions like the following:

Is the given email message spam or not spam ?
Is this an image of dog, a cat or a hamster ?

Different classification methods:

Linear classifier: Logistic regression, Naive Bayes classifier
Support Vector Machines
Decision Trees
Boosted Trees
Random forest
Neural Networks
Nearest neighbor
and many more ....

[src] :

https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Statistical_classification

https://medium.com/@sifium/machine-learning-types-of-classification-9497bd4f2e14

Computer Inquisitive

Pages