top of page

Logistic Regression Training: Gradient Descent

  • 작성자 사진: Shin Yoonah, Yoonah
    Shin Yoonah, Yoonah
  • 2022년 7월 29일
  • 3분 분량

최종 수정일: 2022년 8월 8일

As the previous post talks about the decision plane, today I will show you how to determine the plane


Basically, we use dataset images to train the classifier

=> When we have unknown sample, we can classify the image


Cost and Loss

Training is where you find the best learnable parameters of the decision boundary


We need a way to determine how good our decision boundary is


Loss: Classification Error

Loss tells you how good your prediction is


=> Classification loss

First column: output of the loss function

- Each time our prediction is correct, the loss function will output a zero

- Each time our prediction is incorrect, the loss function will output a one


Cost: Classification Error

Cost is the sum of the loss

It tells us how good our learnable parameters are doing on the dataset


For each incorrectly classified samples the loss is one increasing the cost

For each correctly classified samples do not change the cost


The cost is a function of the learnable parameters


Let's see the relationship between cost and the decision boundary


First Line

The first line misclassifies the following points => the value of the cost is 3


Second Line

The second line misclassifies the following one points => the value of the cost is 1



Third Line

The final line performs perfectly => the value of the cost is 0


In reality, the cost is a function of multiple parameters w and b

cost (w,b)


*Even in simple 2D examples, it has too many parameters to plot*


Cost: Cross Entropy

It uses the cross entropy loss that uses the output of the logistic function as opposed to the prediction y had


Cross entropy deals with how likely the image belongs to a specific class

- If the likelihood of belonging to an incorrect class is large, the cross entropy loss in turn will be large

- If the likelihood of belonging to the correct class is correct, the cross entropy is small, but not zero


Gradient Descent

It is a method to find the best learnable parameters


Training: Logistic

If we find the minimum of the cost, we can find the best parameter


The gradient gives you the slope of a function


At any point, we can use gradient descent

Also the function of the gradient descent is to find the minimum of the cost function


Let's see how gradient descent works!!

If we start off with a random guess for the bias parameter, we use the superscript to indicate the guessed number



In this case, we have to move our guess in the positive direction

Move the parameter in that direction by adding a positive number


Examining the sign of the gradient, it is the opposite sign of the number

--> Therefore, we can add a number

Proportional to the negative of the gradient


Subtracting the gradient works if we are on the other side of the minimum

--> Would like to move in the negative direction


We can move the parameter value to the negative direction by adding a negative number to the parameter

=> Examining the sign of the gradient it is the opposite sign of the number

Therefore, we can add an amount proportional to the negative of the gradient


Final equation of gradient descent

We add a number proportional to the gradient depending on the context, the valuable i

Eta, dictates how far we move in the direction of the gradient

n = usually a small number


The new volumes for the bias parameter decrease the cost

When we reach the minimum, the gradient is zero


However, n should have a proper rate to reach the minimum

- when n has large learning rate, it may oscillate and never reach the minimum

- when n has low learning rate, it may never reach the minimum


Learning rate is a hyper parameter that we select it by finding a value that has the best accuracy using validation data


Relationship between the cost function and the decision plane

Each iteration of gradient descent finds a parameter b that decreases the cost and the decision plane does a better job at separating the classes


Training: Threshold

It's challenging to perform gradient descent on the threshold function

--> When we stuck at some point where the slope is zero, the gradient will be zero and not update


Using gradient descent to minimize the cost

=> the decision plane has multiple parameters

Therefore, the gradient is vector


We can update the parameter, it's a set of vectors


Plot it as a surface in two dimensional (bowl shape)

When we update the parameter, it will find the minimum

We plot the cost with respect to each iteration i => learning curve



More parameters you have, more images and iterations you need to make the model work!!


Copyright Coursera All rights reserved

Comments


bottom of page