Logistic Regression Training: Gradient Descent

Shin Yoonah, Yoonah
2022년 7월 29일
3분 분량

최종 수정일: 2022년 8월 8일

As the previous post talks about the decision plane, today I will show you how to determine the plane

Basically, we use dataset images to train the classifier

=> When we have unknown sample, we can classify the image

Cost and Loss

Training is where you find the best learnable parameters of the decision boundary

We need a way to determine how good our decision boundary is

Loss: Classification Error

Loss tells you how good your prediction is

=> Classification loss

First column: output of the loss function

- Each time our prediction is correct, the loss function will output a zero

- Each time our prediction is incorrect, the loss function will output a one

Cost: Classification Error

Cost is the sum of the loss

It tells us how good our learnable parameters are doing on the dataset

For each incorrectly classified samples the loss is one increasing the cost

For each correctly classified samples do not change the cost

The cost is a function of the learnable parameters

Let's see the relationship between cost and the decision boundary

First Line

The first line misclassifies the following points => the value of the cost is 3

Second Line

The second line misclassifies the following one points => the value of the cost is 1

Third Line

The final line performs perfectly => the value of the cost is 0

In reality, the cost is a function of multiple parameters w and b

cost (w,b)

*Even in simple 2D examples, it has too many parameters to plot*

Cost: Cross Entropy

It uses the cross entropy loss that uses the output of the logistic function as opposed to the prediction y had

Cross entropy deals with how likely the image belongs to a specific class

- If the likelihood of belonging to an incorrect class is large, the cross entropy loss in turn will be large

- If the likelihood of belonging to the correct class is correct, the cross entropy is small, but not zero

Gradient Descent

It is a method to find the best learnable parameters

Training: Logistic

If we find the minimum of the cost, we can find the best parameter

The gradient gives you the slope of a function

At any point, we can use gradient descent

Also the function of the gradient descent is to find the minimum of the cost function

Let's see how gradient descent works!!

If we start off with a random guess for the bias parameter, we use the superscript to indicate the guessed number

In this case, we have to move our guess in the positive direction

Move the parameter in that direction by adding a positive number

Examining the sign of the gradient, it is the opposite sign of the number

--> Therefore, we can add a number

Proportional to the negative of the gradient

Subtracting the gradient works if we are on the other side of the minimum

--> Would like to move in the negative direction

We can move the parameter value to the negative direction by adding a negative number to the parameter

=> Examining the sign of the gradient it is the opposite sign of the number

Therefore, we can add an amount proportional to the negative of the gradient

Final equation of gradient descent

We add a number proportional to the gradient depending on the context, the valuable i

Eta, dictates how far we move in the direction of the gradient

n = usually a small number

The new volumes for the bias parameter decrease the cost

When we reach the minimum, the gradient is zero

However, n should have a proper rate to reach the minimum

- when n has large learning rate, it may oscillate and never reach the minimum

- when n has low learning rate, it may never reach the minimum

Learning rate is a hyper parameter that we select it by finding a value that has the best accuracy using validation data

Relationship between the cost function and the decision plane

Each iteration of gradient descent finds a parameter b that decreases the cost and the decision plane does a better job at separating the classes

Training: Threshold

It's challenging to perform gradient descent on the threshold function

--> When we stuck at some point where the slope is zero, the gradient will be zero and not update

Using gradient descent to minimize the cost

=> the decision plane has multiple parameters

Therefore, the gradient is vector

We can update the parameter, it's a set of vectors

Plot it as a surface in two dimensional (bowl shape)

When we update the parameter, it will find the minimum

We plot the cost with respect to each iteration i => learning curve

More parameters you have, more images and iterations you need to make the model work!!

Logistic Regression Training: Gradient Descent

최근 게시물

留言

I'd love to hear from you.