Logistic Regression Training: Gradient Descent
- Shin Yoonah, Yoonah
- 2022년 7월 29일
- 3분 분량
최종 수정일: 2022년 8월 8일
As the previous post talks about the decision plane, today I will show you how to determine the plane
Basically, we use dataset images to train the classifier
=> When we have unknown sample, we can classify the image
Cost and Loss
Training is where you find the best learnable parameters of the decision boundary
We need a way to determine how good our decision boundary is
Loss: Classification Error
Loss tells you how good your prediction is
=> Classification loss

First column: output of the loss function
- Each time our prediction is correct, the loss function will output a zero
- Each time our prediction is incorrect, the loss function will output a one
Cost: Classification Error
Cost is the sum of the loss

It tells us how good our learnable parameters are doing on the dataset
For each incorrectly classified samples the loss is one increasing the cost
For each correctly classified samples do not change the cost
The cost is a function of the learnable parameters
Let's see the relationship between cost and the decision boundary
First Line

The first line misclassifies the following points => the value of the cost is 3
Second Line

The second line misclassifies the following one points => the value of the cost is 1
Third Line

The final line performs perfectly => the value of the cost is 0
In reality, the cost is a function of multiple parameters w and b
cost (w,b)
*Even in simple 2D examples, it has too many parameters to plot*
Cost: Cross Entropy
It uses the cross entropy loss that uses the output of the logistic function as opposed to the prediction y had
Cross entropy deals with how likely the image belongs to a specific class

- If the likelihood of belonging to an incorrect class is large, the cross entropy loss in turn will be large
- If the likelihood of belonging to the correct class is correct, the cross entropy is small, but not zero
Gradient Descent
It is a method to find the best learnable parameters
Training: Logistic
If we find the minimum of the cost, we can find the best parameter
The gradient gives you the slope of a function
At any point, we can use gradient descent

Also the function of the gradient descent is to find the minimum of the cost function
Let's see how gradient descent works!!
If we start off with a random guess for the bias parameter, we use the superscript to indicate the guessed number

In this case, we have to move our guess in the positive direction
Move the parameter in that direction by adding a positive number
Examining the sign of the gradient, it is the opposite sign of the number
--> Therefore, we can add a number
Proportional to the negative of the gradient
Subtracting the gradient works if we are on the other side of the minimum
--> Would like to move in the negative direction
We can move the parameter value to the negative direction by adding a negative number to the parameter
=> Examining the sign of the gradient it is the opposite sign of the number
Therefore, we can add an amount proportional to the negative of the gradient
Final equation of gradient descent

We add a number proportional to the gradient depending on the context, the valuable i
Eta, dictates how far we move in the direction of the gradient
n = usually a small number
The new volumes for the bias parameter decrease the cost
When we reach the minimum, the gradient is zero
However, n should have a proper rate to reach the minimum
- when n has large learning rate, it may oscillate and never reach the minimum
- when n has low learning rate, it may never reach the minimum
Learning rate is a hyper parameter that we select it by finding a value that has the best accuracy using validation data
Relationship between the cost function and the decision plane
Each iteration of gradient descent finds a parameter b that decreases the cost and the decision plane does a better job at separating the classes
Training: Threshold
It's challenging to perform gradient descent on the threshold function
--> When we stuck at some point where the slope is zero, the gradient will be zero and not update
Using gradient descent to minimize the cost
=> the decision plane has multiple parameters
Therefore, the gradient is vector
We can update the parameter, it's a set of vectors
Plot it as a surface in two dimensional (bowl shape)

When we update the parameter, it will find the minimum
We plot the cost with respect to each iteration i => learning curve
More parameters you have, more images and iterations you need to make the model work!!
Copyright Coursera All rights reserved
Comments