This a continuation of our previous article https://codegonewild.net/2020/07/11/logistic-regression-part-1/
5.4 Gradient descent
Before diving in to the details lets take a quick detour:
it’s all about the slope :
the smaller x the more precise will be the slope
We can see that :
x changes from x to x + delta x
y changes from f(x) to f(x + delta x)
if we replace those variable in the slope function, and keep in mid that x tend to 0 we get
so if we have a function of f(x) = x2 it’s rate of change is 2x
when x=2 the slope is 4
if x = 5 the slope is 10
Interesting fact: this slope that we discussed is also called a gradient
Gradient is an angle/vector which points ti the direction of the steepest ascent of a curve .
For an f(x) = 4x2 + y2 the gradient would be grad f = [8x, 2y]
5.4.2 The gradient descent
We will try to find the the optimal weights that will minimise the cross entropy loss, Gradient descent is a method to find the minimum of a function by figuring which direction the functions slope is rising the most steeply and going to the other direction, the idea is that if you are on a mountain and trying to get down, you will follow the path the ground is slopping the steepest
Luckily for us cross entropy for LR is convex, in other words there are no local minimum, it only has one minimum (in contrast the loss function for multilayer neural network is non convex and gradient descent might be stuck in a local minimum)
the gradient descent answers this questions by finding the gradient of the loss function at the current point and moving it to the opposite direction .
Then magnitude to move in gradient descent is the value of the slope weighted by a learning rate, A higher learning rate means that we should move w more on each step, the change we make is the gradient times the learning rate .
In an actual logistic regression, the parameter vector w is much longer then 1, since the input vector cane be quite long, and we need a wi for each xi
5.4.3 The gradient for LR
the loss function of LR is :
It turns out its loss function is :
5.4.4 Stochastic gradient descent algorithm
online algorithm that minimises the loss function by calculating the gradient after each trining example
the learning rate is a parameter that must be adjusted, if its too high we might miss the minimum and its too small converging will take a lot of time, it is common to begin with a big learning rate and slowly decrease .
5.4.5 Mini bach training
Stochastic gradient descent is called stochastic because it chooses a single random example at a time, moving the weight so as to improve performance on that single example, that can result in a choppy movements, so it’s common to compute the gradient over batches .
let’s extend the cross entropy loss to support mini batch of size m.
5.6 Multinomial logistic regression
Most of the time we want our classification to yield more then 2 classes, in such cases we use multinomial logistic regression , the target y is a variable that ranges over more than 2 classes.
It uses a generalisation fo the sigmoid called softmax function, it takes a vector of z = [z1, z2, ….,zk] and map them to to the range of [0, 1] with the sun of the mapped value equal to 1.
5.6.1 Featrues in multinomial logistic regression
Suppose we are doing classification of a ne nlp task, help to classify into 3 classes +, -, 0 (neutral), this will look like this
5.6.2 learning in multinomial logistic regression
it has a slightly different loss function then the binary logistic regression because it uses the softmax rather than the sigmoid classifier, the loss function for a single example x is the sum of the logs of the k output class
The gradient for a single example turns out to be very similar ot binary logistic regression
5.7 Few more information
Logistic regression cane be combined with statistical tests, helps investigate wether a particular feature is significant and what is its magnitude (how large is the weight ).
Logistic regression in NLP and many other fields is widely used as an an analytical tool for testing hypothesis about the effect of features.
Perhaps we want to know if logically negative words (no, not, never) are more likely to be associated with negative sentiment, or if negative reviews of movies are more likely to
discuss the cinematography.