LR is used for discovering the link between features, it is one of most important tool in social and natural science, it is used as the baseline in NLP supervised classification and has a strong relation to Neural Networks: NN can be viewed as staked layers of LR.
Logistic can be used to classify observations to 2 classes or multi class .
Generative and discriminative: These are 2 different framework to build ML models , imagine we try and learn how to differentiate 2 type of images: cats and dogs:
Generative: Will try to understand what a dog and cat look like, you might even ask the model to try and generate what a dog image
Discriminative: Only trying to distinguish the classes, if the dog wears a collar and the cat doesn’t it will use this feature to differentiate .
For a classifier like Naive bayes we use
we don’t try to compute P(c|d) directly, we make use of likelihood which express how to generate feature if we know it belongs to class C*, thus it is generative .
By contrast discriminatory approach try to learn P(d|c) it will assign higher weight to document feature that discriminate
Components of a probabilistic machine learning classifier: Like Naive Bayes, LR is a probabilistic and it requires and it requires a training corpus of Observations represented by :
- Feature representation of the input => [x1, x2 …….,xn]
- Classification function that estimates p(y|x) => Softmax/Sigmoid
- Objectif function for for learning => Cross entropy loss function
- Algorithm for optimising the objectif function => Stochastic gradient descent
Let’s dive into each of these:
1.Classification function: the sigmoid
The goal of the binary LR is to train a class that can make a binary decision and to achieve that we will introduce Sigmoid.
We have an input [x1,……., xn] we want to know if y = 0|1, LR solves this by learning weights and a bias .
Each wi is a real number and represents the weight of xi and represents the importance of the feature (can be +, -)
b is another real number added to the to weighted inputs
this represents the dot product, z = w . x + b and is used to make a decision, z sadly is not a probability thus we pass it through a sigmoid function (named because it is in the form of an S)
Sigmoid has a number of advantages , takes a real valued number and map it into a probability space [0,1], the Sigmoid tends to squash outlier values towards 1 or 0 as seen in the figure 1
If we apply the sigmoid to the dot product we get this function, we need to make sure that P(y = 1) + P(y = 0) = 1.
Now we have an algorithm that given an instance x will give us the class it belongs too
Let’s take an example, suppose we are trying to do some sentiment analysis classification:
Here is the text:
And the related features:
Let’s assume we learned 6 weights : [2.5, -5, -1.2, 0.5, 2.0, 0.07] and b = 0.1 .
Count of negative words is important (biggest weight)
Designing features: Generally designed by examining the training set and some intuition, might be useful to build complex features that are a combination of other features, : period disambiguation, bigrams and so on, for LR and NB this needs to be by hands for other .
In order to avoid extensive human effort in feature design NLP focused in representation learning.
Choosing a classifier: NB will treat all features as independent, and so we will overestimate correlated features, LR will split the weight thus LR works better in large documents
NB works better on small documents and is easy and fast to implement to implement.
2.Learning in LR :
How w’s and b are learned? In this section we will to answer this question
LR is a supervised classification and thus we need :
A metric : how close are we from the current label to the true gold label, we will introduce the cross entropy loss we call it the cost function
Optimisation algorithm : to optimize w and b the standard algorithms for it is gradient descent, we will introduce the stochastic gradient descent
3.Cross entropy loss :
We need a loss function that represent how far/close are we from the golden label with
We use a loss function that prefers correct class label, this is called conditional maximum likelihood estimation, we choose the parameters w, b that will maximise the log probability of the true y labels in the training data given the observations x, the resulting loss function is the negative log likelihood loss: cross-entropy loss.
Let’s derive it:
P(y|x) takes only 2 out put [0, 1], it is a Bernoulli distribution and thus can be written:
If y = 1 P(y|x) is back to our sigmoid function
here is a reminder for Bernoulli.
We take the log of both side, a very handy mathematical trick that doesn’t hurt and makes the equation simpler
The previous equation represents something that needs to be maximised, by flipping the signs we will try to minimise it .
Finally we replace the sigmoid function definition