📖 Check out our Introduction to Deep Learning & Neural Networks course 📖

Learn more

Understanding Maximum Likelihood Estimation in Supervised Learning

Nikolas Adaloglouon2022-02-10·5 mins
Machine Learning

This article demystifies the ML learning modeling process under the prism of statistics. We will understand how our assumptions on the data enable us to create meaningful optimization problems. In fact, we will derive commonly used criteria such as cross-entropy in classification and mean square error in regression. Finally, I am trying to answer an interview question that I encountered: What would happen if we use MSE on binary classification?

Likelihood VS probability and probability density

To begin, let's start with a fundamental question: what is the difference between likelihood and probability? The data xx are connected to the possible models θ\theta by means of a probability P(x,θ)P(x,\theta) or a probability density function (pdf) p(x,θ)p(x,\theta).

In short, A pdf gives the probabilities of occurrence of different possible values. The pdf describes the infinitely small probability of any given value. We'll stick with the pdf notation here. For any given set of parameters θ\theta, p(x,θ)p(x,\theta) is intended to be the probability density function of x.

The likelihood p(x,θ)p(x,\theta) is defined as the joint density of the observed data as a function of model parameters. That means, for any given x, p(x=fixed,θ)p(x=\operatorname{fixed},\theta) can be viewed as a function of θ\theta. Thus, the likelihood function is a function of the parameters θ\theta only, with the data held as a fixed constant.

Notations

We will consider the case were we are dealt with a set XX of mm data instances X={x(1),..,x(m)}X= \{ \textbf{x}^{(1)}, . . , \textbf{x}^{(m)} \} that follow the empirical training data distribution pdatatrain(x)=pdata(x)p_{data}^{train}(\textbf{x}) = p_{data}(\textbf{x}), which is a good and representative sample of the unknown and broader data distribution pdatareal(x)p_{data}^{real}(\textbf{x}).

The Independent and identically distributed assumption

This brings us to the most fundamental assumption of ML: Independent and Identically Distributed (IID) data (random variables). Statistical independence means that for random variables A and B, the joint distribution PA,B(a,b)P_{A,B}(a,b) factors into the product of their marginal distribution functions PA,B(a,b)=PA(a)PB(b)P_{A,B}(a,b)= P_{A}(a) P_{B}(b). That's how sums multi-variable joint distributions are turned into products. Note that the product can be turned into a sum by taking the logx=logxlog \prod x = \sum log x. Since log(x) is monotonic, it's not changing the optimization problem.

Our estimator (model) will have some learnable parameters θ\boldsymbol{\theta} that make another probability distribution pmodel(x,θ)p_{model}(\textbf{x}, \boldsymbol{\theta}). Ideally, pmodel(x,θ)pdata(x)p_{model}(\textbf{x}, \boldsymbol{\theta}) \approx p_{data}(\textbf{x}).

The essence of ML is to pick a good initial model that exploits the assumptions and the structure of the data. Less literally, a model with a decent inductive bias. As the parameters are iteratively optimized, pmodel(x,θ)p_{model}(\textbf{x}, \boldsymbol{\theta}) gets closer to pdata(x)p_{data}(\textbf{x}).

In neural networks, because the iterations happen in a mini-batch fashion instead of the whole dataset, mm will be the mini-batch size.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is simply a common principled method with which we can derive good estimators, hence, picking θ\boldsymbol{\theta} such that it fits the data.

To disentangle this concept, let's observe the formula in the most intuitive form:

θMLE=argmaxparamspmodel(outputinputs,params)\boldsymbol{\theta}_{\mathrm{MLE}}= \underset{\operatorname{params}}{\arg \max } \operatorname{p}_{\operatorname{model}}( \operatorname{output} | {\operatorname{inputs}},\operatorname{params})

The optimization problem is maximizing the likelihood of the given data. Outputs are the conditions in the probability world. Unconditional MLE means we have no conditioning on the outputs, so no labels.

θMLE=argmaxθpmodel (X,θ),=argmaxθi=1mpmodel (x(i),θ).\begin{aligned} \boldsymbol{\theta}_{\mathrm{MLE}} &=\underset{\boldsymbol{\theta}}{\arg \max } \operatorname{p}_{\text {model }}({X} , \boldsymbol{\theta}), \\ &=\underset{\boldsymbol{\theta}}{\arg \max } \prod_{i=1}^{m} \operatorname{p}_{\text {model }}\left(\boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right) . \end{aligned}

In a supervised ML context, the condition would simply be the data labels.

θML=argmaxθi=1mlogpmodel(y(i)x(i),θ)\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Quantifying distribution closeness: KL-div

One way to interpret MLE is to view it as minimizing the "closeness" between the training data distribution pdata(x)p_{data}(\textbf{x}) and the model distribution pmodel(x,θ)p_{model}(\textbf{x}, \boldsymbol{\theta}). The best way to quantify this "closeness" between distributions is the KL divergence, defined as:

DKL(pdatapmodel)=Expdata[logpdata(x)pmodel(x,θ)]=Expdata[logpdata(x)logpmodel(x,θ)],\begin{gathered} D_{K L}( p_{data} \| p_{model})=E_{x\sim p_{data}} [ \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})}] = \\ E_{x\sim p_{data}} [ \log p_{data}(\textbf{x}) - \log p_{model}(\textbf{x}, \boldsymbol{\theta})], \end{gathered}

where EE denotes the expectation over all possible training data. In general, the expected value EE is a weighted average of all possible outcomes. We will replace the expectation with a sum, whilst multiplying each term with its possible "weight" of happening, that is pdatap_{data}.

kl-div Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible. By Mundhenk at English Wikipedia, CC BY-SA 3.0

Notice that I intentionally avoided using the term distance. Why? Because a distance function is defined to be symmetric. KL-div, on the other hand, is asymmetric meaning DKL(pdatapmodel)DKL(pmodelpdata)D_{K L}( p_{data} \| p_{model}) \neq D_{K L}(p_{model} \| p_{data} ).

kl-div-assymetry Source: Datumorphism

Intuitively, you can think of pdatap_{data} as a static "source" of information that sends (passes) batches of data to pmodelp_{model}, the "receiver". Since information is passed one way only, that is from pdatap_{data} to pmodelp_{model}, it would make no sense to calculate the distance with pmodelp_{model} as the reference source. You can practically observe this nonsense by swapping the target and the model prediction in the cross-entropy or KL-div loss function in your code.

By replacing the expectation EE with our lovely sum:

DKL(pdatapmodel)=x=1Npdata(x)logpdata(x)pmodel(x,θ)=x=1Npdata(x)[logpdata(x)logpmodel(x,θ)]\begin{gathered} D_{K L}(p_{data} \| p_{model})=\sum_{x=1}^{N} p_{data}(\textbf{x}) \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})} \\ =\sum_{x=1}^{N} p_{data}(\textbf{x})[\log p_{data}(\textbf{x})-\log p_{model}(\textbf{x}, \boldsymbol{\theta})] \end{gathered}

When we minimize the KL divergence with respect to the parameters of our estimator, logpdata(x)\log p_{data}(\textbf{x}) gets out of the way, which brings us to:

θDKL(pdatapmodel)=x=1Npdata(x)logpmodel(x,θ). \nabla_{\theta} D_{K L}(p_{data} \| p_{model}) = - \sum_{x=1}^{N} p_{data}(\textbf{x}) \log p_{model}(\textbf{x}, \boldsymbol{\theta}).

In other words, minimizing KL-div is mathematically equivalent to minimizing cross-entropy (H(P,Q)=xP(x)logQ(x)H(P, Q)=-\sum_{x} P(x) \log Q(x)) :

H(pdata,pmodel)=H(pdata)+DKL(pdatapmodel)θH(pdata,pmodel)=θ(H(pdata)+DKL(pdatapmodel))=θDKL(pdatapmodel)\begin{aligned} H\left(p_{data}, p_{model}\right) &=H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right) \\ \nabla_{\theta} H\left(p_{data}, p_{model}\right) &=\nabla_{\theta}\left(H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right)\right) \\ &=\nabla_{\theta} D_{K L}\left(p_{data} \| p_{model}\right) \end{aligned}

The optimal parameters θ\boldsymbol{\theta} will, in principle, be the same. Even though the optimization landscape would be different (as defined by the objective functions), maximizing the likelihood is equivalent to minimizing the KL divergence. In this case, the entropy of the data H(pdata)H(p_{data}) will shift the landscape, while a scalar multiplication would scale the optimization landscape. Sometimes I find it helpful to imagine the landscape as descending a mountain. Practically, both are framed as minimizing an objective cost function.

From the statistical point of view, it's more of bringing the distributions close so KL-div. From the aspect of information theory, cross-entropy might make more sense to you.

MLE in Linear regression

Let's consider linear regression. Imagine that each single prediction y^\hat{y} produces a "conditional" distribution pmodel(y^x)p_{model}(\hat{y} | \textbf{x}), given a sufficiently large train set. The goal of the learning algorithm is again to match the distribution pdata(yx)p_{data}(y | \textbf{x}).

linear-regression Source

Now we need an assumption. We hypothesize the neural network or any estimator ff as y^=f(x,θ)\hat{y}=f(\textbf{x} , \theta). The estimator approximates the mean of the normal distribution (N(μ,σ)N(\mu,\sigma)) the we choose to parametrize pdatap_{data}. Specifically, in the simplest case of linear regression we have μ=θTx\mu = \boldsymbol{\theta}^T \mathbf{x}. We also assume a fixed standard deviation σ\sigma of the normal distribution. These assumptions immediately causes MLE to become Mean Squared Error (MSE) optimization. Let's see how.

y^=f(x,θ)yN(y,μ=y^,σ2)p(yx,θ)=1σ2πexp((yy^)22σ2)\begin{aligned} & \hat{y}=f(\textbf{x} , \boldsymbol{\theta}) \\ y & \sim \mathcal{N}\left(y , \mu=\hat{y}, \sigma^{2}\right) \\ p(y \mid \textbf{x} , \boldsymbol{\theta}) &=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-(y-\hat{y})^{2}}{2 \sigma^{2}}\right) \end{aligned}

In terms of log-likelihood we can form a loss function:

L=i=1mlogp(yx,θ)=i=1mlog1σ2πexp((y^(i)y(i))22σ2)=i=1mlog(σ2π)logexp((y^(i)y(i))2.2σ2)=i=1mlog(σ)12log(2π)(y^(i)y(i))22σ2=mlog(σ)m2log(2π)i=1m(y^(i)y(i))22σ2\begin{aligned} L &=\sum_{i=1}^{m} \log p(y \mid \textbf{x} , \boldsymbol{\theta}) \\ &=\sum_{i=1}^{m} \log \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-\left(\hat{y}^{(i)}-y^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma \sqrt{2 \pi})-\log \exp \left(\frac{(\hat{y}^{(i)}-y^{{(i)}} )^{2}.}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma)-\frac{1}{2} \log (2 \pi)-\frac{(\hat{y}^{(i)}-y^{{(i)}})^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left(\hat{y}^{(i)}-y^{{(i)}}\right)^{2}}{2 \sigma^{2}} \\ \end{aligned}

By taking the partial derivative with respect to the parameters, we get the desired MSE.

θL=θi=1my^(i)y(i)22σ2=mlog(σ)m2log(2π)i=1my^(i)y(i)22σ2=mlog(σ)m2log(2π)m2σ2MSE\begin{aligned} \nabla_{\theta} L &=-\nabla_{\theta} \sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)- \frac{m}{2 \sigma^{2}} MSE \end{aligned}

Since MSE=1mi=1my^(i)y(i)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}

MLE in supervised classification

In linear regression, we parametrized pmodel(yx,θ)p_{model}(y | \mathbf{x}, \boldsymbol{\theta}) as a normal distribution. More precisely, we parametrized the mean to be μ=θTx\mu = \boldsymbol{\theta}^T \mathbf{x}.

It is possible to convert linear regression to a classification problem. All we need to do is encode the ground truth as a one-hot vector:

pdata(yxi)={1 if y=yi0 otherwise ,p_{data}\left(y \mid \textbf{x}_{i}\right)= \begin{cases}1 & \text { if } y=y_{i} \\ 0 & \text { otherwise }\end{cases} ,

where ii refer to a single data instance.

Hi(pdata,pmodel)=yYpdata(yxi)logpmodel(yxi)=logpmodel(yixi)\begin{aligned} H_{i}\left(p_{data}, p_{model}\right) &=-\sum_{y \in Y} p_{data}\left(y \mid \textbf{x}_{i}\right) \log p_{model}\left(y \mid \textbf{x}_{i}\right) \\ &=-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}

For simplicity let's consider the binary case of two labels, 0 and 1.

L=i=1nHi(pdata,pmodel)=i=1nlogpmodel(yixi)=i=1nlogpmodel(yixi)\begin{aligned} L &=\sum_{i=1}^{n} H_{i}\left(p_{data}, p_{model}\right) \\ &=\sum_{i=1}^{n}-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \\ &=-\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}
=argminθL=argminθi=1nlogpmodel(yixi)\begin{aligned} =\underset{\boldsymbol{\theta}}{\arg \min } L &= \underset{\boldsymbol{\theta}}{\arg \min } -\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}

This is in line with our definition of conditional MLE:

θML=argmaxθi=1mlogpmodel(y(i)x(i),θ)\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Broadly speaking, MLE can be applied to most (supervised) learning problems,

by specifying a parametric family of (conditional) probability distributions.

Another way to achieve this in a binary classification problem would be to take the scalar output yy of the linear layer and pass it from a sigmoid function. The output will be in the range [0,1] and we define this as the probability of p(y=1x,θ)p(y = 1 | \mathbf{x}, \boldsymbol{\theta}).

p(y=1x,θ)=σ(θTx)=sigmoid(θTx)[0,1]p(y = 1 | \mathbf{x}, \boldsymbol{\theta}) = \sigma( \boldsymbol{\theta}^T \mathbf{x}) = \operatorname{sigmoid}( \boldsymbol{\theta}^T \mathbf{x}) \in [0,1]

Consequently, p(y=0x,θ)=1p(y=1x,θ)p(y = 0 | \mathbf{x}, \boldsymbol{\theta}) = 1 - p(y = 1 | \mathbf{x}, \boldsymbol{\theta}). In this case binary-cross entropy is practically used. No closed form solution exist here, one can approximate it with gradient descend. For reference, this approach is surprisingly known as ""logistic regression".

Bonus: What would happen if we use MSE on binary classification?

So far I presented the basics. This is a bonus question that I was asked during an ML interview: What if we use MSE on binary classification?

When y^(i)=0\hat{y}^{(i)}=0:

MSE=1mi=1my(i)2=1mi=1mσ(θTx)2=1mi=1mσ(θTx)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|-y^{(i)}\right\|^{2}= \frac{1}{m} \sum_{i=1}^{m}\left\|-\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2} = \frac{1}{m} \sum_{i=1}^{m}\left\|\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

When y^(i)=1\hat{y}^{(i)}=1:

MSE=1mi=1m1y(i)2=1mi=1m1σ(θTx)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|1 -y^{(i)}\right\|^{2}= \frac{1}{m} \sum_{i=1}^{m}\left\|1 - \sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

One intuitive way to guess what's happening without diving into the math is this one: in the beginning of training the network will output something very close to 0.5, which gives roughly the same signal for both classes. Below is a more principled method proposed after the initial release of the article by Jonas Maison.

Proposed demonstration by Jonas Maison

Let's assume that we have a simple neural network with weights θ\theta such as z=θxz=\theta^\intercal x, and outputs y^=σ(z)\hat{y}=\sigma(z) with a sigmoid activation.

Lθ=Ly^y^zzθ\frac{\partial L}{\partial \theta}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial \theta}

MSE Loss

L(y,y^)=12(yy^)2L(y, \hat{y}) = \frac{1}{2}(y-\hat{y})^2
Lθ=(yy^)σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=-(y-\hat{y})\sigma(z)(1-\sigma(z))x
Lθ=(yy^)y^(1y^)x\frac{\partial L}{\partial \theta}=-(y-\hat{y})\hat{y}(1-\hat{y})x

σ(z)(1σ(z))\sigma(z)(1-\sigma(z)) makes the gradient vanish if σ(z)\sigma(z) is close to 0 or 1. Thus, the neural net can't train.

Binary Cross Entropy (BCE) Loss

L(y,y^)=ylog(y^)(1y)log(1y^)L(y, \hat{y}) = -ylog(\hat{y})-(1-y)log(1-\hat{y})

For y=0y=0:

Lθ=1y1y^σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\sigma(z)(1-\sigma(z))x
Lθ=1y1y^y^(1y^)x\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\hat{y}(1-\hat{y})x
Lθ=(1y)(y^)x\frac{\partial L}{\partial \theta}=(1-y)(\hat{y})x
Lθ=y^x\frac{\partial L}{\partial \theta}=\hat{y}x

If the network is right, y^=0\hat{y}=0, the gradient is null.

For y=1y=1:

Lθ=yy^σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\sigma(z)(1-\sigma(z))x
Lθ=yy^y^(1y^)x\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\hat{y}(1-\hat{y})x
Lθ=y(1y^)x\frac{\partial L}{\partial \theta}=-y(1-\hat{y})x
Lθ=(1y^)x\frac{\partial L}{\partial \theta}=-(1-\hat{y})x

If the network is right, y^=1\hat{y}=1, the gradient is null.

Conclusion and References

This short analysis explains why we blindly choose our objective functions to minimize such as cross-entropy. MLE is a principled way to define an optimization problem and I find it a common discussion topic to back up design choices during interviews.

If you like our content consider supporting us, by any possible means. It would be massively appreciated.

Deep Learning in Production Book 📖

Learn how to build, train, deploy, scale and maintain deep learning models. Understand ML infrastructure and MLOps using hands-on examples.

Learn more

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.