Sunday, December 22, 2024

Grasping Maximum Likelihood Estimation in Supervised Learning

Share

Demystifying the Machine Learning Modeling Process Through Statistics

In the realm of machine learning (ML), the modeling process can often seem enigmatic, especially for those who are new to the field. However, by examining this process through the lens of statistics, we can uncover the underlying principles that guide our assumptions about data and how they help us formulate meaningful optimization problems. This article will explore key concepts such as likelihood versus probability, the Independent and Identically Distributed (IID) assumption, Maximum Likelihood Estimation (MLE), and the implications of using Mean Squared Error (MSE) in binary classification.

Likelihood vs. Probability and Probability Density

To begin, it is essential to clarify the distinction between likelihood and probability. Given a dataset ( x ) and a set of possible models ( \theta ), the relationship is expressed through a probability ( P(x, \theta) ) or a probability density function (pdf) ( p(x, \theta) ).

A pdf provides the probabilities of occurrence for various potential values, describing the infinitesimally small probability of any specific value. For a given set of parameters ( \theta ), ( p(x, \theta) ) serves as the probability density function of ( x ).

On the other hand, likelihood ( p(x, \theta) ) is defined as the joint density of the observed data as a function of model parameters. This means that for any fixed ( x ), ( p(x=\text{fixed}, \theta) ) can be viewed as a function of ( \theta ) alone, with the data held constant. Thus, the likelihood function is parameter-centric, emphasizing how well the model explains the observed data.

Notations

In our discussion, we will consider a dataset ( X = { \mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)} } ) consisting of ( m ) data instances that follow the empirical training data distribution ( p{\text{data}}^{\text{train}}(x) = p{\text{data}}(x) ). This distribution serves as a representative sample of the broader, unknown data distribution ( p_{\text{data}}^{\text{real}}(x) ).

The Independent and Identically Distributed Assumption

One of the foundational assumptions in machine learning is that the data is Independent and Identically Distributed (IID). Statistical independence implies that for random variables ( A ) and ( B ), the joint distribution can be factored into the product of their marginal distributions:

[
P{A,B}(a,b) = P{A}(a) P_{B}(b)
]

This factorization allows us to transform sums of multivariable joint distributions into products. By taking the logarithm, we can convert products into sums, which is advantageous for optimization.

Our model will have learnable parameters ( \theta ) that define another probability distribution ( p{\text{model}}(x, \theta) ). Ideally, we want ( p{\text{model}}(x, \theta) \approx p_{\text{data}}(x) ). The essence of machine learning lies in selecting a model that effectively exploits the assumptions and structure of the data, leading to a good inductive bias.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a principled method for deriving estimators that fit the data well. The goal is to choose ( \theta ) such that the likelihood of the observed data is maximized:

[
\theta{\text{MLE}} = \arg \max{\text{params}} p_{\text{model}}(X, \theta)
]

In a supervised learning context, this can be expressed as:

[
\theta{\text{ML}} = \arg \max{\theta} \sum{i=1}^{m} \log p{\text{model}}(y^{(i)} | x^{(i)}, \theta)
]

This optimization problem focuses on maximizing the likelihood of the given data, which can be interpreted as minimizing the "closeness" between the training data distribution and the model distribution.

Quantifying Distribution Closeness: KL Divergence

One effective way to quantify the "closeness" between distributions is through Kullback-Leibler (KL) divergence, defined as:

[
D{KL}(p{\text{data}} | p{\text{model}}) = E{x \sim p{\text{data}}} \left[ \log \frac{p{\text{data}}(x)}{p_{\text{model}}(x, \theta)} \right]
]

Minimizing KL divergence with respect to the parameters of our estimator is mathematically equivalent to minimizing cross-entropy. This relationship highlights the connection between statistical principles and optimization in machine learning.

MLE in Linear Regression

In the context of linear regression, we can model the conditional distribution ( p_{\text{model}}(y | x, \theta) ) as a normal distribution. By assuming that the estimator approximates the mean of this distribution, we can derive the Mean Squared Error (MSE) optimization problem.

The log-likelihood can be expressed as:

[
L = \sum{i=1}^{m} \log p(y | x, \theta) = -\frac{m}{2} \log(2\pi) – \frac{1}{2\sigma^2} \sum{i=1}^{m} (y^{(i)} – \hat{y}^{(i)})^2
]

Taking the partial derivative with respect to the parameters leads us to the desired MSE formulation.

MLE in Supervised Classification

In supervised classification, we can convert a regression problem into a classification problem by encoding the ground truth as a one-hot vector. For binary classification, we can define the model’s output as the probability of belonging to class 1:

[
p(y=1 | x, \theta) = \sigma(\theta^T x)
]

The loss function in this case is the binary cross-entropy, which is derived from the MLE framework.

Bonus: What Would Happen If We Use MSE on Binary Classification?

A common interview question is: What if we use MSE in binary classification? When using MSE, the model’s output may initially be close to 0.5, leading to a situation where the gradients vanish as the model approaches the extremes of 0 or 1. This can hinder the training process, making it difficult for the model to learn effectively.

In contrast, binary cross-entropy loss provides a more robust gradient signal, allowing the model to learn effectively even when the outputs are near the decision boundary.

Conclusion

This exploration of the machine learning modeling process through the lens of statistics reveals the importance of understanding the underlying principles that guide our assumptions about data. By leveraging concepts such as likelihood, MLE, and KL divergence, we can formulate meaningful optimization problems that enhance our modeling efforts. Understanding these principles not only aids in the design of effective models but also serves as a valuable discussion point in interviews and professional settings.

If you found this article informative, consider supporting our work to continue providing valuable insights into machine learning and data science.

Read more

Related updates