Understanding Latent Variable Models and Variational Autoencoders
In recent years, the field of machine learning has witnessed a significant shift towards generative models and unsupervised learning techniques. Among the most prominent architectures in this domain are Generative Adversarial Networks (GANs) and Latent Variable Models (LVMs). This article aims to delve deeply into the workings of latent variable models, their core principles, and to formulate one of their most popular representatives: Variational Autoencoders (VAEs).
Discriminative vs Generative Models
Machine learning models can generally be categorized into two types: discriminative and generative models. This distinction is rooted in the probabilistic formulations used to build and train these models.
Discriminative Models
Discriminative models focus on learning the conditional probability of a label ( y ) given a data point ( x ), denoted as ( p(y | x) ). The goal here is to establish a mapping between the data and the classes, which can be represented as a probability distribution. In this framework, each label competes with others for probability density over a specific data point.
Generative Models
Conversely, generative models learn the probability distribution over the data points themselves, without relying on external labels. This is mathematically expressed as ( p(x) ). In this case, the data points compete for probability density, allowing the model to generate new data points by sampling from the learned distribution.
Conditional Generative Models
A subset of generative models, known as conditional generative models, learns the distribution of data ( x ) conditioned on labels ( y ), represented as ( p(x | y) ). Here, the data competes for density, but within the context of specific labels.
The Interconnectedness of Models
It’s important to note that these model types are interconnected through Bayes’ rule, which allows for the construction of one type of model from another. However, this article will focus specifically on generative models, leading us to the derivation of the Variational Autoencoder.
Generative Models
The primary objective of generative models is to learn the probability density function ( p(x) ), which describes the behavior of training data and enables the generation of novel data by sampling from the distribution. Ideally, we want our model to learn a probability density ( p(x) ) that closely resembles the true data distribution ( p_{\text{data}}(x) ).
Explicit vs Implicit Density Models
Generative models can be classified into two categories:
-
Explicit Density Models: These models can compute the density function ( p(x) ) explicitly. After training, they can output the likelihood of a given data point.
- Implicit Density Models: These models do not compute ( p(x) ) directly but can sample from the underlying distribution once trained. Generative Adversarial Networks (GANs) are a prime example of implicit density models.
Latent Variable Models
Latent variable models aim to model the probability distribution using latent variables, which are transformations of data points into a continuous lower-dimensional space. These latent variables simplify the representation of the data.
Key Terms in Latent Variable Models
- Prior Distribution ( p(z) ): Models the behavior of latent variables.
- Likelihood ( p(x | z) ): Defines how latent variables map to data points.
- Joint Distribution ( p(x, z) = p(x | z)p(z) ): Describes the model by combining likelihood and prior.
- Marginal Distribution ( p(x) ): Represents the distribution of the original data.
- Posterior Distribution ( p(z | x) ): Describes the latent variables produced by a specific data point.
Generation and Inference
- Generation: To generate a data point, we sample ( z ) from ( p(z) ) and then sample ( x ) from ( p(x | z) ).
- Inference: To infer a latent variable, we sample ( x ) from ( p(x) ) and then sample ( z ) from ( p(z | x) ).
The fundamental challenge in latent variable models is determining how to find these distributions.
Training a Latent Variable Model with Maximum Likelihood
Maximum likelihood estimation (MLE) is a technique used to estimate the parameters of a probability distribution to fit observed data. The likelihood function measures the goodness of fit of a statistical model to a sample of data, formulated as:
[
\theta{\text{ML}} = \arg\max{\theta} \sum{i=1}^{N} \log p{\theta}(x_i)
]
However, to apply gradient descent for MLE, we need to compute the gradient of the marginal log-likelihood function, which requires knowledge of the posterior distribution ( p(z | x) ). This leads us back to the inference problem.
Computing the Posterior Distribution: Solving the Inference Problem
Inference problems can be categorized into tractable and intractable models. While some models allow for closed-form solutions, many do not, necessitating approximate inference methods.
Variational Inference
Variational inference approximates the intractable posterior distribution with a tractable one, referred to as the variational posterior ( q_{\phi}(z | x) ). The goal is to optimize the lower bound of the marginal log-likelihood, known as the Evidence Lower Bound (ELBO):
[
L{\theta, \phi}(x) = \mathbb{E}{q{\phi}(z | x)} \left[ \log \frac{p{\theta}(x, z)}{q{\phi}(z | x)} \right] \leq \log p{\theta}(x)
]
This formulation allows us to maximize the lower bound with respect to both the model parameters ( \theta ) and the variational parameters ( \phi ).
Amortized Variational Inference
To streamline the inference process, amortized variational inference employs an external neural network to predict the variational parameters for each data point, rather than optimizing ELBO for each instance. This inference network is trained simultaneously with the main model.
Computing the Gradient of ELBO
Maximizing ELBO requires computing gradients with respect to both model and variational parameters. For model parameters, we can use Monte Carlo sampling to estimate gradients. For variational parameters, we employ the reparameterization trick, which allows us to express samples from the variational posterior in a way that is independent of the parameters.
Reparameterization Trick
The reparameterization trick involves transforming a sample from a fixed, known distribution into a sample from ( q_{\phi}(z) ). For instance, if we assume a Gaussian distribution, we can express ( z ) as:
[
z = \mu + \sigma \epsilon \quad \text{where } \epsilon \sim N(0, 1)
]
This allows us to compute gradients and perform backpropagation effectively.
Variational Autoencoders (VAEs)
Now that we have established the foundational concepts, we can construct the Variational Autoencoder. The VAE consists of two neural networks: the Encoder (inference network) and the Decoder (generative model).
Encoder
The Encoder parameterizes the variational posterior ( q_{\phi}(z | x) ):
self.encoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(28, 28, 1)),
tf.keras.layers.Conv2D(filters=32, kernel_size=3, strides=(2, 2), activation='relu'),
tf.keras.layers.Conv2D(filters=64, kernel_size=3, strides=(2, 2), activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(latent_dim + latent_dim),
])
Decoder
The Decoder parameterizes the likelihood ( p_{\theta}(x | z) ):
self.decoder = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(latent_dim,)),
tf.keras.layers.Dense(units=7*7*32, activation=tf.nn.relu),
tf.keras.layers.Reshape(target_shape=(7, 7, 32)),
tf.keras.layers.Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same', activation='relu'),
tf.keras.layers.Conv2DTranspose(filters=32, kernel_size=3, strides=2, padding='same', activation='relu'),
tf.keras.layers.Conv2DTranspose(filters=1, kernel_size=3, strides=1, padding='same'),
])
Training the VAE
The VAE is trained by maximizing the ELBO, which consists of two terms: the reconstruction error and the KL divergence between the variational posterior and the prior.
def compute_loss(model, x):
mean, logvar = model.encode(x)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z)
marginal_likelihood = tf.reduce_sum(x * tf.log(x_logit) + (1 - x) * tf.log(1 - x_logit), 1)
KL_divergence = 0.5 * tf.reduce_sum(tf.square(mean) + tf.square(logvar) - tf.log(1e-8 + tf.square(logvar)) - 1, 1)
ELBO = tf.reduce_mean(marginal_likelihood) - tf.reduce_mean(KL_divergence)
return -ELBO
Generating New Data Points
To generate new data points, we sample from the prior distribution and pass the latent variables through the decoder:
sample = tf.random.normal(shape=[num_examples_to_generate, latent_dim])
Conclusion
In this article, we explored the intricacies of latent variable models and detailed the formulation of Variational Autoencoders. Understanding these concepts requires a solid foundation in probability and statistics, but the insights gained are invaluable for anyone looking to delve into generative modeling. For those interested in further exploration, consider checking out additional resources on VAEs and related topics.
References
- Kingma, D., & Welling, M. (2013). Auto-Encoding Variational Bayes.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Johnson, J. (2020). Deep Learning for Computer Vision. University of Michigan.
- Mnih, A. (2020). DeepMind x UCL, Deep Learning Lectures.
- Weng, L. (2018). From Autoencoder to Beta-VAE.
- Jordan, J. (2018). Variational Autoencoders.
- Rocca, J. (2019). Understanding Variational Autoencoders (VAEs).
- Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504-507.
- Blei, D., Kucukelbir, A., & McAuliffe, J. (2018). Variational Inference: A Review for Statisticians.