Variational Inference I 🤖

In next posts I'm going to speak about probabilistic machine learning. Last months I was learning about this field, specifically about a type of inference on these models known as Variational Inference, and I reckon can be interesting to write a set of posts summarizing my experience about it thanks to my teacher Joan Capdevila.

Probabilistic Machine Learning

Last decades studies about machine learning has caused the appearance of a wide variety of algorithms to solve a large set of problems covering areas such as driving autonomous vehicles, medical diagnosis, speech recognition or user ranking for marketing campaigns. These algorithms are mainly based on aa construction of a model which describes data as closely as possible.

A model is a compact description of a set of data that avoids us to make predictions about future samples. The main difference between a conventional machine learning model and a probabilistic one is that the last avoids to model uncertainty, in other words, it avoids to know how probably is a prediction to be fulfilled. This aspect can be very valuable in areas such as medicine or economy where risk about to take a bad decision can be detrimental to a person's health or lead to financial loss.

Probabilistic machine learning situation at artificial intelligence area

This kind of models uses the probability theory to model priori information, in this way the algorithm is not based just on sample data. These models permit the use of different datasets to learn (this is useful when we have a small quantity of data) and to define complex models with a quantity of random variables required. They also support online leaning, you don't need to retrain the full model each time you obtain new data, you just require to update some probabilities. They are also very useful in decision-making, when a robust explanation of a model is required. Finally they are generative models, thanks to the distributions that they infer, they allow to generate new data simulating values of any random variable of the model.

Unlike discriminant models which only model the probability of the variable to be predicted, a generative model is a complete model which uses all the variables (observed and latent), allowing multiple questions to be answered.

The construction of this type of models with latent variables is done following Box Loop philosophy. This loop was created by the statistician George E. P. Box. This loop is iterated several times during the construction of a probabilistic model.

Box loop graphical scheme

First, probabilistic model is made based on environment knowledge that we already have.
After some patterns are discovered using the model previously defined and with an inference method.
Finally the model has to be validated. If it was not good enough we would go back to step 1 unless it would be used to describe or predict new data.

Bayesian inference

Bayesian inference tries to reveal a hidden structure in data that cannot be directly observed. For traditional machine learning methods parameters are values that are determined by optimization algorithms minimizing an error function. The bayesian point of view is a little bit different. For a bayesian all the unknown parameters are described by probability distributions and observation of evidence avoids to update these distrubutions using Bayes rule.

Bayes rule

The first thing you need to be clear about to understand Bayesian inference is Bayes rule. Bayes rule indicates how a priori probability about an event has to be updated after observe evidences about it. From a bayesian point of view there are no differences between parameters and observed variables, both are random variables. I'm going to use x to reference observed variables and θ to reference latent variables.

The following explains what is each term of the formula, being x and θ data and model parameters respectively:

Posterior p(θ|x): It is the probability of data, the probability that the model with θ parameters has generated x data.
Likelihood p(x|θ): It is the probability of data assuming that are modeled by a parametrized distribution. The way to calculate it depends on the model. Usually it is used to assess the quality of a model.
Prior p(θ): It is the probability of the parameters. In this factor of the formula is where prior knowledge is reflected, information that we have before to observe any data observation.
Evidence p(x): This is the evidence from data. It is calculated as ∫ p(x,θ)dθ. Usually it can't be calculated but, as it is a normalizing constant of the model, it does not affect. When two probabilistic models are compared, important factors are the ones which depend on θ because p(x) will be the same for each model because it just depends on data.

Product p(x|θ)p(θ) is also known as joint probability: p(θ,x).

Use the likelihood to estimate θ parameters is known as Maximum Likelihood Estimation (MLE) while if you take the prior into account then is known as Maximum A Posteriori estimation (MAP). MLE and MAP are equivalents if there is an uniform prior. However these methods only allow estimating a mean, a median and a mode of the posterior and maybe your goal requires to model uncertainty or to generate new data. In theses cases we would need to know posterior distribution. As we will see, methods as Variational Inference (VI) or Markov Chain Monte Carlo (MCMC) allow to infer this distribution. Taking into account the Bayes normalizer constant p(x) is what allows these methods to calculate a posterior distribution.

A summary of this formula would be: At the beginning we have a belief (prior) about an event θ (p(θ)), for example, that the height of Barcelona population is described by a Normal distribution. After we observe evidences (x), a sample of the heights of Barcelona population. With this evidence our belief about θ has to be updated, in other words, Normal distribution which described the height of Barcelona population has to be updated. This change is reflected by the posterior distribution (p(θ|x)). In this example we can appreciate the support to online learning that offers this kind of models.

At the end it is an iterative process of updating beliefs (prior) based on evidences (x) where posterior of one iteration will be the prior of next one.

Posterior inference algorithms avoid to analyze information under certain assumptions (priors) discovering a hidden structure which best fits with our data. When all relations between model random variable are conjugated, this is, when joint distribution has the same form as the prior, posterior can be calculated analytically. This is the case of models like Dirichlet-Categorical o Normal Inverse Wishart-Normal. In the opposite case, the problem of this formula resides in the calculation of the evidence (p(x)). For many models of interest (no conjugated) to calculate a posterior is computationally intractable because the integral over all latent variables of data that requires to calculate the evidence. For these models, another strategy is required to obtain the posterior, for that reason the calculation of the posterior becomes an approximation problem.

Posterior approximation

Probabilistic machine learning uses latent variables to describe data hidden structure, some relations between observed and latent variables are modeled using probability distributions and inference algorithms are used to estimate the posterior, that is the conditional distribution about latent variables given the observed variables. Due to the fact that in most cases we are working in spaces with many dimensions, the calculation of posterior expectation, E(p(θ|x)), is impossible to obtain analytically and computationally, for this reason some inference methods have been created to approximate this distribution. Bayesian inference concept comes from the set of tools which have been developed to approximate this posterior and it is one of the central problems in bayesian statistics. Nowadays there are 2 algorithmic branches:

Markov Chain Monte Carlo (MCMC) : Sampling approximate inference.
Variational Inference (VI) : Structural approximate inference.

On the one hand, MCMC is based on the construction of a Markov chain over all latent variables being the posterior its stationary distribution. After the chain is executed till it arrives to its equilibrium point. Finally, results obtained sampling the Markov chain in its stationary section, are the posterior samples. The best known algorithms of this family are Metropolis–Hastings , Gibbs sampling and Hamiltonian Monte Carlo.

On the other hand, Variational Inference approximates the posterior creating an analytical approximation, the variational model, which is adjusted in order to reduce the distance with the posterior. In this family the problem is transformed from an approximation one to an optimization one.

Probabilistic Graphical Models

In the bayesian field, a model represents a joint probability over all random variables of the problem.

A very practical way of representing these models is using probabilistic graphical models. A probabilistic graphical model is a directed graph where nodes are random variables and edges are dependency relations between those variables. For example, joint distribution probabilistic graphical model is:

At this context, p(x|y) represents the conditional probability in which x variable depends on the value of y. At this kind of diagrams also exists another components called plates.

This notation indicates a vector of n random variables x.

Local and global variables

In a probabilistic model two types of random variables can be distinguished: globals and locals. A global variable is the one which is shared between all dataset examples while a local variable is owned by each example. For example, the following probabilistic graphical model, y variable is local while z variable is a global one. When a node appears obscured it means that it is an observed variable.

Variational inference

In next posts we will focus on the different variational strategies for posterior approach.

Strategy

As already mentioned, Variational Inference consists in defining a distribution, q(θ|λ), whose parameters λ will be optimized in order to reduce the differences with the posterior p(θ|x). This new distribution is known as variational model and the posterior as probabilistic model. To summarize, VI goal is to optimize variational model λ parameters in order to reduce the distance with probabilistic model p(θ|x). λ parameters are known also as variational parameters.

Kullback-Leibler divergence

To calculate Euclidean distance between distribution parameters to establish the similarity between both is an imperfect measure since we are comparing distributions and not points. Imagine a Normal distribution with 0 mean and 5 variance, N(0, 5), and another one with 5 mean and 5 variance, N(5, 5). These two distributions are very similar but they are separated by an Euclidean distance of 5. If now we compare first defined Normal distribution N(0, 5) with another Normal distribution with 2 mean and 7 variance, N(2, 7), looks like they are more different but Euclidean distance that separates them is 4. For this reason we have to use another measure: Kullback-Leibler divergence (KL).

KL is a divergence, in other words, a non-simmetric distance, it isn't the same to calculate KL[p(θ|x)||q(θ|λ)] (forwards KL) than KL[q(θ|λ)||p(θ|x)] (reverse KL). The fact of use one or another rises to different algorithms: Expectation Propagation uses forwards KL while VI uses reverse KL. In general terms, Expectation Propagation is more computationally expensive. KL quantifies loss information when you approximate one distribution with another. It is based on the concept of entropy. Entropy measures the quantity of information that own data and is defined as follows:

KL definition consists in modify entropy formula to take q distribution into account.

If we adapt this form to the VI problem and we apply logarithms properties we have:

This divergence avoids us to find the real similarity between two probability distributions and it is the measure that is minimized at VI algorithm.

Forwards and reverse KL comparison to approximate a bimodal distribution. Blue part represents distribution to be approximated and red one the approximation. a image is an approximation with forwards KL and b and c are approximation with reverse KL.

Forwards and reverse KL comparison to approximate a unimodal distribution. Blue part represents distribution to be approximated and red one the approximation. a image is an approximation with forwards KL and b is an approximation with reverse KL.

Mean-Field assumption

In order to define a treatable distribution over all latent variables to approximate the posterior we can simplify variational model optimization assuming that it is a factorized model. It is to suppose that q(θ|λ) is composed by a set of distributions q_i(θ_i|λ_i) (of the Exponential Family). Each one of these distributions has its own parameters λ_i which could be optimized individually.

Main goal of this post is not more than to present the basic idea of Variational Inference and its main players. In following posts we will go deeper into these algorithms and programming tools to code this models will be presented.

References

Journal of the American Statistical AssociationGeorge (E. P. Box)
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models (David M. Blei)
Probabilistic graphical models: principles and techniques (Koller, Daphne, and Nir Friedman)
Model-based Machine Learning (Christopher M. Bishop)
Machine Learning. A probabilistic perspective (Kevin P. Murphy)