Variational Inference II ⛰️

In this post I'm going to continue explaining concepts that I introduced in my previous post about Variational Inference (VI). At first I'm going to show that to find the best posterior approach it's necessary to minimize the Kullback-Leibler divergence (KL) between the variational model q(θ|λ) and the probabilistic model p(θ|x). Starting from Bayes' rule we have:

This expression gives us a more affordable way to calculate the evidence of the model where each factor is:

With this demonstration we get that minimize KL[q(θ|λ)||p(θ|x)] is equivalent to maximize ELBO(q(θ|λ),p(x,θ)) which is easier to evaluate. An intractable integral has been transformed to an expectation of a known distribution.

Maximize ELBO is equivalent to minimize the distance between variational model and probabilistic model.

Variational inference uses ELBO as algorithm stop condition. When some iterations, where ELBO value does not increase, are executed it means that good values for variational model λ params have been found. These are values of λ which get closer probabilistic model (the posterior). We can rewrite ELBO as:

Where each factor is:

And now, if we take into account the Mean-Field assumption commented in the previous post, ELBO ends as follows:

ELBO taking into account Mean Field assumption

Variational Inference algorithm

This algorithm represents the basic idea of VI. In practice more things must be taken into account for its correct operation. First, variational model could be formed by local and global variables and the updates of these have to be done in a concrete way. You also have to choose an inference method: Coordinate Ascent, Gradient Ascent, Sthocastic Gradient Descent, ... Chosen inference method doesn't change the basic structure of the VI algorithm just change the method of obtaining the λ new values of variational model in each iteration.

Coordinate Ascent Variational Inference

The most traditional method for the inference of probabilistic models is: Coordinate Ascent Variational Inference (CAVI). For the implementation of this kind of inference you require knowledge of Bayesian statistics because for the update of each variational model λ parameter and ELBO some analytical closed formulas have to be derived. As previously mentioned when the model is completely conjugated (as Dirichlet-Categorical model or Normal Inverse Wishart-Normal model) posterior can be analytically calculated, that is, without the need to approximate it. However if we have a bast amount of data this analytical calculus is impracticable due to the operations with very large matrices in memory. For this reason, in this case and the case of no conjugated models it is a good option to approximate the posterior using VI.

The derivation of the analytical updates for the variational model parameters can be done in two ways: generic derivation or using the properties of the Exponential Family.

Generic derivarion is based on the following formula (assuming the Mean-Field assumption):

This derivation has to be done for each variable of the variational model θi and after that, a statistician could deduce the distribution type of the variational parameter and how to update it.

Another way to obtain the variational parameters updates is to derive them using the properties of the Exponential Family. To this family belong all the distributions that can be written in the form:

h(x): Base measure.
η(θ): Natural parameters(it just depends on the parameters).
t(x): Sufficient statistics(it just depends on the data). Lets know the shape of the distribution. Describe the possible space for the distribution parameters.
a(η(θ)): Cumulant. It is a normalizer.

This family allows to establish conjugation relations between distributions. When creating the joint distribution based on two distributions the natural parameters of one allow some simplification in the formulation together with the sufficient statistics of the other one we say that the first distribution is conjugated of the second. The conjugated models, as already mentioned, thanks to these simplifications, allow to calculate the posterior analytically.

Coordinate Ascent Variational Inference algorithm

In this version of CAVI algorithm the distinction between updating local and global variables of the model has already been taken into accoun.

Gradient Ascent Variational Inference

A more naive inference alternative is Gradient Ascent Variational Inference (GAVI). The difference is how is the variational parameters update. It is not done analytically with derived formulas by a statistician, it is an exploratory process. GAVI is based on the Gradient Descent/Ascent algorithm.

Gradient Ascent

Gradient Ascent aims to maximize a cost function C(λ) parameterized by the model parameters, λ. The algorithm optimizes these parameters λ in the gradient direction (in the case of Gradient Descent, in the opposite direction of the gradient) of the objective function ∇λC(λ). In our case we need Gradient Ascent because we want to maximize ELBO function. Learning rate η>0 determines the size of the step in the direction of the local maximum. Gradient Ascent explores latent variables space of the model and moves in the direction of maximum slope (which is indicated by the gradient of the function) until find a local maximum. Variational model parameters are updated as follows:

Over the last years optimizations of this algorithm have been appearing: Momentum, Adagrad, Adadelta, RMSprop, Adam, ... the improvements they offer are based on aspects such as each parameter λi has its own learning rate ηi or taking into account the value of previous iterations gradients to calculate the following.

Gradient Ascent Variational Inference algorithm

A problem of this algorithm to approximate the posterior (which causes more inaccurate convergences) is the use of the gradient to optimize variational parameters. The gradient supposes that latent variables space is an Euclidean space. This fact implies the assumption that distance between the distributions is mesured by the Euclidean distance of their parameters. The solution to this problem, to find the real distance between two distributions, is to use the natural gradient.

Natural gradient indicates the direction of maximum slope in other space, the Riemman space, where the real distance between distributions is taken into account. This distance can be calculated premultiplying the normal gradient by the inverse of the matrix Fisher's, G(λ).

In the case of CAVI, when the analytical updates of each variational parameter are derived, the shape of the distributions to measure the distance between them is already taking into account.

Efficiency problems

Nowadays, in the already known as the information age, algorithms used in machine learning use huge volumes of data. This causes programmers to scale algorithms or design alternatives less computationally expensive. CAVI and GAVI have to pass through all the data for each iteration. This procedure for massive datasets is intractable. In the next post I'm going to explain the measures you can take in this cases and how to solve the scalability problem.

References

Journal of the American Statistical AssociationGeorge (E. P. Box)
Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models (David M. Blei)
Probabilistic graphical models: principles and techniques (Koller, Daphne, and Nir Friedman)
Model-based Machine Learning (Christopher M. Bishop)
Machine Learning. A probabilistic perspective (Kevin P. Murphy)
An overview of gradient descent optimization algorithms (Sebastian Rudes)
Variational Bayesian inference (slides) (Kay H. Brodersen)
Variational Inference (David M. Blei)