You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vi(m, q_init, 1000) # perform VI with the default algorithm on `m` for 1000 iterations
28
27
```
28
+
29
29
Thus, it's no more work than standard MCMC sampling in Turing.
30
30
The default algorithm uses stochastic gradient descent to minimise the (exclusive) [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
31
31
This approach is commonly referred to as *automatic differentiation variational inference* (ADVI)[^KTRGB2017], *stochastic gradient VI*[^TL2014], and *black-box variational inference*[^RGB2014] with the reparameterization gradient[^KW2014][^RMW2014][^TL2014].
@@ -240,10 +240,16 @@ Our implementation supports any optimiser that implements the [Optimisers.jl](ht
240
240
For instance, let's try using `Optimisers.Adam`[^KB2014], which is a popular choice.
241
241
Since `AdvancedVI` does not implement a proximal operator for `Optimisers.Adam`, we must use the `AdvancedVI.ClipScale()` projection operator, which ensures that the scale matrix of the variational approximation is positive definite.
242
242
(See the paper by J. Domke 2020[^D2020] for more detail about the use of a projection operator.)
Compared to the default option `AdvancedVI.DoWG()`, we can see that `Optimisers.Adam(3e-3)` is converging more slowly.
256
263
With more step size tuning, it is possible that `Optimisers.Adam` could perform better or equal.
257
264
That is, most common optimisers require some degree of tuning to perform better or comparably to `AdvancedVI.DoWG()` or `AdvancedVI.DoG()`, which do not require much tuning at all.
@@ -261,6 +268,7 @@ Due to this fact, they are referred to as parameter-free optimizers.
261
268
So far, we have only used the mean-field Gaussian family.
262
269
This, however, approximates the posterior covariance with a diagonal matrix.
263
270
To model the full covariance matrix, we can use the *full-rank* Gaussian family[^TL2014][^KTRGB2017]:
271
+
264
272
```{julia}
265
273
q_init_fr = q_fullrank_gaussian(m);
266
274
```
@@ -273,6 +281,7 @@ The term *full-rank* might seem a bit peculiar since covariance matrices are alw
273
281
This term, however, traditionally comes from the fact that full-rank families use full-rank factors in addition to the diagonal of the covariance.
274
282
275
283
In contrast to the mean-field family, the full-rank family will often result in more computation per optimisation step and slower convergence, especially in high dimensions:
However, we can see that the full-rank families achieve a higher ELBO in the end.
285
295
Due to the relationship between the ELBO and the Kullback-Leibler divergence, this indicates that the full-rank covariance is much more accurate.
286
296
This trade-off between statistical accuracy and optimisation speed is often referred to as the *statistical-computational trade-off*.
287
297
The fact that we can control this trade-off through the choice of variational family is a strength, rather than a limitation, of variational inference.
0 commit comments