Skip to content

Commit 071d5a3

Browse files
committed
Update VI tutorial
1 parent 88e1161 commit 071d5a3

File tree

1 file changed

+14
-3
lines changed

1 file changed

+14
-3
lines changed

tutorials/variational-inference/index.qmd

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ Let's start with a minimal example.
2020
Consider a `Turing.Model`, which we denote as `model`.
2121
Approximating the posterior associated with `model` via VI is as simple as
2222

23-
```{julia}
24-
#| eval: false
23+
```julia
2524
m = model(data...) # instantiate model on the data
2625
q_init = q_fullrank_gaussian(m) # initial variational approximation
2726
vi(m, q_init, 1000) # perform VI with the default algorithm on `m` for 1000 iterations
2827
```
28+
2929
Thus, it's no more work than standard MCMC sampling in Turing.
3030
The default algorithm uses stochastic gradient descent to minimise the (exclusive) [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
3131
This approach is commonly referred to as *automatic differentiation variational inference* (ADVI)[^KTRGB2017], *stochastic gradient VI*[^TL2014], and *black-box variational inference*[^RGB2014] with the reparameterization gradient[^KW2014][^RMW2014][^TL2014].
@@ -240,10 +240,16 @@ Our implementation supports any optimiser that implements the [Optimisers.jl](ht
240240
For instance, let's try using `Optimisers.Adam`[^KB2014], which is a popular choice.
241241
Since `AdvancedVI` does not implement a proximal operator for `Optimisers.Adam`, we must use the `AdvancedVI.ClipScale()` projection operator, which ensures that the scale matrix of the variational approximation is positive definite.
242242
(See the paper by J. Domke 2020[^D2020] for more detail about the use of a projection operator.)
243+
243244
```{julia}
244245
using Optimisers
245246
246-
_, _, info_adam, _ = vi(m, q_init, n_iters; show_progress=false, callback=callback, optimizer=Optimisers.Adam(3e-3), operator=ClipScale());
247+
_, _, info_adam, _ = vi(
248+
m, q_init, n_iters;
249+
show_progress=false,
250+
callback=callback,
251+
algorithm=KLMinRepGradDescent(AutoForwardDiff(); optimizer=Optimisers.Adam(3e-3), operator=ClipScale())
252+
);
247253
```
248254

249255
```{julia}
@@ -252,6 +258,7 @@ elbo_adam = [i.elbo_avg for i in info_adam[iters]]
252258
Plots.plot(iters, elbo_mf, xlabel="Iterations", ylabel="ELBO", label="DoWG")
253259
Plots.plot!(iters, elbo_adam, xlabel="Iterations", ylabel="ELBO", label="Adam")
254260
```
261+
255262
Compared to the default option `AdvancedVI.DoWG()`, we can see that `Optimisers.Adam(3e-3)` is converging more slowly.
256263
With more step size tuning, it is possible that `Optimisers.Adam` could perform better or equal.
257264
That is, most common optimisers require some degree of tuning to perform better or comparably to `AdvancedVI.DoWG()` or `AdvancedVI.DoG()`, which do not require much tuning at all.
@@ -261,6 +268,7 @@ Due to this fact, they are referred to as parameter-free optimizers.
261268
So far, we have only used the mean-field Gaussian family.
262269
This, however, approximates the posterior covariance with a diagonal matrix.
263270
To model the full covariance matrix, we can use the *full-rank* Gaussian family[^TL2014][^KTRGB2017]:
271+
264272
```{julia}
265273
q_init_fr = q_fullrank_gaussian(m);
266274
```
@@ -273,6 +281,7 @@ The term *full-rank* might seem a bit peculiar since covariance matrices are alw
273281
This term, however, traditionally comes from the fact that full-rank families use full-rank factors in addition to the diagonal of the covariance.
274282

275283
In contrast to the mean-field family, the full-rank family will often result in more computation per optimisation step and slower convergence, especially in high dimensions:
284+
276285
```{julia}
277286
q_fr, _, info_fr, _ = vi(m, q_init_fr, n_iters; show_progress=false, callback)
278287
@@ -281,12 +290,14 @@ Plots.plot(elbo_mf, xlabel="Iterations", ylabel="ELBO", label="Mean-Field", ylim
281290
elbo_fr = [i.elbo_avg for i in info_fr[iters]]
282291
Plots.plot!(elbo_fr, xlabel="Iterations", ylabel="ELBO", label="Full-Rank", ylims=(-200, Inf))
283292
```
293+
284294
However, we can see that the full-rank families achieve a higher ELBO in the end.
285295
Due to the relationship between the ELBO and the Kullback-Leibler divergence, this indicates that the full-rank covariance is much more accurate.
286296
This trade-off between statistical accuracy and optimisation speed is often referred to as the *statistical-computational trade-off*.
287297
The fact that we can control this trade-off through the choice of variational family is a strength, rather than a limitation, of variational inference.
288298

289299
We can also visualise the covariance matrix.
300+
290301
```{julia}
291302
heatmap(cov(rand(q_fr, 100_000), dims=2))
292303
```

0 commit comments

Comments
 (0)