|
78 | 78 | 2, |
79 | 79 | None, |
80 | 80 | 'positive-and-negative-phases'), |
81 | | - ('Gradient examples', 2, None, 'gradient-examples'), |
82 | | - ('Kullback-Leibler relative entropy', |
83 | | - 2, |
84 | | - None, |
85 | | - 'kullback-leibler-relative-entropy'), |
86 | | - ('Kullback-Leibler divergence', |
87 | | - 2, |
88 | | - None, |
89 | | - 'kullback-leibler-divergence'), |
90 | | - ('Maximizing log-likelihood', |
91 | | - 2, |
92 | | - None, |
93 | | - 'maximizing-log-likelihood'), |
94 | | - ('More on the partition function', |
95 | | - 2, |
96 | | - None, |
97 | | - 'more-on-the-partition-function'), |
98 | | - ('Setting up for gradient descent calculations', |
99 | | - 2, |
100 | | - None, |
101 | | - 'setting-up-for-gradient-descent-calculations'), |
102 | | - ('Difference of moments', 2, None, 'difference-of-moments'), |
103 | | - ('More observations', 2, None, 'more-observations'), |
104 | | - ('Adding hyperparameters', 2, None, 'adding-hyperparameters'), |
105 | 81 | ('Theory of Variational Autoencoders', |
106 | 82 | 2, |
107 | 83 | None, |
|
231 | 207 | <!-- navigation toc: --> <li><a href="#explicit-expression-for-the-derivative" style="font-size: 80%;">Explicit expression for the derivative</a></li> |
232 | 208 | <!-- navigation toc: --> <li><a href="#final-expression" style="font-size: 80%;">Final expression</a></li> |
233 | 209 | <!-- navigation toc: --> <li><a href="#positive-and-negative-phases" style="font-size: 80%;">Positive and negative phases</a></li> |
234 | | - <!-- navigation toc: --> <li><a href="#gradient-examples" style="font-size: 80%;">Gradient examples</a></li> |
235 | | - <!-- navigation toc: --> <li><a href="#kullback-leibler-relative-entropy" style="font-size: 80%;">Kullback-Leibler relative entropy</a></li> |
236 | | - <!-- navigation toc: --> <li><a href="#kullback-leibler-divergence" style="font-size: 80%;">Kullback-Leibler divergence</a></li> |
237 | | - <!-- navigation toc: --> <li><a href="#maximizing-log-likelihood" style="font-size: 80%;">Maximizing log-likelihood</a></li> |
238 | | - <!-- navigation toc: --> <li><a href="#more-on-the-partition-function" style="font-size: 80%;">More on the partition function</a></li> |
239 | | - <!-- navigation toc: --> <li><a href="#setting-up-for-gradient-descent-calculations" style="font-size: 80%;">Setting up for gradient descent calculations</a></li> |
240 | | - <!-- navigation toc: --> <li><a href="#difference-of-moments" style="font-size: 80%;">Difference of moments</a></li> |
241 | | - <!-- navigation toc: --> <li><a href="#more-observations" style="font-size: 80%;">More observations</a></li> |
242 | | - <!-- navigation toc: --> <li><a href="#adding-hyperparameters" style="font-size: 80%;">Adding hyperparameters</a></li> |
243 | 210 | <!-- navigation toc: --> <li><a href="#theory-of-variational-autoencoders" style="font-size: 80%;">Theory of Variational Autoencoders</a></li> |
244 | 211 | <!-- navigation toc: --> <li><a href="#the-autoencoder-again" style="font-size: 80%;">The Autoencoder again</a></li> |
245 | 212 | <!-- navigation toc: --> <li><a href="#schematic-image-of-an-autoencoder" style="font-size: 80%;">Schematic image of an Autoencoder</a></li> |
@@ -548,149 +515,6 @@ <h2 id="positive-and-negative-phases" class="anchor">Positive and negative phase |
548 | 515 | their probability). |
549 | 516 | </p> |
550 | 517 |
|
551 | | -<!-- !split --> |
552 | | -<h2 id="gradient-examples" class="anchor">Gradient examples </h2> |
553 | | -<p>The gradient of the negative log-likelihood cost function of a Binary-Binary RBM is then</p> |
554 | | -$$ |
555 | | -\begin{align*} |
556 | | - \frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial w_{ij}} =& \langle x_i h_j \rangle_{data} - \langle x_i h_j \rangle_{model} \\ |
557 | | - \frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial a_{ij}} =& \langle x_i \rangle_{data} - \langle x_i \rangle_{model} \\ |
558 | | - \frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial b_{ij}} =& \langle h_i \rangle_{data} - \langle h_i \rangle_{model}. \\ |
559 | | -\end{align*} |
560 | | -$$ |
561 | | - |
562 | | -<p>To get the expectation values with respect to the <em>data</em>, we set the visible units to each of the observed samples in the training data, then update the hidden units according to the conditional probability found before. We then average over all samples in the training data to calculate expectation values with respect to the data. </p> |
563 | | - |
564 | | -<!-- !split --> |
565 | | -<h2 id="kullback-leibler-relative-entropy" class="anchor">Kullback-Leibler relative entropy </h2> |
566 | | - |
567 | | -<p>When the goal of the training is to approximate a probability |
568 | | -distribution, as it is in generative modeling, another relevant |
569 | | -measure is the <b>Kullback-Leibler divergence</b>, also known as the |
570 | | -relative entropy or Shannon entropy. It is a non-symmetric measure of the |
571 | | -dissimilarity between two probability density functions \( p \) and |
572 | | -\( q \). If \( p \) is the unkown probability which we approximate with \( q \), |
573 | | -we can measure the difference by |
574 | | -</p> |
575 | | -$$ |
576 | | -\begin{align*} |
577 | | - \text{KL}(p||q) = \int_{-\infty}^{\infty} p (\boldsymbol{x}) \log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} d\boldsymbol{x}. |
578 | | -\end{align*} |
579 | | -$$ |
580 | | - |
581 | | - |
582 | | -<!-- !split --> |
583 | | -<h2 id="kullback-leibler-divergence" class="anchor">Kullback-Leibler divergence </h2> |
584 | | - |
585 | | -<p>Thus, the Kullback-Leibler divergence between the distribution of the |
586 | | -training data \( f(\boldsymbol{x}) \) and the model distribution \( p(\boldsymbol{x}| |
587 | | -\boldsymbol{\Theta}) \) is |
588 | | -</p> |
589 | | - |
590 | | -$$ |
591 | | -\begin{align*} |
592 | | - \text{KL} (f(\boldsymbol{x})|| p(\boldsymbol{x}| \boldsymbol{\Theta})) =& \int_{-\infty}^{\infty} |
593 | | - f (\boldsymbol{x}) \log \frac{f(\boldsymbol{x})}{p(\boldsymbol{x}| \boldsymbol{\Theta})} d\boldsymbol{x} \\ |
594 | | - =& \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log f(\boldsymbol{x}) d\boldsymbol{x} - \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log |
595 | | - p(\boldsymbol{x}| \boldsymbol{\Theta}) d\boldsymbol{x} \\ |
596 | | - %=& \mathbb{E}_{f(\boldsymbol{x})} (\log f(\boldsymbol{x})) - \mathbb{E}_{f(\boldsymbol{x})} (\log p(\boldsymbol{x}| \boldsymbol{\Theta})) |
597 | | - =& \langle \log f(\boldsymbol{x}) \rangle_{f(\boldsymbol{x})} - \langle \log p(\boldsymbol{x}| \boldsymbol{\Theta}) \rangle_{f(\boldsymbol{x})} \\ |
598 | | - =& \langle \log f(\boldsymbol{x}) \rangle_{data} + \langle E(\boldsymbol{x}) \rangle_{data} + \log Z \\ |
599 | | - =& \langle \log f(\boldsymbol{x}) \rangle_{data} + \mathcal{C}_{LL} . |
600 | | -\end{align*} |
601 | | -$$ |
602 | | - |
603 | | - |
604 | | -<!-- !split --> |
605 | | -<h2 id="maximizing-log-likelihood" class="anchor">Maximizing log-likelihood </h2> |
606 | | - |
607 | | -<p>The first term is constant with respect to \( \boldsymbol{\Theta} \) since |
608 | | -\( f(\boldsymbol{x}) \) is independent of \( \boldsymbol{\Theta} \). Thus the Kullback-Leibler |
609 | | -Divergence is minimal when the second term is minimal. The second term |
610 | | -is the log-likelihood cost function, hence minimizing the |
611 | | -Kullback-Leibler divergence is equivalent to maximizing the |
612 | | -log-likelihood. |
613 | | -</p> |
614 | | - |
615 | | -<p>To further understand generative models it is useful to study the |
616 | | -gradient of the cost function which is needed in order to minimize it |
617 | | -using methods like stochastic gradient descent. |
618 | | -</p> |
619 | | - |
620 | | -<!-- !split --> |
621 | | -<h2 id="more-on-the-partition-function" class="anchor">More on the partition function </h2> |
622 | | - |
623 | | -<p>The partition function is the generating function of |
624 | | -expectation values, in particular there are mathematical relationships |
625 | | -between expectation values and the log-partition function. In this |
626 | | -case we have |
627 | | -</p> |
628 | | -$$ |
629 | | -\begin{align*} |
630 | | - \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model} |
631 | | - = \int p(\boldsymbol{x}| \boldsymbol{\Theta}) \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} d\boldsymbol{x} |
632 | | - = -\frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} . |
633 | | -\end{align*} |
634 | | -$$ |
635 | | - |
636 | | -<p>Here \( \langle \cdot \rangle_{model} \) is the expectation value over the model probability distribution \( p(\boldsymbol{x}| \boldsymbol{\Theta}) \).</p> |
637 | | - |
638 | | -<!-- !split --> |
639 | | -<h2 id="setting-up-for-gradient-descent-calculations" class="anchor">Setting up for gradient descent calculations </h2> |
640 | | - |
641 | | -<p>Using the previous relationship we can express the gradient of the cost function as</p> |
642 | | - |
643 | | -$$ |
644 | | -\begin{align*} |
645 | | - \frac{\partial \mathcal{C}_{LL}}{\partial \Theta_i} |
646 | | - =& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} + \frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} \\ |
647 | | - =& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} - \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model} \\ |
648 | | - %=& \langle O_i(\boldsymbol{x}) \rangle_{data} - \langle O_i(\boldsymbol{x}) \rangle_{model} |
649 | | -\end{align*} |
650 | | -$$ |
651 | | - |
652 | | - |
653 | | -<!-- !split --> |
654 | | -<h2 id="difference-of-moments" class="anchor">Difference of moments </h2> |
655 | | - |
656 | | -<p>This expression shows that the gradient of the log-likelihood cost |
657 | | -function is a <b>difference of moments</b>, with one calculated from |
658 | | -the data and one calculated from the model. The data-dependent term is |
659 | | -called the <b>positive phase</b> and the model-dependent term is |
660 | | -called the <b>negative phase</b> of the gradient. We see now that |
661 | | -minimizing the cost function results in lowering the energy of |
662 | | -configurations \( \boldsymbol{x} \) near points in the training data and |
663 | | -increasing the energy of configurations not observed in the training |
664 | | -data. That means we increase the model's probability of configurations |
665 | | -similar to those in the training data. |
666 | | -</p> |
667 | | - |
668 | | -<!-- !split --> |
669 | | -<h2 id="more-observations" class="anchor">More observations </h2> |
670 | | - |
671 | | -<p>The gradient of the cost function also demonstrates why gradients of |
672 | | -unsupervised, generative models must be computed differently from for |
673 | | -those of for example FNNs. While the data-dependent expectation value |
674 | | -is easily calculated based on the samples \( \boldsymbol{x}_i \) in the training |
675 | | -data, we must sample from the model in order to generate samples from |
676 | | -which to caclulate the model-dependent term. We sample from the model |
677 | | -by using MCMC-based methods. We can not sample from the model directly |
678 | | -because the partition function \( Z \) is generally intractable. |
679 | | -</p> |
680 | | - |
681 | | -<!-- !split --> |
682 | | -<h2 id="adding-hyperparameters" class="anchor">Adding hyperparameters </h2> |
683 | | - |
684 | | -<p>As in supervised machine learning problems, the goal is also here to |
685 | | -perform well on <b>unseen</b> data, that is to have good |
686 | | -generalization from the training data. The distribution \( f(x) \) we |
687 | | -approximate is not the <b>true</b> distribution we wish to estimate, |
688 | | -it is limited to the training data. Hence, in unsupervised training as |
689 | | -well it is important to prevent overfitting to the training data. Thus |
690 | | -it is common to add regularizers to the cost function in the same |
691 | | -manner as we discussed for say linear regression. |
692 | | -</p> |
693 | | - |
694 | 518 | <!-- !split --> |
695 | 519 | <h2 id="theory-of-variational-autoencoders" class="anchor">Theory of Variational Autoencoders </h2> |
696 | 520 |
|
|
0 commit comments