Skip to content

Commit c1148ea

Browse files
committed
update
1 parent 9af26e2 commit c1148ea

File tree

8 files changed

+146
-1206
lines changed

8 files changed

+146
-1206
lines changed

doc/pub/week12/html/week12-bs.html

Lines changed: 0 additions & 176 deletions
Original file line numberDiff line numberDiff line change
@@ -78,30 +78,6 @@
7878
2,
7979
None,
8080
'positive-and-negative-phases'),
81-
('Gradient examples', 2, None, 'gradient-examples'),
82-
('Kullback-Leibler relative entropy',
83-
2,
84-
None,
85-
'kullback-leibler-relative-entropy'),
86-
('Kullback-Leibler divergence',
87-
2,
88-
None,
89-
'kullback-leibler-divergence'),
90-
('Maximizing log-likelihood',
91-
2,
92-
None,
93-
'maximizing-log-likelihood'),
94-
('More on the partition function',
95-
2,
96-
None,
97-
'more-on-the-partition-function'),
98-
('Setting up for gradient descent calculations',
99-
2,
100-
None,
101-
'setting-up-for-gradient-descent-calculations'),
102-
('Difference of moments', 2, None, 'difference-of-moments'),
103-
('More observations', 2, None, 'more-observations'),
104-
('Adding hyperparameters', 2, None, 'adding-hyperparameters'),
10581
('Theory of Variational Autoencoders',
10682
2,
10783
None,
@@ -231,15 +207,6 @@
231207
<!-- navigation toc: --> <li><a href="#explicit-expression-for-the-derivative" style="font-size: 80%;">Explicit expression for the derivative</a></li>
232208
<!-- navigation toc: --> <li><a href="#final-expression" style="font-size: 80%;">Final expression</a></li>
233209
<!-- navigation toc: --> <li><a href="#positive-and-negative-phases" style="font-size: 80%;">Positive and negative phases</a></li>
234-
<!-- navigation toc: --> <li><a href="#gradient-examples" style="font-size: 80%;">Gradient examples</a></li>
235-
<!-- navigation toc: --> <li><a href="#kullback-leibler-relative-entropy" style="font-size: 80%;">Kullback-Leibler relative entropy</a></li>
236-
<!-- navigation toc: --> <li><a href="#kullback-leibler-divergence" style="font-size: 80%;">Kullback-Leibler divergence</a></li>
237-
<!-- navigation toc: --> <li><a href="#maximizing-log-likelihood" style="font-size: 80%;">Maximizing log-likelihood</a></li>
238-
<!-- navigation toc: --> <li><a href="#more-on-the-partition-function" style="font-size: 80%;">More on the partition function</a></li>
239-
<!-- navigation toc: --> <li><a href="#setting-up-for-gradient-descent-calculations" style="font-size: 80%;">Setting up for gradient descent calculations</a></li>
240-
<!-- navigation toc: --> <li><a href="#difference-of-moments" style="font-size: 80%;">Difference of moments</a></li>
241-
<!-- navigation toc: --> <li><a href="#more-observations" style="font-size: 80%;">More observations</a></li>
242-
<!-- navigation toc: --> <li><a href="#adding-hyperparameters" style="font-size: 80%;">Adding hyperparameters</a></li>
243210
<!-- navigation toc: --> <li><a href="#theory-of-variational-autoencoders" style="font-size: 80%;">Theory of Variational Autoencoders</a></li>
244211
<!-- navigation toc: --> <li><a href="#the-autoencoder-again" style="font-size: 80%;">The Autoencoder again</a></li>
245212
<!-- navigation toc: --> <li><a href="#schematic-image-of-an-autoencoder" style="font-size: 80%;">Schematic image of an Autoencoder</a></li>
@@ -548,149 +515,6 @@ <h2 id="positive-and-negative-phases" class="anchor">Positive and negative phase
548515
their probability).
549516
</p>
550517

551-
<!-- !split -->
552-
<h2 id="gradient-examples" class="anchor">Gradient examples </h2>
553-
<p>The gradient of the negative log-likelihood cost function of a Binary-Binary RBM is then</p>
554-
$$
555-
\begin{align*}
556-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial w_{ij}} =& \langle x_i h_j \rangle_{data} - \langle x_i h_j \rangle_{model} \\
557-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial a_{ij}} =& \langle x_i \rangle_{data} - \langle x_i \rangle_{model} \\
558-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial b_{ij}} =& \langle h_i \rangle_{data} - \langle h_i \rangle_{model}. \\
559-
\end{align*}
560-
$$
561-
562-
<p>To get the expectation values with respect to the <em>data</em>, we set the visible units to each of the observed samples in the training data, then update the hidden units according to the conditional probability found before. We then average over all samples in the training data to calculate expectation values with respect to the data. </p>
563-
564-
<!-- !split -->
565-
<h2 id="kullback-leibler-relative-entropy" class="anchor">Kullback-Leibler relative entropy </h2>
566-
567-
<p>When the goal of the training is to approximate a probability
568-
distribution, as it is in generative modeling, another relevant
569-
measure is the <b>Kullback-Leibler divergence</b>, also known as the
570-
relative entropy or Shannon entropy. It is a non-symmetric measure of the
571-
dissimilarity between two probability density functions \( p \) and
572-
\( q \). If \( p \) is the unkown probability which we approximate with \( q \),
573-
we can measure the difference by
574-
</p>
575-
$$
576-
\begin{align*}
577-
\text{KL}(p||q) = \int_{-\infty}^{\infty} p (\boldsymbol{x}) \log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} d\boldsymbol{x}.
578-
\end{align*}
579-
$$
580-
581-
582-
<!-- !split -->
583-
<h2 id="kullback-leibler-divergence" class="anchor">Kullback-Leibler divergence </h2>
584-
585-
<p>Thus, the Kullback-Leibler divergence between the distribution of the
586-
training data \( f(\boldsymbol{x}) \) and the model distribution \( p(\boldsymbol{x}|
587-
\boldsymbol{\Theta}) \) is
588-
</p>
589-
590-
$$
591-
\begin{align*}
592-
\text{KL} (f(\boldsymbol{x})|| p(\boldsymbol{x}| \boldsymbol{\Theta})) =& \int_{-\infty}^{\infty}
593-
f (\boldsymbol{x}) \log \frac{f(\boldsymbol{x})}{p(\boldsymbol{x}| \boldsymbol{\Theta})} d\boldsymbol{x} \\
594-
=& \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log f(\boldsymbol{x}) d\boldsymbol{x} - \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log
595-
p(\boldsymbol{x}| \boldsymbol{\Theta}) d\boldsymbol{x} \\
596-
%=& \mathbb{E}_{f(\boldsymbol{x})} (\log f(\boldsymbol{x})) - \mathbb{E}_{f(\boldsymbol{x})} (\log p(\boldsymbol{x}| \boldsymbol{\Theta}))
597-
=& \langle \log f(\boldsymbol{x}) \rangle_{f(\boldsymbol{x})} - \langle \log p(\boldsymbol{x}| \boldsymbol{\Theta}) \rangle_{f(\boldsymbol{x})} \\
598-
=& \langle \log f(\boldsymbol{x}) \rangle_{data} + \langle E(\boldsymbol{x}) \rangle_{data} + \log Z \\
599-
=& \langle \log f(\boldsymbol{x}) \rangle_{data} + \mathcal{C}_{LL} .
600-
\end{align*}
601-
$$
602-
603-
604-
<!-- !split -->
605-
<h2 id="maximizing-log-likelihood" class="anchor">Maximizing log-likelihood </h2>
606-
607-
<p>The first term is constant with respect to \( \boldsymbol{\Theta} \) since
608-
\( f(\boldsymbol{x}) \) is independent of \( \boldsymbol{\Theta} \). Thus the Kullback-Leibler
609-
Divergence is minimal when the second term is minimal. The second term
610-
is the log-likelihood cost function, hence minimizing the
611-
Kullback-Leibler divergence is equivalent to maximizing the
612-
log-likelihood.
613-
</p>
614-
615-
<p>To further understand generative models it is useful to study the
616-
gradient of the cost function which is needed in order to minimize it
617-
using methods like stochastic gradient descent.
618-
</p>
619-
620-
<!-- !split -->
621-
<h2 id="more-on-the-partition-function" class="anchor">More on the partition function </h2>
622-
623-
<p>The partition function is the generating function of
624-
expectation values, in particular there are mathematical relationships
625-
between expectation values and the log-partition function. In this
626-
case we have
627-
</p>
628-
$$
629-
\begin{align*}
630-
\langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model}
631-
= \int p(\boldsymbol{x}| \boldsymbol{\Theta}) \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} d\boldsymbol{x}
632-
= -\frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} .
633-
\end{align*}
634-
$$
635-
636-
<p>Here \( \langle \cdot \rangle_{model} \) is the expectation value over the model probability distribution \( p(\boldsymbol{x}| \boldsymbol{\Theta}) \).</p>
637-
638-
<!-- !split -->
639-
<h2 id="setting-up-for-gradient-descent-calculations" class="anchor">Setting up for gradient descent calculations </h2>
640-
641-
<p>Using the previous relationship we can express the gradient of the cost function as</p>
642-
643-
$$
644-
\begin{align*}
645-
\frac{\partial \mathcal{C}_{LL}}{\partial \Theta_i}
646-
=& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} + \frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} \\
647-
=& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} - \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model} \\
648-
%=& \langle O_i(\boldsymbol{x}) \rangle_{data} - \langle O_i(\boldsymbol{x}) \rangle_{model}
649-
\end{align*}
650-
$$
651-
652-
653-
<!-- !split -->
654-
<h2 id="difference-of-moments" class="anchor">Difference of moments </h2>
655-
656-
<p>This expression shows that the gradient of the log-likelihood cost
657-
function is a <b>difference of moments</b>, with one calculated from
658-
the data and one calculated from the model. The data-dependent term is
659-
called the <b>positive phase</b> and the model-dependent term is
660-
called the <b>negative phase</b> of the gradient. We see now that
661-
minimizing the cost function results in lowering the energy of
662-
configurations \( \boldsymbol{x} \) near points in the training data and
663-
increasing the energy of configurations not observed in the training
664-
data. That means we increase the model's probability of configurations
665-
similar to those in the training data.
666-
</p>
667-
668-
<!-- !split -->
669-
<h2 id="more-observations" class="anchor">More observations </h2>
670-
671-
<p>The gradient of the cost function also demonstrates why gradients of
672-
unsupervised, generative models must be computed differently from for
673-
those of for example FNNs. While the data-dependent expectation value
674-
is easily calculated based on the samples \( \boldsymbol{x}_i \) in the training
675-
data, we must sample from the model in order to generate samples from
676-
which to caclulate the model-dependent term. We sample from the model
677-
by using MCMC-based methods. We can not sample from the model directly
678-
because the partition function \( Z \) is generally intractable.
679-
</p>
680-
681-
<!-- !split -->
682-
<h2 id="adding-hyperparameters" class="anchor">Adding hyperparameters </h2>
683-
684-
<p>As in supervised machine learning problems, the goal is also here to
685-
perform well on <b>unseen</b> data, that is to have good
686-
generalization from the training data. The distribution \( f(x) \) we
687-
approximate is not the <b>true</b> distribution we wish to estimate,
688-
it is limited to the training data. Hence, in unsupervised training as
689-
well it is important to prevent overfitting to the training data. Thus
690-
it is common to add regularizers to the cost function in the same
691-
manner as we discussed for say linear regression.
692-
</p>
693-
694518
<!-- !split -->
695519
<h2 id="theory-of-variational-autoencoders" class="anchor">Theory of Variational Autoencoders </h2>
696520

doc/pub/week12/html/week12-reveal.html

Lines changed: 0 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -476,165 +476,6 @@ <h2 id="positive-and-negative-phases">Positive and negative phases </h2>
476476
</p>
477477
</section>
478478

479-
<section>
480-
<h2 id="gradient-examples">Gradient examples </h2>
481-
<p>The gradient of the negative log-likelihood cost function of a Binary-Binary RBM is then</p>
482-
<p>&nbsp;<br>
483-
$$
484-
\begin{align*}
485-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial w_{ij}} =& \langle x_i h_j \rangle_{data} - \langle x_i h_j \rangle_{model} \\
486-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial a_{ij}} =& \langle x_i \rangle_{data} - \langle x_i \rangle_{model} \\
487-
\frac{\partial \mathcal{C} (w_{ij}, a_i, b_j)}{\partial b_{ij}} =& \langle h_i \rangle_{data} - \langle h_i \rangle_{model}. \\
488-
\end{align*}
489-
$$
490-
<p>&nbsp;<br>
491-
492-
<p>To get the expectation values with respect to the <em>data</em>, we set the visible units to each of the observed samples in the training data, then update the hidden units according to the conditional probability found before. We then average over all samples in the training data to calculate expectation values with respect to the data. </p>
493-
</section>
494-
495-
<section>
496-
<h2 id="kullback-leibler-relative-entropy">Kullback-Leibler relative entropy </h2>
497-
498-
<p>When the goal of the training is to approximate a probability
499-
distribution, as it is in generative modeling, another relevant
500-
measure is the <b>Kullback-Leibler divergence</b>, also known as the
501-
relative entropy or Shannon entropy. It is a non-symmetric measure of the
502-
dissimilarity between two probability density functions \( p \) and
503-
\( q \). If \( p \) is the unkown probability which we approximate with \( q \),
504-
we can measure the difference by
505-
</p>
506-
<p>&nbsp;<br>
507-
$$
508-
\begin{align*}
509-
\text{KL}(p||q) = \int_{-\infty}^{\infty} p (\boldsymbol{x}) \log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} d\boldsymbol{x}.
510-
\end{align*}
511-
$$
512-
<p>&nbsp;<br>
513-
</section>
514-
515-
<section>
516-
<h2 id="kullback-leibler-divergence">Kullback-Leibler divergence </h2>
517-
518-
<p>Thus, the Kullback-Leibler divergence between the distribution of the
519-
training data \( f(\boldsymbol{x}) \) and the model distribution \( p(\boldsymbol{x}|
520-
\boldsymbol{\Theta}) \) is
521-
</p>
522-
523-
<p>&nbsp;<br>
524-
$$
525-
\begin{align*}
526-
\text{KL} (f(\boldsymbol{x})|| p(\boldsymbol{x}| \boldsymbol{\Theta})) =& \int_{-\infty}^{\infty}
527-
f (\boldsymbol{x}) \log \frac{f(\boldsymbol{x})}{p(\boldsymbol{x}| \boldsymbol{\Theta})} d\boldsymbol{x} \\
528-
=& \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log f(\boldsymbol{x}) d\boldsymbol{x} - \int_{-\infty}^{\infty} f(\boldsymbol{x}) \log
529-
p(\boldsymbol{x}| \boldsymbol{\Theta}) d\boldsymbol{x} \\
530-
%=& \mathbb{E}_{f(\boldsymbol{x})} (\log f(\boldsymbol{x})) - \mathbb{E}_{f(\boldsymbol{x})} (\log p(\boldsymbol{x}| \boldsymbol{\Theta}))
531-
=& \langle \log f(\boldsymbol{x}) \rangle_{f(\boldsymbol{x})} - \langle \log p(\boldsymbol{x}| \boldsymbol{\Theta}) \rangle_{f(\boldsymbol{x})} \\
532-
=& \langle \log f(\boldsymbol{x}) \rangle_{data} + \langle E(\boldsymbol{x}) \rangle_{data} + \log Z \\
533-
=& \langle \log f(\boldsymbol{x}) \rangle_{data} + \mathcal{C}_{LL} .
534-
\end{align*}
535-
$$
536-
<p>&nbsp;<br>
537-
</section>
538-
539-
<section>
540-
<h2 id="maximizing-log-likelihood">Maximizing log-likelihood </h2>
541-
542-
<p>The first term is constant with respect to \( \boldsymbol{\Theta} \) since
543-
\( f(\boldsymbol{x}) \) is independent of \( \boldsymbol{\Theta} \). Thus the Kullback-Leibler
544-
Divergence is minimal when the second term is minimal. The second term
545-
is the log-likelihood cost function, hence minimizing the
546-
Kullback-Leibler divergence is equivalent to maximizing the
547-
log-likelihood.
548-
</p>
549-
550-
<p>To further understand generative models it is useful to study the
551-
gradient of the cost function which is needed in order to minimize it
552-
using methods like stochastic gradient descent.
553-
</p>
554-
</section>
555-
556-
<section>
557-
<h2 id="more-on-the-partition-function">More on the partition function </h2>
558-
559-
<p>The partition function is the generating function of
560-
expectation values, in particular there are mathematical relationships
561-
between expectation values and the log-partition function. In this
562-
case we have
563-
</p>
564-
<p>&nbsp;<br>
565-
$$
566-
\begin{align*}
567-
\langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model}
568-
= \int p(\boldsymbol{x}| \boldsymbol{\Theta}) \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} d\boldsymbol{x}
569-
= -\frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} .
570-
\end{align*}
571-
$$
572-
<p>&nbsp;<br>
573-
574-
<p>Here \( \langle \cdot \rangle_{model} \) is the expectation value over the model probability distribution \( p(\boldsymbol{x}| \boldsymbol{\Theta}) \).</p>
575-
</section>
576-
577-
<section>
578-
<h2 id="setting-up-for-gradient-descent-calculations">Setting up for gradient descent calculations </h2>
579-
580-
<p>Using the previous relationship we can express the gradient of the cost function as</p>
581-
582-
<p>&nbsp;<br>
583-
$$
584-
\begin{align*}
585-
\frac{\partial \mathcal{C}_{LL}}{\partial \Theta_i}
586-
=& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} + \frac{\partial \log Z(\Theta_i)}{ \partial \Theta_i} \\
587-
=& \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{data} - \langle \frac{ \partial E(\boldsymbol{x}; \Theta_i) } { \partial \Theta_i} \rangle_{model} \\
588-
%=& \langle O_i(\boldsymbol{x}) \rangle_{data} - \langle O_i(\boldsymbol{x}) \rangle_{model}
589-
\end{align*}
590-
$$
591-
<p>&nbsp;<br>
592-
</section>
593-
594-
<section>
595-
<h2 id="difference-of-moments">Difference of moments </h2>
596-
597-
<p>This expression shows that the gradient of the log-likelihood cost
598-
function is a <b>difference of moments</b>, with one calculated from
599-
the data and one calculated from the model. The data-dependent term is
600-
called the <b>positive phase</b> and the model-dependent term is
601-
called the <b>negative phase</b> of the gradient. We see now that
602-
minimizing the cost function results in lowering the energy of
603-
configurations \( \boldsymbol{x} \) near points in the training data and
604-
increasing the energy of configurations not observed in the training
605-
data. That means we increase the model's probability of configurations
606-
similar to those in the training data.
607-
</p>
608-
</section>
609-
610-
<section>
611-
<h2 id="more-observations">More observations </h2>
612-
613-
<p>The gradient of the cost function also demonstrates why gradients of
614-
unsupervised, generative models must be computed differently from for
615-
those of for example FNNs. While the data-dependent expectation value
616-
is easily calculated based on the samples \( \boldsymbol{x}_i \) in the training
617-
data, we must sample from the model in order to generate samples from
618-
which to caclulate the model-dependent term. We sample from the model
619-
by using MCMC-based methods. We can not sample from the model directly
620-
because the partition function \( Z \) is generally intractable.
621-
</p>
622-
</section>
623-
624-
<section>
625-
<h2 id="adding-hyperparameters">Adding hyperparameters </h2>
626-
627-
<p>As in supervised machine learning problems, the goal is also here to
628-
perform well on <b>unseen</b> data, that is to have good
629-
generalization from the training data. The distribution \( f(x) \) we
630-
approximate is not the <b>true</b> distribution we wish to estimate,
631-
it is limited to the training data. Hence, in unsupervised training as
632-
well it is important to prevent overfitting to the training data. Thus
633-
it is common to add regularizers to the cost function in the same
634-
manner as we discussed for say linear regression.
635-
</p>
636-
</section>
637-
638479
<section>
639480
<h2 id="theory-of-variational-autoencoders">Theory of Variational Autoencoders </h2>
640481

0 commit comments

Comments
 (0)