Skip to content

Commit 09cd3a8

Browse files
committed
added more rnn material in case we get there
1 parent 8b2a918 commit 09cd3a8

File tree

8 files changed

+1556
-173
lines changed

8 files changed

+1556
-173
lines changed

doc/pub/week5/html/week5-bs.html

Lines changed: 248 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,55 @@
216216
('RNNs in more detail, part 7',
217217
2,
218218
None,
219-
'rnns-in-more-detail-part-7')]}
219+
'rnns-in-more-detail-part-7'),
220+
('Backpropagation through time',
221+
2,
222+
None,
223+
'backpropagation-through-time'),
224+
('The backward pass is linear',
225+
2,
226+
None,
227+
'the-backward-pass-is-linear'),
228+
('The problem of exploding or vanishing gradients',
229+
2,
230+
None,
231+
'the-problem-of-exploding-or-vanishing-gradients'),
232+
('Mathematical setup', 2, None, 'mathematical-setup'),
233+
('Back propagation in time through figures, part 1',
234+
2,
235+
None,
236+
'back-propagation-in-time-through-figures-part-1'),
237+
('Back propagation in time, part 2',
238+
2,
239+
None,
240+
'back-propagation-in-time-part-2'),
241+
('Back propagation in time, part 3',
242+
2,
243+
None,
244+
'back-propagation-in-time-part-3'),
245+
('Back propagation in time, part 4',
246+
2,
247+
None,
248+
'back-propagation-in-time-part-4'),
249+
('Back propagation in time in equations',
250+
2,
251+
None,
252+
'back-propagation-in-time-in-equations'),
253+
('Chain rule again', 2, None, 'chain-rule-again'),
254+
('Gradients of loss functions',
255+
2,
256+
None,
257+
'gradients-of-loss-functions'),
258+
('Summary of RNNs', 2, None, 'summary-of-rnns'),
259+
('Summary of a typical RNN',
260+
2,
261+
None,
262+
'summary-of-a-typical-rnn'),
263+
('Four effective ways to learn an RNN and preparing for next '
264+
'week',
265+
2,
266+
None,
267+
'four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week')]}
220268
end of tocinfo -->
221269

222270
<body>
@@ -326,6 +374,20 @@
326374
<!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-5" style="font-size: 80%;"><b>RNNs in more detail, part 5</b></a></li>
327375
<!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-6" style="font-size: 80%;"><b>RNNs in more detail, part 6</b></a></li>
328376
<!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-7" style="font-size: 80%;"><b>RNNs in more detail, part 7</b></a></li>
377+
<!-- navigation toc: --> <li><a href="#backpropagation-through-time" style="font-size: 80%;"><b>Backpropagation through time</b></a></li>
378+
<!-- navigation toc: --> <li><a href="#the-backward-pass-is-linear" style="font-size: 80%;"><b>The backward pass is linear</b></a></li>
379+
<!-- navigation toc: --> <li><a href="#the-problem-of-exploding-or-vanishing-gradients" style="font-size: 80%;"><b>The problem of exploding or vanishing gradients</b></a></li>
380+
<!-- navigation toc: --> <li><a href="#mathematical-setup" style="font-size: 80%;"><b>Mathematical setup</b></a></li>
381+
<!-- navigation toc: --> <li><a href="#back-propagation-in-time-through-figures-part-1" style="font-size: 80%;"><b>Back propagation in time through figures, part 1</b></a></li>
382+
<!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-2" style="font-size: 80%;"><b>Back propagation in time, part 2</b></a></li>
383+
<!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-3" style="font-size: 80%;"><b>Back propagation in time, part 3</b></a></li>
384+
<!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-4" style="font-size: 80%;"><b>Back propagation in time, part 4</b></a></li>
385+
<!-- navigation toc: --> <li><a href="#back-propagation-in-time-in-equations" style="font-size: 80%;"><b>Back propagation in time in equations</b></a></li>
386+
<!-- navigation toc: --> <li><a href="#chain-rule-again" style="font-size: 80%;"><b>Chain rule again</b></a></li>
387+
<!-- navigation toc: --> <li><a href="#gradients-of-loss-functions" style="font-size: 80%;"><b>Gradients of loss functions</b></a></li>
388+
<!-- navigation toc: --> <li><a href="#summary-of-rnns" style="font-size: 80%;"><b>Summary of RNNs</b></a></li>
389+
<!-- navigation toc: --> <li><a href="#summary-of-a-typical-rnn" style="font-size: 80%;"><b>Summary of a typical RNN</b></a></li>
390+
<!-- navigation toc: --> <li><a href="#four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week" style="font-size: 80%;"><b>Four effective ways to learn an RNN and preparing for next week</b></a></li>
329391

330392
</ul>
331393
</li>
@@ -5049,6 +5111,191 @@ <h2 id="rnns-in-more-detail-part-7" class="anchor">RNNs in more detail, part 7
50495111
</center>
50505112
<br/><br/>
50515113

5114+
<!-- !split -->
5115+
<h2 id="backpropagation-through-time" class="anchor">Backpropagation through time </h2>
5116+
5117+
<div class="panel panel-default">
5118+
<div class="panel-body">
5119+
<!-- subsequent paragraphs come in larger fonts, so start with a paragraph -->
5120+
<p>We can think of the recurrent net as a layered, feed-forward
5121+
net with shared weights and then train the feed-forward net
5122+
with weight constraints.
5123+
</p>
5124+
</div>
5125+
</div>
5126+
5127+
5128+
<p>We can also think of this training algorithm in the time domain:</p>
5129+
<ol>
5130+
<li> The forward pass builds up a stack of the activities of all the units at each time step.</li>
5131+
<li> The backward pass peels activities off the stack to compute the error derivatives at each time step.</li>
5132+
<li> After the backward pass we add together the derivatives at all the different times for each weight.</li>
5133+
</ol>
5134+
<!-- !split -->
5135+
<h2 id="the-backward-pass-is-linear" class="anchor">The backward pass is linear </h2>
5136+
5137+
<ol>
5138+
<li> There is a big difference between the forward and backward passes.</li>
5139+
<li> In the forward pass we use squashing functions (like the logistic) to prevent the activity vectors from exploding.</li>
5140+
<li> The backward pass, is completely linear. If you double the error derivatives at the final layer, all the error derivatives will double.</li>
5141+
</ol>
5142+
<p>The forward pass determines the slope of the linear function used for
5143+
backpropagating through each neuron
5144+
</p>
5145+
5146+
<!-- !split -->
5147+
<h2 id="the-problem-of-exploding-or-vanishing-gradients" class="anchor">The problem of exploding or vanishing gradients </h2>
5148+
<ul>
5149+
<li> What happens to the magnitude of the gradients as we backpropagate through many layers?
5150+
<ol type="a"></li>
5151+
<li> If the weights are small, the gradients shrink exponentially.</li>
5152+
<li> If the weights are big the gradients grow exponentially.</li>
5153+
</ol>
5154+
<li> Typical feed-forward neural nets can cope with these exponential effects because they only have a few hidden layers.</li>
5155+
<li> In an RNN trained on long sequences (e.g. 100 time steps) the gradients can easily explode or vanish.
5156+
<ol type="a"></li>
5157+
<li> We can avoid this by initializing the weights very carefully.</li>
5158+
</ol>
5159+
<li> Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago.</li>
5160+
</ul>
5161+
<p>RNNs have difficulty dealing with long-range dependencies. </p>
5162+
5163+
<!-- !split -->
5164+
<h2 id="mathematical-setup" class="anchor">Mathematical setup </h2>
5165+
5166+
<p>The expression for the simplest Recurrent network resembles that of a
5167+
regular feed-forward neural network, but now with
5168+
the concept of temporal dependencies
5169+
</p>
5170+
5171+
$$
5172+
\begin{align*}
5173+
\mathbf{a}^{(t)} & = U * \mathbf{x}^{(t)} + W * \mathbf{h}^{(t-1)} + \mathbf{b}, \notag \\
5174+
\mathbf{h}^{(t)} &= \sigma_h(\mathbf{a}^{(t)}), \notag\\
5175+
\mathbf{y}^{(t)} &= V * \mathbf{h}^{(t)} + \mathbf{c}, \notag\\
5176+
\mathbf{\hat{y}}^{(t)} &= \sigma_y(\mathbf{y}^{(t)}).
5177+
\end{align*}
5178+
$$
5179+
5180+
5181+
<!-- !split -->
5182+
<h2 id="back-propagation-in-time-through-figures-part-1" class="anchor">Back propagation in time through figures, part 1 </h2>
5183+
5184+
<br/><br/>
5185+
<center>
5186+
<p><img src="figslides/RNN9.png" width="700" align="bottom"></p>
5187+
</center>
5188+
<br/><br/>
5189+
5190+
<!-- !split -->
5191+
<h2 id="back-propagation-in-time-part-2" class="anchor">Back propagation in time, part 2 </h2>
5192+
5193+
<br/><br/>
5194+
<center>
5195+
<p><img src="figslides/RNN10.png" width="700" align="bottom"></p>
5196+
</center>
5197+
<br/><br/>
5198+
5199+
<!-- !split -->
5200+
<h2 id="back-propagation-in-time-part-3" class="anchor">Back propagation in time, part 3 </h2>
5201+
5202+
<br/><br/>
5203+
<center>
5204+
<p><img src="figslides/RNN11.png" width="700" align="bottom"></p>
5205+
</center>
5206+
<br/><br/>
5207+
5208+
<!-- !split -->
5209+
<h2 id="back-propagation-in-time-part-4" class="anchor">Back propagation in time, part 4 </h2>
5210+
5211+
<br/><br/>
5212+
<center>
5213+
<p><img src="figslides/RNN12.png" width="700" align="bottom"></p>
5214+
</center>
5215+
<br/><br/>
5216+
5217+
<!-- !split -->
5218+
<h2 id="back-propagation-in-time-in-equations" class="anchor">Back propagation in time in equations </h2>
5219+
5220+
<p>To derive the expression of the gradients of \( \mathcal{L} \) for
5221+
the RNN, we need to start recursively from the nodes closer to the
5222+
output layer in the temporal unrolling scheme - such as \( \mathbf{y} \)
5223+
and \( \mathbf{h} \) at final time \( t = \tau \),
5224+
</p>
5225+
5226+
$$
5227+
\begin{align*}
5228+
(\nabla_{ \mathbf{y}^{(t)}} \mathcal{L})_{i} &= \frac{\partial \mathcal{L}}{\partial L^{(t)}}\frac{\partial L^{(t)}}{\partial y_{i}^{(t)}}, \notag\\
5229+
\nabla_{\mathbf{h}^{(\tau)}} \mathcal{L} &= \mathbf{V}^\mathsf{T}\nabla_{ \mathbf{y}^{(\tau)}} \mathcal{L}.
5230+
\end{align*}
5231+
$$
5232+
5233+
5234+
<!-- !split -->
5235+
<h2 id="chain-rule-again" class="anchor">Chain rule again </h2>
5236+
<p>For the following hidden nodes, we have to iterate through time, so by the chain rule, </p>
5237+
5238+
$$
5239+
\begin{align*}
5240+
\nabla_{\mathbf{h}^{(t)}} \mathcal{L} &= \left(\frac{\partial\mathbf{h}^{(t+1)}}{\partial\mathbf{h}^{(t)}}\right)^\mathsf{T}\nabla_{\mathbf{h}^{(t+1)}}\mathcal{L} + \left(\frac{\partial\mathbf{y}^{(t)}}{\partial\mathbf{h}^{(t)}}\right)^\mathsf{T}\nabla_{ \mathbf{y}^{(t)}} \mathcal{L}.
5241+
\end{align*}
5242+
$$
5243+
5244+
5245+
<!-- !split -->
5246+
<h2 id="gradients-of-loss-functions" class="anchor">Gradients of loss functions </h2>
5247+
<p>Similarly, the gradients of \( \mathcal{L} \) with respect to the weights and biases follow,</p>
5248+
5249+
$$
5250+
\begin{align*}
5251+
\nabla_{\mathbf{c}} \mathcal{L} &=\sum_{t}\left(\frac{\partial \mathbf{y}^{(t)}}{\partial \mathbf{c}}\right)^\mathsf{T} \nabla_{\mathbf{y}^{(t)}} \mathcal{L} \notag\\
5252+
\nabla_{\mathbf{b}} \mathcal{L} &=\sum_{t}\left(\frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{b}}\right)^\mathsf{T} \nabla_{\mathbf{h}^{(t)}} \mathcal{L} \notag\\
5253+
\nabla_{\mathbf{V}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial y_i^{(t)} }\right)\nabla_{\mathbf{V}^{(t)}}y_i^{(t)} \notag\\
5254+
\nabla_{\mathbf{W}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial h_i^{(t)}}\right)\nabla_{\mathbf{w}^{(t)}} h_i^{(t)} \notag\\
5255+
\nabla_{\mathbf{U}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial h_i^{(t)}}\right)\nabla_{\mathbf{U}^{(t)}}h_i^{(t)}.
5256+
\label{eq:rnn_gradients3}
5257+
\end{align*}
5258+
$$
5259+
5260+
5261+
<!-- !split -->
5262+
<h2 id="summary-of-rnns" class="anchor">Summary of RNNs </h2>
5263+
5264+
<p>Recurrent neural networks (RNNs) have in general no probabilistic component
5265+
in a model. With a given fixed input and target from data, the RNNs learn the intermediate
5266+
association between various layers.
5267+
The inputs, outputs, and internal representation (hidden states) are all
5268+
real-valued vectors.
5269+
</p>
5270+
5271+
<p>In a traditional NN, it is assumed that every input is
5272+
independent of each other. But with sequential data, the input at a given stage \( t \) depends on the input from the previous stage \( t-1 \)
5273+
</p>
5274+
5275+
<!-- !split -->
5276+
<h2 id="summary-of-a-typical-rnn" class="anchor">Summary of a typical RNN </h2>
5277+
5278+
<ol>
5279+
<li> Weight matrices \( U \), \( W \) and \( V \) that connect the input layer at a stage \( t \) with the hidden layer \( h_t \), the previous hidden layer \( h_{t-1} \) with \( h_t \) and the hidden layer \( h_t \) connecting with the output layer at the same stage and producing an output \( \tilde{y}_t \), respectively.</li>
5280+
<li> The output from the hidden layer \( h_t \) is oftem modulated by a \( \tanh{} \) function \( h_t=\sigma_h(x_t,h_{t-1})=\tanh{(Ux_t+Wh_{t-1}+b)} \) with \( b \) a bias value</li>
5281+
<li> The output from the hidden layer produces \( \tilde{y}_t=\sigma_y(Vh_t+c) \) where \( c \) is a new bias parameter.</li>
5282+
<li> The output from the training at a given stage is in turn compared with the observation \( y_t \) thorugh a chosen cost function.</li>
5283+
</ol>
5284+
<p>The function \( g \) can any of the standard activation functions, that is a Sigmoid, a Softmax, a ReLU and other.
5285+
The parameters are trained through the so-called back-propagation through time (BPTT) algorithm.
5286+
</p>
5287+
5288+
<!-- !split -->
5289+
<h2 id="four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week" class="anchor">Four effective ways to learn an RNN and preparing for next week </h2>
5290+
<ol>
5291+
<li> Long Short Term Memory Make the RNN out of little modules that are designed to remember values for a long time.</li>
5292+
<li> Hessian Free Optimization: Deal with the vanishing gradients problem by using a fancy optimizer that can detect directions with a tiny gradient but even smaller curvature.</li>
5293+
<li> Echo State Networks: Initialize the input a hidden and hidden-hidden and output-hidden connections very carefully so that the hidden state has a huge reservoir of weakly coupled oscillators which can be selectively driven by the input.</li>
5294+
<ul>
5295+
<li> ESNs only need to learn the hidden-output connections.</li>
5296+
</ul>
5297+
<li> Good initialization with momentum Initialize like in Echo State Networks, but then learn all of the connections using momentum</li>
5298+
</ol>
50525299
<!-- ------------------- end of main content --------------- -->
50535300
</div> <!-- end container -->
50545301
<!-- include javascript, jQuery *first* -->

0 commit comments

Comments
 (0)