|
216 | 216 | ('RNNs in more detail, part 7', |
217 | 217 | 2, |
218 | 218 | None, |
219 | | - 'rnns-in-more-detail-part-7')]} |
| 219 | + 'rnns-in-more-detail-part-7'), |
| 220 | + ('Backpropagation through time', |
| 221 | + 2, |
| 222 | + None, |
| 223 | + 'backpropagation-through-time'), |
| 224 | + ('The backward pass is linear', |
| 225 | + 2, |
| 226 | + None, |
| 227 | + 'the-backward-pass-is-linear'), |
| 228 | + ('The problem of exploding or vanishing gradients', |
| 229 | + 2, |
| 230 | + None, |
| 231 | + 'the-problem-of-exploding-or-vanishing-gradients'), |
| 232 | + ('Mathematical setup', 2, None, 'mathematical-setup'), |
| 233 | + ('Back propagation in time through figures, part 1', |
| 234 | + 2, |
| 235 | + None, |
| 236 | + 'back-propagation-in-time-through-figures-part-1'), |
| 237 | + ('Back propagation in time, part 2', |
| 238 | + 2, |
| 239 | + None, |
| 240 | + 'back-propagation-in-time-part-2'), |
| 241 | + ('Back propagation in time, part 3', |
| 242 | + 2, |
| 243 | + None, |
| 244 | + 'back-propagation-in-time-part-3'), |
| 245 | + ('Back propagation in time, part 4', |
| 246 | + 2, |
| 247 | + None, |
| 248 | + 'back-propagation-in-time-part-4'), |
| 249 | + ('Back propagation in time in equations', |
| 250 | + 2, |
| 251 | + None, |
| 252 | + 'back-propagation-in-time-in-equations'), |
| 253 | + ('Chain rule again', 2, None, 'chain-rule-again'), |
| 254 | + ('Gradients of loss functions', |
| 255 | + 2, |
| 256 | + None, |
| 257 | + 'gradients-of-loss-functions'), |
| 258 | + ('Summary of RNNs', 2, None, 'summary-of-rnns'), |
| 259 | + ('Summary of a typical RNN', |
| 260 | + 2, |
| 261 | + None, |
| 262 | + 'summary-of-a-typical-rnn'), |
| 263 | + ('Four effective ways to learn an RNN and preparing for next ' |
| 264 | + 'week', |
| 265 | + 2, |
| 266 | + None, |
| 267 | + 'four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week')]} |
220 | 268 | end of tocinfo --> |
221 | 269 |
|
222 | 270 | <body> |
|
326 | 374 | <!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-5" style="font-size: 80%;"><b>RNNs in more detail, part 5</b></a></li> |
327 | 375 | <!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-6" style="font-size: 80%;"><b>RNNs in more detail, part 6</b></a></li> |
328 | 376 | <!-- navigation toc: --> <li><a href="#rnns-in-more-detail-part-7" style="font-size: 80%;"><b>RNNs in more detail, part 7</b></a></li> |
| 377 | + <!-- navigation toc: --> <li><a href="#backpropagation-through-time" style="font-size: 80%;"><b>Backpropagation through time</b></a></li> |
| 378 | + <!-- navigation toc: --> <li><a href="#the-backward-pass-is-linear" style="font-size: 80%;"><b>The backward pass is linear</b></a></li> |
| 379 | + <!-- navigation toc: --> <li><a href="#the-problem-of-exploding-or-vanishing-gradients" style="font-size: 80%;"><b>The problem of exploding or vanishing gradients</b></a></li> |
| 380 | + <!-- navigation toc: --> <li><a href="#mathematical-setup" style="font-size: 80%;"><b>Mathematical setup</b></a></li> |
| 381 | + <!-- navigation toc: --> <li><a href="#back-propagation-in-time-through-figures-part-1" style="font-size: 80%;"><b>Back propagation in time through figures, part 1</b></a></li> |
| 382 | + <!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-2" style="font-size: 80%;"><b>Back propagation in time, part 2</b></a></li> |
| 383 | + <!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-3" style="font-size: 80%;"><b>Back propagation in time, part 3</b></a></li> |
| 384 | + <!-- navigation toc: --> <li><a href="#back-propagation-in-time-part-4" style="font-size: 80%;"><b>Back propagation in time, part 4</b></a></li> |
| 385 | + <!-- navigation toc: --> <li><a href="#back-propagation-in-time-in-equations" style="font-size: 80%;"><b>Back propagation in time in equations</b></a></li> |
| 386 | + <!-- navigation toc: --> <li><a href="#chain-rule-again" style="font-size: 80%;"><b>Chain rule again</b></a></li> |
| 387 | + <!-- navigation toc: --> <li><a href="#gradients-of-loss-functions" style="font-size: 80%;"><b>Gradients of loss functions</b></a></li> |
| 388 | + <!-- navigation toc: --> <li><a href="#summary-of-rnns" style="font-size: 80%;"><b>Summary of RNNs</b></a></li> |
| 389 | + <!-- navigation toc: --> <li><a href="#summary-of-a-typical-rnn" style="font-size: 80%;"><b>Summary of a typical RNN</b></a></li> |
| 390 | + <!-- navigation toc: --> <li><a href="#four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week" style="font-size: 80%;"><b>Four effective ways to learn an RNN and preparing for next week</b></a></li> |
329 | 391 |
|
330 | 392 | </ul> |
331 | 393 | </li> |
@@ -5049,6 +5111,191 @@ <h2 id="rnns-in-more-detail-part-7" class="anchor">RNNs in more detail, part 7 |
5049 | 5111 | </center> |
5050 | 5112 | <br/><br/> |
5051 | 5113 |
|
| 5114 | +<!-- !split --> |
| 5115 | +<h2 id="backpropagation-through-time" class="anchor">Backpropagation through time </h2> |
| 5116 | + |
| 5117 | +<div class="panel panel-default"> |
| 5118 | +<div class="panel-body"> |
| 5119 | +<!-- subsequent paragraphs come in larger fonts, so start with a paragraph --> |
| 5120 | +<p>We can think of the recurrent net as a layered, feed-forward |
| 5121 | +net with shared weights and then train the feed-forward net |
| 5122 | +with weight constraints. |
| 5123 | +</p> |
| 5124 | +</div> |
| 5125 | +</div> |
| 5126 | + |
| 5127 | + |
| 5128 | +<p>We can also think of this training algorithm in the time domain:</p> |
| 5129 | +<ol> |
| 5130 | +<li> The forward pass builds up a stack of the activities of all the units at each time step.</li> |
| 5131 | +<li> The backward pass peels activities off the stack to compute the error derivatives at each time step.</li> |
| 5132 | +<li> After the backward pass we add together the derivatives at all the different times for each weight.</li> |
| 5133 | +</ol> |
| 5134 | +<!-- !split --> |
| 5135 | +<h2 id="the-backward-pass-is-linear" class="anchor">The backward pass is linear </h2> |
| 5136 | + |
| 5137 | +<ol> |
| 5138 | +<li> There is a big difference between the forward and backward passes.</li> |
| 5139 | +<li> In the forward pass we use squashing functions (like the logistic) to prevent the activity vectors from exploding.</li> |
| 5140 | +<li> The backward pass, is completely linear. If you double the error derivatives at the final layer, all the error derivatives will double.</li> |
| 5141 | +</ol> |
| 5142 | +<p>The forward pass determines the slope of the linear function used for |
| 5143 | +backpropagating through each neuron |
| 5144 | +</p> |
| 5145 | + |
| 5146 | +<!-- !split --> |
| 5147 | +<h2 id="the-problem-of-exploding-or-vanishing-gradients" class="anchor">The problem of exploding or vanishing gradients </h2> |
| 5148 | +<ul> |
| 5149 | +<li> What happens to the magnitude of the gradients as we backpropagate through many layers? |
| 5150 | +<ol type="a"></li> |
| 5151 | + <li> If the weights are small, the gradients shrink exponentially.</li> |
| 5152 | + <li> If the weights are big the gradients grow exponentially.</li> |
| 5153 | +</ol> |
| 5154 | +<li> Typical feed-forward neural nets can cope with these exponential effects because they only have a few hidden layers.</li> |
| 5155 | +<li> In an RNN trained on long sequences (e.g. 100 time steps) the gradients can easily explode or vanish. |
| 5156 | +<ol type="a"></li> |
| 5157 | + <li> We can avoid this by initializing the weights very carefully.</li> |
| 5158 | +</ol> |
| 5159 | +<li> Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago.</li> |
| 5160 | +</ul> |
| 5161 | +<p>RNNs have difficulty dealing with long-range dependencies. </p> |
| 5162 | + |
| 5163 | +<!-- !split --> |
| 5164 | +<h2 id="mathematical-setup" class="anchor">Mathematical setup </h2> |
| 5165 | + |
| 5166 | +<p>The expression for the simplest Recurrent network resembles that of a |
| 5167 | +regular feed-forward neural network, but now with |
| 5168 | +the concept of temporal dependencies |
| 5169 | +</p> |
| 5170 | + |
| 5171 | +$$ |
| 5172 | +\begin{align*} |
| 5173 | + \mathbf{a}^{(t)} & = U * \mathbf{x}^{(t)} + W * \mathbf{h}^{(t-1)} + \mathbf{b}, \notag \\ |
| 5174 | + \mathbf{h}^{(t)} &= \sigma_h(\mathbf{a}^{(t)}), \notag\\ |
| 5175 | + \mathbf{y}^{(t)} &= V * \mathbf{h}^{(t)} + \mathbf{c}, \notag\\ |
| 5176 | + \mathbf{\hat{y}}^{(t)} &= \sigma_y(\mathbf{y}^{(t)}). |
| 5177 | +\end{align*} |
| 5178 | +$$ |
| 5179 | + |
| 5180 | + |
| 5181 | +<!-- !split --> |
| 5182 | +<h2 id="back-propagation-in-time-through-figures-part-1" class="anchor">Back propagation in time through figures, part 1 </h2> |
| 5183 | + |
| 5184 | +<br/><br/> |
| 5185 | +<center> |
| 5186 | +<p><img src="figslides/RNN9.png" width="700" align="bottom"></p> |
| 5187 | +</center> |
| 5188 | +<br/><br/> |
| 5189 | + |
| 5190 | +<!-- !split --> |
| 5191 | +<h2 id="back-propagation-in-time-part-2" class="anchor">Back propagation in time, part 2 </h2> |
| 5192 | + |
| 5193 | +<br/><br/> |
| 5194 | +<center> |
| 5195 | +<p><img src="figslides/RNN10.png" width="700" align="bottom"></p> |
| 5196 | +</center> |
| 5197 | +<br/><br/> |
| 5198 | + |
| 5199 | +<!-- !split --> |
| 5200 | +<h2 id="back-propagation-in-time-part-3" class="anchor">Back propagation in time, part 3 </h2> |
| 5201 | + |
| 5202 | +<br/><br/> |
| 5203 | +<center> |
| 5204 | +<p><img src="figslides/RNN11.png" width="700" align="bottom"></p> |
| 5205 | +</center> |
| 5206 | +<br/><br/> |
| 5207 | + |
| 5208 | +<!-- !split --> |
| 5209 | +<h2 id="back-propagation-in-time-part-4" class="anchor">Back propagation in time, part 4 </h2> |
| 5210 | + |
| 5211 | +<br/><br/> |
| 5212 | +<center> |
| 5213 | +<p><img src="figslides/RNN12.png" width="700" align="bottom"></p> |
| 5214 | +</center> |
| 5215 | +<br/><br/> |
| 5216 | + |
| 5217 | +<!-- !split --> |
| 5218 | +<h2 id="back-propagation-in-time-in-equations" class="anchor">Back propagation in time in equations </h2> |
| 5219 | + |
| 5220 | +<p>To derive the expression of the gradients of \( \mathcal{L} \) for |
| 5221 | +the RNN, we need to start recursively from the nodes closer to the |
| 5222 | +output layer in the temporal unrolling scheme - such as \( \mathbf{y} \) |
| 5223 | +and \( \mathbf{h} \) at final time \( t = \tau \), |
| 5224 | +</p> |
| 5225 | + |
| 5226 | +$$ |
| 5227 | +\begin{align*} |
| 5228 | + (\nabla_{ \mathbf{y}^{(t)}} \mathcal{L})_{i} &= \frac{\partial \mathcal{L}}{\partial L^{(t)}}\frac{\partial L^{(t)}}{\partial y_{i}^{(t)}}, \notag\\ |
| 5229 | + \nabla_{\mathbf{h}^{(\tau)}} \mathcal{L} &= \mathbf{V}^\mathsf{T}\nabla_{ \mathbf{y}^{(\tau)}} \mathcal{L}. |
| 5230 | +\end{align*} |
| 5231 | +$$ |
| 5232 | + |
| 5233 | + |
| 5234 | +<!-- !split --> |
| 5235 | +<h2 id="chain-rule-again" class="anchor">Chain rule again </h2> |
| 5236 | +<p>For the following hidden nodes, we have to iterate through time, so by the chain rule, </p> |
| 5237 | + |
| 5238 | +$$ |
| 5239 | +\begin{align*} |
| 5240 | + \nabla_{\mathbf{h}^{(t)}} \mathcal{L} &= \left(\frac{\partial\mathbf{h}^{(t+1)}}{\partial\mathbf{h}^{(t)}}\right)^\mathsf{T}\nabla_{\mathbf{h}^{(t+1)}}\mathcal{L} + \left(\frac{\partial\mathbf{y}^{(t)}}{\partial\mathbf{h}^{(t)}}\right)^\mathsf{T}\nabla_{ \mathbf{y}^{(t)}} \mathcal{L}. |
| 5241 | +\end{align*} |
| 5242 | +$$ |
| 5243 | + |
| 5244 | + |
| 5245 | +<!-- !split --> |
| 5246 | +<h2 id="gradients-of-loss-functions" class="anchor">Gradients of loss functions </h2> |
| 5247 | +<p>Similarly, the gradients of \( \mathcal{L} \) with respect to the weights and biases follow,</p> |
| 5248 | + |
| 5249 | +$$ |
| 5250 | +\begin{align*} |
| 5251 | + \nabla_{\mathbf{c}} \mathcal{L} &=\sum_{t}\left(\frac{\partial \mathbf{y}^{(t)}}{\partial \mathbf{c}}\right)^\mathsf{T} \nabla_{\mathbf{y}^{(t)}} \mathcal{L} \notag\\ |
| 5252 | + \nabla_{\mathbf{b}} \mathcal{L} &=\sum_{t}\left(\frac{\partial \mathbf{h}^{(t)}}{\partial \mathbf{b}}\right)^\mathsf{T} \nabla_{\mathbf{h}^{(t)}} \mathcal{L} \notag\\ |
| 5253 | + \nabla_{\mathbf{V}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial y_i^{(t)} }\right)\nabla_{\mathbf{V}^{(t)}}y_i^{(t)} \notag\\ |
| 5254 | + \nabla_{\mathbf{W}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial h_i^{(t)}}\right)\nabla_{\mathbf{w}^{(t)}} h_i^{(t)} \notag\\ |
| 5255 | + \nabla_{\mathbf{U}} \mathcal{L} &=\sum_{t}\sum_{i}\left(\frac{\partial \mathcal{L}}{\partial h_i^{(t)}}\right)\nabla_{\mathbf{U}^{(t)}}h_i^{(t)}. |
| 5256 | + \label{eq:rnn_gradients3} |
| 5257 | +\end{align*} |
| 5258 | +$$ |
| 5259 | + |
| 5260 | + |
| 5261 | +<!-- !split --> |
| 5262 | +<h2 id="summary-of-rnns" class="anchor">Summary of RNNs </h2> |
| 5263 | + |
| 5264 | +<p>Recurrent neural networks (RNNs) have in general no probabilistic component |
| 5265 | +in a model. With a given fixed input and target from data, the RNNs learn the intermediate |
| 5266 | +association between various layers. |
| 5267 | +The inputs, outputs, and internal representation (hidden states) are all |
| 5268 | +real-valued vectors. |
| 5269 | +</p> |
| 5270 | + |
| 5271 | +<p>In a traditional NN, it is assumed that every input is |
| 5272 | +independent of each other. But with sequential data, the input at a given stage \( t \) depends on the input from the previous stage \( t-1 \) |
| 5273 | +</p> |
| 5274 | + |
| 5275 | +<!-- !split --> |
| 5276 | +<h2 id="summary-of-a-typical-rnn" class="anchor">Summary of a typical RNN </h2> |
| 5277 | + |
| 5278 | +<ol> |
| 5279 | +<li> Weight matrices \( U \), \( W \) and \( V \) that connect the input layer at a stage \( t \) with the hidden layer \( h_t \), the previous hidden layer \( h_{t-1} \) with \( h_t \) and the hidden layer \( h_t \) connecting with the output layer at the same stage and producing an output \( \tilde{y}_t \), respectively.</li> |
| 5280 | +<li> The output from the hidden layer \( h_t \) is oftem modulated by a \( \tanh{} \) function \( h_t=\sigma_h(x_t,h_{t-1})=\tanh{(Ux_t+Wh_{t-1}+b)} \) with \( b \) a bias value</li> |
| 5281 | +<li> The output from the hidden layer produces \( \tilde{y}_t=\sigma_y(Vh_t+c) \) where \( c \) is a new bias parameter.</li> |
| 5282 | +<li> The output from the training at a given stage is in turn compared with the observation \( y_t \) thorugh a chosen cost function.</li> |
| 5283 | +</ol> |
| 5284 | +<p>The function \( g \) can any of the standard activation functions, that is a Sigmoid, a Softmax, a ReLU and other. |
| 5285 | +The parameters are trained through the so-called back-propagation through time (BPTT) algorithm. |
| 5286 | +</p> |
| 5287 | + |
| 5288 | +<!-- !split --> |
| 5289 | +<h2 id="four-effective-ways-to-learn-an-rnn-and-preparing-for-next-week" class="anchor">Four effective ways to learn an RNN and preparing for next week </h2> |
| 5290 | +<ol> |
| 5291 | +<li> Long Short Term Memory Make the RNN out of little modules that are designed to remember values for a long time.</li> |
| 5292 | +<li> Hessian Free Optimization: Deal with the vanishing gradients problem by using a fancy optimizer that can detect directions with a tiny gradient but even smaller curvature.</li> |
| 5293 | +<li> Echo State Networks: Initialize the input a hidden and hidden-hidden and output-hidden connections very carefully so that the hidden state has a huge reservoir of weakly coupled oscillators which can be selectively driven by the input.</li> |
| 5294 | +<ul> |
| 5295 | + <li> ESNs only need to learn the hidden-output connections.</li> |
| 5296 | +</ul> |
| 5297 | +<li> Good initialization with momentum Initialize like in Echo State Networks, but then learn all of the connections using momentum</li> |
| 5298 | +</ol> |
5052 | 5299 | <!-- ------------------- end of main content --------------- --> |
5053 | 5300 | </div> <!-- end container --> |
5054 | 5301 | <!-- include javascript, jQuery *first* --> |
|
0 commit comments