Skip to content

Commit 0b0b0cd

Browse files
committed
lstm additions
1 parent 915db6e commit 0b0b0cd

File tree

8 files changed

+131
-1382
lines changed

8 files changed

+131
-1382
lines changed

doc/pub/week8/html/week8-bs.html

Lines changed: 1 addition & 208 deletions
Original file line numberDiff line numberDiff line change
@@ -63,31 +63,6 @@
6363
('Input gate', 2, None, 'input-gate'),
6464
('Forget and input', 2, None, 'forget-and-input'),
6565
('Output gate', 2, None, 'output-gate'),
66-
('Example: Solving Differential equations',
67-
2,
68-
None,
69-
'example-solving-differential-equations'),
70-
('Lorenz attractor', 2, None, 'lorenz-attractor'),
71-
('Generating data', 2, None, 'generating-data'),
72-
('Training and testing', 2, None, 'training-and-testing'),
73-
('Computationally expensive',
74-
2,
75-
None,
76-
'computationally-expensive'),
77-
('Choice of training data', 2, None, 'choice-of-training-data'),
78-
('Cost/Loss function', 2, None, 'cost-loss-function'),
79-
('Modifying the cost/loss function, adding more info',
80-
2,
81-
None,
82-
'modifying-the-cost-loss-function-adding-more-info'),
83-
('Changing the function to optimize',
84-
2,
85-
None,
86-
'changing-the-function-to-optimize'),
87-
('Adding more information to the loss function',
88-
2,
89-
None,
90-
'adding-more-information-to-the-loss-function'),
9166
('Autoencoders: Overarching view',
9267
2,
9368
None,
@@ -206,16 +181,6 @@
206181
<!-- navigation toc: --> <li><a href="#input-gate" style="font-size: 80%;">Input gate</a></li>
207182
<!-- navigation toc: --> <li><a href="#forget-and-input" style="font-size: 80%;">Forget and input</a></li>
208183
<!-- navigation toc: --> <li><a href="#output-gate" style="font-size: 80%;">Output gate</a></li>
209-
<!-- navigation toc: --> <li><a href="#example-solving-differential-equations" style="font-size: 80%;">Example: Solving Differential equations</a></li>
210-
<!-- navigation toc: --> <li><a href="#lorenz-attractor" style="font-size: 80%;">Lorenz attractor</a></li>
211-
<!-- navigation toc: --> <li><a href="#generating-data" style="font-size: 80%;">Generating data</a></li>
212-
<!-- navigation toc: --> <li><a href="#training-and-testing" style="font-size: 80%;">Training and testing</a></li>
213-
<!-- navigation toc: --> <li><a href="#computationally-expensive" style="font-size: 80%;">Computationally expensive</a></li>
214-
<!-- navigation toc: --> <li><a href="#choice-of-training-data" style="font-size: 80%;">Choice of training data</a></li>
215-
<!-- navigation toc: --> <li><a href="#cost-loss-function" style="font-size: 80%;">Cost/Loss function</a></li>
216-
<!-- navigation toc: --> <li><a href="#modifying-the-cost-loss-function-adding-more-info" style="font-size: 80%;">Modifying the cost/loss function, adding more info</a></li>
217-
<!-- navigation toc: --> <li><a href="#changing-the-function-to-optimize" style="font-size: 80%;">Changing the function to optimize</a></li>
218-
<!-- navigation toc: --> <li><a href="#adding-more-information-to-the-loss-function" style="font-size: 80%;">Adding more information to the loss function</a></li>
219184
<!-- navigation toc: --> <li><a href="#autoencoders-overarching-view" style="font-size: 80%;">Autoencoders: Overarching view</a></li>
220185
<!-- navigation toc: --> <li><a href="#powerful-detectors" style="font-size: 80%;">Powerful detectors</a></li>
221186
<!-- navigation toc: --> <li><a href="#first-introduction-of-aes" style="font-size: 80%;">First introduction of AEs</a></li>
@@ -294,7 +259,6 @@ <h2 id="plans-for-the-week-march-10-14" class="anchor">Plans for the week March
294259
<!-- subsequent paragraphs come in larger fonts, so start with a paragraph -->
295260
<ol>
296261
<li> RNNs and discussion of Long-Short-Term memory</li>
297-
<li> Example of application of RNNs to differential equations</li>
298262
<li> Start discussion of Autoencoders (AEs)</li>
299263
<li> Links between Principal Component Analysis (PCA) and AE
300264
<!-- o Discussion of specific examples relevant for project 1, <a href="https://github.com/CompPhysics/AdvancedMachineLearning/blob/main/doc/Projects/2023/ProjectExamples/RNNs.pdf" target="_self">see project from last year by Daniel and Keran</a> --></li>
@@ -312,7 +276,7 @@ <h2 id="reading-recommendations-rnns-and-lstms" class="anchor">Reading recommend
312276
<ol>
313277
<li> For RNNs see Goodfellow et al chapter 10, see <a href="https://www.deeplearningbook.org/contents/rnn.html" target="_self"><tt>https://www.deeplearningbook.org/contents/rnn.html</tt></a></li>
314278
<li> Reading suggestions for implementation of RNNs in PyTorch: Rashcka et al's text, chapter 15</li>
315-
<li> RNN video at URL":https://youtu.be/PCgrgHgy26c?feature=shared"</li>
279+
<li> RNN video at <a href="https://youtu.be/PCgrgHgy26c?feature=shared" target="_self"><tt>https://youtu.be/PCgrgHgy26c?feature=shared</tt></a></li>
316280
<li> New xLSTM, see Beck et al <a href="https://arxiv.org/abs/2405.04517" target="_self"><tt>https://arxiv.org/abs/2405.04517</tt></a>. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.</li>
317281
</ol>
318282
</div>
@@ -484,177 +448,6 @@ <h2 id="output-gate" class="anchor">Output gate </h2>
484448

485449
<p>where \( \mathbf{W_o,U_o} \) are the weights of the output gate and \( \mathbf{b_o} \) is the bias of the output gate.</p>
486450

487-
<!-- !split -->
488-
<h2 id="example-solving-differential-equations" class="anchor">Example: Solving Differential equations </h2>
489-
490-
<p>The dynamics of a stable spiral evolve in such a way that the system's
491-
trajectory converges to a fixed point while spiraling inward. These
492-
oscillations around the fixed point are gradually dampened until the
493-
system reaches a steady state at a fixed point. Suppose we have a
494-
two-dimensional system of coupled differential equations of the form
495-
</p>
496-
$$
497-
\begin{align*}
498-
\frac{dx}{dt} &= ax + by \notag,\\
499-
\frac{dy}{dt} &= cx + dy.
500-
\end{align*}
501-
$$
502-
503-
<p>The choice of \( a,b,c,d \in \mathbb{R} \) completely determines the
504-
behavior of the solution, and for some of these values, albeit not
505-
all, the system is said to be a stable spiral. This condition is
506-
satisfied when the eigenvalues of the matrix formed by the
507-
coefficients are complex conjugates with a negative real part.
508-
</p>
509-
510-
<!-- !split -->
511-
<h2 id="lorenz-attractor" class="anchor">Lorenz attractor </h2>
512-
513-
<p>A Lorenz attractor presents some added complexity. It exhbits what is called a chaotic
514-
behavior and its behavior is extremely sensitive to initial conditions.
515-
</p>
516-
517-
<p>The expression for the Lorenz attractor evolution consists of a set of three coupled nonlinear differential equations given by</p>
518-
519-
$$
520-
\begin{align*}
521-
\frac{dx}{dt} &= \sigma (y-x), \notag\\
522-
\frac{dy}{dt} &= x(\rho -z) - y, \notag\\
523-
\frac{dz}{dt} &= xy- \beta z.
524-
\end{align*}
525-
$$
526-
527-
<p>For this problem, \( (x,y,z) \) are the variables that determine the state
528-
of the system in the space while \( \sigma, \rho \) and \( \beta \) are,
529-
similarly to the constants \( a,b,c,d \) of the stable spiral, parameters
530-
that influence largely how the system evolves.
531-
</p>
532-
533-
<!-- !split -->
534-
<h2 id="generating-data" class="anchor">Generating data </h2>
535-
536-
<p>Both of the above-mentioned systems are governed by differential
537-
equations, and as such, they can be solved numerically through some
538-
integration scheme such as forward-Euler or fourth-order
539-
Runge-Kutta.
540-
</p>
541-
542-
<p>We use the common choice of parameters \( \sigma =10 \), \( \rho =28 \),
543-
\( \beta =8/3 \). This choice generates complex and aesthetic
544-
trajectories that have been extensively investigated and benchmarked
545-
in the literature of numerical simulations.
546-
</p>
547-
548-
<p>For the stable spiral, we employ \( a = 0.2 \), \( b = -1.0 \), \( c = 1.0 \), \( d = 0.2 \).
549-
This gives a good number of oscillations before reaching a steady state.
550-
</p>
551-
552-
<!-- !split -->
553-
<h2 id="training-and-testing" class="anchor">Training and testing </h2>
554-
555-
<p>Training and testing procedures in recurrent neural networks follow
556-
what is usual for regular FNNs, but some special consideration needs
557-
to be taken into account due to the sequential character of the
558-
data. <b>Training and testing batches must not be randomly shuffled</b> for
559-
it would clearly decorrelate the time-series points and leak future
560-
information into present or past points of the model.
561-
</p>
562-
563-
<!-- !split -->
564-
<h2 id="computationally-expensive" class="anchor">Computationally expensive </h2>
565-
566-
<p>The training algorithm can become computationally
567-
costly, especially if the losses are evaluated for all previous time
568-
steps. While other architectures such as that of LSTMs can be used to
569-
mitigate that, it is also possible to introduce another hyperparameter
570-
responsible for controlling how much of the network will be unfolded
571-
in the training process, adjusting how much the network will remember
572-
from previous points in time . Similarly, the number of steps the network predicts
573-
in the future per iteration greatly influences the assessment of the
574-
loss function. .
575-
</p>
576-
577-
<!-- !split -->
578-
<h2 id="choice-of-training-data" class="anchor">Choice of training data </h2>
579-
580-
<p>The training and testing batches were separated into whole
581-
trajectories. This means that instead of training and testing on
582-
different fractions of the same trajectory, all trajectories that were
583-
tested had completely new initial conditions. In this sense, from a
584-
total of 10 initial conditions (independent trajectories), 9 were used
585-
for training and 1 for testing. Each trajectory consisted of 800
586-
points in each space coordinate.
587-
</p>
588-
589-
<!-- !split -->
590-
<h2 id="cost-loss-function" class="anchor">Cost/Loss function </h2>
591-
592-
<p>The problem we have is a time-series forecasting problem, so, we are
593-
free to choose the loss function amongst the big collection of
594-
regression losses. Using the mean-squared error of the predicted
595-
versus factual trajectories of the dynamic systems is a natural choice.
596-
</p>
597-
598-
<p>It is a convex
599-
function, so given sufficient time and appropriate learning rates, it
600-
is guaranteed to converge to global minima irrespective of the
601-
weightss random initializations.
602-
</p>
603-
604-
$$
605-
\begin{align}
606-
\mathcal{L}_{MSE} = \frac{1}{N}\sum_{i}^N (y(\mathbf{x}_i) - \hat{y}(\mathbf{x}_i, \mathbf{\theta}))^2
607-
\label{_auto1}
608-
\end{align}
609-
$$
610-
611-
<p>where \( \mathbf{\theta} \) represents the set of all parameters of the network, and \( \mathbf{x}_i \) are the input values</p>
612-
613-
<!-- !split -->
614-
<h2 id="modifying-the-cost-loss-function-adding-more-info" class="anchor">Modifying the cost/loss function, adding more info </h2>
615-
616-
<p>A cost/loss function that is based on the observational and
617-
predicted data, is normally referred to as a purely data-driven approach.
618-
</p>
619-
620-
<p>While this is a
621-
well-established way of assessing regressions, it does not make use of
622-
other intuitions we might have over the problem we are trying to
623-
solve. At the same time, it is a well-established fact that neural
624-
network models are data-greedy - they need large amounts of data to be
625-
able to generalize predictions outside the training set. One way to
626-
try to mitigate this is by using physics-informed neural networks
627-
(PINNs) when possible.
628-
</p>
629-
630-
<!-- !split -->
631-
<h2 id="changing-the-function-to-optimize" class="anchor">Changing the function to optimize </h2>
632-
633-
<p>Trying to improve the performance of our model beyond training sets,
634-
PINNs then add physics-informed penalties to the loss function. In
635-
essence, this means that we add a worse evaluation score to
636-
predictions that do not respect physical laws we think our real data
637-
should obey. This procedure often has the advantage of trimming the
638-
parameter space without adding bias to the model if the constraints
639-
imposed are correct, but the choice of the physical laws can be a
640-
delicate one.
641-
</p>
642-
643-
<!-- !split -->
644-
<h2 id="adding-more-information-to-the-loss-function" class="anchor">Adding more information to the loss function </h2>
645-
646-
<p>A general way of expressing this added penalty to the loss function is shown here</p>
647-
$$
648-
\begin{align*}
649-
\mathcal{L} = w_{MSE}\mathcal{L}_{MSE} + w_{PI}\mathcal{L}_{PI}.
650-
\end{align*}
651-
$$
652-
653-
<p>Here, the weights \( w_{MSE} \) and \( w_{PI} \) explicitly mediate how much
654-
influence the specific parts of the total loss function should
655-
contribute.
656-
</p>
657-
658451
<!-- !split -->
659452
<h2 id="autoencoders-overarching-view" class="anchor">Autoencoders: Overarching view </h2>
660453

0 commit comments

Comments
 (0)