Update linalgebra.do.txt

mhjensen · mhjensen · commit ba618a6f1526 · 2025-01-12T11:41:53.000+01:00
diff --git a/doc/src/week1/linalgebra.do.txt b/doc/src/week1/linalgebra.do.txt
@@ -1740,3 +1740,389 @@ and continue till we have solved all $n$ sets of linear equations.
 
 
 
+
+!split
+===== Basic math of the SVD =====
+
+
+From standard linear algebra we know that a square matrix $\bm{X}$ can be diagonalized if and only it is 
+a so-called "normal matrix":"https://en.wikipedia.org/wiki/Normal_matrix", that is if $\bm{X}\in {\mathbb{R}}^{n\times n}$
+we have $\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ or if $\bm{X}\in {\mathbb{C}}^{n\times n}$ we have $\bm{X}\bm{X}^{\dagger}=\bm{X}^{\dagger}\bm{X}$.
+The matrix has then a set of eigenpairs 
+
+!bt
+\[
+(\lambda_1,\bm{u}_1),\dots, (\lambda_n,\bm{u}_n),
+!et
+and the eigenvalues are given by the diagonal matrix
+!bt
+\[
+\bm{\Sigma}=\mathrm{Diag}(\lambda_1, \dots,\lambda_n).
+\]
+!et
+The matrix $\bm{X}$ can be written in terms of an orthogonal/unitary transformation $\bm{U}$
+!bt
+\[
+\bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T,
+\]
+!et
+with $\bm{U}\bm{U}^T=\bm{I}$ or $\bm{U}\bm{U}^{\dagger}=\bm{I}$.
+
+Not all square matrices are diagonalizable. A matrix like the one discussed above
+!bt
+\[
+\bm{X} = \begin{bmatrix} 
+1&  -1 \\
+1& -1\\
+\end{bmatrix} 
+\]
+!et
+is not diagonalizable, it is a so-called "defective matrix":"https://en.wikipedia.org/wiki/Defective_matrix". It is easy to see that the condition
+$\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ is not fulfilled. 
+
+
+!split
+===== The SVD, a Fantastic Algorithm =====
+
+
+However, and this is the strength of the SVD algorithm, any general
+matrix $\bm{X}$ can be decomposed in terms of a diagonal matrix and
+two orthogonal/unitary matrices.  The "Singular Value Decompostion
+(SVD) theorem":"https://en.wikipedia.org/wiki/Singular_value_decomposition"
+states that a general $m\times n$ matrix $\bm{X}$ can be written in
+terms of a diagonal matrix $\bm{\Sigma}$ of dimensionality $m\times n$
+and two orthognal matrices $\bm{U}$ and $\bm{V}$, where the first has
+dimensionality $m \times m$ and the last dimensionality $n\times n$.
+We have then
+
+!bt
+\[ 
+\bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T 
+\] 
+!et 
+
+As an example, the above defective matrix can be decomposed as
+
+!bt
+\[
+\bm{X} = \frac{1}{\sqrt{2}}\begin{bmatrix}  1&  1 \\ 1& -1\\ \end{bmatrix} \begin{bmatrix}  2&  0 \\ 0& 0\\ \end{bmatrix}    \frac{1}{\sqrt{2}}\begin{bmatrix}  1&  -1 \\ 1& 1\\ \end{bmatrix}=\bm{U}\bm{\Sigma}\bm{V}^T, 
+\]
+!et
+
+with eigenvalues $\sigma_1=2$ and $\sigma_2=0$. 
+The SVD exits always! 
+
+The SVD
+decomposition (singular values) gives eigenvalues 
+$\sigma_i\geq\sigma_{i+1}$ for all $i$ and for dimensions larger than $i=p$, the
+eigenvalues (singular values) are zero.
+
+In the general case, where our design matrix $\bm{X}$ has dimension
+$n\times p$, the matrix is thus decomposed into an $n\times n$
+orthogonal matrix $\bm{U}$, a $p\times p$ orthogonal matrix $\bm{V}$
+and a diagonal matrix $\bm{\Sigma}$ with $r=\mathrm{min}(n,p)$
+singular values $\sigma_i\geq 0$ on the main diagonal and zeros filling
+the rest of the matrix.  There are at most $p$ singular values
+assuming that $n > p$. In our regression examples for the nuclear
+masses and the equation of state this is indeed the case, while for
+the Ising model we have $p > n$. These are often cases that lead to
+near singular or singular matrices.
+
+The columns of $\bm{U}$ are called the left singular vectors while the columns of $\bm{V}$ are the right singular vectors.
+
+!split
+===== Economy-size SVD =====
+
+If we assume that $n > p$, then our matrix $\bm{U}$ has dimension $n
+\times n$. The last $n-p$ columns of $\bm{U}$ become however
+irrelevant in our calculations since they are multiplied with the
+zeros in $\bm{\Sigma}$.
+
+The economy-size decomposition removes extra rows or columns of zeros
+from the diagonal matrix of singular values, $\bm{\Sigma}$, along with the columns
+in either $\bm{U}$ or $\bm{V}$ that multiply those zeros in the expression. 
+Removing these zeros and columns can improve execution time
+and reduce storage requirements without compromising the accuracy of
+the decomposition.
+
+If $n > p$, we keep only the first $p$ columns of $\bm{U}$ and $\bm{\Sigma}$ has dimension $p\times p$. 
+If $p > n$, then only the first $n$ columns of $\bm{V}$ are computed and $\bm{\Sigma}$ has dimension $n\times n$.
+The $n=p$ case is obvious, we retain the full SVD. 
+In general the economy-size SVD leads to less FLOPS and still conserving the desired accuracy.
+
+!split
+=====  Codes for the SVD =====
+
+!bc pycod
+import numpy as np
+# SVD inversion
+def SVD(A):
+    ''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
+    SVD is numerically more stable than the inversion algorithms provided by
+    numpy and scipy.linalg at the cost of being slower.
+    '''
+    U, S, VT = np.linalg.svd(A,full_matrices=True)
+    print('test U')
+    print( (np.transpose(U) @ U - U @np.transpose(U)))
+    print('test VT')
+    print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
+    print(U)
+    print(S)
+    print(VT)
+
+    D = np.zeros((len(U),len(VT)))
+    for i in range(0,len(VT)):
+        D[i,i]=S[i]
+    return U @ D @ VT
+
+
+X = np.array([ [1.0,-1.0], [1.0,-1.0]])
+#X = np.array([[1, 2], [3, 4], [5, 6]])
+
+print(X)
+C = SVD(X)
+# Print the difference between the original matrix and the SVD one
+print(C-X)
+!ec
+
+The matrix $\bm{X}$ has columns that are linearly dependent. The first
+column is the row-wise sum of the other two columns. The rank of a
+matrix (the column rank) is the dimension of space spanned by the
+column vectors. The rank of the matrix is the number of linearly
+independent columns, in this case just $2$. We see this from the
+singular values when running the above code. Running the standard
+inversion algorithm for matrix inversion with $\bm{X}^T\bm{X}$ results
+in the program terminating due to a singular matrix.
+
+
+!split
+=====  Note about SVD Calculations =====
+
+The $U$, $S$, and $V$ matrices returned from the _svd()_ function
+cannot be multiplied directly.
+
+As you can see from the code, the $S$ vector must be converted into a
+diagonal matrix. This may cause a problem as the size of the matrices
+do not fit the rules of matrix multiplication, where the number of
+columns in a matrix must match the number of rows in the subsequent
+matrix.
+
+If you wish to include the zero singular values, you will need to
+resize the matrices and set up a diagonal matrix as done in the above
+example
+
+
+
+
+
+!split
+===== Mathematics of the SVD and implications =====
+
+Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.
+
+Our starting point is our design matrix $\bm{X}$ of dimension $n\times p$
+!bt
+\[
+\bm{X}=\begin{bmatrix}
+x_{0,0} & x_{0,1} & x_{0,2}& \dots & \dots x_{0,p-1}\\
+x_{1,0} & x_{1,1} & x_{1,2}& \dots & \dots x_{1,p-1}\\
+x_{2,0} & x_{2,1} & x_{2,2}& \dots & \dots x_{2,p-1}\\
+\dots & \dots & \dots & \dots \dots & \dots \\
+x_{n-2,0} & x_{n-2,1} & x_{n-2,2}& \dots & \dots x_{n-2,p-1}\\
+x_{n-1,0} & x_{n-1,1} & x_{n-1,2}& \dots & \dots x_{n-1,p-1}\\
+\end{bmatrix}.
+\]
+!et
+
+We can SVD decompose our matrix as
+!bt
+\[
+\bm{X}=\bm{U}\bm{\Sigma}\bm{V}^T,
+\]
+!et
+where $\bm{U}$ is an orthogonal matrix of dimension $n\times n$, meaning that $\bm{U}\bm{U}^T=\bm{U}^T\bm{U}=\bm{I}_n$. Here $\bm{I}_n$ is the unit matrix of dimension $n \times n$.
+
+Similarly, $\bm{V}$ is an orthogonal matrix of dimension $p\times p$, meaning that $\bm{V}\bm{V}^T=\bm{V}^T\bm{V}=\bm{I}_p$. Here $\bm{I}_n$ is the unit matrix of dimension $p \times p$.
+
+Finally $\bm{\Sigma}$ contains the singular values $\sigma_i$. This matrix has dimension $n\times p$ and the singular values $\sigma_i$ are all positive. The non-zero values are ordered in descending order, that is
+
+!bt
+\[
+\sigma_0 > \sigma_1 > \sigma_2 > \dots > \sigma_{p-1} > 0. 
+\]
+!et
+
+All values beyond $p-1$ are all zero.
+
+!split
+===== Example Matrix =====
+
+As an example, consider the following $3\times 2$ example for the matrix $\bm{\Sigma}$
+
+!bt
+\[
+\bm{\Sigma}=
+\begin{bmatrix}
+2& 0 \\
+0 & 1 \\
+0 & 0 \\
+\end{bmatrix}
+\]
+!et
+
+The singular values are $\sigma_0=2$ and $\sigma_1=1$. It is common to rewrite the matrix $\bm{\Sigma}$ as
+
+!bt
+\[
+\bm{\Sigma}=
+\begin{bmatrix}
+\bm{\tilde{\Sigma}}\\
+\bm{0}\\
+\end{bmatrix},
+\]
+!et
+
+where
+!bt
+\[
+\bm{\tilde{\Sigma}}=
+\begin{bmatrix}
+2& 0 \\
+0 & 1 \\
+\end{bmatrix},
+\]
+!et
+contains only the singular values.   Note also (and we will use this below) that
+
+!bt
+\[
+\bm{\Sigma}^T\bm{\Sigma}=
+\begin{bmatrix}
+4& 0 \\
+0 & 1 \\
+\end{bmatrix},
+\]
+!et
+which is a $2\times 2 $ matrix while
+!bt
+\[
+\bm{\Sigma}\bm{\Sigma}^T=
+\begin{bmatrix}
+4& 0 & 0\\
+0 & 1 & 0\\
+0 & 0 & 0\\
+\end{bmatrix},
+\]
+!et
+
+is a $3\times 3 $ matrix. The last row and column of this last matrix
+contain only zeros. This will have important consequences for our SVD
+decomposition of the design matrix.
+
+
+!split
+=====  Setting up the Matrix to be inverted =====
+
+The matrix that may cause problems for us is $\bm{X}^T\bm{X}$. Using the SVD we can rewrite this matrix as
+
+!bt
+\[
+\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T,
+\]
+!et
+and using the orthogonality of the matrix $\bm{U}$ we have
+
+!bt
+\[
+\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T.
+\]
+!et
+We define $\bm{\Sigma}^T\bm{\Sigma}=\tilde{\bm{\Sigma}}^2$ which is  a diagonal matrix containing only the singular values squared. It has dimensionality $p \times p$.
+
+
+We can now insert the result for the matrix $\bm{X}^T\bm{X}$ into our equation for ordinary least squares where
+
+!bt
+\[
+\tilde{y}_{\mathrm{OLS}}=\bm{X}\left(\bm{X}^T\bm{X}\right)^{-1}\bm{X}^T\bm{y},
+\]
+!et
+and using our SVD decomposition of $\bm{X}$ we have
+
+!bt
+\[
+\tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{\Sigma}\bm{V}^T\left(\bm{V}\tilde{\bm{\Sigma}}^{2}(\bm{V}^T\right)^{-1}\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{y},
+\]
+!et
+which gives us, using the orthogonality of the matrix $\bm{V}$,
+
+!bt
+\[
+\tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{U}^T\bm{y}=\sum_{i=0}^{p-1}\bm{u}_i\bm{u}^T_i\bm{y},
+\]
+!et
+
+It means that the ordinary least square model (with the optimal
+parameters) $\bm{\tilde{y}}$, corresponds to an orthogonal
+transformation of the output (or target) vector $\bm{y}$ by the
+vectors of the matrix $\bm{U}$. _Note that the summation ends at_
+$p-1$, that is $\bm{\tilde{y}}\ne \bm{y}$. We can thus not use the
+orthogonality relation for the matrix $\bm{U}$. This can already be
+when we multiply the matrices $\bm{\Sigma}^T\bm{U}^T$.
+
+!split
+===== Further properties (important for our analyses later) =====
+
+Let us study again $\bm{X}^T\bm{X}$ in terms of our SVD,
+!bt
+\[
+\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T. 
+\]
+!et
+
+If we now multiply from the right with $\bm{V}$ (using the orthogonality of $\bm{V}$) we get
+!bt
+\[
+\left(\bm{X}^T\bm{X}\right)\bm{V}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}. 
+\]
+!et
+This means the vectors $\bm{v}_i$ of the orthogonal matrix $\bm{V}$ are the eigenvectors of the matrix $\bm{X}^T\bm{X}$
+with eigenvalues given by the singular values squared, that is
+!bt
+\[
+\left(\bm{X}^T\bm{X}\right)\bm{v}_i=\bm{v}_i\sigma_i^2. 
+\]
+!et
+
+Similarly, if we use the SVD decomposition for the matrix $\bm{X}\bm{X}^T$, we have
+!bt
+\[
+\bm{X}\bm{X}^T=\bm{U}\bm{\Sigma}\bm{V}^T\bm{V}\bm{\Sigma}^T\bm{U}^T=\bm{U}\bm{\Sigma}\bm{\Sigma}^T\bm{U}^T. 
+\]
+!et
+
+If we now multiply from the right with $\bm{U}$ (using the orthogonality of $\bm{U}$) we get
+!bt
+\[
+\left(\bm{X}\bm{X}^T\right)\bm{U}=\bm{U}\bm{\Sigma}\bm{\Sigma}^T. 
+\]
+!et
+This means the vectors $\bm{u}_i$ of the orthogonal matrix $\bm{U}$ are the eigenvectors of the matrix $\bm{X}\bm{X}^T$
+with eigenvalues given by the singular values squared, that is
+!bt
+\[
+\left(\bm{X}\bm{X}^T\right)\bm{u}_i=\bm{u}_i\sigma_i^2. 
+\]
+!et
+
+_Important note_: we have defined our design matrix $\bm{X}$ to be an
+$n\times p$ matrix. In most supervised learning cases we have that $n
+\ge p$, and quite often we have $n >> p$. For linear algebra based methods like ordinary least squares or Ridge regression, this leads to a matrix $\bm{X}^T\bm{X}$ which is small and thereby easier to handle from a computational point of view (in terms of number of floating point operations).
+
+In our lectures, the number of columns will
+always refer to the number of features in our data set, while the
+number of rows represents the number of data inputs. Note that in
+other texts you may find the opposite notation. This has consequences
+for the definition of for example the covariance matrix and its relation to the SVD.
+
+