@@ -1740,3 +1740,389 @@ and continue till we have solved all $n$ sets of linear equations.
17401740
17411741
17421742
1743+
1744+ !split
1745+ ===== Basic math of the SVD =====
1746+
1747+
1748+ From standard linear algebra we know that a square matrix $\bm{X}$ can be diagonalized if and only it is
1749+ a so-called "normal matrix":"https://en.wikipedia.org/wiki/Normal_matrix", that is if $\bm{X}\in {\mathbb{R}}^{n\times n}$
1750+ we have $\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ or if $\bm{X}\in {\mathbb{C}}^{n\times n}$ we have $\bm{X}\bm{X}^{\dagger}=\bm{X}^{\dagger}\bm{X}$.
1751+ The matrix has then a set of eigenpairs
1752+
1753+ !bt
1754+ \[
1755+ (\lambda_1,\bm{u}_1),\dots, (\lambda_n,\bm{u}_n),
1756+ !et
1757+ and the eigenvalues are given by the diagonal matrix
1758+ !bt
1759+ \[
1760+ \bm{\Sigma}=\mathrm{Diag}(\lambda_1, \dots,\lambda_n).
1761+ \]
1762+ !et
1763+ The matrix $\bm{X}$ can be written in terms of an orthogonal/unitary transformation $\bm{U}$
1764+ !bt
1765+ \[
1766+ \bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T,
1767+ \]
1768+ !et
1769+ with $\bm{U}\bm{U}^T=\bm{I}$ or $\bm{U}\bm{U}^{\dagger}=\bm{I}$.
1770+
1771+ Not all square matrices are diagonalizable. A matrix like the one discussed above
1772+ !bt
1773+ \[
1774+ \bm{X} = \begin{bmatrix}
1775+ 1& -1 \\
1776+ 1& -1\\
1777+ \end{bmatrix}
1778+ \]
1779+ !et
1780+ is not diagonalizable, it is a so-called "defective matrix":"https://en.wikipedia.org/wiki/Defective_matrix". It is easy to see that the condition
1781+ $\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ is not fulfilled.
1782+
1783+
1784+ !split
1785+ ===== The SVD, a Fantastic Algorithm =====
1786+
1787+
1788+ However, and this is the strength of the SVD algorithm, any general
1789+ matrix $\bm{X}$ can be decomposed in terms of a diagonal matrix and
1790+ two orthogonal/unitary matrices. The "Singular Value Decompostion
1791+ (SVD) theorem":"https://en.wikipedia.org/wiki/Singular_value_decomposition"
1792+ states that a general $m\times n$ matrix $\bm{X}$ can be written in
1793+ terms of a diagonal matrix $\bm{\Sigma}$ of dimensionality $m\times n$
1794+ and two orthognal matrices $\bm{U}$ and $\bm{V}$, where the first has
1795+ dimensionality $m \times m$ and the last dimensionality $n\times n$.
1796+ We have then
1797+
1798+ !bt
1799+ \[
1800+ \bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T
1801+ \]
1802+ !et
1803+
1804+ As an example, the above defective matrix can be decomposed as
1805+
1806+ !bt
1807+ \[
1808+ \bm{X} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1& 1 \\ 1& -1\\ \end{bmatrix} \begin{bmatrix} 2& 0 \\ 0& 0\\ \end{bmatrix} \frac{1}{\sqrt{2}}\begin{bmatrix} 1& -1 \\ 1& 1\\ \end{bmatrix}=\bm{U}\bm{\Sigma}\bm{V}^T,
1809+ \]
1810+ !et
1811+
1812+ with eigenvalues $\sigma_1=2$ and $\sigma_2=0$.
1813+ The SVD exits always!
1814+
1815+ The SVD
1816+ decomposition (singular values) gives eigenvalues
1817+ $\sigma_i\geq\sigma_{i+1}$ for all $i$ and for dimensions larger than $i=p$, the
1818+ eigenvalues (singular values) are zero.
1819+
1820+ In the general case, where our design matrix $\bm{X}$ has dimension
1821+ $n\times p$, the matrix is thus decomposed into an $n\times n$
1822+ orthogonal matrix $\bm{U}$, a $p\times p$ orthogonal matrix $\bm{V}$
1823+ and a diagonal matrix $\bm{\Sigma}$ with $r=\mathrm{min}(n,p)$
1824+ singular values $\sigma_i\geq 0$ on the main diagonal and zeros filling
1825+ the rest of the matrix. There are at most $p$ singular values
1826+ assuming that $n > p$. In our regression examples for the nuclear
1827+ masses and the equation of state this is indeed the case, while for
1828+ the Ising model we have $p > n$. These are often cases that lead to
1829+ near singular or singular matrices.
1830+
1831+ The columns of $\bm{U}$ are called the left singular vectors while the columns of $\bm{V}$ are the right singular vectors.
1832+
1833+ !split
1834+ ===== Economy-size SVD =====
1835+
1836+ If we assume that $n > p$, then our matrix $\bm{U}$ has dimension $n
1837+ \times n$. The last $n-p$ columns of $\bm{U}$ become however
1838+ irrelevant in our calculations since they are multiplied with the
1839+ zeros in $\bm{\Sigma}$.
1840+
1841+ The economy-size decomposition removes extra rows or columns of zeros
1842+ from the diagonal matrix of singular values, $\bm{\Sigma}$, along with the columns
1843+ in either $\bm{U}$ or $\bm{V}$ that multiply those zeros in the expression.
1844+ Removing these zeros and columns can improve execution time
1845+ and reduce storage requirements without compromising the accuracy of
1846+ the decomposition.
1847+
1848+ If $n > p$, we keep only the first $p$ columns of $\bm{U}$ and $\bm{\Sigma}$ has dimension $p\times p$.
1849+ If $p > n$, then only the first $n$ columns of $\bm{V}$ are computed and $\bm{\Sigma}$ has dimension $n\times n$.
1850+ The $n=p$ case is obvious, we retain the full SVD.
1851+ In general the economy-size SVD leads to less FLOPS and still conserving the desired accuracy.
1852+
1853+ !split
1854+ ===== Codes for the SVD =====
1855+
1856+ !bc pycod
1857+ import numpy as np
1858+ # SVD inversion
1859+ def SVD(A):
1860+ ''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
1861+ SVD is numerically more stable than the inversion algorithms provided by
1862+ numpy and scipy.linalg at the cost of being slower.
1863+ '''
1864+ U, S, VT = np.linalg.svd(A,full_matrices=True)
1865+ print('test U')
1866+ print( (np.transpose(U) @ U - U @np.transpose(U)))
1867+ print('test VT')
1868+ print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
1869+ print(U)
1870+ print(S)
1871+ print(VT)
1872+
1873+ D = np.zeros((len(U),len(VT)))
1874+ for i in range(0,len(VT)):
1875+ D[i,i]=S[i]
1876+ return U @ D @ VT
1877+
1878+
1879+ X = np.array([ [1.0,-1.0], [1.0,-1.0]])
1880+ #X = np.array([[1, 2], [3, 4], [5, 6]])
1881+
1882+ print(X)
1883+ C = SVD(X)
1884+ # Print the difference between the original matrix and the SVD one
1885+ print(C-X)
1886+ !ec
1887+
1888+ The matrix $\bm{X}$ has columns that are linearly dependent. The first
1889+ column is the row-wise sum of the other two columns. The rank of a
1890+ matrix (the column rank) is the dimension of space spanned by the
1891+ column vectors. The rank of the matrix is the number of linearly
1892+ independent columns, in this case just $2$. We see this from the
1893+ singular values when running the above code. Running the standard
1894+ inversion algorithm for matrix inversion with $\bm{X}^T\bm{X}$ results
1895+ in the program terminating due to a singular matrix.
1896+
1897+
1898+ !split
1899+ ===== Note about SVD Calculations =====
1900+
1901+ The $U$, $S$, and $V$ matrices returned from the _svd()_ function
1902+ cannot be multiplied directly.
1903+
1904+ As you can see from the code, the $S$ vector must be converted into a
1905+ diagonal matrix. This may cause a problem as the size of the matrices
1906+ do not fit the rules of matrix multiplication, where the number of
1907+ columns in a matrix must match the number of rows in the subsequent
1908+ matrix.
1909+
1910+ If you wish to include the zero singular values, you will need to
1911+ resize the matrices and set up a diagonal matrix as done in the above
1912+ example
1913+
1914+
1915+
1916+
1917+
1918+ !split
1919+ ===== Mathematics of the SVD and implications =====
1920+
1921+ Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.
1922+
1923+ Our starting point is our design matrix $\bm{X}$ of dimension $n\times p$
1924+ !bt
1925+ \[
1926+ \bm{X}=\begin{bmatrix}
1927+ x_{0,0} & x_{0,1} & x_{0,2}& \dots & \dots x_{0,p-1}\\
1928+ x_{1,0} & x_{1,1} & x_{1,2}& \dots & \dots x_{1,p-1}\\
1929+ x_{2,0} & x_{2,1} & x_{2,2}& \dots & \dots x_{2,p-1}\\
1930+ \dots & \dots & \dots & \dots \dots & \dots \\
1931+ x_{n-2,0} & x_{n-2,1} & x_{n-2,2}& \dots & \dots x_{n-2,p-1}\\
1932+ x_{n-1,0} & x_{n-1,1} & x_{n-1,2}& \dots & \dots x_{n-1,p-1}\\
1933+ \end{bmatrix}.
1934+ \]
1935+ !et
1936+
1937+ We can SVD decompose our matrix as
1938+ !bt
1939+ \[
1940+ \bm{X}=\bm{U}\bm{\Sigma}\bm{V}^T,
1941+ \]
1942+ !et
1943+ where $\bm{U}$ is an orthogonal matrix of dimension $n\times n$, meaning that $\bm{U}\bm{U}^T=\bm{U}^T\bm{U}=\bm{I}_n$. Here $\bm{I}_n$ is the unit matrix of dimension $n \times n$.
1944+
1945+ Similarly, $\bm{V}$ is an orthogonal matrix of dimension $p\times p$, meaning that $\bm{V}\bm{V}^T=\bm{V}^T\bm{V}=\bm{I}_p$. Here $\bm{I}_n$ is the unit matrix of dimension $p \times p$.
1946+
1947+ Finally $\bm{\Sigma}$ contains the singular values $\sigma_i$. This matrix has dimension $n\times p$ and the singular values $\sigma_i$ are all positive. The non-zero values are ordered in descending order, that is
1948+
1949+ !bt
1950+ \[
1951+ \sigma_0 > \sigma_1 > \sigma_2 > \dots > \sigma_{p-1} > 0.
1952+ \]
1953+ !et
1954+
1955+ All values beyond $p-1$ are all zero.
1956+
1957+ !split
1958+ ===== Example Matrix =====
1959+
1960+ As an example, consider the following $3\times 2$ example for the matrix $\bm{\Sigma}$
1961+
1962+ !bt
1963+ \[
1964+ \bm{\Sigma}=
1965+ \begin{bmatrix}
1966+ 2& 0 \\
1967+ 0 & 1 \\
1968+ 0 & 0 \\
1969+ \end{bmatrix}
1970+ \]
1971+ !et
1972+
1973+ The singular values are $\sigma_0=2$ and $\sigma_1=1$. It is common to rewrite the matrix $\bm{\Sigma}$ as
1974+
1975+ !bt
1976+ \[
1977+ \bm{\Sigma}=
1978+ \begin{bmatrix}
1979+ \bm{\tilde{\Sigma}}\\
1980+ \bm{0}\\
1981+ \end{bmatrix},
1982+ \]
1983+ !et
1984+
1985+ where
1986+ !bt
1987+ \[
1988+ \bm{\tilde{\Sigma}}=
1989+ \begin{bmatrix}
1990+ 2& 0 \\
1991+ 0 & 1 \\
1992+ \end{bmatrix},
1993+ \]
1994+ !et
1995+ contains only the singular values. Note also (and we will use this below) that
1996+
1997+ !bt
1998+ \[
1999+ \bm{\Sigma}^T\bm{\Sigma}=
2000+ \begin{bmatrix}
2001+ 4& 0 \\
2002+ 0 & 1 \\
2003+ \end{bmatrix},
2004+ \]
2005+ !et
2006+ which is a $2\times 2 $ matrix while
2007+ !bt
2008+ \[
2009+ \bm{\Sigma}\bm{\Sigma}^T=
2010+ \begin{bmatrix}
2011+ 4& 0 & 0\\
2012+ 0 & 1 & 0\\
2013+ 0 & 0 & 0\\
2014+ \end{bmatrix},
2015+ \]
2016+ !et
2017+
2018+ is a $3\times 3 $ matrix. The last row and column of this last matrix
2019+ contain only zeros. This will have important consequences for our SVD
2020+ decomposition of the design matrix.
2021+
2022+
2023+ !split
2024+ ===== Setting up the Matrix to be inverted =====
2025+
2026+ The matrix that may cause problems for us is $\bm{X}^T\bm{X}$. Using the SVD we can rewrite this matrix as
2027+
2028+ !bt
2029+ \[
2030+ \bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T,
2031+ \]
2032+ !et
2033+ and using the orthogonality of the matrix $\bm{U}$ we have
2034+
2035+ !bt
2036+ \[
2037+ \bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T.
2038+ \]
2039+ !et
2040+ We define $\bm{\Sigma}^T\bm{\Sigma}=\tilde{\bm{\Sigma}}^2$ which is a diagonal matrix containing only the singular values squared. It has dimensionality $p \times p$.
2041+
2042+
2043+ We can now insert the result for the matrix $\bm{X}^T\bm{X}$ into our equation for ordinary least squares where
2044+
2045+ !bt
2046+ \[
2047+ \tilde{y}_{\mathrm{OLS}}=\bm{X}\left(\bm{X}^T\bm{X}\right)^{-1}\bm{X}^T\bm{y},
2048+ \]
2049+ !et
2050+ and using our SVD decomposition of $\bm{X}$ we have
2051+
2052+ !bt
2053+ \[
2054+ \tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{\Sigma}\bm{V}^T\left(\bm{V}\tilde{\bm{\Sigma}}^{2}(\bm{V}^T\right)^{-1}\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{y},
2055+ \]
2056+ !et
2057+ which gives us, using the orthogonality of the matrix $\bm{V}$,
2058+
2059+ !bt
2060+ \[
2061+ \tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{U}^T\bm{y}=\sum_{i=0}^{p-1}\bm{u}_i\bm{u}^T_i\bm{y},
2062+ \]
2063+ !et
2064+
2065+ It means that the ordinary least square model (with the optimal
2066+ parameters) $\bm{\tilde{y}}$, corresponds to an orthogonal
2067+ transformation of the output (or target) vector $\bm{y}$ by the
2068+ vectors of the matrix $\bm{U}$. _Note that the summation ends at_
2069+ $p-1$, that is $\bm{\tilde{y}}\ne \bm{y}$. We can thus not use the
2070+ orthogonality relation for the matrix $\bm{U}$. This can already be
2071+ when we multiply the matrices $\bm{\Sigma}^T\bm{U}^T$.
2072+
2073+ !split
2074+ ===== Further properties (important for our analyses later) =====
2075+
2076+ Let us study again $\bm{X}^T\bm{X}$ in terms of our SVD,
2077+ !bt
2078+ \[
2079+ \bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T.
2080+ \]
2081+ !et
2082+
2083+ If we now multiply from the right with $\bm{V}$ (using the orthogonality of $\bm{V}$) we get
2084+ !bt
2085+ \[
2086+ \left(\bm{X}^T\bm{X}\right)\bm{V}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}.
2087+ \]
2088+ !et
2089+ This means the vectors $\bm{v}_i$ of the orthogonal matrix $\bm{V}$ are the eigenvectors of the matrix $\bm{X}^T\bm{X}$
2090+ with eigenvalues given by the singular values squared, that is
2091+ !bt
2092+ \[
2093+ \left(\bm{X}^T\bm{X}\right)\bm{v}_i=\bm{v}_i\sigma_i^2.
2094+ \]
2095+ !et
2096+
2097+ Similarly, if we use the SVD decomposition for the matrix $\bm{X}\bm{X}^T$, we have
2098+ !bt
2099+ \[
2100+ \bm{X}\bm{X}^T=\bm{U}\bm{\Sigma}\bm{V}^T\bm{V}\bm{\Sigma}^T\bm{U}^T=\bm{U}\bm{\Sigma}\bm{\Sigma}^T\bm{U}^T.
2101+ \]
2102+ !et
2103+
2104+ If we now multiply from the right with $\bm{U}$ (using the orthogonality of $\bm{U}$) we get
2105+ !bt
2106+ \[
2107+ \left(\bm{X}\bm{X}^T\right)\bm{U}=\bm{U}\bm{\Sigma}\bm{\Sigma}^T.
2108+ \]
2109+ !et
2110+ This means the vectors $\bm{u}_i$ of the orthogonal matrix $\bm{U}$ are the eigenvectors of the matrix $\bm{X}\bm{X}^T$
2111+ with eigenvalues given by the singular values squared, that is
2112+ !bt
2113+ \[
2114+ \left(\bm{X}\bm{X}^T\right)\bm{u}_i=\bm{u}_i\sigma_i^2.
2115+ \]
2116+ !et
2117+
2118+ _Important note_: we have defined our design matrix $\bm{X}$ to be an
2119+ $n\times p$ matrix. In most supervised learning cases we have that $n
2120+ \ge p$, and quite often we have $n >> p$. For linear algebra based methods like ordinary least squares or Ridge regression, this leads to a matrix $\bm{X}^T\bm{X}$ which is small and thereby easier to handle from a computational point of view (in terms of number of floating point operations).
2121+
2122+ In our lectures, the number of columns will
2123+ always refer to the number of features in our data set, while the
2124+ number of rows represents the number of data inputs. Note that in
2125+ other texts you may find the opposite notation. This has consequences
2126+ for the definition of for example the covariance matrix and its relation to the SVD.
2127+
2128+
0 commit comments