Skip to content

Commit ba618a6

Browse files
committed
Update linalgebra.do.txt
1 parent 77d3323 commit ba618a6

File tree

1 file changed

+386
-0
lines changed

1 file changed

+386
-0
lines changed

doc/src/week1/linalgebra.do.txt

Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1740,3 +1740,389 @@ and continue till we have solved all $n$ sets of linear equations.
17401740

17411741

17421742

1743+
1744+
!split
1745+
===== Basic math of the SVD =====
1746+
1747+
1748+
From standard linear algebra we know that a square matrix $\bm{X}$ can be diagonalized if and only it is
1749+
a so-called "normal matrix":"https://en.wikipedia.org/wiki/Normal_matrix", that is if $\bm{X}\in {\mathbb{R}}^{n\times n}$
1750+
we have $\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ or if $\bm{X}\in {\mathbb{C}}^{n\times n}$ we have $\bm{X}\bm{X}^{\dagger}=\bm{X}^{\dagger}\bm{X}$.
1751+
The matrix has then a set of eigenpairs
1752+
1753+
!bt
1754+
\[
1755+
(\lambda_1,\bm{u}_1),\dots, (\lambda_n,\bm{u}_n),
1756+
!et
1757+
and the eigenvalues are given by the diagonal matrix
1758+
!bt
1759+
\[
1760+
\bm{\Sigma}=\mathrm{Diag}(\lambda_1, \dots,\lambda_n).
1761+
\]
1762+
!et
1763+
The matrix $\bm{X}$ can be written in terms of an orthogonal/unitary transformation $\bm{U}$
1764+
!bt
1765+
\[
1766+
\bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T,
1767+
\]
1768+
!et
1769+
with $\bm{U}\bm{U}^T=\bm{I}$ or $\bm{U}\bm{U}^{\dagger}=\bm{I}$.
1770+
1771+
Not all square matrices are diagonalizable. A matrix like the one discussed above
1772+
!bt
1773+
\[
1774+
\bm{X} = \begin{bmatrix}
1775+
1& -1 \\
1776+
1& -1\\
1777+
\end{bmatrix}
1778+
\]
1779+
!et
1780+
is not diagonalizable, it is a so-called "defective matrix":"https://en.wikipedia.org/wiki/Defective_matrix". It is easy to see that the condition
1781+
$\bm{X}\bm{X}^T=\bm{X}^T\bm{X}$ is not fulfilled.
1782+
1783+
1784+
!split
1785+
===== The SVD, a Fantastic Algorithm =====
1786+
1787+
1788+
However, and this is the strength of the SVD algorithm, any general
1789+
matrix $\bm{X}$ can be decomposed in terms of a diagonal matrix and
1790+
two orthogonal/unitary matrices. The "Singular Value Decompostion
1791+
(SVD) theorem":"https://en.wikipedia.org/wiki/Singular_value_decomposition"
1792+
states that a general $m\times n$ matrix $\bm{X}$ can be written in
1793+
terms of a diagonal matrix $\bm{\Sigma}$ of dimensionality $m\times n$
1794+
and two orthognal matrices $\bm{U}$ and $\bm{V}$, where the first has
1795+
dimensionality $m \times m$ and the last dimensionality $n\times n$.
1796+
We have then
1797+
1798+
!bt
1799+
\[
1800+
\bm{X} = \bm{U}\bm{\Sigma}\bm{V}^T
1801+
\]
1802+
!et
1803+
1804+
As an example, the above defective matrix can be decomposed as
1805+
1806+
!bt
1807+
\[
1808+
\bm{X} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1& 1 \\ 1& -1\\ \end{bmatrix} \begin{bmatrix} 2& 0 \\ 0& 0\\ \end{bmatrix} \frac{1}{\sqrt{2}}\begin{bmatrix} 1& -1 \\ 1& 1\\ \end{bmatrix}=\bm{U}\bm{\Sigma}\bm{V}^T,
1809+
\]
1810+
!et
1811+
1812+
with eigenvalues $\sigma_1=2$ and $\sigma_2=0$.
1813+
The SVD exits always!
1814+
1815+
The SVD
1816+
decomposition (singular values) gives eigenvalues
1817+
$\sigma_i\geq\sigma_{i+1}$ for all $i$ and for dimensions larger than $i=p$, the
1818+
eigenvalues (singular values) are zero.
1819+
1820+
In the general case, where our design matrix $\bm{X}$ has dimension
1821+
$n\times p$, the matrix is thus decomposed into an $n\times n$
1822+
orthogonal matrix $\bm{U}$, a $p\times p$ orthogonal matrix $\bm{V}$
1823+
and a diagonal matrix $\bm{\Sigma}$ with $r=\mathrm{min}(n,p)$
1824+
singular values $\sigma_i\geq 0$ on the main diagonal and zeros filling
1825+
the rest of the matrix. There are at most $p$ singular values
1826+
assuming that $n > p$. In our regression examples for the nuclear
1827+
masses and the equation of state this is indeed the case, while for
1828+
the Ising model we have $p > n$. These are often cases that lead to
1829+
near singular or singular matrices.
1830+
1831+
The columns of $\bm{U}$ are called the left singular vectors while the columns of $\bm{V}$ are the right singular vectors.
1832+
1833+
!split
1834+
===== Economy-size SVD =====
1835+
1836+
If we assume that $n > p$, then our matrix $\bm{U}$ has dimension $n
1837+
\times n$. The last $n-p$ columns of $\bm{U}$ become however
1838+
irrelevant in our calculations since they are multiplied with the
1839+
zeros in $\bm{\Sigma}$.
1840+
1841+
The economy-size decomposition removes extra rows or columns of zeros
1842+
from the diagonal matrix of singular values, $\bm{\Sigma}$, along with the columns
1843+
in either $\bm{U}$ or $\bm{V}$ that multiply those zeros in the expression.
1844+
Removing these zeros and columns can improve execution time
1845+
and reduce storage requirements without compromising the accuracy of
1846+
the decomposition.
1847+
1848+
If $n > p$, we keep only the first $p$ columns of $\bm{U}$ and $\bm{\Sigma}$ has dimension $p\times p$.
1849+
If $p > n$, then only the first $n$ columns of $\bm{V}$ are computed and $\bm{\Sigma}$ has dimension $n\times n$.
1850+
The $n=p$ case is obvious, we retain the full SVD.
1851+
In general the economy-size SVD leads to less FLOPS and still conserving the desired accuracy.
1852+
1853+
!split
1854+
===== Codes for the SVD =====
1855+
1856+
!bc pycod
1857+
import numpy as np
1858+
# SVD inversion
1859+
def SVD(A):
1860+
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
1861+
SVD is numerically more stable than the inversion algorithms provided by
1862+
numpy and scipy.linalg at the cost of being slower.
1863+
'''
1864+
U, S, VT = np.linalg.svd(A,full_matrices=True)
1865+
print('test U')
1866+
print( (np.transpose(U) @ U - U @np.transpose(U)))
1867+
print('test VT')
1868+
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
1869+
print(U)
1870+
print(S)
1871+
print(VT)
1872+
1873+
D = np.zeros((len(U),len(VT)))
1874+
for i in range(0,len(VT)):
1875+
D[i,i]=S[i]
1876+
return U @ D @ VT
1877+
1878+
1879+
X = np.array([ [1.0,-1.0], [1.0,-1.0]])
1880+
#X = np.array([[1, 2], [3, 4], [5, 6]])
1881+
1882+
print(X)
1883+
C = SVD(X)
1884+
# Print the difference between the original matrix and the SVD one
1885+
print(C-X)
1886+
!ec
1887+
1888+
The matrix $\bm{X}$ has columns that are linearly dependent. The first
1889+
column is the row-wise sum of the other two columns. The rank of a
1890+
matrix (the column rank) is the dimension of space spanned by the
1891+
column vectors. The rank of the matrix is the number of linearly
1892+
independent columns, in this case just $2$. We see this from the
1893+
singular values when running the above code. Running the standard
1894+
inversion algorithm for matrix inversion with $\bm{X}^T\bm{X}$ results
1895+
in the program terminating due to a singular matrix.
1896+
1897+
1898+
!split
1899+
===== Note about SVD Calculations =====
1900+
1901+
The $U$, $S$, and $V$ matrices returned from the _svd()_ function
1902+
cannot be multiplied directly.
1903+
1904+
As you can see from the code, the $S$ vector must be converted into a
1905+
diagonal matrix. This may cause a problem as the size of the matrices
1906+
do not fit the rules of matrix multiplication, where the number of
1907+
columns in a matrix must match the number of rows in the subsequent
1908+
matrix.
1909+
1910+
If you wish to include the zero singular values, you will need to
1911+
resize the matrices and set up a diagonal matrix as done in the above
1912+
example
1913+
1914+
1915+
1916+
1917+
1918+
!split
1919+
===== Mathematics of the SVD and implications =====
1920+
1921+
Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.
1922+
1923+
Our starting point is our design matrix $\bm{X}$ of dimension $n\times p$
1924+
!bt
1925+
\[
1926+
\bm{X}=\begin{bmatrix}
1927+
x_{0,0} & x_{0,1} & x_{0,2}& \dots & \dots x_{0,p-1}\\
1928+
x_{1,0} & x_{1,1} & x_{1,2}& \dots & \dots x_{1,p-1}\\
1929+
x_{2,0} & x_{2,1} & x_{2,2}& \dots & \dots x_{2,p-1}\\
1930+
\dots & \dots & \dots & \dots \dots & \dots \\
1931+
x_{n-2,0} & x_{n-2,1} & x_{n-2,2}& \dots & \dots x_{n-2,p-1}\\
1932+
x_{n-1,0} & x_{n-1,1} & x_{n-1,2}& \dots & \dots x_{n-1,p-1}\\
1933+
\end{bmatrix}.
1934+
\]
1935+
!et
1936+
1937+
We can SVD decompose our matrix as
1938+
!bt
1939+
\[
1940+
\bm{X}=\bm{U}\bm{\Sigma}\bm{V}^T,
1941+
\]
1942+
!et
1943+
where $\bm{U}$ is an orthogonal matrix of dimension $n\times n$, meaning that $\bm{U}\bm{U}^T=\bm{U}^T\bm{U}=\bm{I}_n$. Here $\bm{I}_n$ is the unit matrix of dimension $n \times n$.
1944+
1945+
Similarly, $\bm{V}$ is an orthogonal matrix of dimension $p\times p$, meaning that $\bm{V}\bm{V}^T=\bm{V}^T\bm{V}=\bm{I}_p$. Here $\bm{I}_n$ is the unit matrix of dimension $p \times p$.
1946+
1947+
Finally $\bm{\Sigma}$ contains the singular values $\sigma_i$. This matrix has dimension $n\times p$ and the singular values $\sigma_i$ are all positive. The non-zero values are ordered in descending order, that is
1948+
1949+
!bt
1950+
\[
1951+
\sigma_0 > \sigma_1 > \sigma_2 > \dots > \sigma_{p-1} > 0.
1952+
\]
1953+
!et
1954+
1955+
All values beyond $p-1$ are all zero.
1956+
1957+
!split
1958+
===== Example Matrix =====
1959+
1960+
As an example, consider the following $3\times 2$ example for the matrix $\bm{\Sigma}$
1961+
1962+
!bt
1963+
\[
1964+
\bm{\Sigma}=
1965+
\begin{bmatrix}
1966+
2& 0 \\
1967+
0 & 1 \\
1968+
0 & 0 \\
1969+
\end{bmatrix}
1970+
\]
1971+
!et
1972+
1973+
The singular values are $\sigma_0=2$ and $\sigma_1=1$. It is common to rewrite the matrix $\bm{\Sigma}$ as
1974+
1975+
!bt
1976+
\[
1977+
\bm{\Sigma}=
1978+
\begin{bmatrix}
1979+
\bm{\tilde{\Sigma}}\\
1980+
\bm{0}\\
1981+
\end{bmatrix},
1982+
\]
1983+
!et
1984+
1985+
where
1986+
!bt
1987+
\[
1988+
\bm{\tilde{\Sigma}}=
1989+
\begin{bmatrix}
1990+
2& 0 \\
1991+
0 & 1 \\
1992+
\end{bmatrix},
1993+
\]
1994+
!et
1995+
contains only the singular values. Note also (and we will use this below) that
1996+
1997+
!bt
1998+
\[
1999+
\bm{\Sigma}^T\bm{\Sigma}=
2000+
\begin{bmatrix}
2001+
4& 0 \\
2002+
0 & 1 \\
2003+
\end{bmatrix},
2004+
\]
2005+
!et
2006+
which is a $2\times 2 $ matrix while
2007+
!bt
2008+
\[
2009+
\bm{\Sigma}\bm{\Sigma}^T=
2010+
\begin{bmatrix}
2011+
4& 0 & 0\\
2012+
0 & 1 & 0\\
2013+
0 & 0 & 0\\
2014+
\end{bmatrix},
2015+
\]
2016+
!et
2017+
2018+
is a $3\times 3 $ matrix. The last row and column of this last matrix
2019+
contain only zeros. This will have important consequences for our SVD
2020+
decomposition of the design matrix.
2021+
2022+
2023+
!split
2024+
===== Setting up the Matrix to be inverted =====
2025+
2026+
The matrix that may cause problems for us is $\bm{X}^T\bm{X}$. Using the SVD we can rewrite this matrix as
2027+
2028+
!bt
2029+
\[
2030+
\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T,
2031+
\]
2032+
!et
2033+
and using the orthogonality of the matrix $\bm{U}$ we have
2034+
2035+
!bt
2036+
\[
2037+
\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T.
2038+
\]
2039+
!et
2040+
We define $\bm{\Sigma}^T\bm{\Sigma}=\tilde{\bm{\Sigma}}^2$ which is a diagonal matrix containing only the singular values squared. It has dimensionality $p \times p$.
2041+
2042+
2043+
We can now insert the result for the matrix $\bm{X}^T\bm{X}$ into our equation for ordinary least squares where
2044+
2045+
!bt
2046+
\[
2047+
\tilde{y}_{\mathrm{OLS}}=\bm{X}\left(\bm{X}^T\bm{X}\right)^{-1}\bm{X}^T\bm{y},
2048+
\]
2049+
!et
2050+
and using our SVD decomposition of $\bm{X}$ we have
2051+
2052+
!bt
2053+
\[
2054+
\tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{\Sigma}\bm{V}^T\left(\bm{V}\tilde{\bm{\Sigma}}^{2}(\bm{V}^T\right)^{-1}\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{y},
2055+
\]
2056+
!et
2057+
which gives us, using the orthogonality of the matrix $\bm{V}$,
2058+
2059+
!bt
2060+
\[
2061+
\tilde{y}_{\mathrm{OLS}}=\bm{U}\bm{U}^T\bm{y}=\sum_{i=0}^{p-1}\bm{u}_i\bm{u}^T_i\bm{y},
2062+
\]
2063+
!et
2064+
2065+
It means that the ordinary least square model (with the optimal
2066+
parameters) $\bm{\tilde{y}}$, corresponds to an orthogonal
2067+
transformation of the output (or target) vector $\bm{y}$ by the
2068+
vectors of the matrix $\bm{U}$. _Note that the summation ends at_
2069+
$p-1$, that is $\bm{\tilde{y}}\ne \bm{y}$. We can thus not use the
2070+
orthogonality relation for the matrix $\bm{U}$. This can already be
2071+
when we multiply the matrices $\bm{\Sigma}^T\bm{U}^T$.
2072+
2073+
!split
2074+
===== Further properties (important for our analyses later) =====
2075+
2076+
Let us study again $\bm{X}^T\bm{X}$ in terms of our SVD,
2077+
!bt
2078+
\[
2079+
\bm{X}^T\bm{X}=\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V}^T=\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T.
2080+
\]
2081+
!et
2082+
2083+
If we now multiply from the right with $\bm{V}$ (using the orthogonality of $\bm{V}$) we get
2084+
!bt
2085+
\[
2086+
\left(\bm{X}^T\bm{X}\right)\bm{V}=\bm{V}\bm{\Sigma}^T\bm{\Sigma}.
2087+
\]
2088+
!et
2089+
This means the vectors $\bm{v}_i$ of the orthogonal matrix $\bm{V}$ are the eigenvectors of the matrix $\bm{X}^T\bm{X}$
2090+
with eigenvalues given by the singular values squared, that is
2091+
!bt
2092+
\[
2093+
\left(\bm{X}^T\bm{X}\right)\bm{v}_i=\bm{v}_i\sigma_i^2.
2094+
\]
2095+
!et
2096+
2097+
Similarly, if we use the SVD decomposition for the matrix $\bm{X}\bm{X}^T$, we have
2098+
!bt
2099+
\[
2100+
\bm{X}\bm{X}^T=\bm{U}\bm{\Sigma}\bm{V}^T\bm{V}\bm{\Sigma}^T\bm{U}^T=\bm{U}\bm{\Sigma}\bm{\Sigma}^T\bm{U}^T.
2101+
\]
2102+
!et
2103+
2104+
If we now multiply from the right with $\bm{U}$ (using the orthogonality of $\bm{U}$) we get
2105+
!bt
2106+
\[
2107+
\left(\bm{X}\bm{X}^T\right)\bm{U}=\bm{U}\bm{\Sigma}\bm{\Sigma}^T.
2108+
\]
2109+
!et
2110+
This means the vectors $\bm{u}_i$ of the orthogonal matrix $\bm{U}$ are the eigenvectors of the matrix $\bm{X}\bm{X}^T$
2111+
with eigenvalues given by the singular values squared, that is
2112+
!bt
2113+
\[
2114+
\left(\bm{X}\bm{X}^T\right)\bm{u}_i=\bm{u}_i\sigma_i^2.
2115+
\]
2116+
!et
2117+
2118+
_Important note_: we have defined our design matrix $\bm{X}$ to be an
2119+
$n\times p$ matrix. In most supervised learning cases we have that $n
2120+
\ge p$, and quite often we have $n >> p$. For linear algebra based methods like ordinary least squares or Ridge regression, this leads to a matrix $\bm{X}^T\bm{X}$ which is small and thereby easier to handle from a computational point of view (in terms of number of floating point operations).
2121+
2122+
In our lectures, the number of columns will
2123+
always refer to the number of features in our data set, while the
2124+
number of rows represents the number of data inputs. Note that in
2125+
other texts you may find the opposite notation. This has consequences
2126+
for the definition of for example the covariance matrix and its relation to the SVD.
2127+
2128+

0 commit comments

Comments
 (0)