Skip to content

Commit f5df0b2

Browse files
committed
Update week13.do.txt
1 parent 4bab81a commit f5df0b2

File tree

1 file changed

+105
-29
lines changed

1 file changed

+105
-29
lines changed

doc/src/week13/week13.do.txt

Lines changed: 105 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,9 @@ The case with two well-separated classes only can be understood in an
210210
intuitive way in terms of lines in a two-dimensional space separating
211211
the two classes (see figure below).
212212

213+
!split
214+
===== Basic mathematics =====
215+
213216
The basic mathematics behind the SVM is however less familiar to most of us.
214217
It relies on the definition of hyperplanes and the
215218
definition of a _margin_ which separates classes (in case of
@@ -317,9 +320,14 @@ The aim of the SVM algorithm is to find a hyperplane in a
317320
$p$-dimensional space, where $p$ is the number of features that
318321
distinctly classifies the data points.
319322

320-
In a $p$-dimensional space, a hyperplane is what we call an affine subspace of dimension of $p-1$.
321-
As an example, in two dimension, a hyperplane is simply as straight line while in three dimensions it is
322-
a two-dimensional subspace, or stated simply, a plane.
323+
In a $p$-dimensional space, a hyperplane is what we call an affine
324+
subspace of dimension of $p-1$. As an example, in two dimension, a
325+
hyperplane is simply as straight line while in three dimensions it is
326+
a two-dimensional subspace, or stated simply, a plane.
327+
328+
329+
!split
330+
===== Two-dimensional case =====
323331

324332
In two dimensions, with the variables $x_1$ and $x_2$, the hyperplane is defined as
325333
!bt
@@ -357,6 +365,10 @@ of dimension $n\times p$, where $n$ represents the observations for each feature
357365
\bm{x}_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \dots \\ \dots \\ x_{ip} \end{bmatrix}.
358366
\]
359367
!et
368+
369+
!split
370+
===== More details =====
371+
360372
If the above condition is not met for a given vector $\bm{x}_i$ we have
361373
!bt
362374
\[
@@ -380,7 +392,9 @@ y_i\left(b+w_1x_{i1}+w_2x_{i2}+\dots +w_px_{ip}\right) > 0.
380392
\]
381393
!et
382394

383-
When we try to separate hyperplanes, if it exists, we can use it to construct a natural classifier: a test observation is assigned a given class depending on which side of the hyperplane it is located.
395+
When we try to separate hyperplanes, if it exists, we can use it to
396+
construct a natural classifier: a test observation is assigned a given
397+
class depending on which side of the hyperplane it is located.
384398

385399
!split
386400
===== The two-dimensional case =====
@@ -394,6 +408,9 @@ data points of both classes. Maximizing the margin distance provides
394408
some reinforcement so that future data points can be classified with
395409
more confidence.
396410

411+
!split
412+
===== Linear classifier =====
413+
397414
What a linear classifier attempts to accomplish is to split the
398415
feature space into two half spaces by placing a hyperplane between the
399416
data points. This hyperplane will be our decision boundary. All
@@ -438,7 +455,10 @@ C(\bm{w},b) = -\sum_{i\in M} y_i(\bm{w}^T\bm{x}_i+b).
438455
\]
439456
!et
440457

441-
We could now for example define all values $y_i =1$ as misclassified in case we have $\bm{w}^T\bm{x}_i+b < 0$ and the opposite if we have $y_i=-1$. Taking the derivatives gives us
458+
We could now for example define all values $y_i =1$ as misclassified
459+
in case we have $\bm{w}^T\bm{x}_i+b < 0$ and the opposite if we have
460+
$y_i=-1$. Taking the derivatives gives us
461+
442462
!bt
443463
\[
444464
\frac{\partial C}{\partial b} = -\sum_{i\in M} y_i,
@@ -469,22 +489,18 @@ and
469489
where $\eta$ is our by now well-known learning rate.
470490

471491

472-
!split
473-
===== Code Example =====
474-
475-
The equations we discussed above can be coded rather easily (the
476-
framework is similar to what we developed for logistic
477-
regression). We are going to set up a simple case with two classes only and we want to find a line which separates them the best possible way.
478-
!bc pycod
479492

480-
!ec
481493

482494
!split
483495
===== Problems with the Simpler Approach =====
484496

497+
The equations we discussed above can be coded rather easily (the
498+
framework is similar to what has been developed for say logistic
499+
regression).
500+
485501

486502
There are however problems with this approach, although it looks
487-
pretty straightforward to implement. When running the above code, we see that we can easily end up with many diffeent lines which separate the two classes.
503+
pretty straightforward to implement. When running such a code, we see that we can easily end up with many diffeent lines which separate the two classes.
488504

489505

490506
For small
@@ -509,7 +525,11 @@ y_i(\bm{w}^T\bm{x}_i+b) \geq M \hspace{0.1cm}\forall i=1,2,\dots, p.
509525
!et
510526
All points are thus at a signed distance from the decision boundary defined by the line $L$. The parameters $b$ and $w_1$ and $w_2$ define this line.
511527

512-
We seek thus the largest value $M$ defined by
528+
!split
529+
===== Largest value $M$ =====
530+
531+
532+
We seek the largest value $M$ defined by
513533
!bt
514534
\[
515535
\frac{1}{\vert \vert \bm{w}\vert\vert}y_i(\bm{w}^T\bm{x}_i+b) \geq M \hspace{0.1cm}\forall i=1,2,\dots, n,
@@ -556,6 +576,10 @@ due to
556576
df = \frac{\partial f}{\partial x}dx+\frac{\partial f}{\partial y}dy+\frac{\partial f}{\partial z}dz.
557577
\]
558578
!et
579+
580+
!split
581+
===== Not all variables are indepenent of each other =====
582+
559583
In many problems the variables $x,y,z$ are often subject to constraints (such as those above for the margin)
560584
so that they are no longer all independent. It is possible at least in principle to use each
561585
constraint to eliminate one variable
@@ -575,6 +599,10 @@ the variables $x,y,z$
575599
d\phi = \frac{\partial \phi}{\partial x}dx+\frac{\partial \phi}{\partial y}dy+\frac{\partial \phi}{\partial z}dz =0.
576600
\]
577601
!et
602+
603+
!split
604+
===== Only two independent variables =====
605+
578606
Now we cannot set anymore
579607
!bt
580608
\[
@@ -610,6 +638,10 @@ Our multiplier is chosen so that
610638
\]
611639
!et
612640

641+
642+
!split
643+
===== More details =====
644+
613645
We need to remember that we took $dx$ and $dy$ to be arbitrary and thus we must have
614646
!bt
615647
\[
@@ -634,7 +666,7 @@ If we have a set of constraints $\phi_k$ we have the equations
634666
!et
635667

636668
!split
637-
===== Setting up the Problem =====
669+
===== Setting up the problem =====
638670
In order to solve the above problem, we define the following Lagrangian function to be minimized
639671
!bt
640672
\[
@@ -643,6 +675,10 @@ In order to solve the above problem, we define the following Lagrangian function
643675
!et
644676
where $\lambda_i$ is a so-called Lagrange multiplier subject to the condition $\lambda_i \geq 0$.
645677

678+
679+
!split
680+
===== Setting up derivaties =====
681+
646682
Taking the derivatives with respect to $b$ and $\bm{w}$ we obtain
647683
!bt
648684
\[
@@ -662,6 +698,10 @@ Inserting these constraints into the equation for ${\cal L}$ we obtain
662698
\]
663699
!et
664700
subject to the constraints $\lambda_i\geq 0$ and $\sum_i\lambda_iy_i=0$.
701+
702+
!split
703+
===== Karush-Kuhn-Tucker condition =====
704+
665705
We must in addition satisfy the "Karush-Kuhn-Tucker":"https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions" (KKT) condition
666706
!bt
667707
\[
@@ -724,6 +764,9 @@ or if we write it out in terms of the support vectors only, with $N_s$ being the
724764
b = \frac{1}{N_s}\sum_{j\in N_s}\left(y_j-\sum_{i=1}^n\lambda_iy_i\bm{x}_i^T\bm{x}_j\right).
725765
\]
726766
!et
767+
768+
!split
769+
===== Classifier equations =====
727770
With our hyperplane coefficients we can use our classifier to assign any observation by simply using
728771
!bt
729772
\[
@@ -756,9 +799,15 @@ y_i(\bm{w}^T\bm{x}_i+b)=1-\xi_i,
756799
\]
757800
!et
758801
with the requirement $\xi_i\geq 0$. The total violation is now $\sum_i\xi$.
759-
The value $\xi_i$ in the constraint the last constraint corresponds to the amount by which the prediction
760-
$y_i(\bm{w}^T\bm{x}_i+b)=1$ is on the wrong side of its margin. Hence by bounding the sum $\sum_i \xi_i$,
761-
we bound the total amount by which predictions fall on the wrong side of their margins.
802+
803+
!split
804+
===== Misclassification =====
805+
806+
The value $\xi_i$ in the constraint the last constraint corresponds to
807+
the amount by which the prediction $y_i(\bm{w}^T\bm{x}_i+b)=1$ is on
808+
the wrong side of its margin. Hence by bounding the sum $\sum_i
809+
\xi_i$, we bound the total amount by which predictions fall on the
810+
wrong side of their margins.
762811

763812
Misclassifications occur when $\xi_i > 1$. Thus bounding the total sum by some value $C$ bounds in turn the total number of
764813
misclassifications.
@@ -781,6 +830,9 @@ y_i(\bm{w}^T\bm{x}_i+b)=1-\xi_i \hspace{0.1cm}\forall i,
781830
!et
782831
with the requirement $\xi_i\geq 0$.
783832

833+
!split
834+
===== Derivatives with respect to $b$ and $\bm{w}$ =====
835+
784836
Taking the derivatives with respect to $b$ and $\bm{w}$ we obtain
785837
!bt
786838
\[
@@ -799,6 +851,10 @@ and
799851
\lambda_i = C-\gamma_i \hspace{0.1cm}\forall i.
800852
\]
801853
!et
854+
855+
!split
856+
===== New constraints =====
857+
802858
Inserting these constraints into the equation for ${\cal L}$ we obtain the same equation as before
803859
!bt
804860
\[
@@ -897,7 +953,8 @@ plt.show()
897953
!split
898954
===== The equations =====
899955

900-
Suppose we define a polynomial transformation of degree two only (we continue to live in a plane with $x_i$ and $y_i$ as variables)
956+
Suppose we define a polynomial transformation of degree two only (we
957+
continue to live in a plane with $x_i$ and $y_i$ as variables)
901958
!bt
902959
\[
903960
z = \phi(x_i) =\left(x_i^2, y_i^2, \sqrt{2}x_iy_i\right).
@@ -917,6 +974,10 @@ y_i(\bm{w}^T\bm{z}_i+b)= 1 \hspace{0.1cm}\forall i,
917974
\]
918975
!et
919976
from which we also find $b$.
977+
978+
!split
979+
===== Defining the kernel =====
980+
920981
To compute $\bm{z}_i^T\bm{z}_j$ we define the kernel $K(\bm{x}_i,\bm{x}_j)$ as
921982
!bt
922983
\[
@@ -930,6 +991,9 @@ K(\bm{x}_i,\bm{x}_j)=[x_i^2, y_i^2, \sqrt{2}x_iy_i]^T\begin{bmatrix} x_j^2 \\ y_
930991
\]
931992
!et
932993

994+
!split
995+
===== Kernel trick =====
996+
933997
We note that this is nothing but the dot product of the two original
934998
vectors $(\bm{x}_i^T\bm{x}_j)^2$. Instead of thus computing the
935999
product in the Lagrangian of $\bm{z}_i^T\bm{z}_j$ we simply compute
@@ -965,6 +1029,8 @@ subject to $\bm{y}^T\bm{\lambda}=0$. Here we defined the vectors $\bm{\lambda} =
9651029
$\bm{y}=[y_1,y_2,\dots,y_n]$.
9661030
If we add the slack constants this leads to the additional constraint $0\leq \lambda_i \leq C$.
9671031

1032+
!split
1033+
===== Convex optimization =====
9681034
We can rewrite this (see the solutions below) in terms of a convex optimization problem of the type
9691035
!bt
9701036
\begin{align*}
@@ -978,7 +1044,7 @@ $0\leq \lambda_i$ and $\lambda_i \leq C$. These two inequalities define then the
9781044

9791045

9801046
!split
981-
===== Different kernels and Mercer's theorem =====
1047+
===== Different kernels =====
9821048

9831049
There are several popular kernels being used. These are
9841050
o Linear: $K(\bm{x},\bm{y})=\bm{x}^T\bm{y}$,
@@ -987,6 +1053,10 @@ o Gaussian Radial Basis Function: $K(\bm{x},\bm{y})=\exp{\left(-\gamma\vert\vert
9871053
o Tanh: $K(\bm{x},\bm{y})=\tanh{(\bm{x}^T\bm{y}+\gamma)}$,
9881054
and many other ones.
9891055

1056+
1057+
!split
1058+
===== Mercer's theorem =====
1059+
9901060
An important theorem for us is "Mercer's
9911061
theorem":"https://en.wikipedia.org/wiki/Mercer%27s_theorem". The
9921062
theorem states that if a kernel function $K$ is symmetric, continuous
@@ -1218,8 +1288,9 @@ subject to some constraints for say a selected set $i=1,2,\dots, n$.
12181288
In our case we are optimizing with respect to the Lagrangian multipliers $\lambda_i$, and the
12191289
vector $\bm{\lambda}=[\lambda_1, \lambda_2,\dots, \lambda_n]$ is the optimization variable we are dealing with.
12201290

1221-
In our case we are particularly interested in a class of optimization problems called convex optmization problems.
1222-
In our discussion on gradient descent methods we discussed at length the definition of a convex function.
1291+
In our case we are particularly interested in a class of optimization
1292+
problems called convex optmization problems.
1293+
12231294

12241295
Convex optimization problems play a central role in applied mathematics and we recommend strongly "Boyd and Vandenberghe's text on the topics":"http://web.stanford.edu/~boyd/cvxbook/".
12251296

@@ -1268,6 +1339,10 @@ Let us show how to perform the optmization using a simple case. Assume we want t
12681339
&3x+4y \leq 80. \\ \nonumber
12691340
\end{align*}
12701341
!et
1342+
1343+
!split
1344+
===== Rewriting in terms of vectors and matrices =====
1345+
12711346
The minimization problem can be rewritten in terms of vectors and matrices as (with $x$ and $y$ being the unknowns)
12721347
!bt
12731348
\[
@@ -1280,6 +1355,10 @@ Similarly, we can now set up the inequalities (we need to change $\geq$ to $\leq
12801355
\begin{bmatrix} -1 & 0 \\ 0 & -1 \\ -1 & -3 \\ 2 & 5 \\ 3 & 4\end{bmatrix}\begin{bmatrix} x \\ y\end{bmatrix} \preceq \begin{bmatrix}0 \\ 0\\ -15 \\ 100 \\ 80\end{bmatrix}.
12811356
\]
12821357
!et
1358+
1359+
!split
1360+
===== Rewriting inequalities =====
1361+
12831362
We have collapsed all the inequalities into a single matrix $\bm{G}$. We see also that our matrix
12841363
!bt
12851364
\[
@@ -1337,9 +1416,6 @@ Using the _CVXOPT_ library, the matrix $P$ would then be defined by the above ma
13371416

13381417

13391418
!split
1340-
===== Support vector machines for regression =====
1341-
1342-
Material will be added here.
1343-
1344-
1345-
1419+
===== Plans for next week =====
1420+
o Discussion of quantum support vector machines
1421+
o Introducing quantum neural networks

0 commit comments

Comments
 (0)