You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we try to separate hyperplanes, if it exists, we can use it to construct a natural classifier: a test observation is assigned a given class depending on which side of the hyperplane it is located.
395
+
When we try to separate hyperplanes, if it exists, we can use it to
396
+
construct a natural classifier: a test observation is assigned a given
397
+
class depending on which side of the hyperplane it is located.
384
398
385
399
!split
386
400
===== The two-dimensional case =====
@@ -394,6 +408,9 @@ data points of both classes. Maximizing the margin distance provides
394
408
some reinforcement so that future data points can be classified with
395
409
more confidence.
396
410
411
+
!split
412
+
===== Linear classifier =====
413
+
397
414
What a linear classifier attempts to accomplish is to split the
398
415
feature space into two half spaces by placing a hyperplane between the
399
416
data points. This hyperplane will be our decision boundary. All
We could now for example define all values $y_i =1$ as misclassified in case we have $\bm{w}^T\bm{x}_i+b < 0$ and the opposite if we have $y_i=-1$. Taking the derivatives gives us
458
+
We could now for example define all values $y_i =1$ as misclassified
459
+
in case we have $\bm{w}^T\bm{x}_i+b < 0$ and the opposite if we have
where $\eta$ is our by now well-known learning rate.
470
490
471
491
472
-
!split
473
-
===== Code Example =====
474
-
475
-
The equations we discussed above can be coded rather easily (the
476
-
framework is similar to what we developed for logistic
477
-
regression). We are going to set up a simple case with two classes only and we want to find a line which separates them the best possible way.
478
-
!bc pycod
479
492
480
-
!ec
481
493
482
494
!split
483
495
===== Problems with the Simpler Approach =====
484
496
497
+
The equations we discussed above can be coded rather easily (the
498
+
framework is similar to what has been developed for say logistic
499
+
regression).
500
+
485
501
486
502
There are however problems with this approach, although it looks
487
-
pretty straightforward to implement. When running the above code, we see that we can easily end up with many diffeent lines which separate the two classes.
503
+
pretty straightforward to implement. When running such a code, we see that we can easily end up with many diffeent lines which separate the two classes.
488
504
489
505
490
506
For small
@@ -509,7 +525,11 @@ y_i(\bm{w}^T\bm{x}_i+b) \geq M \hspace{0.1cm}\forall i=1,2,\dots, p.
509
525
!et
510
526
All points are thus at a signed distance from the decision boundary defined by the line $L$. The parameters $b$ and $w_1$ and $w_2$ define this line.
511
527
512
-
We seek thus the largest value $M$ defined by
528
+
!split
529
+
===== Largest value $M$ =====
530
+
531
+
532
+
We seek the largest value $M$ defined by
513
533
!bt
514
534
\[
515
535
\frac{1}{\vert \vert \bm{w}\vert\vert}y_i(\bm{w}^T\bm{x}_i+b) \geq M \hspace{0.1cm}\forall i=1,2,\dots, n,
We note that this is nothing but the dot product of the two original
934
998
vectors $(\bm{x}_i^T\bm{x}_j)^2$. Instead of thus computing the
935
999
product in the Lagrangian of $\bm{z}_i^T\bm{z}_j$ we simply compute
@@ -965,6 +1029,8 @@ subject to $\bm{y}^T\bm{\lambda}=0$. Here we defined the vectors $\bm{\lambda} =
965
1029
$\bm{y}=[y_1,y_2,\dots,y_n]$.
966
1030
If we add the slack constants this leads to the additional constraint $0\leq \lambda_i \leq C$.
967
1031
1032
+
!split
1033
+
===== Convex optimization =====
968
1034
We can rewrite this (see the solutions below) in terms of a convex optimization problem of the type
969
1035
!bt
970
1036
\begin{align*}
@@ -978,7 +1044,7 @@ $0\leq \lambda_i$ and $\lambda_i \leq C$. These two inequalities define then the
978
1044
979
1045
980
1046
!split
981
-
===== Different kernels and Mercer's theorem =====
1047
+
===== Different kernels =====
982
1048
983
1049
There are several popular kernels being used. These are
984
1050
o Linear: $K(\bm{x},\bm{y})=\bm{x}^T\bm{y}$,
@@ -987,6 +1053,10 @@ o Gaussian Radial Basis Function: $K(\bm{x},\bm{y})=\exp{\left(-\gamma\vert\vert
987
1053
o Tanh: $K(\bm{x},\bm{y})=\tanh{(\bm{x}^T\bm{y}+\gamma)}$,
988
1054
and many other ones.
989
1055
1056
+
1057
+
!split
1058
+
===== Mercer's theorem =====
1059
+
990
1060
An important theorem for us is "Mercer's
991
1061
theorem":"https://en.wikipedia.org/wiki/Mercer%27s_theorem". The
992
1062
theorem states that if a kernel function $K$ is symmetric, continuous
@@ -1218,8 +1288,9 @@ subject to some constraints for say a selected set $i=1,2,\dots, n$.
1218
1288
In our case we are optimizing with respect to the Lagrangian multipliers $\lambda_i$, and the
1219
1289
vector $\bm{\lambda}=[\lambda_1, \lambda_2,\dots, \lambda_n]$ is the optimization variable we are dealing with.
1220
1290
1221
-
In our case we are particularly interested in a class of optimization problems called convex optmization problems.
1222
-
In our discussion on gradient descent methods we discussed at length the definition of a convex function.
1291
+
In our case we are particularly interested in a class of optimization
1292
+
problems called convex optmization problems.
1293
+
1223
1294
1224
1295
Convex optimization problems play a central role in applied mathematics and we recommend strongly "Boyd and Vandenberghe's text on the topics":"http://web.stanford.edu/~boyd/cvxbook/".
1225
1296
@@ -1268,6 +1339,10 @@ Let us show how to perform the optmization using a simple case. Assume we want t
1268
1339
&3x+4y \leq 80. \\ \nonumber
1269
1340
\end{align*}
1270
1341
!et
1342
+
1343
+
!split
1344
+
===== Rewriting in terms of vectors and matrices =====
1345
+
1271
1346
The minimization problem can be rewritten in terms of vectors and matrices as (with $x$ and $y$ being the unknowns)
1272
1347
!bt
1273
1348
\[
@@ -1280,6 +1355,10 @@ Similarly, we can now set up the inequalities (we need to change $\geq$ to $\leq
0 commit comments