Skip to content

Commit 35c4569

Browse files
committed
updates
1 parent d4db310 commit 35c4569

37 files changed

+5815
-5344
lines changed

doc/BookChapters/chapter1.dlog

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,13 @@ found info about 5 exercises
8787

8888
*** warning: latex envir \begin{bmatrix} does not work well in Markdown. Stick to \[ ... \], equation, equation*, align, or align* environments in math environments.
8989
output in chapter1.ipynb
90+
Translating doconce text in chapter1.do.txt to ipynb
91+
*** replacing \bm{...} by \boldsymbol{...} (\bm is not supported by MathJax)
92+
found info about 5 exercises
93+
94+
*** warning: latex envir \begin{bmatrix} does not work well in Markdown. Stick to \[ ... \], equation, equation*, align, or align* environments in math environments.
95+
96+
*** warning: latex envir \begin{bmatrix} does not work well in Markdown. Stick to \[ ... \], equation, equation*, align, or align* environments in math environments.
97+
98+
*** warning: latex envir \begin{bmatrix} does not work well in Markdown. Stick to \[ ... \], equation, equation*, align, or align* environments in math environments.
99+
output in chapter1.ipynb

doc/BookChapters/chapter1.do.txt

Lines changed: 100 additions & 115 deletions
Large diffs are not rendered by default.

doc/BookChapters/chapter2.do.txt

Lines changed: 188 additions & 188 deletions
Large diffs are not rendered by default.

doc/BookChapters/chapter3.do.txt

Lines changed: 129 additions & 129 deletions
Large diffs are not rendered by default.

doc/BookChapters/chapter4.do.txt

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ independent variables $x_i$. Linear regression resulted in
1212
analytical expressions for standard ordinary Least Squares or Ridge
1313
regression (in terms of matrices to invert) for several quantities,
1414
ranging from the variance and thereby the confidence intervals of the
15-
optimal parameters $\hat{\beta}$ to the mean squared error. If we can invert
15+
optimal parameters $\hat{\theta}$ to the mean squared error. If we can invert
1616
the product of the design matrices, linear regression gives then a
1717
simple recipe for fitting our data.
1818

@@ -37,7 +37,7 @@ failure etc.
3737
Logistic regression will also serve as our stepping stone towards
3838
neural network algorithms and supervised deep learning. For logistic
3939
learning, the minimization of the cost function leads to a non-linear
40-
equation in the parameters $\hat{\beta}$. The optimization of the
40+
equation in the parameters $\hat{\theta}$. The optimization of the
4141
problem calls therefore for minimization algorithms. This forms the
4242
bottle neck of all machine learning algorithms, namely how to find
4343
reliable minima of a multi-variable function. This leads us to the
@@ -86,11 +86,11 @@ We would then have our
8686
weighted linear combination, namely
8787
!bt
8888
\begin{equation}
89-
\bm{y} = \bm{X}^T\bm{\beta} + \bm{\epsilon},
89+
\bm{y} = \bm{X}^T\bm{\theta} + \bm{\epsilon},
9090
\end{equation}
9191
!et
9292
where $\bm{y}$ is a vector representing the possible outcomes, $\bm{X}$ is our
93-
$n\times p$ design matrix and $\bm{\beta}$ represents our estimators/predictors.
93+
$n\times p$ design matrix and $\bm{\theta}$ represents our estimators/predictors.
9494

9595

9696
The main problem with our function is that it takes values on the
@@ -186,7 +186,7 @@ We are now trying to find a function $f(y\vert x)$, that is a function which giv
186186
In standard linear regression with a linear dependence on $x$, we would write this in terms of our model
187187
!bt
188188
\[
189-
f(y_i\vert x_i)=\beta_0+\beta_1 x_i.
189+
f(y_i\vert x_i)=\theta_0+\theta_1 x_i.
190190
\]
191191
!et
192192

@@ -291,19 +291,19 @@ plt.show()
291291

292292

293293

294-
We assume now that we have two classes with $y_i$ either $0$ or $1$. Furthermore we assume also that we have only two parameters $\beta$ in our fitting of the Sigmoid function, that is we define probabilities
294+
We assume now that we have two classes with $y_i$ either $0$ or $1$. Furthermore we assume also that we have only two parameters $\theta$ in our fitting of the Sigmoid function, that is we define probabilities
295295
!bt
296296
\begin{align*}
297-
p(y_i=1|x_i,\bm{\beta}) &= \frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}},\nonumber\\
298-
p(y_i=0|x_i,\bm{\beta}) &= 1 - p(y_i=1|x_i,\bm{\beta}),
297+
p(y_i=1|x_i,\bm{\theta}) &= \frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}},\nonumber\\
298+
p(y_i=0|x_i,\bm{\theta}) &= 1 - p(y_i=1|x_i,\bm{\theta}),
299299
\end{align*}
300300
!et
301-
where $\bm{\beta}$ are the weights we wish to extract from data, in our case $\beta_0$ and $\beta_1$.
301+
where $\bm{\theta}$ are the weights we wish to extract from data, in our case $\theta_0$ and $\theta_1$.
302302

303303
Note that we used
304304
!bt
305305
\[
306-
p(y_i=0\vert x_i, \bm{\beta}) = 1-p(y_i=1\vert x_i, \bm{\beta}).
306+
p(y_i=0\vert x_i, \bm{\theta}) = 1-p(y_i=1\vert x_i, \bm{\theta}).
307307
\]
308308
!et
309309

@@ -316,86 +316,86 @@ the probability of seeing the observed data. We can then approximate the
316316
likelihood in terms of the product of the individual probabilities of a specific outcome $y_i$, that is
317317
!bt
318318
\begin{align*}
319-
P(\mathcal{D}|\bm{\beta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\bm{\beta})\right]^{y_i}\left[1-p(y_i=1|x_i,\bm{\beta}))\right]^{1-y_i}\nonumber \\
319+
P(\mathcal{D}|\bm{\theta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\bm{\theta})\right]^{y_i}\left[1-p(y_i=1|x_i,\bm{\theta}))\right]^{1-y_i}\nonumber \\
320320
\end{align*}
321321
!et
322322
from which we obtain the log-likelihood and our _cost/loss_ function
323323
!bt
324324
\[
325-
\mathcal{C}(\bm{\beta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\bm{\beta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\bm{\beta}))\right]\right).
325+
\mathcal{C}(\bm{\theta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\bm{\theta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\bm{\theta}))\right]\right).
326326
\]
327327
!et
328328

329329

330330
Reordering the logarithms, we can rewrite the _cost/loss_ function as
331331
!bt
332332
\[
333-
\mathcal{C}(\bm{\beta}) = \sum_{i=1}^n \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right).
333+
\mathcal{C}(\bm{\theta}) = \sum_{i=1}^n \left(y_i(\theta_0+\theta_1x_i) -\log{(1+\exp{(\theta_0+\theta_1x_i)})}\right).
334334
\]
335335
!et
336336

337-
The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $\beta$.
337+
The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $\theta$.
338338
Since the cost (error) function is just the negative log-likelihood, for logistic regression we have that
339339
!bt
340340
\[
341-
\mathcal{C}(\bm{\beta})=-\sum_{i=1}^n \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right).
341+
\mathcal{C}(\bm{\theta})=-\sum_{i=1}^n \left(y_i(\theta_0+\theta_1x_i) -\log{(1+\exp{(\theta_0+\theta_1x_i)})}\right).
342342
\]
343343
!et
344344
This equation is known in statistics as the _cross entropy_. Finally, we note that just as in linear regression,
345345
in practice we often supplement the cross-entropy with additional regularization terms, usually $L_1$ and $L_2$ regularization as we did for Ridge and Lasso regression.
346346

347347

348-
The cross entropy is a convex function of the weights $\bm{\beta}$ and,
348+
The cross entropy is a convex function of the weights $\bm{\theta}$ and,
349349
therefore, any local minimizer is a global minimizer.
350350

351351

352352
Minimizing this
353-
cost function with respect to the two parameters $\beta_0$ and $\beta_1$ we obtain
353+
cost function with respect to the two parameters $\theta_0$ and $\theta_1$ we obtain
354354

355355
!bt
356356
\[
357-
\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \beta_0} = -\sum_{i=1}^n \left(y_i -\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right),
357+
\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \theta_0} = -\sum_{i=1}^n \left(y_i -\frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}}\right),
358358
\]
359359
!et
360360
and
361361
!bt
362362
\[
363-
\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \beta_1} = -\sum_{i=1}^n \left(y_ix_i -x_i\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right).
363+
\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \theta_1} = -\sum_{i=1}^n \left(y_ix_i -x_i\frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}}\right).
364364
\]
365365
!et
366366

367367

368368
Let us now define a vector $\bm{y}$ with $n$ elements $y_i$, an
369369
$n\times p$ matrix $\bm{X}$ which contains the $x_i$ values and a
370-
vector $\bm{p}$ of fitted probabilities $p(y_i\vert x_i,\bm{\beta})$. We can rewrite in a more compact form the first
370+
vector $\bm{p}$ of fitted probabilities $p(y_i\vert x_i,\bm{\theta})$. We can rewrite in a more compact form the first
371371
derivative of cost function as
372372

373373
!bt
374374
\[
375-
\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \bm{\beta}} = -\bm{X}^T\left(\bm{y}-\bm{p}\right).
375+
\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \bm{\theta}} = -\bm{X}^T\left(\bm{y}-\bm{p}\right).
376376
\]
377377
!et
378378

379379
If we in addition define a diagonal matrix $\bm{W}$ with elements
380-
$p(y_i\vert x_i,\bm{\beta})(1-p(y_i\vert x_i,\bm{\beta})$, we can obtain a compact expression of the second derivative as
380+
$p(y_i\vert x_i,\bm{\theta})(1-p(y_i\vert x_i,\bm{\theta})$, we can obtain a compact expression of the second derivative as
381381

382382
!bt
383383
\[
384-
\frac{\partial^2 \mathcal{C}(\bm{\beta})}{\partial \bm{\beta}\partial \bm{\beta}^T} = \bm{X}^T\bm{W}\bm{X}.
384+
\frac{\partial^2 \mathcal{C}(\bm{\theta})}{\partial \bm{\theta}\partial \bm{\theta}^T} = \bm{X}^T\bm{W}\bm{X}.
385385
\]
386386
!et
387387

388388

389389
Within a binary classification problem, we can easily expand our model to include multiple predictors. Our ratio between likelihoods is then with $p$ predictors
390390
!bt
391391
\[
392-
\log{ \frac{p(\bm{\beta}\bm{x})}{1-p(\bm{\beta}\bm{x})}} = \beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p.
392+
\log{ \frac{p(\bm{\theta}\bm{x})}{1-p(\bm{\theta}\bm{x})}} = \theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p.
393393
\]
394394
!et
395-
Here we defined $\bm{x}=[1,x_1,x_2,\dots,x_p]$ and $\bm{\beta}=[\beta_0, \beta_1, \dots, \beta_p]$ leading to
395+
Here we defined $\bm{x}=[1,x_1,x_2,\dots,x_p]$ and $\bm{\theta}=[\theta_0, \theta_1, \dots, \theta_p]$ leading to
396396
!bt
397397
\[
398-
p(\bm{\beta}\bm{x})=\frac{ \exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}{1+\exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}.
398+
p(\bm{\theta}\bm{x})=\frac{ \exp{(\theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p)}}{1+\exp{(\theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p)}}.
399399
\]
400400
!et
401401

@@ -406,19 +406,19 @@ of simplicity assume we have only two predictors. We have then following model
406406

407407
!bt
408408
\[
409-
\log{\frac{p(C=1\vert x)}{p(K\vert x)}} = \beta_{10}+\beta_{11}x_1,
409+
\log{\frac{p(C=1\vert x)}{p(K\vert x)}} = \theta_{10}+\theta_{11}x_1,
410410
\]
411411
!et
412412
and
413413
!bt
414414
\[
415-
\log{\frac{p(C=2\vert x)}{p(K\vert x)}} = \beta_{20}+\beta_{21}x_1,
415+
\log{\frac{p(C=2\vert x)}{p(K\vert x)}} = \theta_{20}+\theta_{21}x_1,
416416
\]
417417
!et
418418
and so on till the class $C=K-1$ class
419419
!bt
420420
\[
421-
\log{\frac{p(C=K-1\vert x)}{p(K\vert x)}} = \beta_{(K-1)0}+\beta_{(K-1)1}x_1,
421+
\log{\frac{p(C=K-1\vert x)}{p(K\vert x)}} = \theta_{(K-1)0}+\theta_{(K-1)1}x_1,
422422
\]
423423
!et
424424

@@ -437,18 +437,18 @@ Bayes classifiers, and artificial neural networks. Specifically, in
437437
multinomial logistic regression and linear discriminant analysis, the
438438
input to the function is the result of $K$ distinct linear functions,
439439
and the predicted probability for the $k$-th class given a sample
440-
vector $\bm{x}$ and a weighting vector $\bm{\beta}$ is (with two
440+
vector $\bm{x}$ and a weighting vector $\bm{\theta}$ is (with two
441441
predictors):
442442

443443
!bt
444444
\[
445-
p(C=k\vert \mathbf {x} )=\frac{\exp{(\beta_{k0}+\beta_{k1}x_1)}}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}}.
445+
p(C=k\vert \mathbf {x} )=\frac{\exp{(\theta_{k0}+\theta_{k1}x_1)}}{1+\sum_{l=1}^{K-1}\exp{(\theta_{l0}+\theta_{l1}x_1)}}.
446446
\]
447447
!et
448448
It is easy to extend to more predictors. The final class is
449449
!bt
450450
\[
451-
p(C=K\vert \mathbf {x} )=\frac{1}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}},
451+
p(C=K\vert \mathbf {x} )=\frac{1}{1+\sum_{l=1}^{K-1}\exp{(\theta_{l0}+\theta_{l1}x_1)}},
452452
\]
453453
!et
454454

0 commit comments

Comments
 (0)