CompPhysics
diff --git a/‎doc/BookChapters/chapter1.dlog‎
Lines changed: 10 additions & 0 deletions b/‎doc/BookChapters/chapter1.dlog‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎doc/BookChapters/chapter1.do.txt‎
Lines changed: 100 additions & 115 deletions b/‎doc/BookChapters/chapter1.do.txt‎
Lines changed: 100 additions & 115 deletions
diff --git a/‎doc/BookChapters/chapter2.do.txt‎
Lines changed: 188 additions & 188 deletions b/‎doc/BookChapters/chapter2.do.txt‎
Lines changed: 188 additions & 188 deletions
diff --git a/‎doc/BookChapters/chapter3.do.txt‎
Lines changed: 129 additions & 129 deletions b/‎doc/BookChapters/chapter3.do.txt‎
Lines changed: 129 additions & 129 deletions
diff --git a/‎doc/BookChapters/chapter4.do.txt‎
Lines changed: 32 additions & 32 deletions b/‎doc/BookChapters/chapter4.do.txt‎
Lines changed: 32 additions & 32 deletions
@@ -87,3 +87,13 @@ found info about 5 exercises
 
 *** warning: latex envir \begin{bmatrix} does not work well in Markdown.    Stick to \[ ... \], equation, equation*, align, or align*    environments in math environments.
 output in chapter1.ipynb
+Translating doconce text in chapter1.do.txt to ipynb
+*** replacing \bm{...} by \boldsymbol{...} (\bm is not supported by MathJax)
+found info about 5 exercises
+
+*** warning: latex envir \begin{bmatrix} does not work well in Markdown.    Stick to \[ ... \], equation, equation*, align, or align*    environments in math environments.
+
+*** warning: latex envir \begin{bmatrix} does not work well in Markdown.    Stick to \[ ... \], equation, equation*, align, or align*    environments in math environments.
+
+*** warning: latex envir \begin{bmatrix} does not work well in Markdown.    Stick to \[ ... \], equation, equation*, align, or align*    environments in math environments.
+output in chapter1.ipynb
@@ -12,7 +12,7 @@ independent variables $x_i$. Linear regression resulted in
 analytical expressions for standard ordinary Least Squares or Ridge
 regression (in terms of matrices to invert) for several quantities,
 ranging from the variance and thereby the confidence intervals of the
-optimal parameters $\hat{\beta}$ to the mean squared error. If we can invert
+optimal parameters $\hat{\theta}$ to the mean squared error. If we can invert
 the product of the design matrices, linear regression gives then a
 simple recipe for fitting our data.
 
@@ -37,7 +37,7 @@ failure etc.
 Logistic regression will also serve as our stepping stone towards
 neural network algorithms and supervised deep learning. For logistic
 learning, the minimization of the cost function leads to a non-linear
-equation in the parameters $\hat{\beta}$. The optimization of the
+equation in the parameters $\hat{\theta}$. The optimization of the
 problem calls therefore for minimization algorithms. This forms the
 bottle neck of all machine learning algorithms, namely how to find
 reliable minima of a multi-variable function. This leads us to the
@@ -86,11 +86,11 @@ We would then have our
 weighted linear combination, namely 
 !bt
 \begin{equation}
-\bm{y} = \bm{X}^T\bm{\beta} +  \bm{\epsilon},
+\bm{y} = \bm{X}^T\bm{\theta} +  \bm{\epsilon},
 \end{equation}
 !et
 where $\bm{y}$ is a vector representing the possible outcomes, $\bm{X}$ is our
-$n\times p$ design matrix and $\bm{\beta}$ represents our estimators/predictors.
+$n\times p$ design matrix and $\bm{\theta}$ represents our estimators/predictors.
 
 
 The main problem with our function is that it takes values on the
@@ -186,7 +186,7 @@ We are now trying to find a function $f(y\vert x)$, that is a function which giv
 In standard linear regression with a linear dependence on $x$, we would write this in terms of our model
 !bt
 \[
-f(y_i\vert x_i)=\beta_0+\beta_1 x_i.
+f(y_i\vert x_i)=\theta_0+\theta_1 x_i.
 \]
 !et
 
@@ -291,19 +291,19 @@ plt.show()
 
 
 
-We assume now that we have two classes with $y_i$ either $0$ or $1$. Furthermore we assume also that we have only two parameters $\beta$ in our fitting of the Sigmoid function, that is we define probabilities 
+We assume now that we have two classes with $y_i$ either $0$ or $1$. Furthermore we assume also that we have only two parameters $\theta$ in our fitting of the Sigmoid function, that is we define probabilities 
 !bt
 \begin{align*}
-p(y_i=1|x_i,\bm{\beta}) &= \frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}},\nonumber\\
-p(y_i=0|x_i,\bm{\beta}) &= 1 - p(y_i=1|x_i,\bm{\beta}),
+p(y_i=1|x_i,\bm{\theta}) &= \frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}},\nonumber\\
+p(y_i=0|x_i,\bm{\theta}) &= 1 - p(y_i=1|x_i,\bm{\theta}),
 \end{align*}
 !et
-where $\bm{\beta}$ are the weights we wish to extract from data, in our case $\beta_0$ and $\beta_1$. 
+where $\bm{\theta}$ are the weights we wish to extract from data, in our case $\theta_0$ and $\theta_1$. 
 
 Note that we used
 !bt
 \[
-p(y_i=0\vert x_i, \bm{\beta}) = 1-p(y_i=1\vert x_i, \bm{\beta}).
+p(y_i=0\vert x_i, \bm{\theta}) = 1-p(y_i=1\vert x_i, \bm{\theta}).
 \]
 !et
 
@@ -316,86 +316,86 @@ the probability of seeing the observed data. We can then approximate the
 likelihood in terms of the product of the individual probabilities of a specific outcome $y_i$, that is 
 !bt
 \begin{align*}
-P(\mathcal{D}|\bm{\beta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\bm{\beta})\right]^{y_i}\left[1-p(y_i=1|x_i,\bm{\beta}))\right]^{1-y_i}\nonumber \\
+P(\mathcal{D}|\bm{\theta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\bm{\theta})\right]^{y_i}\left[1-p(y_i=1|x_i,\bm{\theta}))\right]^{1-y_i}\nonumber \\
 \end{align*}
 !et
 from which we obtain the log-likelihood and our _cost/loss_ function
 !bt
 \[
-\mathcal{C}(\bm{\beta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\bm{\beta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\bm{\beta}))\right]\right).
+\mathcal{C}(\bm{\theta}) = \sum_{i=1}^n \left( y_i\log{p(y_i=1|x_i,\bm{\theta})} + (1-y_i)\log\left[1-p(y_i=1|x_i,\bm{\theta}))\right]\right).
 \]
 !et
 
 
 Reordering the logarithms, we can rewrite the _cost/loss_ function as
 !bt
 \[
-\mathcal{C}(\bm{\beta}) = \sum_{i=1}^n  \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right).
+\mathcal{C}(\bm{\theta}) = \sum_{i=1}^n  \left(y_i(\theta_0+\theta_1x_i) -\log{(1+\exp{(\theta_0+\theta_1x_i)})}\right).
 \]
 !et
 
-The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $\beta$.
+The maximum likelihood estimator is defined as the set of parameters that maximize the log-likelihood where we maximize with respect to $\theta$.
 Since the cost (error) function is just the negative log-likelihood, for logistic regression we have that
 !bt
 \[
-\mathcal{C}(\bm{\beta})=-\sum_{i=1}^n  \left(y_i(\beta_0+\beta_1x_i) -\log{(1+\exp{(\beta_0+\beta_1x_i)})}\right).
+\mathcal{C}(\bm{\theta})=-\sum_{i=1}^n  \left(y_i(\theta_0+\theta_1x_i) -\log{(1+\exp{(\theta_0+\theta_1x_i)})}\right).
 \]
 !et
 This equation is known in statistics as the _cross entropy_. Finally, we note that just as in linear regression, 
 in practice we often supplement the cross-entropy with additional regularization terms, usually $L_1$ and $L_2$ regularization as we did for Ridge and Lasso regression.
 
 
-The cross entropy is a convex function of the weights $\bm{\beta}$ and,
+The cross entropy is a convex function of the weights $\bm{\theta}$ and,
 therefore, any local minimizer is a global minimizer. 
 
 
 Minimizing this
-cost function with respect to the two parameters $\beta_0$ and $\beta_1$ we obtain
+cost function with respect to the two parameters $\theta_0$ and $\theta_1$ we obtain
 
 !bt
 \[
-\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \beta_0} = -\sum_{i=1}^n  \left(y_i -\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right),
+\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \theta_0} = -\sum_{i=1}^n  \left(y_i -\frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}}\right),
 \]
 !et
 and 
 !bt
 \[
-\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \beta_1} = -\sum_{i=1}^n  \left(y_ix_i -x_i\frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}}\right).
+\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \theta_1} = -\sum_{i=1}^n  \left(y_ix_i -x_i\frac{\exp{(\theta_0+\theta_1x_i)}}{1+\exp{(\theta_0+\theta_1x_i)}}\right).
 \]
 !et
 
 
 Let us now define a vector $\bm{y}$ with $n$ elements $y_i$, an
 $n\times p$ matrix $\bm{X}$ which contains the $x_i$ values and a
-vector $\bm{p}$ of fitted probabilities $p(y_i\vert x_i,\bm{\beta})$. We can rewrite in a more compact form the first
+vector $\bm{p}$ of fitted probabilities $p(y_i\vert x_i,\bm{\theta})$. We can rewrite in a more compact form the first
 derivative of cost function as
 
 !bt
 \[
-\frac{\partial \mathcal{C}(\bm{\beta})}{\partial \bm{\beta}} = -\bm{X}^T\left(\bm{y}-\bm{p}\right). 
+\frac{\partial \mathcal{C}(\bm{\theta})}{\partial \bm{\theta}} = -\bm{X}^T\left(\bm{y}-\bm{p}\right). 
 \]
 !et
 
 If we in addition define a diagonal matrix $\bm{W}$ with elements 
-$p(y_i\vert x_i,\bm{\beta})(1-p(y_i\vert x_i,\bm{\beta})$, we can obtain a compact expression of the second derivative as 
+$p(y_i\vert x_i,\bm{\theta})(1-p(y_i\vert x_i,\bm{\theta})$, we can obtain a compact expression of the second derivative as 
 
 !bt
 \[
-\frac{\partial^2 \mathcal{C}(\bm{\beta})}{\partial \bm{\beta}\partial \bm{\beta}^T} = \bm{X}^T\bm{W}\bm{X}. 
+\frac{\partial^2 \mathcal{C}(\bm{\theta})}{\partial \bm{\theta}\partial \bm{\theta}^T} = \bm{X}^T\bm{W}\bm{X}. 
 \]
 !et
 
 
 Within a binary classification problem, we can easily expand our model to include multiple predictors. Our ratio between likelihoods is then with $p$ predictors
 !bt
 \[
-\log{ \frac{p(\bm{\beta}\bm{x})}{1-p(\bm{\beta}\bm{x})}} = \beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p.
+\log{ \frac{p(\bm{\theta}\bm{x})}{1-p(\bm{\theta}\bm{x})}} = \theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p.
 \]
 !et
-Here we defined $\bm{x}=[1,x_1,x_2,\dots,x_p]$ and $\bm{\beta}=[\beta_0, \beta_1, \dots, \beta_p]$ leading to
+Here we defined $\bm{x}=[1,x_1,x_2,\dots,x_p]$ and $\bm{\theta}=[\theta_0, \theta_1, \dots, \theta_p]$ leading to
 !bt
 \[
-p(\bm{\beta}\bm{x})=\frac{ \exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}{1+\exp{(\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_px_p)}}.
+p(\bm{\theta}\bm{x})=\frac{ \exp{(\theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p)}}{1+\exp{(\theta_0+\theta_1x_1+\theta_2x_2+\dots+\theta_px_p)}}.
 \]
 !et
 
@@ -406,19 +406,19 @@ of simplicity assume we have only two predictors. We have then following model
 
 !bt
 \[
-\log{\frac{p(C=1\vert x)}{p(K\vert x)}} = \beta_{10}+\beta_{11}x_1,
+\log{\frac{p(C=1\vert x)}{p(K\vert x)}} = \theta_{10}+\theta_{11}x_1,
 \]
 !et
 and 
 !bt
 \[
-\log{\frac{p(C=2\vert x)}{p(K\vert x)}} = \beta_{20}+\beta_{21}x_1,
+\log{\frac{p(C=2\vert x)}{p(K\vert x)}} = \theta_{20}+\theta_{21}x_1,
 \]
 !et
 and so on till the class $C=K-1$ class
 !bt
 \[
-\log{\frac{p(C=K-1\vert x)}{p(K\vert x)}} = \beta_{(K-1)0}+\beta_{(K-1)1}x_1,
+\log{\frac{p(C=K-1\vert x)}{p(K\vert x)}} = \theta_{(K-1)0}+\theta_{(K-1)1}x_1,
 \]
 !et
 
@@ -437,18 +437,18 @@ Bayes classifiers, and artificial neural networks.  Specifically, in
 multinomial logistic regression and linear discriminant analysis, the
 input to the function is the result of $K$ distinct linear functions,
 and the predicted probability for the $k$-th class given a sample
-vector $\bm{x}$ and a weighting vector $\bm{\beta}$ is (with two
+vector $\bm{x}$ and a weighting vector $\bm{\theta}$ is (with two
 predictors):
 
 !bt
 \[
-p(C=k\vert \mathbf {x} )=\frac{\exp{(\beta_{k0}+\beta_{k1}x_1)}}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}}.
+p(C=k\vert \mathbf {x} )=\frac{\exp{(\theta_{k0}+\theta_{k1}x_1)}}{1+\sum_{l=1}^{K-1}\exp{(\theta_{l0}+\theta_{l1}x_1)}}.
 \]
 !et
 It is easy to extend to more predictors. The final class is 
 !bt
 \[
-p(C=K\vert \mathbf {x} )=\frac{1}{1+\sum_{l=1}^{K-1}\exp{(\beta_{l0}+\beta_{l1}x_1)}},
+p(C=K\vert \mathbf {x} )=\frac{1}{1+\sum_{l=1}^{K-1}\exp{(\theta_{l0}+\theta_{l1}x_1)}},
 \]
 !et