-
Notifications
You must be signed in to change notification settings - Fork 2
MachineLearning
In the context of Mathematical algorithms, its space and time complexity can be used to compare the algorithms. In the context of Machine Learning algorithm, its capabilities to model complex decision boundaries can be used to compare.
Random Forest use bootstrapping method for training/testing and decision trees for prediction. Bootstrapping simply means generating random samples from the dataset with replacement. Every bootstrapped sample has a corresponding hold out or ‘out-of-bag’ sample which is used to test. This means that if you generate 100 bootstrapped samples, each time you will get a set of predictions. Final prediction is simply the average of all 100 predictions.
When using a boosting technique, We are free to use any algorithm, it could be decision tree, or NN or KNN or SVM. Now lets look at how is training/testing done here. Boosting methods also produce multiple random samples, but it is done more thoughtfully. The subsequent samples depend on weights given to records in the previous sample which did not predict correctly - hence called weak learners. The final prediction is also not a simple average of all 100 predictions, but a weighted average.
The key difference is in the sampling technique. In bagging, uniform random samples with replacement are taken, individual classifier are independent of each other. In booting, data points are weighted based on performance of previous classifier on the data point. Boosting is sequential where as bagging is parallel.
Best answer from stackoverflow
Parametric methods have a set of fixed parameters that determine a probability model. Example linear regression, identifying set of slope and intercept is enough to define the regression.
The term non-parametric (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis h grows linearly with the size of the training set A non parametric model can capture more subtle aspects of the data. It allows more information to pass from the current set of data that is attached to the model at the current state, to be able to predict any future data. The parameters are usually said to be infinite in dimensions and so can express the characteristics in the data much better than parametric models. It has more degrees of freedom and is more flexible. A Gaussian mixture model for example has more flexibility to express the data in form of multiple gaussian distributions and another example is locally weighted regression.
Given an executable of trained logistic regression model, how do you find coefficients? You can run the model with any input and observe the output but can not get the internals.
Logistic Regression Hypothesis is given in the above equation. By setting all values of X to zero except xi element and solve for theta i, repeat the above step for all element of x.
http://imonad.com/rbm/restricted-boltzmann-machine/
http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/
The real handling approaches to missing data does not use data point with missing values in the evaluation of a split. However, when child nodes are created and trained, those instances are distributed somehow.
I know about the following approaches to distribute the missing value instances to child nodes:
- all goes to the node which already has the biggest number of instances (CART, is not the primary rule)
- distribute to all children, but with diminished weights, proportional with the number of instances from each child node (C45 and others)
- distribute randomly to only one single child node, eventually according with a categorical distribution (I have seen that in various implementations of C45 and CART for a faster running time)
- build, sort and use surrogates to distribute instances to a child node, where surrogates are input features which resembles best how the test feature send data instances to left or right child node (CART, if that fails, the majority rule is used)
https://stats.stackexchange.com/a/96458
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
- ni sub(j)= the importance of node j
- w sub(j) = weighted number of samples reaching node j
- C sub(j)= the impurity value of node j
- left(j) = child node from left split on node j
- right(j) = child node from right split on node j
