Skip to content

MachineLearning

Sathiyanarayanan S edited this page Feb 15, 2019 · 22 revisions

Explain the Assumptions and violations of k-means clustering


Give use cases of k-means and k-medoids


How to compute correlation with categorical variables?


How do you know if one algorithm is better than other?

In the context of Mathematical algorithms, its space and time complexity can be used to compare the algorithms. In the context of Machine Learning algorithm, its capabilities to model complex decision boundaries can be used to compare.


What is the difference between Gradient Boosting and Random Forest?

Random Forest use bootstrapping method for training/testing and decision trees for prediction. Bootstrapping simply means generating random samples from the dataset with replacement. Every bootstrapped sample has a corresponding hold out or ‘out-of-bag’ sample which is used to test. This means that if you generate 100 bootstrapped samples, each time you will get a set of predictions. Final prediction is simply the average of all 100 predictions.

When using a boosting technique, We are free to use any algorithm, it could be decision tree, or NN or KNN or SVM. Now lets look at how is training/testing done here. Boosting methods also produce multiple random samples, but it is done more thoughtfully. The subsequent samples depend on weights given to records in the previous sample which did not predict correctly - hence called weak learners. The final prediction is also not a simple average of all 100 predictions, but a weighted average.

Analytics Vidhya Reference


Explain the difference between bagged and boosting models.

The key difference is in the sampling technique. In bagging, uniform random samples with replacement are taken, individual classifier are independent of each other. In booting, data points are weighted based on performance of previous classifier on the data point. Boosting is sequential where as bagging is parallel.


What is cross entropy? Information Gain, Gini Index

Best answer from stackoverflow


Define parametric and non-parametric methods. Give some examples

Parametric methods have a set of fixed parameters that determine a probability model. Example linear regression, identifying set of slope and intercept is enough to define the regression.

The term non-parametric (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis h grows linearly with the size of the training set A non parametric model can capture more subtle aspects of the data. It allows more information to pass from the current set of data that is attached to the model at the current state, to be able to predict any future data. The parameters are usually said to be infinite in dimensions and so can express the characteristics in the data much better than parametric models. It has more degrees of freedom and is more flexible. A Gaussian mixture model for example has more flexibility to express the data in form of multiple gaussian distributions and another example is locally weighted regression.


What is over fitting?


What is the degree of freedom for lasso?


Given an executable of trained logistic regression model, how do you find coefficients? You can run the model with any input and observe the output but can not get the internals.

Logistic Regression Hypothesis

Logistic Regression Hypothesis is given in the above equation. By setting all values of X to zero except xi element and solve for theta i, repeat the above step for all element of x.


What is RBM and What are it's application?

http://imonad.com/rbm/restricted-boltzmann-machine/

http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/


Will it be of any use if PCA is applied to data which is not linearly separable?


What are the assumptions of decision tree?


How missing values are handled in decision tree?

The real handling approaches to missing data does not use data point with missing values in the evaluation of a split. However, when child nodes are created and trained, those instances are distributed somehow.

I know about the following approaches to distribute the missing value instances to child nodes:

  • all goes to the node which already has the biggest number of instances (CART, is not the primary rule)
  • distribute to all children, but with diminished weights, proportional with the number of instances from each child node (C45 and others)
  • distribute randomly to only one single child node, eventually according with a categorical distribution (I have seen that in various implementations of C45 and CART for a faster running time)
  • build, sort and use surrogates to distribute instances to a child node, where surrogates are input features which resembles best how the test feature send data instances to left or right child node (CART, if that fails, the majority rule is used)

https://stats.stackexchange.com/a/96458


What is surrogate splits in decision tree ?


How to compute importance of variable in decision tree?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

Importance Expression

  • ni sub(j)= the importance of node j
  • w sub(j) = weighted number of samples reaching node j
  • C sub(j)= the impurity value of node j
  • left(j) = child node from left split on node j
  • right(j) = child node from right split on node j

How do you evaluate decision tree result ?

Clone this wiki locally