Release Date: 4/30/2013
Author: Warren Winter
This script implements with the Orange machine learning library an algorithm for extracting and ranking features that carry the most discriminative or predictive power for an observation's class membership. It can be used to improve the performance of classifiers, as well as to aid the discovery of biomarkers.
It is based off of the SVM-RFE algorithm first developed by Guyon et al., 2002, critiqued by Ambroise et al., 2002, and refined by Duan et al., 2005.
In outline:
- I. Rank the dataset's features
- Divide the data into multiple (e.g., 10) "external" folds
- Within each external fold:
- Until all features have been ranked:
- Divide the data into 10 "internal" folds
- Within each internal fold:
- Perform grid search (with 4-fold internal CV) to optimize a linear SVM's C parameter for the internal fold's data
- Train a linear SVM on the internal fold's observations
- Normalize the vector of each feature's coefficients in the model
- Calculate weights for each feature
- For each feature, calculate an SNR measure to stabilize its weight estimations across the internal folds: mean(weights) / SD(weights)
- Record the rank of and eliminate some number of lowest-weighted features from the external fold's set of features
- Until all features have been ranked:
- Calculate the mean of each feature's rank across all external folds
--
- II. Determine the optimal number of highest-ranked features for classification accuracy
- For some number of top-ranked features ranging from 1 - n:
- Constrain the features of the dataset to such top features
- Divide the data into multiple (e.g., 10) folds
- Within each fold:
- Perform grid search (with 4-fold internal CV) to optimize an RBF-kernel SVM's C and Gamma parameters for the fold's data
- Train an SVM with an RBF kernel on the fold's observations
- Test the SVM on the the held-out observations, record performance metrics
- Calculate the across-folds mean of each performance metric of the SVM trained on the constrained feature set
- Find the feature set which trained the SVM to classify best
- For some number of top-ranked features ranging from 1 - n:
Usage:
- I. To run the first, ranking stage, call the mSVM_RFE() function
- mSVM_RFE() takes 2 parameters:
- dataset (a string, specifying the path to your Orange-ready, tab-delimited data file)
- folds (either an int, to specify how many folds to divide the data, or the string "n" to specify n-fold or leave-one-out cross-validation)
- mSVM_RFE() saves the output to a JSON file in the same folder as the .tab data file
- mSVM_RFE() takes 2 parameters:
--
- II. To run the second, performance testing stage, call the test_best_features() function
- test_best_features() takes 4 parameters:
- dataset (string containing path to your data file)
- rankjson (string containing path to the JSON output from mSVM_RFE)
- maxfeatures (int specifying the number of features in the largest top-feature set you want to train the SVM on
- folds (either an int, to specify how many folds to divide the data, or the string "n" to specify n-fold or leave-one-out cross-validation)
- test_best_features() saves the output to another JSON file in the same folder
- test_best_features() takes 4 parameters:
To come: plotting number of features x accuracy, permutation test to assess significance, and some further documentation.