-
Notifications
You must be signed in to change notification settings - Fork 2
Adding Logistic Regression class implementation & utils #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 3568976682Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
@gonuke This PR should be reviewed and merged first because it includes Coveralls says that coverage dropped because the SSML functionality of |
gonuke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct enough, but a few suggestions for maintainability.
Some good discussion points for a S/W meeting, too.
Co-authored-by: Paul Wilson <paul.wilson@wisc.edu>
stompsjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gonuke I have addressed some of the comments from our S/W review. Here are a few lingering comments, pending a re-review from you.
scripts/utils.py
Outdated
| if stratified: | ||
| cv = StratifiedKFold(n_splits=n_splits, random_state=random_state, | ||
| shuffle=shuffle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no test for stratified KFold, but the only difference between this and standard KFold (which is tested) is a different scikit-learn class. I could devise a second test dataset with a few extra instances of one class (which would be balanced by StratifiedKFold) or we could ignore testing this portion. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of # pragma: no cover from Coveralls, I am adding to this IF...ELSE... since we are not testing StratifiedKFold.
scripts/utils.py
Outdated
| def pca(Lx, Ux, n): | ||
| ''' | ||
| A function for computing n-component PCA. | ||
| Inputs: | ||
| Lx: labeled feature data. | ||
| Ux: unlabeled feature data. | ||
| n: number of singular values to include in PCA analysis. | ||
| ''' | ||
|
|
||
| pcadata = np.append(Lx, Ux, axis=0) | ||
| normalizer = StandardScaler() | ||
| x = normalizer.fit_transform(pcadata) | ||
| logging.info(np.mean(pcadata), np.std(pcadata)) | ||
| logging.info(np.mean(x), np.std(x)) | ||
|
|
||
| pca = PCA(n_components=n) | ||
| pca.fit_transform(x) | ||
| logging.info(pca.explained_variance_ratio_) | ||
| logging.info(pca.singular_values_) | ||
| logging.info(pca.components_) | ||
|
|
||
| principalComponents = pca.fit_transform(x) | ||
|
|
||
| return principalComponents | ||
|
|
||
|
|
||
| def _plot_pca_data(principalComponents, Ly, Uy, ax): | ||
| ''' | ||
| Helper function for plot_pca that plots data for a given axis. | ||
| Inputs: | ||
| principalComponents: ndarray of shape (n_samples, n_components). | ||
| Ly: class labels for labeled data. | ||
| Uy: labels for unlabeled data (all labels should be -1). | ||
| ax: matplotlib-axis to plot on. | ||
| ''' | ||
|
|
||
| # only saving colors for binary classification with unlabeled instances | ||
| col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'} | ||
|
|
||
| for idx, color in col_dict.items(): | ||
| indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0] | ||
| ax.scatter(principalComponents[indices, 0], | ||
| principalComponents[indices, 1], | ||
| c=color, | ||
| label='class '+str(idx)) | ||
| return ax | ||
|
|
||
|
|
||
| def plot_pca(principalComponents, Ly, Uy, filename, n=2): | ||
| ''' | ||
| A function for computing and plotting n-dimensional PCA. | ||
| Inputs: | ||
| principalComponents: ndarray of shape (n_samples, n_components). | ||
| Ly: class labels for labeled data. | ||
| Uy: labels for unlabeled data (all labels should be -1). | ||
| filename: filename for saved plot. | ||
| The file must be saved with extension .joblib. | ||
| Added to filename if not included as input. | ||
| n: number of singular values to include in PCA analysis. | ||
| ''' | ||
|
|
||
| plt.rcParams.update({'font.size': 20}) | ||
|
|
||
| alph = ["A", "B", "C", "D", "E", "F", "G", "H", | ||
| "I", "J", "K", "L", "M", "N", "O", "P", | ||
| "Q", "R", "S", "T", "U", "V", "W", "X", | ||
| "Y", "Z"] | ||
| jobs = alph[:n] | ||
|
|
||
| # only one plot is needed for n=2 | ||
| if n == 2: | ||
| fig, ax = plt.subplots(figsize=(10, 8)) | ||
| ax.set_xlabel('PC '+jobs[0], fontsize=15) | ||
| ax.set_ylabel('PC '+jobs[1], fontsize=15) | ||
| ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
| ax.grid() | ||
| ax.legend() | ||
| else: | ||
| fig, axes = plt.subplots(n, n, figsize=(15, 15)) | ||
| for row in range(axes.shape[0]): | ||
| for col in range(axes.shape[1]): | ||
| ax = axes[row, col] | ||
| # blank label plot | ||
| if row == col: | ||
| ax.tick_params( | ||
| axis='both', which='both', | ||
| bottom='off', top='off', | ||
| labelbottom='off', | ||
| left='off', right='off', | ||
| labelleft='off' | ||
| ) | ||
| ax.text(0.5, 0.5, jobs[row], horizontalalignment='center') | ||
| # PCA results | ||
| else: | ||
| ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
|
|
||
| if filename[-4:] != '.png': | ||
| filename += '.png' | ||
| fig.tight_layout() | ||
| fig.savefig(filename) | ||
|
|
||
|
|
||
| def plot_cf(testy, predy, title, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gonuke, per our conversations from S/W, do you still concur that we can ignore testing these PCA/plotting functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but is there a way to exclude them from the denominator of the coverage testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, Coveralls supports adding # pragma: no cover to blocks of code, and it will ignore it. Adding to the plotting functions above.
| # since the test data used here is synthetic/toy data (i.e. uninteresting), | ||
| # the trained model should be at least better than a 50-50 guess | ||
| # if it was worse, something would be wrong with the ML class | ||
| assert acc > 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some clarification to explain our "good-enough" testing of trained ML models.
gonuke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some questions about what the right test for parameters passed into the LogReg initializer....
models/LogReg.py
Outdated
| random_state=self.random_state | ||
| ) | ||
| else: | ||
| if all(key in params.keys() for key in keys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the most correct/robust way to do this. The LogisticRegression model has defaults for these parameters, so it may be OK if some are missing. You just need to make sure they exist if you want to pass them along. Right now, you only allow 0 parameters or all 3 parameters, but maybe it's OK for just 1 or 2?
One way to manage this is with the **kwargs object that you can pass through, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first time using **kwargs but I saw a recommendation to use kwargs.pop('key', default_value) to pull option args from the input. This system should support any combination of input parameters, including ones that are not supported. I have updated the __init__ and its relevant unit test. Let me know if you have feedback!
scripts/utils.py
Outdated
| def pca(Lx, Ux, n): | ||
| ''' | ||
| A function for computing n-component PCA. | ||
| Inputs: | ||
| Lx: labeled feature data. | ||
| Ux: unlabeled feature data. | ||
| n: number of singular values to include in PCA analysis. | ||
| ''' | ||
|
|
||
| pcadata = np.append(Lx, Ux, axis=0) | ||
| normalizer = StandardScaler() | ||
| x = normalizer.fit_transform(pcadata) | ||
| logging.info(np.mean(pcadata), np.std(pcadata)) | ||
| logging.info(np.mean(x), np.std(x)) | ||
|
|
||
| pca = PCA(n_components=n) | ||
| pca.fit_transform(x) | ||
| logging.info(pca.explained_variance_ratio_) | ||
| logging.info(pca.singular_values_) | ||
| logging.info(pca.components_) | ||
|
|
||
| principalComponents = pca.fit_transform(x) | ||
|
|
||
| return principalComponents | ||
|
|
||
|
|
||
| def _plot_pca_data(principalComponents, Ly, Uy, ax): | ||
| ''' | ||
| Helper function for plot_pca that plots data for a given axis. | ||
| Inputs: | ||
| principalComponents: ndarray of shape (n_samples, n_components). | ||
| Ly: class labels for labeled data. | ||
| Uy: labels for unlabeled data (all labels should be -1). | ||
| ax: matplotlib-axis to plot on. | ||
| ''' | ||
|
|
||
| # only saving colors for binary classification with unlabeled instances | ||
| col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'} | ||
|
|
||
| for idx, color in col_dict.items(): | ||
| indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0] | ||
| ax.scatter(principalComponents[indices, 0], | ||
| principalComponents[indices, 1], | ||
| c=color, | ||
| label='class '+str(idx)) | ||
| return ax | ||
|
|
||
|
|
||
| def plot_pca(principalComponents, Ly, Uy, filename, n=2): | ||
| ''' | ||
| A function for computing and plotting n-dimensional PCA. | ||
| Inputs: | ||
| principalComponents: ndarray of shape (n_samples, n_components). | ||
| Ly: class labels for labeled data. | ||
| Uy: labels for unlabeled data (all labels should be -1). | ||
| filename: filename for saved plot. | ||
| The file must be saved with extension .joblib. | ||
| Added to filename if not included as input. | ||
| n: number of singular values to include in PCA analysis. | ||
| ''' | ||
|
|
||
| plt.rcParams.update({'font.size': 20}) | ||
|
|
||
| alph = ["A", "B", "C", "D", "E", "F", "G", "H", | ||
| "I", "J", "K", "L", "M", "N", "O", "P", | ||
| "Q", "R", "S", "T", "U", "V", "W", "X", | ||
| "Y", "Z"] | ||
| jobs = alph[:n] | ||
|
|
||
| # only one plot is needed for n=2 | ||
| if n == 2: | ||
| fig, ax = plt.subplots(figsize=(10, 8)) | ||
| ax.set_xlabel('PC '+jobs[0], fontsize=15) | ||
| ax.set_ylabel('PC '+jobs[1], fontsize=15) | ||
| ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
| ax.grid() | ||
| ax.legend() | ||
| else: | ||
| fig, axes = plt.subplots(n, n, figsize=(15, 15)) | ||
| for row in range(axes.shape[0]): | ||
| for col in range(axes.shape[1]): | ||
| ax = axes[row, col] | ||
| # blank label plot | ||
| if row == col: | ||
| ax.tick_params( | ||
| axis='both', which='both', | ||
| bottom='off', top='off', | ||
| labelbottom='off', | ||
| left='off', right='off', | ||
| labelleft='off' | ||
| ) | ||
| ax.text(0.5, 0.5, jobs[row], horizontalalignment='center') | ||
| # PCA results | ||
| else: | ||
| ax = _plot_pca_data(principalComponents, Ly, Uy, ax) | ||
|
|
||
| if filename[-4:] != '.png': | ||
| filename += '.png' | ||
| fig.tight_layout() | ||
| fig.savefig(filename) | ||
|
|
||
|
|
||
| def plot_cf(testy, predy, title, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but is there a way to exclude them from the denominator of the coverage testing?
|
@gonuke I addressed your comments, thanks! |
gonuke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This introduces an ML class,
LogReg, that can be used for supervised logistic regression. This includes typical scikit-learn-esque methods like train and predict as well as methods for hyperparameter optimization and saving the model to file.