Adding Logistic Regression class implementation & utils #41

stompsjo · 2022-10-31T18:30:43Z

This introduces an ML class, LogReg, that can be used for supervised logistic regression. This includes typical scikit-learn-esque methods like train and predict as well as methods for hyperparameter optimization and saving the model to file.

coveralls · 2022-10-31T18:31:27Z

Pull Request Test Coverage Report for Build 3568976682

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

134 of 160 (83.75%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-6.8%) to 93.229%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
scripts/utils.py	94	120	78.33%

Totals
Change from base Build 3348444582:	-6.8%
Covered Lines:	358
Relevant Lines:	384

💛 - Coveralls

…n the LabelProp PR

stompsjo · 2022-10-31T19:12:23Z

@gonuke This PR should be reviewed and merged first because it includes utils.py which will be used for all subsequent ML PRs (the rest can be done in parallel). This will also be the "most difficult" to review, since it includes both the utils and one ML class (LogReg) that reflects the common structure for ML models.

Coveralls says that coverage dropped because the SSML functionality of utils.cross_validation are not tested (since this PR does not include any SSML models). A test is already implemented as part of #45.

gonuke

This looks correct enough, but a few suggestions for maintainability.

Some good discussion points for a S/W meeting, too.

models/LogReg.py

scripts/utils.py

tests/test_models.py

Co-authored-by: Paul Wilson <paul.wilson@wisc.edu>

stompsjo

@gonuke I have addressed some of the comments from our S/W review. Here are a few lingering comments, pending a re-review from you.

scripts/utils.py

stompsjo · 2022-12-15T20:13:19Z

scripts/utils.py

+    if stratified:
+        cv = StratifiedKFold(n_splits=n_splits, random_state=random_state,
+                             shuffle=shuffle)


There is currently no test for stratified KFold, but the only difference between this and standard KFold (which is tested) is a different scikit-learn class. I could devise a second test dataset with a few extra instances of one class (which would be balanced by StratifiedKFold) or we could ignore testing this portion. Thoughts?

In light of # pragma: no cover from Coveralls, I am adding to this IF...ELSE... since we are not testing StratifiedKFold.

stompsjo · 2022-12-15T20:14:18Z

scripts/utils.py

+def pca(Lx, Ux, n):
+    '''
+    A function for computing n-component PCA.
+    Inputs:
+    Lx: labeled feature data.
+    Ux: unlabeled feature data.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    pcadata = np.append(Lx, Ux, axis=0)
+    normalizer = StandardScaler()
+    x = normalizer.fit_transform(pcadata)
+    logging.info(np.mean(pcadata), np.std(pcadata))
+    logging.info(np.mean(x), np.std(x))
+
+    pca = PCA(n_components=n)
+    pca.fit_transform(x)
+    logging.info(pca.explained_variance_ratio_)
+    logging.info(pca.singular_values_)
+    logging.info(pca.components_)
+
+    principalComponents = pca.fit_transform(x)
+
+    return principalComponents
+
+
+def _plot_pca_data(principalComponents, Ly, Uy, ax):
+    '''
+    Helper function for plot_pca that plots data for a given axis.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    ax: matplotlib-axis to plot on.
+    '''
+
+    # only saving colors for binary classification with unlabeled instances
+    col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}
+
+    for idx, color in col_dict.items():
+        indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
+        ax.scatter(principalComponents[indices, 0],
+                   principalComponents[indices, 1],
+                   c=color,
+                   label='class '+str(idx))
+    return ax
+
+
+def plot_pca(principalComponents, Ly, Uy, filename, n=2):
+    '''
+    A function for computing and plotting n-dimensional PCA.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    filename: filename for saved plot.
+        The file must be saved with extension .joblib.
+        Added to filename if not included as input.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    plt.rcParams.update({'font.size': 20})
+
+    alph = ["A", "B", "C", "D", "E", "F", "G", "H",
+            "I", "J", "K", "L", "M", "N", "O", "P",
+            "Q", "R", "S", "T", "U", "V", "W", "X",
+            "Y", "Z"]
+    jobs = alph[:n]
+
+    # only one plot is needed for n=2
+    if n == 2:
+        fig, ax = plt.subplots(figsize=(10, 8))
+        ax.set_xlabel('PC '+jobs[0], fontsize=15)
+        ax.set_ylabel('PC '+jobs[1], fontsize=15)
+        ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+        ax.grid()
+        ax.legend()
+    else:
+        fig, axes = plt.subplots(n, n, figsize=(15, 15))
+        for row in range(axes.shape[0]):
+            for col in range(axes.shape[1]):
+                ax = axes[row, col]
+                # blank label plot
+                if row == col:
+                    ax.tick_params(
+                        axis='both', which='both',
+                        bottom='off', top='off',
+                        labelbottom='off',
+                        left='off', right='off',
+                        labelleft='off'
+                    )
+                    ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
+                # PCA results
+                else:
+                    ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+
+    if filename[-4:] != '.png':
+        filename += '.png'
+    fig.tight_layout()
+    fig.savefig(filename)
+
+
+def plot_cf(testy, predy, title, filename):


@gonuke, per our conversations from S/W, do you still concur that we can ignore testing these PCA/plotting functions?

Yes - but is there a way to exclude them from the denominator of the coverage testing?

Turns out, Coveralls supports adding # pragma: no cover to blocks of code, and it will ignore it. Adding to the plotting functions above.

stompsjo · 2022-12-15T20:15:09Z

tests/test_models.py

+    # since the test data used here is synthetic/toy data (i.e. uninteresting),
+    # the trained model should be at least better than a 50-50 guess
+    # if it was worse, something would be wrong with the ML class
+    assert acc > 0.5


I added some clarification to explain our "good-enough" testing of trained ML models.

scripts/utils.py

gonuke

Still some questions about what the right test for parameters passed into the LogReg initializer....

gonuke · 2022-12-31T22:52:34Z

models/LogReg.py

+                            random_state=self.random_state
+                        )
+        else:
+            if all(key in params.keys() for key in keys):


I'm not sure if this is the most correct/robust way to do this. The LogisticRegression model has defaults for these parameters, so it may be OK if some are missing. You just need to make sure they exist if you want to pass them along. Right now, you only allow 0 parameters or all 3 parameters, but maybe it's OK for just 1 or 2?

One way to manage this is with the **kwargs object that you can pass through, perhaps?

This is my first time using **kwargs but I saw a recommendation to use kwargs.pop('key', default_value) to pull option args from the input. This system should support any combination of input parameters, including ones that are not supported. I have updated the __init__ and its relevant unit test. Let me know if you have feedback!

scripts/utils.py

gonuke · 2022-12-31T22:57:59Z

scripts/utils.py

+def pca(Lx, Ux, n):
+    '''
+    A function for computing n-component PCA.
+    Inputs:
+    Lx: labeled feature data.
+    Ux: unlabeled feature data.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    pcadata = np.append(Lx, Ux, axis=0)
+    normalizer = StandardScaler()
+    x = normalizer.fit_transform(pcadata)
+    logging.info(np.mean(pcadata), np.std(pcadata))
+    logging.info(np.mean(x), np.std(x))
+
+    pca = PCA(n_components=n)
+    pca.fit_transform(x)
+    logging.info(pca.explained_variance_ratio_)
+    logging.info(pca.singular_values_)
+    logging.info(pca.components_)
+
+    principalComponents = pca.fit_transform(x)
+
+    return principalComponents
+
+
+def _plot_pca_data(principalComponents, Ly, Uy, ax):
+    '''
+    Helper function for plot_pca that plots data for a given axis.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    ax: matplotlib-axis to plot on.
+    '''
+
+    # only saving colors for binary classification with unlabeled instances
+    col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}
+
+    for idx, color in col_dict.items():
+        indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
+        ax.scatter(principalComponents[indices, 0],
+                   principalComponents[indices, 1],
+                   c=color,
+                   label='class '+str(idx))
+    return ax
+
+
+def plot_pca(principalComponents, Ly, Uy, filename, n=2):
+    '''
+    A function for computing and plotting n-dimensional PCA.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    filename: filename for saved plot.
+        The file must be saved with extension .joblib.
+        Added to filename if not included as input.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    plt.rcParams.update({'font.size': 20})
+
+    alph = ["A", "B", "C", "D", "E", "F", "G", "H",
+            "I", "J", "K", "L", "M", "N", "O", "P",
+            "Q", "R", "S", "T", "U", "V", "W", "X",
+            "Y", "Z"]
+    jobs = alph[:n]
+
+    # only one plot is needed for n=2
+    if n == 2:
+        fig, ax = plt.subplots(figsize=(10, 8))
+        ax.set_xlabel('PC '+jobs[0], fontsize=15)
+        ax.set_ylabel('PC '+jobs[1], fontsize=15)
+        ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+        ax.grid()
+        ax.legend()
+    else:
+        fig, axes = plt.subplots(n, n, figsize=(15, 15))
+        for row in range(axes.shape[0]):
+            for col in range(axes.shape[1]):
+                ax = axes[row, col]
+                # blank label plot
+                if row == col:
+                    ax.tick_params(
+                        axis='both', which='both',
+                        bottom='off', top='off',
+                        labelbottom='off',
+                        left='off', right='off',
+                        labelleft='off'
+                    )
+                    ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
+                # PCA results
+                else:
+                    ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+
+    if filename[-4:] != '.png':
+        filename += '.png'
+    fig.tight_layout()
+    fig.savefig(filename)
+
+
+def plot_cf(testy, predy, title, filename):


Yes - but is there a way to exclude them from the denominator of the coverage testing?

stompsjo · 2023-01-16T14:57:57Z

@gonuke I addressed your comments, thanks!

gonuke

Looks good!

Jordan Stomps added 2 commits October 31, 2022 14:19

utility and testing scripts needed for implementing ML and SSML

54f7065

adding Logistic Regression class implementation

b335056

stompsjo requested a review from gonuke October 31, 2022 18:30

stompsjo self-assigned this Oct 31, 2022

Merge branch 'utils' into logreg since tests require it

bc8cbfc

stompsjo mentioned this pull request Oct 31, 2022

Utility and testing scripts for ML implementations #40

Closed

Jordan Stomps added 2 commits October 31, 2022 14:54

removing pytests for future ML implementations

c661099

commenting cross validation test for ssml since it will be included i…

a7d4bfe

…n the LabelProp PR

stompsjo added the good first issue Good for newcomers label Oct 31, 2022

removing leftovers from migration

738720e

stompsjo mentioned this pull request Oct 31, 2022

SSML Models Implementation #38

Closed

stompsjo changed the title ~~Adding Logistic Regression class implementation~~ Adding Logistic Regression class implementation & utils Oct 31, 2022

stompsjo force-pushed the logreg branch from dd66c85 to 738720e Compare October 31, 2022 19:20

gonuke requested changes Nov 20, 2022

View reviewed changes

Jordan Stomps and others added 7 commits November 28, 2022 14:31

condensing bool return

b6a2c36

Co-authored-by: Paul Wilson <paul.wilson@wisc.edu>

addressing comments from PR review

de820fa

Merge branch 'logreg' of github.com:stompsjo/RadClass into logreg

cfeb32b

splitting utils test between cross_validation and pca

49ed491

implementing more meaningful test data for testing cross validation

785e424

changing print statements to logging

fd3cf53

clarifying the testing logic for ML classes

56da530

stompsjo commented Dec 15, 2022

View reviewed changes

scripts/utils.py Show resolved Hide resolved

gonuke requested changes Dec 31, 2022

View reviewed changes

Jordan Stomps added 2 commits January 16, 2023 09:26

moving default init params to support kwargs

533bbb2

adding pragmas for coveralls to ignore PCA functions

f731a56

pep8

c9cb249

gonuke approved these changes Jan 16, 2023

View reviewed changes

gonuke merged commit d727d1e into cnerg:main Jan 16, 2023

stompsjo pushed a commit to stompsjo/RadClass that referenced this pull request Jan 16, 2023

changes in light of PR cnerg#41 comments

ec47a63

stompsjo mentioned this pull request Jan 16, 2023

Adding CoTraining class implementation #42

Closed

Adding Logistic Regression class implementation & utils #41

Adding Logistic Regression class implementation & utils #41

Uh oh!

Conversation

stompsjo commented Oct 31, 2022

Uh oh!

coveralls commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 3568976682

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

stompsjo commented Oct 31, 2022

Uh oh!

gonuke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stompsjo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gonuke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stompsjo commented Jan 16, 2023

Uh oh!

gonuke left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Oct 31, 2022 •

edited

Loading