wk3

caalo · caalo · commit c3f6670cdc47 · 2026-03-17T14:33:34.000-07:00
diff --git a/02-Regression.qmd b/02-Regression.qmd
@@ -294,15 +294,15 @@ for p_degree in [4, 10]:
   plt.clf()
   fig, (ax1, ax2) = plt.subplots(2, layout='constrained')
   
-  ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
-  ax1.scatter(X_train_BMI, y_train_predicted, label="fitted line")
+  ax1.scatter(X_train[X_train.columns[1]], y_train, alpha=.5, color="brown", label="Training set")
+  ax1.scatter(X_train[X_train.columns[1]], y_train_predicted, label="fitted line")
   ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')
   ax1.set_title('Training Error: ' + str(round(train_err, 2)))
   ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
   ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))
 
-  ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
-  ax2.scatter(X_test_BMI, y_test_predicted, label="fitted line")
+  ax2.scatter(X_test[X_test.columns[1]], y_test, alpha=.5, color="brown", label="Testing set")
+  ax2.scatter(X_test[X_test.columns[1]], y_test_predicted, label="fitted line")
   ax2.set(xlabel='BMI', ylabel='Mean Blood Pressure')
   ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
   ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
@@ -340,8 +340,6 @@ plt.xlabel('Polynomial Degree')
 plt.ylabel('Error')
 plt.legend()
 plt.show()  
-  
-  
 
 ```
 
diff --git a/03-Classification.qmd b/03-Classification.qmd
@@ -49,7 +49,7 @@ ax.set_ylabel('')
 plt.show()
 ```
 
-Great, there seems to be an association. However, recall that our classification model is going to be making predictions of probability on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
+Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
 
 ```{python}
 nhanes_train['bins'] = pd.cut(nhanes_train['BMI'], bins=20) 
@@ -75,7 +75,7 @@ plt.show()
 
 ```
 
-Great, looks like we have a strong relationship, but it doesn't encompass the full spectrum of probabilities.
+Great, looks like we have a relationship, but it doesn't encompass the full spectrum of probabilities.
 
 ## Logistic Regression
 
@@ -143,7 +143,7 @@ print('Accuracy = ', accuracy_score(y_test, logit_model.predict(X_test)))
 
 Okay, that's a starting point!
 
-However, we need to be mindful of the class imbalance we saw in the dataset at the beginning of the lesson. Recall we roughly have 88% of our data as No Hypertension. If we have a classifier that *always* predicted No Hypertension, then we achieve a 88% accuracy rate, but this model is not particularly novel.
+However, we need to be mindful of the class imbalance we saw in the dataset at the beginning of the lesson. Recall we roughly have 88% of our data as No Hypertension. If we have a classifier that *always* predicted No Hypertension, then we achieve a 88% accuracy rate, but this model is not particularly novel and it raises questions of whether our model of 76% accuracy is novel.
 
 We can break down classification accuracy to four additional results, via a table called the **Confusion Matrix**:
 
@@ -156,9 +156,9 @@ plt.show()
 
 The top left hand corner is the number of True Negatives (1128), the top right hand corner is the number of False Positives (24), the bottom left corner is the number of False Negatives (325), and the bottom right corner is the number of True Positives (15).
 
-Our Sensitivity (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 15/(15+325) = 4%
+Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 15/(15+325) = 4%
 
-Our Specificity (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 1128/(1128+24) = 98%.
+Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 1128/(1128+24) = 98%.
 
 Therefore, we do a pretty terrible job of predicting the Hypertension cases!
 
@@ -176,23 +176,6 @@ disp.plot()
 plt.show()
 ```
 
-ROC Curve
-
-```{python}
-
-from sklearn.metrics import RocCurveDisplay
-from sklearn.metrics import roc_curve, roc_auc_score
-
-# Compute ROC curve
-fpr, tpr, thresholds = roc_curve(y_test, logit_model.predict(X_test), drop_intermediate=False, pos_label)
-print(f"FPR: {fpr}")
-print(f"TPR: {tpr}")
-print(f"Thresholds: {thresholds}")
-
-display = RocCurveDisplay.from_predictions(y_test, logit_model.predict(X_test))
-plt.show()
-```
-
 ## Assumptions of logistic regression
 
 ### Linearity of log odds - predictor relationship
@@ -240,6 +223,23 @@ logit_model = sm.Logit(y_train, X_train).fit()
 logit_model.summary()
 ```
 
+## Appendix: ROC Curve
+
+```{python}
+
+from sklearn.metrics import RocCurveDisplay
+from sklearn.metrics import roc_curve, roc_auc_score
+
+# Compute ROC curve
+fpr, tpr, thresholds = roc_curve(y_test, logit_model.predict(X_test), drop_intermediate=False, pos_label)
+print(f"FPR: {fpr}")
+print(f"TPR: {tpr}")
+print(f"Thresholds: {thresholds}")
+
+display = RocCurveDisplay.from_predictions(y_test, logit_model.predict(X_test))
+plt.show()
+```
+
 ## 
 
 ```{python}