Skip to content

Commit c3f6670

Browse files
committed
wk3
1 parent 5c4aa1b commit c3f6670

2 files changed

Lines changed: 26 additions & 28 deletions

File tree

02-Regression.qmd

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -294,15 +294,15 @@ for p_degree in [4, 10]:
294294
plt.clf()
295295
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')
296296
297-
ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
298-
ax1.scatter(X_train_BMI, y_train_predicted, label="fitted line")
297+
ax1.scatter(X_train[X_train.columns[1]], y_train, alpha=.5, color="brown", label="Training set")
298+
ax1.scatter(X_train[X_train.columns[1]], y_train_predicted, label="fitted line")
299299
ax1.set(xlabel='BMI', ylabel='Mean Blood Pressure')
300300
ax1.set_title('Training Error: ' + str(round(train_err, 2)))
301301
ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
302302
ax1.set_ylim(np.min(nhanes_tiny.MeanBloodPressure), np.max(nhanes_tiny.MeanBloodPressure))
303303
304-
ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
305-
ax2.scatter(X_test_BMI, y_test_predicted, label="fitted line")
304+
ax2.scatter(X_test[X_test.columns[1]], y_test, alpha=.5, color="brown", label="Testing set")
305+
ax2.scatter(X_test[X_test.columns[1]], y_test_predicted, label="fitted line")
306306
ax2.set(xlabel='BMI', ylabel='Mean Blood Pressure')
307307
ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
308308
ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
@@ -340,8 +340,6 @@ plt.xlabel('Polynomial Degree')
340340
plt.ylabel('Error')
341341
plt.legend()
342342
plt.show()
343-
344-
345343
346344
```
347345

03-Classification.qmd

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ ax.set_ylabel('')
4949
plt.show()
5050
```
5151

52-
Great, there seems to be an association. However, recall that our classification model is going to be making predictions of probability on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
52+
Great, there seems to be an association. However, recall that our classification model is going to be *making predictions of probabilit*y on a continuous scale of 0 to 1 before we classify it into two categories. Therefore, it makes sense to examine the relationship between BMI and empirical Hypertension probability in our data exploration. To do so, we will need to *bin* our data by small chunks of BMI values and calculate the empirical Hypertension probability for that bin. We plot the midpoint binned BMI value vs. empirical Hypertension probability for 20 bins:
5353

5454
```{python}
5555
nhanes_train['bins'] = pd.cut(nhanes_train['BMI'], bins=20)
@@ -75,7 +75,7 @@ plt.show()
7575
7676
```
7777

78-
Great, looks like we have a strong relationship, but it doesn't encompass the full spectrum of probabilities.
78+
Great, looks like we have a relationship, but it doesn't encompass the full spectrum of probabilities.
7979

8080
## Logistic Regression
8181

@@ -143,7 +143,7 @@ print('Accuracy = ', accuracy_score(y_test, logit_model.predict(X_test)))
143143

144144
Okay, that's a starting point!
145145

146-
However, we need to be mindful of the class imbalance we saw in the dataset at the beginning of the lesson. Recall we roughly have 88% of our data as No Hypertension. If we have a classifier that *always* predicted No Hypertension, then we achieve a 88% accuracy rate, but this model is not particularly novel.
146+
However, we need to be mindful of the class imbalance we saw in the dataset at the beginning of the lesson. Recall we roughly have 88% of our data as No Hypertension. If we have a classifier that *always* predicted No Hypertension, then we achieve a 88% accuracy rate, but this model is not particularly novel and it raises questions of whether our model of 76% accuracy is novel.
147147

148148
We can break down classification accuracy to four additional results, via a table called the **Confusion Matrix**:
149149

@@ -156,9 +156,9 @@ plt.show()
156156

157157
The top left hand corner is the number of True Negatives (1128), the top right hand corner is the number of False Positives (24), the bottom left corner is the number of False Negatives (325), and the bottom right corner is the number of True Positives (15).
158158

159-
Our Sensitivity (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 15/(15+325) = 4%
159+
Our **Sensitivity** (accuracy of Hypertension events) is defined as: $\frac{TP}{TP+FN}$, which is 15/(15+325) = 4%
160160

161-
Our Specificity (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 1128/(1128+24) = 98%.
161+
Our **Specificity** (accuracy of No Hypertension events) is defined as: $\frac{TN}{TN+FP}$, which is 1128/(1128+24) = 98%.
162162

163163
Therefore, we do a pretty terrible job of predicting the Hypertension cases!
164164

@@ -176,23 +176,6 @@ disp.plot()
176176
plt.show()
177177
```
178178

179-
ROC Curve
180-
181-
```{python}
182-
183-
from sklearn.metrics import RocCurveDisplay
184-
from sklearn.metrics import roc_curve, roc_auc_score
185-
186-
# Compute ROC curve
187-
fpr, tpr, thresholds = roc_curve(y_test, logit_model.predict(X_test), drop_intermediate=False, pos_label)
188-
print(f"FPR: {fpr}")
189-
print(f"TPR: {tpr}")
190-
print(f"Thresholds: {thresholds}")
191-
192-
display = RocCurveDisplay.from_predictions(y_test, logit_model.predict(X_test))
193-
plt.show()
194-
```
195-
196179
## Assumptions of logistic regression
197180

198181
### Linearity of log odds - predictor relationship
@@ -240,6 +223,23 @@ logit_model = sm.Logit(y_train, X_train).fit()
240223
logit_model.summary()
241224
```
242225

226+
## Appendix: ROC Curve
227+
228+
```{python}
229+
230+
from sklearn.metrics import RocCurveDisplay
231+
from sklearn.metrics import roc_curve, roc_auc_score
232+
233+
# Compute ROC curve
234+
fpr, tpr, thresholds = roc_curve(y_test, logit_model.predict(X_test), drop_intermediate=False, pos_label)
235+
print(f"FPR: {fpr}")
236+
print(f"TPR: {tpr}")
237+
print(f"Thresholds: {thresholds}")
238+
239+
display = RocCurveDisplay.from_predictions(y_test, logit_model.predict(X_test))
240+
plt.show()
241+
```
242+
243243
##
244244

245245
```{python}

0 commit comments

Comments
 (0)