rt.github.io/03_Section_3.qmd at main · jaylim95/rt.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: Baseline
format: html
filters:
  - pyodide
---


## Building Baseline

A baseline model serves as a simple reference point for evaluating the performance of more complex models. It represents the minimum level of performance that any model should achieve. Common baselines include predicting the most frequent class or random guessing.


**Establishing a baseline is crucial because it provides a benchmark to compare against.** It helps in understanding whether the added complexity of a model is actually improving performance. If a model does not perform better than the baseline, it indicates that the model might not be useful.

This is some starter code to build the baseline models. **Do not change the following code**. Just run it! :)

```{pyodide-python}
import pandas as pd
from sklearn.model_selection import train_test_split

data_url = "https://raw.githubusercontent.com/zadchin/RT-test/master/final_dataset.csv"
data = pd.read_csv(data_url)
features = data.drop("diagnosis", axis = 1)
target = data.diagnosis
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state=42)
```


## Understanding the metrics

We are going to investigate 4 metrics to evalaute the performance of our machine learning algorithms, namely, accuracy, precision, recall and F1 score. We will go through it one-by-one in the following:

### Accuracy

The ratio of correctly predicted instances to the total instances. Mathematically it is defined as $$\text{Accuracy} =\frac{TP+TN}{\text{Total}}$$


::: {.callout-note collapse="true" appearance="simple"}

## Why can't we blindly trust accuracy?

In our dataset, 30% of the people are diagnosed with pancreatic cancer, and 70% are not. If we rely only on accuracy, our model could just predict everyone as not having cancer and still be 70% accurate. This doesn't help us identify those who actually have cancer. That's why we need to look at other metrics like precision and recall to ensure our model is correctly identifying both those with and without the disease.

:::

### Precision

Before we move on to talk about precision, recall & F1 score, we would like you to think about the following problem:

::: {.callout-note collapse="true" appearance="simple"}

## What is the impact of false positive and false negative in diagnosing pancreas cancer?

**False Positive:** The patient doesn't have pancreatic cancer, but the AI incorrectly diagnoses them with it. This can lead to unnecessary tests and procedures, causing stress and potential harm.

**False Negative:** The patient has pancreatic cancer, but the AI incorrectly diagnoses them as not having it. This is very dangerous as the patient misses the chance for early treatment, leading to potentially severe consequences.

:::


Thus, **precision, recall and F1 score are important metrics to help us to get a hollistic picture of our model performance.**


Precision is defined as the ratio of true positive predictions to the total predicted positives. Mathematically,
$$\text{Precision} =\frac{TP}{TP+FP}$$

Precision tells us how many of the patients predicted to have PDAC actually have the disease. High precision means fewer false positives.


### Recall

Recall is defined as the ratio of true positive predictions to the total actual positives. Mathematically, it is defined as

$$\text{Recall} = \frac{TP}{TP+FN}$$

Recall (or sensitivity) indicates how many patients with PDAC are correctly identified by the model. High recall means fewer false negatives, which is crucial for medical diagnosis.


### F1 Score

F1 Score is the harmonic mean of precision and recall. Mathematically, it is defined as

$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

The F1 Score provides a balance between precision and recall. It is particularly useful when the class distribution is imbalanced, as it considers both false positives and false negatives.

## Baseline 1: All False

First, we will create a baseline around the most frequent class, which in our case, is 0, where we assume all patients do not have cnacer.This baseline mimics a current system where we assume that all patients do not have Pancreatic ductal adenocarcinoma (PDAC).


```{pyodide-python}
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
y_pred_baseline = [0] * len(y_test)
print("Baseline (All Negative) Model")
print("Accuracy: ", np.round(accuracy_score(y_test, y_pred_baseline), 2))
print("Precision: ", np.round(precision_score(y_test, y_pred_baseline, zero_division=0),2))
print("Recall: ", np.round(recall_score(y_test, y_pred_baseline, zero_division=0),2))
print("F1 Score: ", np.round(f1_score(y_test, y_pred_baseline, zero_division=0),2))
```


::: {.callout-note collapse="true" appearance="simple"}

## Why is precision, recall and F1 score are all 0 in our case?

Precision, recall, and F1 score are all zero because the baseline model predicts all patients as negative (no PDAC), resulting in no true positives!

**Did you realize the importance of different metrics in judging performance?**

:::

## Baseline 2: Random Guess

Random guessing is a baseline method where predictions are made by randomly assigning classes (positive or negative) without any consideration of the input features. As an analogy, it is like you flipping a coin and randomly assign a patient of its status based on the coins (for example, has cancer if the coin is head, and has no cancer if the coin is tail).

Random guessing sets a very low bar for model performance. If a machine learning model cannot outperform random guessing, it indicates that the model has not captured any meaningful patterns from the data.  Beating random guessing demonstrates that the model has predictive power and can provide reliable results.


```{pyodide-python}
np.random.seed(42)
y_pred_random = np.random.randint(2, size=len(y_test))
y_pred_baseline = [0] * len(y_test)
print("Baseline (Random Guessing) Model")
print("Accuracy: ", np.round(accuracy_score(y_test, y_pred_random), 2))
print("Precision: ", np.round(precision_score(y_test, y_pred_random, zero_division=0), 2))
print("Recall: ", np.round(recall_score(y_test, y_pred_random, zero_division=0), 2))
print("F1 Score: ", np.round(f1_score(y_test, y_pred_random, zero_division=0), 2))
```


$\,$