rt.github.io/05_Section_5.qmd at main · jaylim95/rt.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: Decision Tree
format: html
filters:
  - pyodide
---

## Decision Tree

[Fill description]{style="color:red;"}.


## Train-Test Split

As we have done previosuly under logistic regression, we will first perform a train-test split.

```{pyodide-python}
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, roc_auc_score, confusion_matrix

data_url = "https://raw.githubusercontent.com/zadchin/RT-test/master/final_dataset.csv"
data = pd.read_csv(data_url)
features = data.drop("diagnosis", axis = 1)
target = data.diagnosis
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state=42)
print("X train shape: ", X_train.shape)
print("X test shape: ", X_test.shape)
```


## Algorithm


Next, we will run decision treeon our training dataset, then compute the performance of the metrics using our dataset and then


```{pyodide-python}
import numpy as np
DT_model = tree.DecisionTreeClassifier(max_leaf_nodes=3, max_depth=None,
                                  class_weight="balanced")
DT_model = DT_model.fit(X_train, y_train)
y_pred_DT = DT_model.predict(X_test)
print("Accuracy: ", np.round(accuracy_score(y_test, y_pred_DT), 2))
print("Precision: ",np.round(precision_score(y_test, y_pred_DT), 2))
print("Recall: ",np.round(recall_score(y_test, y_pred_DT), 2))
print("F1 Score: ", np.round(f1_score(y_test, y_pred_DT), 2))
```

::: {.callout-note collapse="true" appearance="simple"}

## Try Removing Class Weight, how does the metrics changes?

Accuracy stays the same at 0.89. However, we observe the precision increased and recall decreased. The F1 score stays approximately the same.

**Why Precision Improved?**

Precision increases when the class weight is removed, suggesting that the model makes fewer false positive predictions. This might be because, without class balancing, the model is biased towards the majority class.

**Why Recall Decreased?**

Recall decreases when the class weight is removed, indicating that the model misses more true positive instances. This is critical in medical diagnoses where missing a positive case (false negative) can be detrimental.

:::

::: {.callout-note collapse="true" appearance="simple"}

## Assume `class_weight="balanced"`, first test with `max_depth=None`, and then test with `max_depth=1`, observe the difference in the performance, and also in visualization of the tree?

Reducing `max_depth` from `None` to `1` decreases the recall and F1 score, indicating the model becomes less capable of identifying true positives, likely due to oversimplification. The oversimplication is reflected in the visualization of the tree.

:::

## Visualizing the tree

```{pyodide-python}
feature_names = list(X_train.columns)
text_representation = tree.export_text(DT_model, feature_names=feature_names)
print(text_representation)
```

$\,$