Skip to content

Commit f101591

Browse files
committed
Minor logic fixes
1 parent 714b39f commit f101591

7 files changed

Lines changed: 203 additions & 179 deletions

File tree

docs/07-functions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def get_gene(data, gene):
5656
"""
5757

5858
g_idx = data["genes"].index.get_loc(gene)
59-
return data["expression"][:, g_idx]
59+
return data["expression"][g_idx]
6060
```
6161

6262
Try it:

docs/08-io.md

Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -134,46 +134,46 @@ It should:
134134
---
135135

136136
??? exercise "Solution: load_data(folder)"
137-
```python
138-
def load_data(folder):
139-
in_dir = Path(folder)
140-
141-
# Folder must exist
142-
if not in_dir.exists():
143-
raise FileNotFoundError(f"Folder not found: {in_dir}")
144-
145-
if not in_dir.is_dir():
146-
raise NotADirectoryError(f"Not a folder: {in_dir}")
147-
148-
# Required files (fixed)
149-
paths = {
150-
"expression": in_dir / "expression.tsv",
151-
"genes": in_dir / "genes.tsv",
152-
"samples": in_dir / "samples.tsv",
153-
}
154-
155-
# Check missing files
156-
missing = [name for name, p in paths.items() if not p.exists()]
157-
if len(missing) > 0:
158-
raise FileNotFoundError(
159-
f"Missing file(s) in {in_dir}: " + ", ".join(missing)
160-
)
161-
162-
# Read tables
163-
expr = np.loadtxt(paths["expression"], delimiter="\t")
164-
genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
165-
samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)
166-
167-
data = {
168-
"expression": expr,
169-
"genes": genes,
170-
"samples": samples
171-
}
172-
173-
check_data_model( data )
174-
175-
return data
176-
```
137+
```python
138+
def load_data(folder):
139+
in_dir = Path(folder)
140+
141+
# Folder must exist
142+
if not in_dir.exists():
143+
raise FileNotFoundError(f"Folder not found: {in_dir}")
144+
145+
if not in_dir.is_dir():
146+
raise NotADirectoryError(f"Not a folder: {in_dir}")
147+
148+
# Required files (fixed)
149+
paths = {
150+
"expression": in_dir / "expression.tsv",
151+
"genes": in_dir / "genes.tsv",
152+
"samples": in_dir / "samples.tsv",
153+
}
154+
155+
# Check missing files
156+
missing = [name for name, p in paths.items() if not p.exists()]
157+
if len(missing) > 0:
158+
raise FileNotFoundError(
159+
f"Missing file(s) in {in_dir}: " + ", ".join(missing)
160+
)
161+
162+
# Read tables
163+
expr = np.loadtxt(paths["expression"], delimiter="\t")
164+
genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
165+
samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)
166+
167+
data = {
168+
"expression": expr,
169+
"genes": genes,
170+
"samples": samples
171+
}
172+
173+
check_data_model( data )
174+
175+
return data
176+
```
177177

178178

179179
---

docs/09-performance.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Writing code that *runs* is not enough — it must also run *fast enough*.
2121
```python
2222
import numpy as np
2323
import pandas as pd
24+
import math
2425

2526
url = "https://raw.githubusercontent.com/shambam/R_programming_1/main/Mouse_HSPC_reduced.txt"
2627
hspc_data = pd.read_csv(
@@ -120,7 +121,7 @@ def subset_genes(data, gene_idx):
120121
return {"expression": X2, "genes": genes2, "samples": samples2}
121122
hspc_data_tiny = subset_genes( hspc_data , np.arange(200) )
122123
check_data_model( hspc_data_tiny )
123-
hspc_data_tiny.shape
124+
hspc_data_tiny['expression'].shape
124125
```
125126

126127
```python
@@ -161,22 +162,23 @@ from scipy.spatial.distance import pdist, squareform
161162

162163
def vectorized_dist(data):
163164
"""
164-
data: a dist with "expression" - a numpy array (rows = features, cols = observations)
165+
data: a numpy array (rows = features, cols = observations)
165166
returns: full (nrow x nrow) Euclidean distance matrix as numpy array
166167
"""
167168
# upper triangle (condensed form)
168-
upper = pdist(data['expression'], metric="euclidean")
169+
upper = pdist(data, metric="euclidean")
169170

170171
# convert to full symmetric matrix
171172
D = squareform(upper)
172-
173+
173174
return D
174175
```
175176

176177
I am sure by now you can check the run time without my help.
177178

178179
Would it be feasable to process all 4170 rows with this fastest function?
179180

181+
---
180182

181183
## The key lesson
182184

@@ -204,8 +206,10 @@ Why? Because even a for loop calls Python code repeatedly whereas the vectorized
204206

205207
# Exercise
206208

207-
Take the function ``zscore_rows`` and convert it from using a numpy ndarray to using our own data structure.
208-
While doing that change tha action to modifying the data in place.
209+
Take the function ``vectorized_dist`` and convert it from using a numpy ndarray to using our own data structure.
210+
While doing that we should think about storing this data in our object.
211+
Currently this neighbor graph is nothing that we need repetetly, but it is rather costly to create.
212+
Instead of only returning the neighbor graph, store it in the dict, too.
209213

210214
**Note:** Mutable objects (like lists, dictionaries, and arrays) can be changed inside a function, while immutable objects (like numbers and strings) cannot. Think of it like "Small objects like numbers or strings can be copied, but putatively large ones like matrices or dictionaries should not be copied".
211215

docs/10-variable-genes.md

Lines changed: 130 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,134 @@ By the end of this section you will be able to:
1313

1414
This is a common preprocessing step before clustering and heatmaps.
1515

16+
---
17+
18+
# Improved start-up
19+
20+
So far we have written many small helper functions directly inside our notebooks.
21+
That works for experiments, but it quickly becomes messy:
22+
23+
* You need to copy-paste functions between notebooks
24+
* It is hard to reuse code
25+
* Mistakes fixed in one notebook do not automatically get fixed in others
26+
27+
A better approach is to store your functions in a **separate Python file** and import them when needed.
28+
29+
---
30+
31+
## Step 1: Create a functions file
32+
33+
In the Jupyter interface:
34+
35+
1. Look at the **file browser panel** on the left.
36+
2. Click the blue **“+”** button.
37+
3. Choose:
38+
39+
* **Other**
40+
* **Python File**
41+
4. Name the file:
42+
43+
```
44+
functions.py
45+
```
46+
47+
---
48+
49+
## Step 2: Collect your functions
50+
51+
Open `functions.py` and move all the following into it:
52+
53+
* All `import` statements
54+
* All helper functions you created in previous sections
55+
56+
For example:
57+
58+
```python
59+
import numpy as np
60+
import pandas as pd
61+
import matplotlib.pyplot as plt
62+
63+
64+
def dist(vec1, vec2):
65+
"""
66+
Euclidean distance between two vectors
67+
"""
68+
diff = vec1 - vec2
69+
return np.sum(diff**2) ** 0.5
70+
71+
72+
def get_top_variable_genes(data, top_n=500):
73+
"""
74+
Return top N most variable genes
75+
"""
76+
var = data.var(axis=1)
77+
top = var.sort_values(ascending=False).head(top_n)
78+
return data.loc[top.index]
79+
```
80+
81+
You can keep adding functions to this file as the course continues.
82+
83+
---
84+
85+
## Step 3: Import the functions into your notebook
86+
87+
At the top of your notebook, simply write:
88+
89+
```python
90+
from functions import *
91+
```
92+
93+
This will:
94+
95+
* Load all libraries defined in `functions.py`
96+
* Load all functions you defined there
97+
* Make them available in your notebook
98+
99+
---
100+
101+
## Important note (very useful in Jupyter)
102+
103+
If you change `functions.py`, Jupyter **does not automatically reload it**.
104+
105+
To reload it, run:
106+
107+
```python
108+
import importlib
109+
import functions
110+
importlib.reload(functions)
111+
from functions import *
112+
```
113+
114+
Or simply restart the kernel.
115+
116+
---
117+
118+
## Why this is good practice
119+
120+
This approach:
121+
122+
* Keeps notebooks clean
123+
* Encourages code reuse
124+
* Makes debugging easier
125+
* Is closer to how real projects are structured
126+
127+
Later, this idea naturally grows into:
128+
129+
* Python modules
130+
* Packages
131+
* Reusable analysis libraries
132+
133+
134+
16135
---
17136

18137
# Import libraries
19138

139+
These are the libraries we use in the here described functions.
140+
When you put new function into ``function.py`` make sure that
141+
all necessary libraries are also loaded in the file.
142+
143+
20144
```python
21145
import numpy as np
22146
import pandas as pd
@@ -139,7 +263,7 @@ print(hspc_zs.std(axis=1, ddof=1).head())
139263

140264
# Visual check: boxplots before and after scaling
141265

142-
There is a boxplot function in pandas, but our dict strores the data as numpy array.
266+
There is a boxplot function in pandas, but our dict stores the data as numpy array.
143267
The easiest here is to define one more function that actually converts out dict to a pandas DatFrame - the reverse of our from_df function earlier:
144268

145269
```python
@@ -151,7 +275,7 @@ def expression_df(data):
151275
)
152276
```
153277

154-
Before:
278+
Before z-scoring:
155279

156280
```python
157281
expression_df(hspc_var).T.boxplot(rot=90)
@@ -172,7 +296,10 @@ plt.show()
172296
# Exercise
173297

174298
We used ``expression_df(hspc_var).T`` there, but we likely also need a transform for our own data.
175-
Implement a ``def transform()`` that returns a transformed data structure.
299+
300+
1. Implement a ``def transpose()`` that returns a transformed data structure.
301+
2. Change the ``zscore_rows()`` function to accept our dict.
302+
Also change the function to zscore in place. We could store the mean and std per gene in the data dist, too.
176303

177304
---
178305

0 commit comments

Comments
 (0)