Skip to content

Commit d0ab147

Browse files
committed
Data internals changed
1 parent e9b5bd2 commit d0ab147

2 files changed

Lines changed: 13 additions & 34 deletions

File tree

docs/06-flow.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ This prints numbers from 0 to 4.
6060
Print each gene name from the expression table:
6161

6262
```python
63-
for gene in data["expression"].index:
63+
for gene in data["genes"].index:
6464
print(gene)
6565
```
6666

@@ -71,7 +71,7 @@ for gene in data["expression"].index:
7171
Print each sample name:
7272

7373
```python
74-
for sample in data["expression"].columns:
74+
for sample in data["samples"].index:
7575
print(sample)
7676
```
7777

@@ -85,7 +85,7 @@ Print the first 2 gene rows (as a reminder of indexing):
8585
expr = data["expression"]
8686

8787
for i in range(2):
88-
print(expr.iloc[i, :])
88+
print(expr[i, :])
8989
```
9090

9191
---
@@ -128,8 +128,8 @@ expr = data["expression"]
128128

129129
gene_means = []
130130

131-
for gene in expr.index:
132-
m = expr.loc[gene, :].mean()
131+
for gene in range(len(expr)):
132+
m = expr[gene].mean()
133133
gene_means.append(m)
134134

135135
gene_means
@@ -150,8 +150,8 @@ expr = data["expression"]
150150

151151
sample_means = []
152152

153-
for sample in expr.columns:
154-
m = expr.loc[:, sample].mean()
153+
for sample in range(expr.shape[1]):
154+
m = expr[:, sample].mean()
155155
sample_means.append(m)
156156

157157
sample_means

docs/09-performance.md

Lines changed: 6 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -150,8 +150,10 @@ def vectorized_dist_mat(mat):
150150

151151
I am sure by now you can check the run time without my help.
152152

153+
Would it be feasable to process all 4170 rows with this fastest function?
153154

154-
# The key lesson
155+
156+
## The key lesson
155157

156158
For numerical work:
157159

@@ -197,6 +199,8 @@ arr = np.array(hspc_data)
197199
print("numpy:", t_fast)
198200
```
199201

202+
**Take Home** If numpy has a function for your problem use that!
203+
200204
---
201205

202206
# Interpreting the result
@@ -224,31 +228,6 @@ But for large biological matrices (genes x samples), prefer vectorization.
224228

225229
# Exercise
226230

227-
1. Time these two operations on `gmp_data`:
228-
229-
- `calc_mean_and_var_slow(gmp_data)`
230-
- `calc_mean_and_var_fast(gmp_data)`
231-
232-
2. Write down the speedup factor:
233-
234-
```python
235-
t_slow / t_fast
236-
```
237-
238-
3. Explain in one sentence why the fast method is faster.
239-
240-
---
241-
242-
# Why this matters for bioinformatics
243-
244-
Bioinformatics data matrices can contain:
245-
246-
- 20,000 genes
247-
- 10,000+ cells or samples
248-
249-
A slow method can take minutes or hours.
250-
A vectorized method can take seconds.
251-
252-
Understanding this difference is one of the most valuable skills in scientific programming.
231+
Just keep the ``timed`` function and apply it later on whenever you like ;-)
253232

254233
In the next section, we will apply these ideas to real expression data: selecting variable genes and scaling (z-scores).

0 commit comments

Comments
 (0)