Data internals changed

stela2502 · stela2502 · commit d0ab147bfa6a · 2026-02-10T14:51:22.000+01:00
diff --git a/docs/06-flow.md b/docs/06-flow.md
@@ -60,7 +60,7 @@ This prints numbers from 0 to 4.
 Print each gene name from the expression table:
 
 ```python
-for gene in data["expression"].index:
+for gene in data["genes"].index:
     print(gene)
 ```
 
@@ -71,7 +71,7 @@ for gene in data["expression"].index:
 Print each sample name:
 
 ```python
-for sample in data["expression"].columns:
+for sample in data["samples"].index:
     print(sample)
 ```
 
@@ -85,7 +85,7 @@ Print the first 2 gene rows (as a reminder of indexing):
 expr = data["expression"]
 
 for i in range(2):
-    print(expr.iloc[i, :])
+    print(expr[i, :])
 ```
 
 ---
@@ -128,8 +128,8 @@ expr = data["expression"]
 
 gene_means = []
 
-for gene in expr.index:
-    m = expr.loc[gene, :].mean()
+for gene in range(len(expr)):
+    m = expr[gene].mean()
     gene_means.append(m)
 
 gene_means
@@ -150,8 +150,8 @@ expr = data["expression"]
 
 sample_means = []
 
-for sample in expr.columns:
-    m = expr.loc[:, sample].mean()
+for sample in range(expr.shape[1]):
+    m = expr[:, sample].mean()
     sample_means.append(m)
 
 sample_means
diff --git a/docs/09-performance.md b/docs/09-performance.md
@@ -150,8 +150,10 @@ def vectorized_dist_mat(mat):
 
 I am sure by now you can check the run time without my help.
 
+Would it be feasable to process all 4170 rows with this fastest function?
 
-# The key lesson
+
+## The key lesson
 
 For numerical work:
 
@@ -197,6 +199,8 @@ arr = np.array(hspc_data)
 print("numpy:", t_fast)
 ```
 
+**Take Home** If numpy has a function for your problem use that!
+
 ---
 
 # Interpreting the result
@@ -224,31 +228,6 @@ But for large biological matrices (genes x samples), prefer vectorization.
 
 # Exercise
 
-1. Time these two operations on `gmp_data`:
-
-   - `calc_mean_and_var_slow(gmp_data)`
-   - `calc_mean_and_var_fast(gmp_data)`
-
-2. Write down the speedup factor:
-
-   ```python
-   t_slow / t_fast
-   ```
-
-3. Explain in one sentence why the fast method is faster.
-
----
-
-# Why this matters for bioinformatics
-
-Bioinformatics data matrices can contain:
-
-- 20,000 genes
-- 10,000+ cells or samples
-
-A slow method can take minutes or hours.  
-A vectorized method can take seconds.
-
-Understanding this difference is one of the most valuable skills in scientific programming.
+Just keep the ``timed`` function and apply it later on whenever you like ;-)
 
 In the next section, we will apply these ideas to real expression data: selecting variable genes and scaling (z-scores).