sccbioinformatics
diff --git a/‎docs/07-functions.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/07-functions.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/08-io.md‎
Lines changed: 40 additions & 40 deletions b/‎docs/08-io.md‎
Lines changed: 40 additions & 40 deletions
diff --git a/‎docs/09-performance.md‎
Lines changed: 10 additions & 6 deletions b/‎docs/09-performance.md‎
Lines changed: 10 additions & 6 deletions
diff --git a/‎docs/10-variable-genes.md‎
Lines changed: 130 additions & 3 deletions b/‎docs/10-variable-genes.md‎
Lines changed: 130 additions & 3 deletions
@@ -56,7 +56,7 @@ def get_gene(data, gene):
     """
 
     g_idx = data["genes"].index.get_loc(gene)
-    return data["expression"][:, g_idx]
+    return data["expression"][g_idx]
 ```
 
 Try it:
 
@@ -134,46 +134,46 @@ It should:
 ---
 
 ??? exercise "Solution: load_data(folder)"
- ```python
- def load_data(folder):
-    in_dir = Path(folder)
-
-    # Folder must exist
-    if not in_dir.exists():
-        raise FileNotFoundError(f"Folder not found: {in_dir}")
-
-    if not in_dir.is_dir():
-        raise NotADirectoryError(f"Not a folder: {in_dir}")
-
-    # Required files (fixed)
-    paths = {
-        "expression": in_dir / "expression.tsv",
-        "genes": in_dir / "genes.tsv",
-        "samples": in_dir / "samples.tsv",
-    }
-
-    # Check missing files
-    missing = [name for name, p in paths.items() if not p.exists()]
-    if len(missing) > 0:
-        raise FileNotFoundError(
-            f"Missing file(s) in {in_dir}: " + ", ".join(missing)
-        )
-
-    # Read tables
-    expr = np.loadtxt(paths["expression"], delimiter="\t")
-    genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
-    samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)
-
-    data = {
-        "expression": expr,
-        "genes": genes,
-        "samples": samples
-    }
-
-    check_data_model( data )
-
-    return data
- ```
+     ```python
+     def load_data(folder):
+        in_dir = Path(folder)
+
+        # Folder must exist
+        if not in_dir.exists():
+            raise FileNotFoundError(f"Folder not found: {in_dir}")
+
+        if not in_dir.is_dir():
+            raise NotADirectoryError(f"Not a folder: {in_dir}")
+
+        # Required files (fixed)
+        paths = {
+            "expression": in_dir / "expression.tsv",
+            "genes": in_dir / "genes.tsv",
+            "samples": in_dir / "samples.tsv",
+        }
+
+        # Check missing files
+        missing = [name for name, p in paths.items() if not p.exists()]
+        if len(missing) > 0:
+            raise FileNotFoundError(
+                f"Missing file(s) in {in_dir}: " + ", ".join(missing)
+            )
+
+        # Read tables
+        expr = np.loadtxt(paths["expression"], delimiter="\t")
+        genes = pd.read_csv(paths["genes"], sep="\t", header=0, index_col=0)
+        samples = pd.read_csv(paths["samples"], sep="\t", header=0, index_col=0)
+
+        data = {
+            "expression": expr,
+            "genes": genes,
+            "samples": samples
+        }
+
+        check_data_model( data )
+
+        return data
+     ```
 
 
 ---
 
@@ -21,6 +21,7 @@ Writing code that *runs* is not enough — it must also run *fast enough*.
 ```python
 import numpy as np
 import pandas as pd
+import math
 
 url = "https://raw.githubusercontent.com/shambam/R_programming_1/main/Mouse_HSPC_reduced.txt"
 hspc_data = pd.read_csv(
@@ -120,7 +121,7 @@ def subset_genes(data, gene_idx):
     return {"expression": X2, "genes": genes2, "samples": samples2}
 hspc_data_tiny = subset_genes(  hspc_data , np.arange(200) )
 check_data_model( hspc_data_tiny )
-hspc_data_tiny.shape
+hspc_data_tiny['expression'].shape
 ```
 
 ```python
@@ -161,22 +162,23 @@ from scipy.spatial.distance import pdist, squareform
 
 def vectorized_dist(data):
     """
-    data: a dist with "expression" - a numpy array (rows = features, cols = observations)
+    data: a numpy array (rows = features, cols = observations)
     returns: full (nrow x nrow) Euclidean distance matrix as numpy array
     """
     # upper triangle (condensed form)
-    upper = pdist(data['expression'], metric="euclidean")
+    upper = pdist(data, metric="euclidean")
 
     # convert to full symmetric matrix
     D = squareform(upper)
-
+    
     return D
 ```
 
 I am sure by now you can check the run time without my help.
 
 Would it be feasable to process all 4170 rows with this fastest function?
 
+---
 
 ## The key lesson
 
@@ -204,8 +206,10 @@ Why? Because even a for loop calls Python code repeatedly whereas the vectorized
 
 # Exercise
 
-Take the function ``zscore_rows`` and convert it from using a numpy ndarray to using our own data structure.
-While doing that change tha action to modifying the data in place. 
+Take the function ``vectorized_dist`` and convert it from using a numpy ndarray to using our own data structure.
+While doing that we should think about storing this data in our object.
+Currently this neighbor graph is nothing that we need repetetly, but it is rather costly to create.
+Instead of only returning the neighbor graph, store it in the dict, too.
 
 **Note:** Mutable objects (like lists, dictionaries, and arrays) can be changed inside a function, while immutable objects (like numbers and strings) cannot. Think of it like "Small objects like numbers or strings can be copied, but putatively large ones like matrices or dictionaries should not be copied". 
 
 
@@ -13,10 +13,134 @@ By the end of this section you will be able to:
 
 This is a common preprocessing step before clustering and heatmaps.
 
+---
+
+# Improved start-up
+
+So far we have written many small helper functions directly inside our notebooks.
+That works for experiments, but it quickly becomes messy:
+
+* You need to copy-paste functions between notebooks
+* It is hard to reuse code
+* Mistakes fixed in one notebook do not automatically get fixed in others
+
+A better approach is to store your functions in a **separate Python file** and import them when needed.
+
+---
+
+## Step 1: Create a functions file
+
+In the Jupyter interface:
+
+1. Look at the **file browser panel** on the left.
+2. Click the blue **“+”** button.
+3. Choose:
+
+   * **Other**
+   * **Python File**
+4. Name the file:
+
+```
+functions.py
+```
+
+---
+
+## Step 2: Collect your functions
+
+Open `functions.py` and move all the following into it:
+
+* All `import` statements
+* All helper functions you created in previous sections
+
+For example:
+
+```python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+
+
+def dist(vec1, vec2):
+    """
+    Euclidean distance between two vectors
+    """
+    diff = vec1 - vec2
+    return np.sum(diff**2) ** 0.5
+
+
+def get_top_variable_genes(data, top_n=500):
+    """
+    Return top N most variable genes
+    """
+    var = data.var(axis=1)
+    top = var.sort_values(ascending=False).head(top_n)
+    return data.loc[top.index]
+```
+
+You can keep adding functions to this file as the course continues.
+
+---
+
+## Step 3: Import the functions into your notebook
+
+At the top of your notebook, simply write:
+
+```python
+from functions import *
+```
+
+This will:
+
+* Load all libraries defined in `functions.py`
+* Load all functions you defined there
+* Make them available in your notebook
+
+---
+
+## Important note (very useful in Jupyter)
+
+If you change `functions.py`, Jupyter **does not automatically reload it**.
+
+To reload it, run:
+
+```python
+import importlib
+import functions
+importlib.reload(functions)
+from functions import *
+```
+
+Or simply restart the kernel.
+
+---
+
+## Why this is good practice
+
+This approach:
+
+* Keeps notebooks clean
+* Encourages code reuse
+* Makes debugging easier
+* Is closer to how real projects are structured
+
+Later, this idea naturally grows into:
+
+* Python modules
+* Packages
+* Reusable analysis libraries
+
+
+
 ---
 
 # Import libraries
 
+These are the libraries we use in the here described functions.
+When you put new function into ``function.py`` make sure that 
+all necessary libraries are also loaded in the file.
+
+
 ```python
 import numpy as np
 import pandas as pd
@@ -139,7 +263,7 @@ print(hspc_zs.std(axis=1, ddof=1).head())
 
 # Visual check: boxplots before and after scaling
 
-There is a boxplot function in pandas, but our dict strores the data as numpy array.
+There is a boxplot function in pandas, but our dict stores the data as numpy array.
 The easiest here is to define one more function that actually converts out dict to a pandas DatFrame - the reverse of our from_df function earlier:
 
 ```python
@@ -151,7 +275,7 @@ def expression_df(data):
     )
 ```
 
-Before:
+Before z-scoring:
 
 ```python
 expression_df(hspc_var).T.boxplot(rot=90)
@@ -172,7 +296,10 @@ plt.show()
 # Exercise
 
 We used ``expression_df(hspc_var).T`` there, but we likely also need a transform for our own data.
-Implement a ``def transform()`` that returns a transformed data structure. 
+
+ 1. Implement a ``def transpose()`` that returns a transformed data structure. 
+ 2. Change the ``zscore_rows()`` function to accept our dict. 
+    Also change the function to zscore in place. We could store the mean and std per gene in the data dist, too.
 
 ---