We analyse Zebrafish LD, NMDA and control scRNASeq samples using scanpy.
These violin plots visualize cell-level quality metrics before and after filtering low-quality cells.
This plot shows raw quality control metrics (e.g., number of genes per cell, mitochondrial percentage) before filtering. It helps ide.pngy thresholds for removing low-quality or dead cells.
This plot shows the same QC metrics after filtering. It confirms that poor-quality cells were successfully removed based on chosen thresholds.
This UMAP plot shows zebrafish single-cell data after batch effect correction using Harmony.
By aligning shared biological structure across different batches or samples, Harmony improves the clustering and visualization. Cells are now grouped primarily by biological similarity rather than technical batch differences, enabling more accurate downstream analysis.
Below are the UMAP plots split by individual samples after reclustering:
Below are the UMAP plots split by individual samples after reclustering:
ππππππππππππππππππππππππππππππππππππππππππππ
ππππππππππππππππππππππππππππππππππππππππππππ
| Sample | MG | MGPC | PR precursors | Rod | Cones | BC | AC | HC | RGC | Microglia_ImmuneCells | RPE | Melanocyte | Endothelial | Perycites | Oligodenrocyte |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Control | 6.49 | 1.13 | 0.88 | 4.76 | 7.54 | 46.85 | 10.24 | 13.40 | 3.20 | 1.74 | 1.34 | 0.27 | 0.57 | 1.09 | 0.51 |
| LD | 43.12 | 15.54 | 14.44 | 7.90 | 3.32 | 4.86 | 3.05 | 6.17 | 1.40 | 0.14 | 0.03 | 0.00 | 0.00 | 0.02 | 0.02 |
| NMDA | 29.89 | 14.74 | 13.87 | 13.74 | 2.75 | 5.50 | 9.23 | 6.94 | 2.96 | 0.22 | 0.00 | 0.02 | 0.02 | 0.05 | 0.07 |
| renamed_samples | Number of cells |
|---|---|
| Control | 134,237 |
| NMDA | 15,424 |
| LD | 11,214 |
The following UMAP plots illustrate the clustering of subtypes within each major retinal cell type. These visualizations highlight transcriptional diversity captured by single-cell RNA sequencing.
View stats summary for AC here
view top 50 in AC subtypes DGE for 0-44 clusters
View top 50 in AC subtypes DGE for annotated clusters
| cluster | Control | LD | NMDA |
|---|---|---|---|
| 0 | 318 (0.08) | 18 (0.12) | 68 (0.158) |
| 1 | 293 (0.073) | 8 (0.053) | 25 (0.058) |
| 2 | 255 (0.064) | 18 (0.12) | 33 (0.077) |
| 3 | 199 (0.05) | 1 (0.007) | 23 (0.053) |
| 4 | 181 (0.045) | 11 (0.073) | 16 (0.037) |
| 5 | 161 (0.04) | 12 (0.08) | 24 (0.056) |
| 6 | 178 (0.044) | 2 (0.013) | 13 (0.03) |
| 7 | 153 (0.038) | 10 (0.067) | 24 (0.056) |
| 8 | 123 (0.031) | 13 (0.087) | 18 (0.042) |
| 9 | 124 (0.031) | 0 (0.0) | 4 (0.009) |
| 10 | 111 (0.028) | 5 (0.033) | 7 (0.016) |
| 11 | 110 (0.028) | 3 (0.02) | 10 (0.023) |
| 12 | 117 (0.029) | 0 (0.0) | 2 (0.005) |
| 13 | 99 (0.025) | 4 (0.027) | 15 (0.035) |
| 14 | 109 (0.027) | 1 (0.007) | 6 (0.014) |
| 15 | 108 (0.027) | 2 (0.013) | 4 (0.009) |
| 16 | 104 (0.026) | 1 (0.007) | 5 (0.012) |
| 17 | 94 (0.024) | 1 (0.007) | 10 (0.023) |
| 18 | 97 (0.024) | 1 (0.007) | 5 (0.012) |
| 19 | 82 (0.02) | 4 (0.027) | 8 (0.019) |
| 20 | 79 (0.02) | 3 (0.02) | 11 (0.026) |
| 21 | 85 (0.021) | 0 (0.0) | 5 (0.012) |
| 22 | 82 (0.02) | 0 (0.0) | 5 (0.012) |
| 23 | 69 (0.017) | 3 (0.02) | 6 (0.014) |
| 24 | 65 (0.016) | 3 (0.02) | 7 (0.016) |
| 25 | 65 (0.016) | 0 (0.0) | 8 (0.019) |
| 26 | 70 (0.018) | 0 (0.0) | 3 (0.007) |
| 27 | 59 (0.015) | 1 (0.007) | 6 (0.014) |
| 28 | 62 (0.016) | 2 (0.013) | 1 (0.002) |
| 29 | 49 (0.012) | 6 (0.04) | 10 (0.023) |
| 30 | 36 (0.009) | 1 (0.007) | 27 (0.063) |
| 31 | 45 (0.011) | 3 (0.02) | 4 (0.009) |
| 32 | 46 (0.012) | 0 (0.0) | 2 (0.005) |
| 33 | 32 (0.008) | 8 (0.053) | 7 (0.016) |
| 34 | 41 (0.01) | 2 (0.013) | 2 (0.005) |
| 35 | 41 (0.01) | 1 (0.007) | 2 (0.005) |
| 36 | 31 (0.008) | 0 (0.0) | 1 (0.002) |
| 37 | 27 (0.007) | 2 (0.013) | 3 (0.007) |
This volcano plot visualizes all genes, showing the relationship between the magnitude of change (log2 fold change) and statistical significance (-log10 adjusted p-value) for each gene. The top 5 genes with the most pronounced differences between Control and the rest are highlighted with arrows for clarity.
-
Red line on volcano plot = p-value of 0.05 (standard significance)
-
Grey line = no change (fold change = 0)
Diffusion map is a dimensionality reduction method that identifies continuous axes of variation in single-cell data based on cell-to-cell similarity, capturing trajectories or state changes rather than discrete clusters.
Figure 1: Diffusion Map (DC1 vs DC2)
- Cells from Control, LD, and NMDA conditions are largely mixed together.
- No distinct separation is observed along DC1 or DC2.
- Interpretation: The dominant transcriptional variation in this cell type is not driven by LD or NMDA.
Figure 2: Violin Plot of DC1
- DC1 distributions for Control, LD, and NMDA overlap almost completely.
- Medians and spreads are similar across conditions.
- Interpretation: Confirms the diffusion map result β neither LD nor NMDA induce a strong shift along the main diffusion axis.
Conclusion: In this cell type, the treatments do not drive major transcriptional differences detectable by diffusion map.
This plot shows cell density along pseudotime, creating a temporal pseudo-surface of the cell population.
Its purpose is to identify:
- Where cells accumulate along a biological trajectory
- Potential stable states
- Possible transition checkpoints
- Regions where cells progress slowly or rapidly
In short:
It shows which transcriptional states are most populated along the inferred trajectory.
Represents progression through a biological process (e.g., differentiation).
Used to vertically separate cells for visualization; not biologically meaningful.
It allows the density of cells along pseudotime to form a smooth surface.
- Brighter regions = higher cell density
- Darker regions = lower cell density
There is a very strong accumulation of cells early in pseudotime, with the highest density around approximately 0.05β0.1. This suggests that many cells occupy an early transcriptional state.
A secondary, lower-density enrichment appears around approximately 0.2β0.25, indicating another populated intermediate state.
There is a much weaker and more diffuse signal around approximately 0.5, suggesting relatively few cells reach mid-to-late pseudotime states.
After approximately 0.6, cell density becomes very low, indicating that only a small fraction of cells progress to late pseudotime states.
Overall, this pattern suggests:
- A large early population
- One or two intermediate states
- Progressive depletion of cells along the trajectory
- Late states that may be transient or reached by only a subset of cells
This plot shows a pseudo-spatial distribution of cells using known marker genes for specific anatomical axes.
Its purpose is to:
- Estimate relative positioning of cells along Nasal β Temporal (X-axis) and Ventral β Dorsal (Y-axis) axes
- Visualize cell density across inferred pseudo-regions
- Provide a spatial context for transcriptional states, even without true spatial coordinates
Note: This is an inferred pseudo-spatial map based on gene expression patterns; it does not represent actual measured cell positions.
--nasal_genes Rbpms Pou4f1 neurolin brn3b isl1 atoh7 foxg1a dmbx1 lhx2 ebf2 eomesa barhl2 --temporal_genes Pou4f2 Thy1 ephb4a ephb4b cox1 cyp26a1 epha4b nr2f2 nr2f6b bmp4 --ventral_genes Tubb3 Nefl barhl1a cxcr4b vsx2 pitx2 bmp7 shh ptc1 gli1 vax2 tbx5 --dorsal_genes Sncg Gap43 epha7a epha7b wnt2ba wnt4a lhx9 nr2e1 tcf7l2 pax6a pax6b
-
X-axis (Pseudo Nasal β Temporal)
Derived from the difference in expression between genes enriched in nasal vs temporal regions.- Cells with X β 0 are more βnasal-likeβ
- Cells with X β 1 are more βtemporal-likeβ
-
Y-axis (Pseudo Ventral β Dorsal)
Derived from the difference in expression between ventral vs dorsal markers.- Cells with Y β 0 are more βventral-likeβ
- Cells with Y β 1 are more βdorsal-likeβ
-
Color / intensity (Cell Density)
The plot uses a 2D hexbin to show where cells accumulate on the pseudo-spatial surface.
Brighter regions indicate higher density.
- Clusters of cells in the plot suggest groups with similar inferred spatial identity.
- Gradients along X and Y axes indicate relative spatial transitions inferred from gene expression.
- This allows for visual comparison of cell distributions across pseudo-anatomical axes.
- Not true spatial positions: Cells were dissociated, so their real Nasal / Temporal / Ventral / Dorsal locations are lost.
- Accuracy depends on the quality and specificity of the chosen marker genes.
- Only provides relative, inferred positions along the axes, not absolute anatomical coordinates.
Conceptually:
- Input: Each cell is represented by its gene expression profile (many genes per cell).
- Training data: Control RGCs only (unlesioned, healthy baseline).
- Goal: Measure how Control-like each cell is, rather than explicitly classifying conditions.
- Approach (one-class model):
- Learns the normal expression manifold of Control RGCs.
- Does not use LD or NMDA cells during training.
- Assigns a fidelity score to every cell based on how well it fits the Control distribution.
This is a novelty / normality detection problem, not a binary classification problem.
- The model learns the structure and variability of Control cells only.
- For each cell (Control, LD, or NMDA), it computes a decision score:
- High score β cell lies close to the Control expression manifold.
- Low score β cell deviates from Control (lesioned / abnormal).
- Scores are normalized to 0β1 for visualization:
- 1 = highly Control-like
- 0 = strongly non-Control-like
Geometrically, you can think of this as learning a boundary around Control cells in high-dimensional gene-expression space, rather than drawing a line between two classes.
-
SVM vs Logistic Regression:
- Logistic regression draws one straight line between two labeled classes β it needs examples from both classes to learn, so it cannot be used if the goal is to measure deviation from Control without labeling LD or NMDA.
- One-Class SVM keeps drawing flexible boundaries (lines or surfaces) around the Control data to capture its full shape and variation β it learns the normal pattern of Control cells and does not require other classes.
-
Why SVM is better when training on Control only:
- It learns the normal distribution of Control cells, capturing the natural variation among healthy cells.
- Assigns a fidelity score showing how Control-like each cell is.
- Automatically flags any unknown or abnormal cells (LD/NMDA) as deviations, without needing labels for them.
-
Why training on Control only is better:
- The goal is to measure deviation from normal, so the model should only learn what βnormalβ looks like.
- Including LD or NMDA in training would blur the boundary, because the model would start treating abnormal patterns as normal.
- By training only on Control, the SVM keeps the boundary tight around healthy cells, ensuring fidelity scores are meaningful and deviations are accurately detected.
Here you go β the Markdown explanation in a code block:
It internally loops over all cell types. For each cell type, it performs the following steps:
-
Subset that cell type
- Extracts only the cells belonging to the current cell type from the full dataset.
-
Train a separate model (scaler + SVM) on Control of that cell type
- Scaling (mean=0, variance=1) is computed using only Control cells of that cell type.
- One-Class SVM is trained only on these Control cells.
-
Score treated cells of that cell type
- LD or NMDA cells of the same cell type are scored relative to their Control model.
-
Write results back
- The fidelity scores are inserted back into the global dataset, preserving separation between cell types.
-
Respects cell type biology
- Different cell types have different gene expression profiles. A single model across all cell types would produce meaningless scores.
-
Correct scaling and HVG selection
- Each cell type has its own variance structure and highly variable genes. Modeling per cell type ensures HVGs and scaling are accurate.
-
Independent, interpretable scores
- Fidelity scores always mean: βsimilar to Control of the same cell typeβ.
-
Error containment
- If one cell type fails or has too few cells, the other cell types are unaffected.
- Different cell types would be mixed in scaling and HVG selection, distorting the feature space.
- One SVM would learn a mixture of distributions, which is biologically meaningless.
- Fidelity scores would not reflect cell-type-specific similarity, making downstream analysis invalid.
Bottom line: Looping per cell type ensures biologically accurate, interpretable, and robust results.
Controls how complex the Control boundary is:
- Guess wrong = either too simple (miss real differences) or too complex (fit noise)
Controls expected % of outliers in Control:
- Guess wrong = either include weird Controls or exclude real Controls
You have to guess BOTH correctly without knowing your data's true structure. Isolation Forest avoids this guessing game.
-
Train on Control cells only
- Build hundreds of random trees
- Each tree: pick random gene β random threshold β split β repeat until single cells
- Trees know NOTHING about "Control" vs "anomaly"
-
Run ALL cells through trained trees
- Control cells (training data)
- LD cells
- NMDA cells
-
Measure path length for each cell
- Count how many splits until cell is isolated
- Average across all trees
-
Convert to fidelity score
- Long average path = hard to isolate = looks like Control = HIGH fidelity
- Short average path = easy to isolate = different from Control = LOW fidelity
Cosine similarity is not invariant to gene selection.
HVG-based analyses answer a different biological question than all-gene analyses.
Both approaches are valid β but they tell different stories, and that choice must be stated explicitly.
Watson, E. R., Mora, A., Taherian Fard, A., & Mar, J. C. (2022). How does the structure of data impact cellβcell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data. Briefings in bioinformatics, 23(6), bbac387.
-
The script first calculates the average expression of each gene within each cell type, separately for the two conditions youβre comparing. So for every gene, you have an average expression value for each cell type under condition 1 and condition 2.
-
Once these averages are calculated, the script looks across all cell types to see how highly each gene is expressed in at least one cell type. It takes the maximum of these average expressions for each gene. This gives a single value per gene representing its highest mean expression in any cell type.
-
Then, the script filters out genes whose maximum mean expression is below the threshold you set. In other words, genes that are not expressed strongly in any cell type are removed from the analysis. Only genes that have at least one cell type where their mean expression exceeds the threshold are kept.
-
This way, the correlation analysis later is done only on genes that are meaningfully expressed, avoiding noise from genes that are essentially silent across all cell types.
python pearson_heatmap.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad renamed_samples Control LD control_vs_ld_heatmap.png --min_mean_expr n| Cell Type | Control | LD |
|---|---|---|
| AC | β | β |
| BC | β | β |
| Cones | β | β |
| HC | β | β |
| MG | β | β |
| MGPC | β | β |
| Microglia_ImmuneCells | β | β |
| Oligodenrocyte | β | β |
| PR precursors | β | β |
| Perycites | β | β |
| RGC | β | β |
| RPE | β | β |
| Rod | β | β |
| Melanocyte | β | β |
| Endothelial | β | β |
sc.tl.rank_genes_groups(
adata_for_dge,
groupby=groupby, #groupby can be celltype or leiden ..etc
method="wilcoxon",
reference="rest",
use_raw=True
)
# Name Version Build Channel
scanpy 1.9.3 pypi_0 pypi
python print_h5ad.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad
Loading AnnData object from: annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad
==================================================
BASIC AnnData INFORMATION
==================================================
Overall shape: (160875, 24597) (cells: 160875, genes: 24597)
Observation names (cells): ['AAACCAAAGCTTAACG-1', 'AAACCAAAGTACGCCG-1', 'AAACCATTCGCTCATT-1', 'AAACCCGCACCTCACG-1', 'AAACCCGCATCGGGAC-1']...
Variable names (genes): ['fgfr1op2', 'si:dkey-21h14.12', 'si:dkey-285e18.2', 'znf1114', 'si:dkey-21h14.10']...
=== adata.X (Primary Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: dense
Data type: float32
Min: -6.042159
Max: 381.777557
Mean: -0.001693
Std: 0.995661
Non-zero values: 3957042375
Zero values: 0
Sparsity: 0.0000 (0.00%)
Top 5 genes (first 5 rows):
fgfr1op2: min=-0.2939, max=7.9207, mean=-0.0006
si:dkey-21h14.12: min=-0.0168, max=101.5491, mean=-0.0000
si:dkey-285e18.2: min=-0.0079, max=182.2779, mean=0.0001
znf1114: min=-0.1110, max=18.6632, mean=0.0004
si:dkey-21h14.10: min=-0.0139, max=136.0288, mean=-0.0019
Top 5 cells (first 5 columns):
AAACCAAAGCTTAACG-1: min=-2.3688, max=124.4653, mean=-0.0308
AAACCAAAGTACGCCG-1: min=-3.3848, max=44.7331, mean=-0.0216
AAACCATTCGCTCATT-1: min=-2.0438, max=22.0059, mean=0.0395
AAACCCGCACCTCACG-1: min=-2.7271, max=35.6094, mean=0.0179
AAACCCGCATCGGGAC-1: min=-2.5089, max=22.6247, mean=0.0050
adata.raw type: <class 'anndata._core.raw.Raw'>
=== adata.raw.X (Raw Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: sparse
Data type: float32
Min: 0.000000
Max: 8.389225
Mean: 0.113462
Std: 0.443252
Non-zero values: 294862321
Zero values: 3662180054
Sparsity: 0.9255 (92.55%)
Top 5 genes (first 5 rows):
fgfr1op2: min=0.0000, max=3.0251, mean=0.1080
si:dkey-21h14.12: min=0.0000, max=1.9556, mean=0.0003
si:dkey-285e18.2: min=0.0000, max=1.9756, mean=0.0001
znf1114: min=0.0000, max=2.6757, mean=0.0159
si:dkey-21h14.10: min=0.0000, max=2.1449, mean=0.0002
Top 5 cells (first 5 columns):
AAACCAAAGCTTAACG-1: min=0.0000, max=4.7681, mean=0.0959
AAACCAAAGTACGCCG-1: min=0.0000, max=5.3671, mean=0.0970
AAACCATTCGCTCATT-1: min=0.0000, max=5.0932, mean=0.1313
AAACCCGCACCTCACG-1: min=0.0000, max=6.1345, mean=0.1182
AAACCCGCATCGGGAC-1: min=0.0000, max=5.2011, mean=0.1099
==================================================
COMPARISON: adata.X vs adata.raw.X
==================================================
Same shape: True
Number of different elements: 3957042359
Identical matrices: False
==================================================
CATEGORICAL OBSERVATIONS
==================================================
Found 4 categorical columns:
Column: sample (8 unique values)
Values: ['Zebra', 'TH115', 'TH44', 'TH54', 'TH55', 'TH56', 'TH57', 'TH71']
---
Column: renamed_samples (3 unique values)
Values: ['LD', 'NMDA', 'Control']
---
Column: leiden (92 unique values)
Values: ['3', '57', '13', '0', '56', '51', '12', '25', '35', '11', '87', '8', '30', '18', '28', '10', '33', '52', '61', '6', '5', '89', '16', '4', '42', '24', '44', '32', '72', '48', '70', '41', '9', '1', '45', '47', '93', '17', '86', '26', '29', '59', '94', '23', '15', '31', '78', '14', '34', '20', '27', '22', '92', '91', '54', '7', '85', '67', '21', '60', '37', '64', '2', '55', '19', '46', '75', '68', '43', '38', '36', '65', '63', '39', '53', '88', '90', '84', '83', '50', '69', '40', '82', '71', '77', '62', '49', '73', '74', '79', '81', '95']
---
Column: celltype (15 unique values)
Values: ['PR precursors', 'HC', 'MG', 'MGPC', 'Rod', 'BC', 'Cones', 'AC', 'RGC', 'RPE', 'Microglia_ImmuneCells', 'Oligodenrocyte', 'Perycites', 'Endothelial', 'Melanocyte']
---
ADDITIONAL METADATA
==================================================
adata.obs columns: ['sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'renamed_samples', 'predicted_doublets', 'leiden', 'celltype']
adata.var columns: ['mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std']
No additional layers found.
## How to run Snakemake
For dry run to check everythign before actual run:
snakemake -j1 -p --configfile config.yaml -n
For Actual run:
##### Update the configfile
The config files has the sample names with the suffix in samples.tsv
and the annotations in annotations.txt. Edit both files correspondingly. You can edit the config file to change their names.






































































































































































































































































































































































































































































































