Zebrafish Project

We analyse Zebrafish LD, NMDA and control scRNASeq samples using scanpy.

Quality Control Plots

These violin plots visualize cell-level quality metrics before and after filtering low-quality cells.

`violin_QC.png`

This plot shows raw quality control metrics (e.g., number of genes per cell, mitochondrial percentage) before filtering. It helps ide.pngy thresholds for removing low-quality or dead cells.

`violin_AfterQC.png`

This plot shows the same QC metrics after filtering. It confirms that poor-quality cells were successfully removed based on chosen thresholds.

UMAP After Batch Effect Removal (Harmony)

`umap_sHarmonyzebrafishes.png`

This UMAP plot shows zebrafish single-cell data after batch effect correction using Harmony.

By aligning shared biological structure across different batches or samples, Harmony improves the clustering and visualization. Cells are now grouped primarily by biological similarity rather than technical batch differences, enabling more accurate downstream analysis.

UMAP per sample

Below are the UMAP plots split by individual samples after reclustering:

UMAP per sample

Below are the UMAP plots split by individual samples after reclustering:

Clustering

UMAP after clustering (resolution: default =1)

Dot plot for Marker Genes

Marker Genes UMAP

After annotation using marker genes

We need to increase the resolution to 2.5 to seperate the tiny branch of Rod Cells

🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑

🚨 Starting from here, we have plots of resolution 2.5 🚨

🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑

UMAP after clustering

After re-Clustering annotation using marker genes

Remove Celltype Cones_MG_MGPC_PostMitotic as likely doublet

Cell ratio after increasing resolution and re-annotations

Sample	MG	MGPC	PR precursors	Rod	Cones	BC	AC	HC	RGC	Microglia_ImmuneCells	RPE	Melanocyte	Endothelial	Perycites	Oligodenrocyte
Control	6.49	1.13	0.88	4.76	7.54	46.85	10.24	13.40	3.20	1.74	1.34	0.27	0.57	1.09	0.51
LD	43.12	15.54	14.44	7.90	3.32	4.86	3.05	6.17	1.40	0.14	0.03	0.00	0.00	0.02	0.02
NMDA	29.89	14.74	13.87	13.74	2.75	5.50	9.23	6.94	2.96	0.22	0.00	0.02	0.02	0.05	0.07

Umap for Control,LD, and NMDA merged

Umap per sample

No. of Cells per Sample

renamed_samples	Number of cells
Control	134,237
NMDA	15,424
LD	11,214

Mini Dotplot

Selected Markers UMAP per request

More Markers per request

Selected Markers Violin per request

Selected genes Comparisons between groups

elavl3/gad2/slc6a9 in AC

cabp5a/cabp5b in BC

prdm1a/nr2e3 in Rod and PR precursors

gnat2/arr3a in Cones

isl2b/rbpms2b in RGC

ompa/gad2 in HC

rho/guca1b in Rod and PR precursors

gap43/marcksa/marcksb/stmn2a/stmn2b/klf4/klf17/klf9/atoh7/pou4f2 in RGC

More gene comparisons

six7/pde6c in Cones

pde6a/pde6b/gnat1 in Rod

stmn4/tuba1b in RGC

per1b/nr1d1 in BC

UMAPs of Major Retinal Cell Subtypes

The following UMAP plots illustrate the clustering of subtypes within each major retinal cell type. These visualizations highlight transcriptional diversity captured by single-cell RNA sequencing.

Cones Photoreceptor Subtypes before batch correction

Cones Photoreceptor Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

Cones Marker Genes after batch correction

Amacrine Cell (AC) Subtypes

AC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

AC Marker Genes after batch correction

AC clusters after removing clusters 11, 48, and 49

Recombining subclusters 1 and 45 in AC

Dotplot for subclusters in AC

More dotplots

vesicular glutamate transporters

mGlu receptors

monoamines

gasotransmitters

neuropeptide extended

Feature plots for subclusters in AC after recombining 1 and 45

More feature plots

Cell counts and cell ratio in AC subset

View stats summary for AC here

AC subtypes annotations

AC subtypes DGE

view top 50 in AC subtypes DGE for 0-44 clusters

Dot plot for top 2 in 0-44 clusters

Same plot but different order

View top 50 in AC subtypes DGE for annotated clusters

Retinal Ganglion Cell (RGC) Subtypes

RGC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

RGC Marker Genes after batch correction

Horizontal Cell (HC) Subtypes

HC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

HC Marker Genes after batch correction

HC clusters after removing clusters 14, 15, and 16

Bipolar Cell (BC) Subtypes

BC Marker Genes without batch correction

MG Subtypes

MG Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

MG marker Genes after batch correction

MG clusters after removing clusters 11 and 13

More DGE

Cones top 50 per subtype

Control vs LD

Control vs NMDA

LD vs NMDA

Cones top 50 not per cluster

Control vs LD

Control vs NMDA

LD vs NMDA

Dotplots

Cones Control vs NMDA

Cones Control vs LD

Cones LD vs NMDA

RGC top 50 per subtype

Control vs LD

Control vs NMDA

LD vs NMDA

RGC top 50 not per cluster

Control vs LD

Control vs NMDA

LD vs NMDA

Dotplots

RGC Control vs NMDA

RGC Control vs LD

RGC LD vs NMDA

RGC top 50 per subtype

RGC per subtype

RGC per subtype dotplot top marker

RGC Cell counts and sample-centered ratios per cluster per sample (count (ratio))

cluster	Control	LD	NMDA
0	318 (0.08)	18 (0.12)	68 (0.158)
1	293 (0.073)	8 (0.053)	25 (0.058)
2	255 (0.064)	18 (0.12)	33 (0.077)
3	199 (0.05)	1 (0.007)	23 (0.053)
4	181 (0.045)	11 (0.073)	16 (0.037)
5	161 (0.04)	12 (0.08)	24 (0.056)
6	178 (0.044)	2 (0.013)	13 (0.03)
7	153 (0.038)	10 (0.067)	24 (0.056)
8	123 (0.031)	13 (0.087)	18 (0.042)
9	124 (0.031)	0 (0.0)	4 (0.009)
10	111 (0.028)	5 (0.033)	7 (0.016)
11	110 (0.028)	3 (0.02)	10 (0.023)
12	117 (0.029)	0 (0.0)	2 (0.005)
13	99 (0.025)	4 (0.027)	15 (0.035)
14	109 (0.027)	1 (0.007)	6 (0.014)
15	108 (0.027)	2 (0.013)	4 (0.009)
16	104 (0.026)	1 (0.007)	5 (0.012)
17	94 (0.024)	1 (0.007)	10 (0.023)
18	97 (0.024)	1 (0.007)	5 (0.012)
19	82 (0.02)	4 (0.027)	8 (0.019)
20	79 (0.02)	3 (0.02)	11 (0.026)
21	85 (0.021)	0 (0.0)	5 (0.012)
22	82 (0.02)	0 (0.0)	5 (0.012)
23	69 (0.017)	3 (0.02)	6 (0.014)
24	65 (0.016)	3 (0.02)	7 (0.016)
25	65 (0.016)	0 (0.0)	8 (0.019)
26	70 (0.018)	0 (0.0)	3 (0.007)
27	59 (0.015)	1 (0.007)	6 (0.014)
28	62 (0.016)	2 (0.013)	1 (0.002)
29	49 (0.012)	6 (0.04)	10 (0.023)
30	36 (0.009)	1 (0.007)	27 (0.063)
31	45 (0.011)	3 (0.02)	4 (0.009)
32	46 (0.012)	0 (0.0)	2 (0.005)
33	32 (0.008)	8 (0.053)	7 (0.016)
34	41 (0.01)	2 (0.013)	2 (0.005)
35	41 (0.01)	1 (0.007)	2 (0.005)
36	31 (0.008)	0 (0.0)	1 (0.002)
37	27 (0.007)	2 (0.013)	3 (0.007)

Rod top 50 not per cluster

Control vs NMDA

Control vs LD

LD vs NMDA

Dotplots

Volcano plots

This volcano plot visualizes all genes, showing the relationship between the magnitude of change (log2 fold change) and statistical significance (-log10 adjusted p-value) for each gene. The top 5 genes with the most pronounced differences between Control and the rest are highlighted with arrows for clarity.

Red line on volcano plot = p-value of 0.05 (standard significance)
Grey line = no change (fold change = 0)

Diffusion Map Analysis of Cell Type Across Treatments

Diffusion map is a dimensionality reduction method that identifies continuous axes of variation in single-cell data based on cell-to-cell similarity, capturing trajectories or state changes rather than discrete clusters.

RGC example

Figure 1: Diffusion Map (DC1 vs DC2)

Cells from Control, LD, and NMDA conditions are largely mixed together.
No distinct separation is observed along DC1 or DC2.
Interpretation: The dominant transcriptional variation in this cell type is not driven by LD or NMDA.

Figure 2: Violin Plot of DC1

DC1 distributions for Control, LD, and NMDA overlap almost completely.
Medians and spreads are similar across conditions.
Interpretation: Confirms the diffusion map result — neither LD nor NMDA induce a strong shift along the main diffusion axis.

Conclusion: In this cell type, the treatments do not drive major transcriptional differences detectable by diffusion map.

Temporal Pseudo-Surface Map of Retinal Cells Along Pseudotime

Purpose of the Plot

This plot shows cell density along pseudotime, creating a temporal pseudo-surface of the cell population.

Its purpose is to identify:

Where cells accumulate along a biological trajectory
Potential stable states
Possible transition checkpoints
Regions where cells progress slowly or rapidly

In short:

It shows which transcriptional states are most populated along the inferred trajectory.

Example in RGC

What the Plot Shows

X-axis (Pseudotime 0 → 1)

Represents progression through a biological process (e.g., differentiation).

Y-axis (Visual Jitter / Surface)

Used to vertically separate cells for visualization; not biologically meaningful.
It allows the density of cells along pseudotime to form a smooth surface.

Color Intensity (Density)

Brighter regions = higher cell density
Darker regions = lower cell density

Interpretation of the Observed Pattern

There is a very strong accumulation of cells early in pseudotime, with the highest density around approximately 0.05–0.1. This suggests that many cells occupy an early transcriptional state.

A secondary, lower-density enrichment appears around approximately 0.2–0.25, indicating another populated intermediate state.

There is a much weaker and more diffuse signal around approximately 0.5, suggesting relatively few cells reach mid-to-late pseudotime states.

After approximately 0.6, cell density becomes very low, indicating that only a small fraction of cells progress to late pseudotime states.

Overall, this pattern suggests:

A large early population
One or two intermediate states
Progressive depletion of cells along the trajectory
Late states that may be transient or reached by only a subset of cells

Pseudo-Spatial Surface Map Using Marker Genes

This plot shows a pseudo-spatial distribution of cells using known marker genes for specific anatomical axes.

Its purpose is to:

Estimate relative positioning of cells along Nasal → Temporal (X-axis) and Ventral → Dorsal (Y-axis) axes
Visualize cell density across inferred pseudo-regions
Provide a spatial context for transcriptional states, even without true spatial coordinates

Note: This is an inferred pseudo-spatial map based on gene expression patterns; it does not represent actual measured cell positions.

Used the following spatial Markers

--nasal_genes Rbpms Pou4f1 neurolin brn3b isl1 atoh7 foxg1a dmbx1 lhx2 ebf2 eomesa barhl2 --temporal_genes Pou4f2 Thy1 ephb4a ephb4b cox1 cyp26a1 epha4b nr2f2 nr2f6b bmp4 --ventral_genes Tubb3 Nefl barhl1a cxcr4b vsx2 pitx2 bmp7 shh ptc1 gli1 vax2 tbx5 --dorsal_genes Sncg Gap43 epha7a epha7b wnt2ba wnt4a lhx9 nr2e1 tcf7l2 pax6a pax6b

What the Plot Shows

X-axis (Pseudo Nasal → Temporal)
Derived from the difference in expression between genes enriched in nasal vs temporal regions.
- Cells with X ≈ 0 are more “nasal-like”
- Cells with X ≈ 1 are more “temporal-like”
Y-axis (Pseudo Ventral → Dorsal)
Derived from the difference in expression between ventral vs dorsal markers.
- Cells with Y ≈ 0 are more “ventral-like”
- Cells with Y ≈ 1 are more “dorsal-like”
Color / intensity (Cell Density)
The plot uses a 2D hexbin to show where cells accumulate on the pseudo-spatial surface.
Brighter regions indicate higher density.

Interpretation of the Observed Pattern

Clusters of cells in the plot suggest groups with similar inferred spatial identity.
Gradients along X and Y axes indicate relative spatial transitions inferred from gene expression.
This allows for visual comparison of cell distributions across pseudo-anatomical axes.

Limitations

Not true spatial positions: Cells were dissociated, so their real Nasal / Temporal / Ventral / Dorsal locations are lost.
Accuracy depends on the quality and specificity of the chosen marker genes.
Only provides relative, inferred positions along the axes, not absolute anatomical coordinates.

One-Class Support Vector Machine (SVM) Control Fidelity Model

Conceptually:

Input: Each cell is represented by its gene expression profile (many genes per cell).
Training data: Control RGCs only (unlesioned, healthy baseline).
Goal: Measure how Control-like each cell is, rather than explicitly classifying conditions.
Approach (one-class model):
- Learns the normal expression manifold of Control RGCs.
- Does not use LD or NMDA cells during training.
- Assigns a fidelity score to every cell based on how well it fits the Control distribution.

This is a novelty / normality detection problem, not a binary classification problem.

How the model works

The model learns the structure and variability of Control cells only.
For each cell (Control, LD, or NMDA), it computes a decision score:
- High score → cell lies close to the Control expression manifold.
- Low score → cell deviates from Control (lesioned / abnormal).
Scores are normalized to 0–1 for visualization:
- 1 = highly Control-like
- 0 = strongly non-Control-like

Geometrically, you can think of this as learning a boundary around Control cells in high-dimensional gene-expression space, rather than drawing a line between two classes.

🚨🚨 Why SVM and not Logistic regression

SVM vs Logistic Regression:
- Logistic regression draws one straight line between two labeled classes — it needs examples from both classes to learn, so it cannot be used if the goal is to measure deviation from Control without labeling LD or NMDA.
- One-Class SVM keeps drawing flexible boundaries (lines or surfaces) around the Control data to capture its full shape and variation — it learns the normal pattern of Control cells and does not require other classes.
Why SVM is better when training on Control only:
- It learns the normal distribution of Control cells, capturing the natural variation among healthy cells.
- Assigns a fidelity score showing how Control-like each cell is.
- Automatically flags any unknown or abnormal cells (LD/NMDA) as deviations, without needing labels for them.
Why training on Control only is better:
- The goal is to measure deviation from normal, so the model should only learn what “normal” looks like.
- Including LD or NMDA in training would blur the boundary, because the model would start treating abnormal patterns as normal.
- By training only on Control, the SVM keeps the boundary tight around healthy cells, ensuring fidelity scores are meaningful and deviations are accurately detected.

SVM classifier RGC example

SVM classifier Rod example

SVM classifier Cones example

SVM classifier BC example

SVM classifier MG example

SVM classifier HC example

SVM classifier for all samples

How this script is different:

Here you go — the Markdown explanation in a code block:

It internally loops over all cell types. For each cell type, it performs the following steps:

Subset that cell type
- Extracts only the cells belonging to the current cell type from the full dataset.
Train a separate model (scaler + SVM) on Control of that cell type
- Scaling (mean=0, variance=1) is computed using only Control cells of that cell type.
- One-Class SVM is trained only on these Control cells.
Score treated cells of that cell type
- LD or NMDA cells of the same cell type are scored relative to their Control model.
Write results back
- The fidelity scores are inserted back into the global dataset, preserving separation between cell types.

Why this approach is good

Respects cell type biology
- Different cell types have different gene expression profiles. A single model across all cell types would produce meaningless scores.
Correct scaling and HVG selection
- Each cell type has its own variance structure and highly variable genes. Modeling per cell type ensures HVGs and scaling are accurate.
Independent, interpretable scores
- Fidelity scores always mean: “similar to Control of the same cell type”.
Error containment
- If one cell type fails or has too few cells, the other cell types are unaffected.

Why not to run one model on all cell types

Different cell types would be mixed in scaling and HVG selection, distorting the feature space.
One SVM would learn a mixture of distributions, which is biologically meaningless.
Fidelity scores would not reflect cell-type-specific similarity, making downstream analysis invalid.

Bottom line: Looping per cell type ensures biologically accurate, interpretable, and robust results.

Why SVM Parameters Are Tricky

Gamma

Controls how complex the Control boundary is:

Guess wrong = either too simple (miss real differences) or too complex (fit noise)

Nu

Controls expected % of outliers in Control:

Guess wrong = either include weird Controls or exclude real Controls

The Problem

You have to guess BOTH correctly without knowing your data's true structure. Isolation Forest avoids this guessing game.

Isolation Forest instead of SVM

Train on Control cells only
- Build hundreds of random trees
- Each tree: pick random gene → random threshold → split → repeat until single cells
- Trees know NOTHING about "Control" vs "anomaly"
Run ALL cells through trained trees
- Control cells (training data)
- LD cells
- NMDA cells
Measure path length for each cell
- Count how many splits until cell is isolated
- Average across all trees
Convert to fidelity score
- Long average path = hard to isolate = looks like Control = HIGH fidelity
- Short average path = easy to isolate = different from Control = LOW fidelity

Isolation Forest Rod example

Isolation Forest RGC example

Isolation Forest BC example

Isolation Forest Cones example

Isolation Forest MG example

Isolation Forest HC example

Isolation Forest classifier for all samples

Using Random Forest:

BC

AC

RGC

HC

MG

Cones

Cosine Similarity

Cosine similarity is not invariant to gene selection.
HVG-based analyses answer a different biological question than all-gene analyses.

Both approaches are valid — but they tell different stories, and that choice must be stated explicitly.

Citation

Watson, E. R., Mora, A., Taherian Fard, A., & Mar, J. C. (2022). How does the structure of data impact cell–cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data. Briefings in bioinformatics, 23(6), bbac387.

Pearson correlation

The script first calculates the average expression of each gene within each cell type, separately for the two conditions you’re comparing. So for every gene, you have an average expression value for each cell type under condition 1 and condition 2.
Once these averages are calculated, the script looks across all cell types to see how highly each gene is expressed in at least one cell type. It takes the maximum of these average expressions for each gene. This gives a single value per gene representing its highest mean expression in any cell type.
Then, the script filters out genes whose maximum mean expression is below the threshold you set. In other words, genes that are not expressed strongly in any cell type are removed from the analysis. Only genes that have at least one cell type where their mean expression exceeds the threshold are kept.
This way, the correlation analysis later is done only on genes that are meaningfully expressed, avoiding noise from genes that are essentially silent across all cell types.

python pearson_heatmap.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad renamed_samples Control LD control_vs_ld_heatmap.png --min_mean_expr n

Cell Type	Control	LD
AC	✅	✅
BC	✅	✅
Cones	✅	✅
HC	✅	✅
MG	✅	✅
MGPC	✅	✅
Microglia_ImmuneCells	✅	✅
Oligodenrocyte	✅	✅
PR precursors	✅	✅
Perycites	✅	✅
RGC	✅	✅
RPE	✅	✅
Rod	✅	✅
Melanocyte	✅	❌
Endothelial	✅	❌

Control vs LD

> 0.05

> 0.1

> 0.2

> 0.5

Control vs NMDA

> 0.05

> 0.1

> 0.2

> 0.5

LD vs NMDA

> 0.05

> 0.1

> 0.2

> 0.5

Differential gene expression: per celltype

 sc.tl.rank_genes_groups(
        adata_for_dge,
        groupby=groupby, #groupby can be celltype or leiden ..etc  
        method="wilcoxon",
        reference="rest",
        use_raw=True
    )

groupby = celltype

Download all DGE here

Heatmap by Wilcoxon score

Differential gene expression per subset

AC: groubpy = leiden

AC DGE Top 50 Wilcoxon score

AC DGE Top 50 logFC

AC DGE Top 50 adj Pvalue

Warning: very big file

Download AC DGE here

Heatmap by Wilcoxon score

🚨 Debugging section

For testing purposes: Control vs Control

🐍 For Debugging Issues

💾 Command & Output

# Name                     Version          Build            Channel
scanpy                     1.9.3            pypi_0           pypi

python print_h5ad.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad 
Loading AnnData object from: annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad

==================================================
BASIC AnnData INFORMATION
==================================================
Overall shape: (160875, 24597) (cells: 160875, genes: 24597)
Observation names (cells): ['AAACCAAAGCTTAACG-1', 'AAACCAAAGTACGCCG-1', 'AAACCATTCGCTCATT-1', 'AAACCCGCACCTCACG-1', 'AAACCCGCATCGGGAC-1']...
Variable names (genes): ['fgfr1op2', 'si:dkey-21h14.12', 'si:dkey-285e18.2', 'znf1114', 'si:dkey-21h14.10']...

=== adata.X (Primary Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: dense
Data type: float32
Min: -6.042159
Max: 381.777557
Mean: -0.001693
Std: 0.995661
Non-zero values: 3957042375
Zero values: 0
Sparsity: 0.0000 (0.00%)

Top 5 genes (first 5 rows):
  fgfr1op2: min=-0.2939, max=7.9207, mean=-0.0006
  si:dkey-21h14.12: min=-0.0168, max=101.5491, mean=-0.0000
  si:dkey-285e18.2: min=-0.0079, max=182.2779, mean=0.0001
  znf1114: min=-0.1110, max=18.6632, mean=0.0004
  si:dkey-21h14.10: min=-0.0139, max=136.0288, mean=-0.0019

Top 5 cells (first 5 columns):
  AAACCAAAGCTTAACG-1: min=-2.3688, max=124.4653, mean=-0.0308
  AAACCAAAGTACGCCG-1: min=-3.3848, max=44.7331, mean=-0.0216
  AAACCATTCGCTCATT-1: min=-2.0438, max=22.0059, mean=0.0395
  AAACCCGCACCTCACG-1: min=-2.7271, max=35.6094, mean=0.0179
  AAACCCGCATCGGGAC-1: min=-2.5089, max=22.6247, mean=0.0050

adata.raw type: <class 'anndata._core.raw.Raw'>

=== adata.raw.X (Raw Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: sparse
Data type: float32
Min: 0.000000
Max: 8.389225
Mean: 0.113462
Std: 0.443252
Non-zero values: 294862321
Zero values: 3662180054
Sparsity: 0.9255 (92.55%)

Top 5 genes (first 5 rows):
  fgfr1op2: min=0.0000, max=3.0251, mean=0.1080
  si:dkey-21h14.12: min=0.0000, max=1.9556, mean=0.0003
  si:dkey-285e18.2: min=0.0000, max=1.9756, mean=0.0001
  znf1114: min=0.0000, max=2.6757, mean=0.0159
  si:dkey-21h14.10: min=0.0000, max=2.1449, mean=0.0002

Top 5 cells (first 5 columns):
  AAACCAAAGCTTAACG-1: min=0.0000, max=4.7681, mean=0.0959
  AAACCAAAGTACGCCG-1: min=0.0000, max=5.3671, mean=0.0970
  AAACCATTCGCTCATT-1: min=0.0000, max=5.0932, mean=0.1313
  AAACCCGCACCTCACG-1: min=0.0000, max=6.1345, mean=0.1182
  AAACCCGCATCGGGAC-1: min=0.0000, max=5.2011, mean=0.1099

==================================================
COMPARISON: adata.X vs adata.raw.X
==================================================
Same shape: True
Number of different elements: 3957042359
Identical matrices: False

==================================================
CATEGORICAL OBSERVATIONS
==================================================
Found 4 categorical columns:

Column: sample (8 unique values)
Values: ['Zebra', 'TH115', 'TH44', 'TH54', 'TH55', 'TH56', 'TH57', 'TH71']
---
Column: renamed_samples (3 unique values)
Values: ['LD', 'NMDA', 'Control']
---
Column: leiden (92 unique values)
Values: ['3', '57', '13', '0', '56', '51', '12', '25', '35', '11', '87', '8', '30', '18', '28', '10', '33', '52', '61', '6', '5', '89', '16', '4', '42', '24', '44', '32', '72', '48', '70', '41', '9', '1', '45', '47', '93', '17', '86', '26', '29', '59', '94', '23', '15', '31', '78', '14', '34', '20', '27', '22', '92', '91', '54', '7', '85', '67', '21', '60', '37', '64', '2', '55', '19', '46', '75', '68', '43', '38', '36', '65', '63', '39', '53', '88', '90', '84', '83', '50', '69', '40', '82', '71', '77', '62', '49', '73', '74', '79', '81', '95']
---
Column: celltype (15 unique values)
Values: ['PR precursors', 'HC', 'MG', 'MGPC', 'Rod', 'BC', 'Cones', 'AC', 'RGC', 'RPE', 'Microglia_ImmuneCells', 'Oligodenrocyte', 'Perycites', 'Endothelial', 'Melanocyte']
---

ADDITIONAL METADATA
==================================================
adata.obs columns: ['sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'renamed_samples', 'predicted_doublets', 'leiden', 'celltype']
adata.var columns: ['mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std']

No additional layers found.



## How to run Snakemake 

For dry run to check everythign before actual run:

    snakemake -j1 -p --configfile config.yaml -n 

For Actual run: 
    

##### Update the configfile 

The config files has the sample names with the suffix in samples.tsv 
and the annotations in annotations.txt. Edit both files correspondingly. You can edit the config file to change their names.

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
figures		figures
src		src
AC_combined_annotations.txt		AC_combined_annotations.txt
Cosine_Ctrl_LD.png		Cosine_Ctrl_LD.png
Cosine_Ctrl_NMDA.png		Cosine_Ctrl_NMDA.png
Cosine_LD_NMDA.png		Cosine_LD_NMDA.png
Makefile		Makefile
README.md		README.md
Snakefile		Snakefile
annotations.txt		annotations.txt
config.yaml		config.yaml
control_vs_control_heatmap_0.0.png		control_vs_control_heatmap_0.0.png
control_vs_ld_heatmap_0.05.png		control_vs_ld_heatmap_0.05.png
control_vs_ld_heatmap_0.1.png		control_vs_ld_heatmap_0.1.png
control_vs_ld_heatmap_0.2.png		control_vs_ld_heatmap_0.2.png
control_vs_ld_heatmap_0.5.png		control_vs_ld_heatmap_0.5.png
control_vs_nmda_heatmap_0.05.png		control_vs_nmda_heatmap_0.05.png
control_vs_nmda_heatmap_0.1.png		control_vs_nmda_heatmap_0.1.png
control_vs_nmda_heatmap_0.2.png		control_vs_nmda_heatmap_0.2.png
control_vs_nmda_heatmap_0.5.png		control_vs_nmda_heatmap_0.5.png
ld_vs_nmda_heatmap_0.05.png		ld_vs_nmda_heatmap_0.05.png
ld_vs_nmda_heatmap_0.1.png		ld_vs_nmda_heatmap_0.1.png
ld_vs_nmda_heatmap_0.2.png		ld_vs_nmda_heatmap_0.2.png
ld_vs_nmda_heatmap_0.5.png		ld_vs_nmda_heatmap_0.5.png
samples.tsv		samples.tsv
umap_all_conditions_Cones_IF.png		umap_all_conditions_Cones_IF.png
umap_all_conditions_Cones_split.png		umap_all_conditions_Cones_split.png
umap_all_conditions_HC_IF.png		umap_all_conditions_HC_IF.png
umap_all_conditions_MG_IF.png		umap_all_conditions_MG_IF.png
umap_all_conditions_MG_split.png		umap_all_conditions_MG_split.png
umap_all_conditions_RGC_IF.png		umap_all_conditions_RGC_IF.png
umap_all_conditions_RGC_split.png		umap_all_conditions_RGC_split.png
umap_all_conditions_Rod_IF.png		umap_all_conditions_Rod_IF.png
umap_all_conditions_Rod_split.png		umap_all_conditions_Rod_split.png
umap_all_conditions_zebrafish_split.png		umap_all_conditions_zebrafish_split.png
violin_fidelity_all_groups_AC_RF.png		violin_fidelity_all_groups_AC_RF.png
violin_fidelity_all_groups_BC_IF.png		violin_fidelity_all_groups_BC_IF.png
violin_fidelity_all_groups_BC_RF.png		violin_fidelity_all_groups_BC_RF.png
violin_fidelity_all_groups_Cones_IF.png		violin_fidelity_all_groups_Cones_IF.png
violin_fidelity_all_groups_Cones_RF.png		violin_fidelity_all_groups_Cones_RF.png
violin_fidelity_all_groups_Cones_split.png		violin_fidelity_all_groups_Cones_split.png
violin_fidelity_all_groups_HC_IF.png		violin_fidelity_all_groups_HC_IF.png
violin_fidelity_all_groups_HC_RF.png		violin_fidelity_all_groups_HC_RF.png
violin_fidelity_all_groups_HC_split.png		violin_fidelity_all_groups_HC_split.png
violin_fidelity_all_groups_MG_IF.png		violin_fidelity_all_groups_MG_IF.png
violin_fidelity_all_groups_MG_RF.png		violin_fidelity_all_groups_MG_RF.png
violin_fidelity_all_groups_MG_split.png		violin_fidelity_all_groups_MG_split.png
violin_fidelity_all_groups_RGC_IF.png		violin_fidelity_all_groups_RGC_IF.png
violin_fidelity_all_groups_RGC_RF.png		violin_fidelity_all_groups_RGC_RF.png
violin_fidelity_all_groups_RGC_split.png		violin_fidelity_all_groups_RGC_split.png
violin_fidelity_all_groups_Rod_IF.png		violin_fidelity_all_groups_Rod_IF.png
violin_fidelity_all_groups_Rod_RF.png		violin_fidelity_all_groups_Rod_RF.png
violin_fidelity_all_groups_Rod_split.png		violin_fidelity_all_groups_Rod_split.png
violin_fidelity_all_groups_zebrafish_IF.png		violin_fidelity_all_groups_zebrafish_IF.png
violin_fidelity_all_groups_zebrafish_split.png		violin_fidelity_all_groups_zebrafish_split.png

Folders and files

Latest commit

History

Repository files navigation

Zebrafish Project

Quality Control Plots

violin_QC.png

violin_AfterQC.png

UMAP After Batch Effect Removal (Harmony)

umap_sHarmonyzebrafishes.png

UMAP per sample

UMAP per sample

Clustering

UMAP after clustering (resolution: default =1)

Dot plot for Marker Genes

Marker Genes UMAP

After annotation using marker genes

We need to increase the resolution to 2.5 to seperate the tiny branch of Rod Cells

🚨 Starting from here, we have plots of resolution 2.5 🚨

UMAP after clustering

After re-Clustering annotation using marker genes

Remove Celltype Cones_MG_MGPC_PostMitotic as likely doublet

Cell ratio after increasing resolution and re-annotations

Umap for Control,LD, and NMDA merged

Umap per sample

No. of Cells per Sample

Mini Dotplot

Selected Markers UMAP per request

More Markers per request

Selected Markers Violin per request

Selected genes Comparisons between groups

elavl3/gad2/slc6a9 in AC

cabp5a/cabp5b in BC

prdm1a/nr2e3 in Rod and PR precursors

gnat2/arr3a in Cones

isl2b/rbpms2b in RGC

ompa/gad2 in HC

rho/guca1b in Rod and PR precursors

gap43/marcksa/marcksb/stmn2a/stmn2b/klf4/klf17/klf9/atoh7/pou4f2 in RGC

More gene comparisons

six7/pde6c in Cones

pde6a/pde6b/gnat1 in Rod

stmn4/tuba1b in RGC

per1b/nr1d1 in BC

UMAPs of Major Retinal Cell Subtypes

Cones Photoreceptor Subtypes before batch correction

Cones Photoreceptor Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

Cones Marker Genes after batch correction

Amacrine Cell (AC) Subtypes

AC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

AC Marker Genes after batch correction

AC clusters after removing clusters 11, 48, and 49

Recombining subclusters 1 and 45 in AC

Dotplot for subclusters in AC

More dotplots

vesicular glutamate transporters

mGlu receptors

monoamines

gasotransmitters

neuropeptide extended

Feature plots for subclusters in AC after recombining 1 and 45

More feature plots

Cell counts and cell ratio in AC subset

AC subtypes annotations

AC subtypes DGE

Dot plot for top 2 in 0-44 clusters

Same plot but different order

Retinal Ganglion Cell (RGC) Subtypes

RGC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

RGC Marker Genes after batch correction

Horizontal Cell (HC) Subtypes

HC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

HC Marker Genes after batch correction

HC clusters after removing clusters 14, 15, and 16

Bipolar Cell (BC) Subtypes

BC Marker Genes without batch correction

`violin_QC.png`

`violin_AfterQC.png`

`umap_sHarmonyzebrafishes.png`