Skip to content

SherineAwad/Zebrafish_scRNASeq_scanpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

281 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Zebrafish Project

We analyse Zebrafish LD, NMDA and control scRNASeq samples using scanpy.

Quality Control Plots

These violin plots visualize cell-level quality metrics before and after filtering low-quality cells.


violin_QC.png

This plot shows raw quality control metrics (e.g., number of genes per cell, mitochondrial percentage) before filtering. It helps ide.pngy thresholds for removing low-quality or dead cells.

violin_QC.png


violin_AfterQC.png

This plot shows the same QC metrics after filtering. It confirms that poor-quality cells were successfully removed based on chosen thresholds.

violin_AfterQC.png

UMAP After Batch Effect Removal (Harmony)

umap_sHarmonyzebrafishes.png

This UMAP plot shows zebrafish single-cell data after batch effect correction using Harmony.

By aligning shared biological structure across different batches or samples, Harmony improves the clustering and visualization. Cells are now grouped primarily by biological similarity rather than technical batch differences, enabling more accurate downstream analysis.

umap_sHarmonyzebrafishes.png

UMAP per sample

Below are the UMAP plots split by individual samples after reclustering:

UMAP per sample

Below are the UMAP plots split by individual samples after reclustering:

UMAP LD sampleUMAP NMDA sampleUMAP Control sample

Clustering

UMAP after clustering (resolution: default =1)

Umap Annotations

Dot plot for Marker Genes

Umap Annotations

Marker Genes UMAP

acta2apoeaapoeb aqp4arr3aascl1a atoh7bhlhe23cabp5a cabp5bcalb1calb2a calb2bcbln4ccnd1 cdh2chatacdk1 crxcsf2rbdla ebf3aelavl3fgf19 gad1agad1bgad2 gfapgli1gnat2 guca1bher12her4.2 her4.3her4.4her6 igf2aigf2binsm1a isl1isl2aisl2b kcnj8lhx1ambpa mpeg1.1neflbnefla nefmanefmbneurod1 notch1anotch1bnr2e3 nrlompaopn1mw1 opn1mw2opn1mw3opn1mw4 opn1sw1opn1sw2pax2a

After annotation using marker genes

Umap Annotations

Umap Annotations ON

We need to increase the resolution to 2.5 to seperate the tiny branch of Rod Cells

πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘

🚨 Starting from here, we have plots of resolution 2.5 🚨

πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘πŸ›‘

UMAP after clustering

Umap Annotations

After re-Clustering annotation using marker genes

Umap Annotations

Umap Annotations ON

Remove Celltype Cones_MG_MGPC_PostMitotic as likely doublet

Umap Annotations

Umap Annotations ON

Cell ratio after increasing resolution and re-annotations

Sample MG MGPC PR precursors Rod Cones BC AC HC RGC Microglia_ImmuneCells RPE Melanocyte Endothelial Perycites Oligodenrocyte
Control 6.49 1.13 0.88 4.76 7.54 46.85 10.24 13.40 3.20 1.74 1.34 0.27 0.57 1.09 0.51
LD 43.12 15.54 14.44 7.90 3.32 4.86 3.05 6.17 1.40 0.14 0.03 0.00 0.00 0.02 0.02
NMDA 29.89 14.74 13.87 13.74 2.75 5.50 9.23 6.94 2.96 0.22 0.00 0.02 0.02 0.05 0.07

Umap for Control,LD, and NMDA merged

Umap per sample

No. of Cells per Sample

renamed_samples Number of cells
Control 134,237
NMDA 15,424
LD 11,214

Mini Dotplot

Selected Markers UMAP per request

More Markers per request

Selected Markers Violin per request

Selected genes Comparisons between groups

elavl3/gad2/slc6a9 in AC

cabp5a/cabp5b in BC

prdm1a/nr2e3 in Rod and PR precursors

gnat2/arr3a in Cones

isl2b/rbpms2b in RGC

ompa/gad2 in HC

rho/guca1b in Rod and PR precursors

gap43/marcksa/marcksb/stmn2a/stmn2b/klf4/klf17/klf9/atoh7/pou4f2 in RGC

More gene comparisons

six7/pde6c in Cones

pde6a/pde6b/gnat1 in Rod

stmn4/tuba1b in RGC

per1b/nr1d1 in BC

UMAPs of Major Retinal Cell Subtypes

The following UMAP plots illustrate the clustering of subtypes within each major retinal cell type. These visualizations highlight transcriptional diversity captured by single-cell RNA sequencing.

Cones Photoreceptor Subtypes before batch correction

Cones Photoreceptor Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

Cones Marker Genes after batch correction

Amacrine Cell (AC) Subtypes

AC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

AC Marker Genes after batch correction

AC clusters after removing clusters 11, 48, and 49

Recombining subclusters 1 and 45 in AC

Dotplot for subclusters in AC

Dotplot

Dotplot Full

More dotplots

vesicular glutamate transporters

Dotplot vesicular_glutamate_transporters

mGlu receptors

Dotplot mGlu_receptors

monoamines

Dotplot monoamines

gasotransmitters

Dotplot gasotransmitters

neuropeptide extended

Dotplot neuropeptide_extended

Feature plots for subclusters in AC after recombining 1 and 45

More feature plots

Cell counts and cell ratio in AC subset

View stats summary for AC here

AC subtypes annotations

AC subtypes AC subtypes ON

AC subtypes DGE

view top 50 in AC subtypes DGE for 0-44 clusters

Dot plot for top 2 in 0-44 clusters

Dotplot 0-44

Same plot but different order

Dotplot 0-44 reordered

View top 50 in AC subtypes DGE for annotated clusters

Retinal Ganglion Cell (RGC) Subtypes

RGC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

RGC Marker Genes after batch correction

Horizontal Cell (HC) Subtypes

HC Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

HC Marker Genes after batch correction

HC clusters after removing clusters 14, 15, and 16

Bipolar Cell (BC) Subtypes

BC Marker Genes without batch correction

MG Subtypes

MG Subtypes after batch correction (STRICT PARAMETERS)

QC after batch correction (STRICT PARAMETERS)

MG marker Genes after batch correction

MG clusters after removing clusters 11 and 13

More DGE

Cones top 50 per subtype

Control vs LD

Control vs NMDA

LD vs NMDA

Cones top 50 not per cluster

Control vs LD

Control vs NMDA

LD vs NMDA

Dotplots

Cones Control vs NMDA

Cones Control vs LD

Cones LD vs NMDA

RGC top 50 per subtype

Control vs LD

Control vs NMDA

LD vs NMDA

RGC top 50 not per cluster

Control vs LD

Control vs NMDA

LD vs NMDA

Dotplots

RGC Control vs NMDA

RGC Control vs LD

RGC LD vs NMDA

RGC top 50 per subtype

RGC per subtype

RGC per subtype dotplot top marker

RGC Cell counts and sample-centered ratios per cluster per sample (count (ratio))

cluster Control LD NMDA
0 318 (0.08) 18 (0.12) 68 (0.158)
1 293 (0.073) 8 (0.053) 25 (0.058)
2 255 (0.064) 18 (0.12) 33 (0.077)
3 199 (0.05) 1 (0.007) 23 (0.053)
4 181 (0.045) 11 (0.073) 16 (0.037)
5 161 (0.04) 12 (0.08) 24 (0.056)
6 178 (0.044) 2 (0.013) 13 (0.03)
7 153 (0.038) 10 (0.067) 24 (0.056)
8 123 (0.031) 13 (0.087) 18 (0.042)
9 124 (0.031) 0 (0.0) 4 (0.009)
10 111 (0.028) 5 (0.033) 7 (0.016)
11 110 (0.028) 3 (0.02) 10 (0.023)
12 117 (0.029) 0 (0.0) 2 (0.005)
13 99 (0.025) 4 (0.027) 15 (0.035)
14 109 (0.027) 1 (0.007) 6 (0.014)
15 108 (0.027) 2 (0.013) 4 (0.009)
16 104 (0.026) 1 (0.007) 5 (0.012)
17 94 (0.024) 1 (0.007) 10 (0.023)
18 97 (0.024) 1 (0.007) 5 (0.012)
19 82 (0.02) 4 (0.027) 8 (0.019)
20 79 (0.02) 3 (0.02) 11 (0.026)
21 85 (0.021) 0 (0.0) 5 (0.012)
22 82 (0.02) 0 (0.0) 5 (0.012)
23 69 (0.017) 3 (0.02) 6 (0.014)
24 65 (0.016) 3 (0.02) 7 (0.016)
25 65 (0.016) 0 (0.0) 8 (0.019)
26 70 (0.018) 0 (0.0) 3 (0.007)
27 59 (0.015) 1 (0.007) 6 (0.014)
28 62 (0.016) 2 (0.013) 1 (0.002)
29 49 (0.012) 6 (0.04) 10 (0.023)
30 36 (0.009) 1 (0.007) 27 (0.063)
31 45 (0.011) 3 (0.02) 4 (0.009)
32 46 (0.012) 0 (0.0) 2 (0.005)
33 32 (0.008) 8 (0.053) 7 (0.016)
34 41 (0.01) 2 (0.013) 2 (0.005)
35 41 (0.01) 1 (0.007) 2 (0.005)
36 31 (0.008) 0 (0.0) 1 (0.002)
37 27 (0.007) 2 (0.013) 3 (0.007)

Rod top 50 not per cluster

Control vs NMDA

Control vs LD

LD vs NMDA

Dotplots

Volcano plots

This volcano plot visualizes all genes, showing the relationship between the magnitude of change (log2 fold change) and statistical significance (-log10 adjusted p-value) for each gene. The top 5 genes with the most pronounced differences between Control and the rest are highlighted with arrows for clarity.

  • Red line on volcano plot = p-value of 0.05 (standard significance)

  • Grey line = no change (fold change = 0)

Rod

Top 50 Rod

Cones

Top 50 Cones

BC

Top 50 BC

AC

Top 50 AC

HC

Top 50 HC

RGC

Top 50 RGC

Diffusion Map Analysis of Cell Type Across Treatments

Diffusion map is a dimensionality reduction method that identifies continuous axes of variation in single-cell data based on cell-to-cell similarity, capturing trajectories or state changes rather than discrete clusters.

RGC example

1

2

Figure 1: Diffusion Map (DC1 vs DC2)

  • Cells from Control, LD, and NMDA conditions are largely mixed together.
  • No distinct separation is observed along DC1 or DC2.
  • Interpretation: The dominant transcriptional variation in this cell type is not driven by LD or NMDA.

Figure 2: Violin Plot of DC1

  • DC1 distributions for Control, LD, and NMDA overlap almost completely.
  • Medians and spreads are similar across conditions.
  • Interpretation: Confirms the diffusion map result β€” neither LD nor NMDA induce a strong shift along the main diffusion axis.

Conclusion: In this cell type, the treatments do not drive major transcriptional differences detectable by diffusion map.

Temporal Pseudo-Surface Map of Retinal Cells Along Pseudotime

Purpose of the Plot

This plot shows cell density along pseudotime, creating a temporal pseudo-surface of the cell population.

Its purpose is to identify:

  • Where cells accumulate along a biological trajectory
  • Potential stable states
  • Possible transition checkpoints
  • Regions where cells progress slowly or rapidly

In short:

It shows which transcriptional states are most populated along the inferred trajectory.

Example in RGC

What the Plot Shows

X-axis (Pseudotime 0 β†’ 1)

Represents progression through a biological process (e.g., differentiation).

Y-axis (Visual Jitter / Surface)

Used to vertically separate cells for visualization; not biologically meaningful.
It allows the density of cells along pseudotime to form a smooth surface.

Color Intensity (Density)

  • Brighter regions = higher cell density
  • Darker regions = lower cell density

Interpretation of the Observed Pattern

There is a very strong accumulation of cells early in pseudotime, with the highest density around approximately 0.05–0.1. This suggests that many cells occupy an early transcriptional state.

A secondary, lower-density enrichment appears around approximately 0.2–0.25, indicating another populated intermediate state.

There is a much weaker and more diffuse signal around approximately 0.5, suggesting relatively few cells reach mid-to-late pseudotime states.

After approximately 0.6, cell density becomes very low, indicating that only a small fraction of cells progress to late pseudotime states.

Overall, this pattern suggests:

  • A large early population
  • One or two intermediate states
  • Progressive depletion of cells along the trajectory
  • Late states that may be transient or reached by only a subset of cells

Pseudo-Spatial Surface Map Using Marker Genes

This plot shows a pseudo-spatial distribution of cells using known marker genes for specific anatomical axes.

Its purpose is to:

  • Estimate relative positioning of cells along Nasal β†’ Temporal (X-axis) and Ventral β†’ Dorsal (Y-axis) axes
  • Visualize cell density across inferred pseudo-regions
  • Provide a spatial context for transcriptional states, even without true spatial coordinates

Note: This is an inferred pseudo-spatial map based on gene expression patterns; it does not represent actual measured cell positions.

Used the following spatial Markers

--nasal_genes Rbpms Pou4f1 neurolin brn3b isl1 atoh7 foxg1a dmbx1 lhx2 ebf2 eomesa barhl2 --temporal_genes Pou4f2 Thy1 ephb4a ephb4b cox1 cyp26a1 epha4b nr2f2 nr2f6b bmp4 --ventral_genes Tubb3 Nefl barhl1a cxcr4b vsx2 pitx2 bmp7 shh ptc1 gli1 vax2 tbx5 --dorsal_genes Sncg Gap43 epha7a epha7b wnt2ba wnt4a lhx9 nr2e1 tcf7l2 pax6a pax6b

What the Plot Shows

  • X-axis (Pseudo Nasal β†’ Temporal)
    Derived from the difference in expression between genes enriched in nasal vs temporal regions.

    • Cells with X β‰ˆ 0 are more β€œnasal-like”
    • Cells with X β‰ˆ 1 are more β€œtemporal-like”
  • Y-axis (Pseudo Ventral β†’ Dorsal)
    Derived from the difference in expression between ventral vs dorsal markers.

    • Cells with Y β‰ˆ 0 are more β€œventral-like”
    • Cells with Y β‰ˆ 1 are more β€œdorsal-like”
  • Color / intensity (Cell Density)
    The plot uses a 2D hexbin to show where cells accumulate on the pseudo-spatial surface.
    Brighter regions indicate higher density.

Interpretation of the Observed Pattern

  • Clusters of cells in the plot suggest groups with similar inferred spatial identity.
  • Gradients along X and Y axes indicate relative spatial transitions inferred from gene expression.
  • This allows for visual comparison of cell distributions across pseudo-anatomical axes.

Limitations

  • Not true spatial positions: Cells were dissociated, so their real Nasal / Temporal / Ventral / Dorsal locations are lost.
  • Accuracy depends on the quality and specificity of the chosen marker genes.
  • Only provides relative, inferred positions along the axes, not absolute anatomical coordinates.

One-Class Support Vector Machine (SVM) Control Fidelity Model

Conceptually:

  • Input: Each cell is represented by its gene expression profile (many genes per cell).
  • Training data: Control RGCs only (unlesioned, healthy baseline).
  • Goal: Measure how Control-like each cell is, rather than explicitly classifying conditions.
  • Approach (one-class model):
    • Learns the normal expression manifold of Control RGCs.
    • Does not use LD or NMDA cells during training.
    • Assigns a fidelity score to every cell based on how well it fits the Control distribution.

This is a novelty / normality detection problem, not a binary classification problem.

How the model works

  • The model learns the structure and variability of Control cells only.
  • For each cell (Control, LD, or NMDA), it computes a decision score:
    • High score β†’ cell lies close to the Control expression manifold.
    • Low score β†’ cell deviates from Control (lesioned / abnormal).
  • Scores are normalized to 0–1 for visualization:
    • 1 = highly Control-like
    • 0 = strongly non-Control-like

Geometrically, you can think of this as learning a boundary around Control cells in high-dimensional gene-expression space, rather than drawing a line between two classes.

🚨🚨 Why SVM and not Logistic regression

  • SVM vs Logistic Regression:

    • Logistic regression draws one straight line between two labeled classes β€” it needs examples from both classes to learn, so it cannot be used if the goal is to measure deviation from Control without labeling LD or NMDA.
    • One-Class SVM keeps drawing flexible boundaries (lines or surfaces) around the Control data to capture its full shape and variation β€” it learns the normal pattern of Control cells and does not require other classes.
  • Why SVM is better when training on Control only:

    • It learns the normal distribution of Control cells, capturing the natural variation among healthy cells.
    • Assigns a fidelity score showing how Control-like each cell is.
    • Automatically flags any unknown or abnormal cells (LD/NMDA) as deviations, without needing labels for them.
  • Why training on Control only is better:

    • The goal is to measure deviation from normal, so the model should only learn what β€œnormal” looks like.
    • Including LD or NMDA in training would blur the boundary, because the model would start treating abnormal patterns as normal.
    • By training only on Control, the SVM keeps the boundary tight around healthy cells, ensuring fidelity scores are meaningful and deviations are accurately detected.

SVM classifier RGC example

SVM classifier Rod example

SVM classifier Cones example

SVM classifier BC example

SVM classifier MG example

SVM classifier HC example

SVM classifier for all samples

How this script is different:

Here you go β€” the Markdown explanation in a code block:

It internally loops over all cell types. For each cell type, it performs the following steps:

  1. Subset that cell type

    • Extracts only the cells belonging to the current cell type from the full dataset.
  2. Train a separate model (scaler + SVM) on Control of that cell type

    • Scaling (mean=0, variance=1) is computed using only Control cells of that cell type.
    • One-Class SVM is trained only on these Control cells.
  3. Score treated cells of that cell type

    • LD or NMDA cells of the same cell type are scored relative to their Control model.
  4. Write results back

    • The fidelity scores are inserted back into the global dataset, preserving separation between cell types.

Why this approach is good

  • Respects cell type biology

    • Different cell types have different gene expression profiles. A single model across all cell types would produce meaningless scores.
  • Correct scaling and HVG selection

    • Each cell type has its own variance structure and highly variable genes. Modeling per cell type ensures HVGs and scaling are accurate.
  • Independent, interpretable scores

    • Fidelity scores always mean: β€œsimilar to Control of the same cell type”.
  • Error containment

    • If one cell type fails or has too few cells, the other cell types are unaffected.

Why not to run one model on all cell types

  • Different cell types would be mixed in scaling and HVG selection, distorting the feature space.
  • One SVM would learn a mixture of distributions, which is biologically meaningless.
  • Fidelity scores would not reflect cell-type-specific similarity, making downstream analysis invalid.

Bottom line: Looping per cell type ensures biologically accurate, interpretable, and robust results.

Why SVM Parameters Are Tricky

Gamma

Controls how complex the Control boundary is:

  • Guess wrong = either too simple (miss real differences) or too complex (fit noise)

Nu

Controls expected % of outliers in Control:

  • Guess wrong = either include weird Controls or exclude real Controls

The Problem

You have to guess BOTH correctly without knowing your data's true structure. Isolation Forest avoids this guessing game.

Isolation Forest instead of SVM

  1. Train on Control cells only

    • Build hundreds of random trees
    • Each tree: pick random gene β†’ random threshold β†’ split β†’ repeat until single cells
    • Trees know NOTHING about "Control" vs "anomaly"
  2. Run ALL cells through trained trees

    • Control cells (training data)
    • LD cells
    • NMDA cells
  3. Measure path length for each cell

    • Count how many splits until cell is isolated
    • Average across all trees
  4. Convert to fidelity score

    • Long average path = hard to isolate = looks like Control = HIGH fidelity
    • Short average path = easy to isolate = different from Control = LOW fidelity

Isolation Forest Rod example

Isolation Forest RGC example

Isolation Forest BC example

Isolation Forest Cones example

Isolation Forest MG example

Isolation Forest HC example

Isolation Forest classifier for all samples

Using Random Forest:

BC

AC

RGC

HC

MG

Cones

Cosine Similarity

Cosine similarity is not invariant to gene selection.
HVG-based analyses answer a different biological question than all-gene analyses.

Both approaches are valid β€” but they tell different stories, and that choice must be stated explicitly.

Cosine Ctrl vs LD

Cosine Ctrl vs NMDA

Cosine LD vs NMDA

Citation

Watson, E. R., Mora, A., Taherian Fard, A., & Mar, J. C. (2022). How does the structure of data impact cell–cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data. Briefings in bioinformatics, 23(6), bbac387.

Pearson correlation

  • The script first calculates the average expression of each gene within each cell type, separately for the two conditions you’re comparing. So for every gene, you have an average expression value for each cell type under condition 1 and condition 2.

  • Once these averages are calculated, the script looks across all cell types to see how highly each gene is expressed in at least one cell type. It takes the maximum of these average expressions for each gene. This gives a single value per gene representing its highest mean expression in any cell type.

  • Then, the script filters out genes whose maximum mean expression is below the threshold you set. In other words, genes that are not expressed strongly in any cell type are removed from the analysis. Only genes that have at least one cell type where their mean expression exceeds the threshold are kept.

  • This way, the correlation analysis later is done only on genes that are meaningfully expressed, avoiding noise from genes that are essentially silent across all cell types.

python pearson_heatmap.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad renamed_samples Control LD control_vs_ld_heatmap.png --min_mean_expr n
Cell Type Control LD
AC βœ… βœ…
BC βœ… βœ…
Cones βœ… βœ…
HC βœ… βœ…
MG βœ… βœ…
MGPC βœ… βœ…
Microglia_ImmuneCells βœ… βœ…
Oligodenrocyte βœ… βœ…
PR precursors βœ… βœ…
Perycites βœ… βœ…
RGC βœ… βœ…
RPE βœ… βœ…
Rod βœ… βœ…
Melanocyte βœ… ❌
Endothelial βœ… ❌

Control vs LD

> 0.05

pearson correlation Heatmap

> 0.1

pearson correlation Heatmap

> 0.2

pearson correlation Heatmap

> 0.5

pearson correlation Heatmap

Control vs NMDA

> 0.05

control vs nmda 0.05

> 0.1

control vs nmda 0.1

> 0.2

control vs nmda 0.2

> 0.5

control vs nmdaI 0.5

LD vs NMDA

> 0.05

ld vs nmda 0.05

> 0.1

ld vs nmda 0.1

> 0.2

ld vs nmda 0.2

> 0.5

ld vs nmda 0.5

Differential gene expression: per celltype

 sc.tl.rank_genes_groups(
        adata_for_dge,
        groupby=groupby, #groupby can be celltype or leiden ..etc  
        method="wilcoxon",
        reference="rest",
        use_raw=True
    )

groupby = celltype

Download all DGE here

Heatmap by Wilcoxon score

Zebrafish DGE heatmap

Differential gene expression per subset

AC: groubpy = leiden

AC DGE Top 50 Wilcoxon score

AC DGE Top 50 logFC

AC DGE Top 50 adj Pvalue

Warning: very big file

Download AC DGE here

Heatmap by Wilcoxon score

AC heatmap

🚨 Debugging section

For testing purposes: Control vs Control

control vs control

🐍 For Debugging Issues

πŸ’Ύ Command & Output

# Name                     Version          Build            Channel
scanpy                     1.9.3            pypi_0           pypi
python print_h5ad.py annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad 
Loading AnnData object from: annotated_clustered_corrected_doubletRemoved_Zebrafishes.h5ad

==================================================
BASIC AnnData INFORMATION
==================================================
Overall shape: (160875, 24597) (cells: 160875, genes: 24597)
Observation names (cells): ['AAACCAAAGCTTAACG-1', 'AAACCAAAGTACGCCG-1', 'AAACCATTCGCTCATT-1', 'AAACCCGCACCTCACG-1', 'AAACCCGCATCGGGAC-1']...
Variable names (genes): ['fgfr1op2', 'si:dkey-21h14.12', 'si:dkey-285e18.2', 'znf1114', 'si:dkey-21h14.10']...

=== adata.X (Primary Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: dense
Data type: float32
Min: -6.042159
Max: 381.777557
Mean: -0.001693
Std: 0.995661
Non-zero values: 3957042375
Zero values: 0
Sparsity: 0.0000 (0.00%)

Top 5 genes (first 5 rows):
  fgfr1op2: min=-0.2939, max=7.9207, mean=-0.0006
  si:dkey-21h14.12: min=-0.0168, max=101.5491, mean=-0.0000
  si:dkey-285e18.2: min=-0.0079, max=182.2779, mean=0.0001
  znf1114: min=-0.1110, max=18.6632, mean=0.0004
  si:dkey-21h14.10: min=-0.0139, max=136.0288, mean=-0.0019

Top 5 cells (first 5 columns):
  AAACCAAAGCTTAACG-1: min=-2.3688, max=124.4653, mean=-0.0308
  AAACCAAAGTACGCCG-1: min=-3.3848, max=44.7331, mean=-0.0216
  AAACCATTCGCTCATT-1: min=-2.0438, max=22.0059, mean=0.0395
  AAACCCGCACCTCACG-1: min=-2.7271, max=35.6094, mean=0.0179
  AAACCCGCATCGGGAC-1: min=-2.5089, max=22.6247, mean=0.0050

adata.raw type: <class 'anndata._core.raw.Raw'>

=== adata.raw.X (Raw Expression Matrix) ===
Shape: (160875, 24597) (cells: 160875, genes: 24597)
Type: sparse
Data type: float32
Min: 0.000000
Max: 8.389225
Mean: 0.113462
Std: 0.443252
Non-zero values: 294862321
Zero values: 3662180054
Sparsity: 0.9255 (92.55%)

Top 5 genes (first 5 rows):
  fgfr1op2: min=0.0000, max=3.0251, mean=0.1080
  si:dkey-21h14.12: min=0.0000, max=1.9556, mean=0.0003
  si:dkey-285e18.2: min=0.0000, max=1.9756, mean=0.0001
  znf1114: min=0.0000, max=2.6757, mean=0.0159
  si:dkey-21h14.10: min=0.0000, max=2.1449, mean=0.0002

Top 5 cells (first 5 columns):
  AAACCAAAGCTTAACG-1: min=0.0000, max=4.7681, mean=0.0959
  AAACCAAAGTACGCCG-1: min=0.0000, max=5.3671, mean=0.0970
  AAACCATTCGCTCATT-1: min=0.0000, max=5.0932, mean=0.1313
  AAACCCGCACCTCACG-1: min=0.0000, max=6.1345, mean=0.1182
  AAACCCGCATCGGGAC-1: min=0.0000, max=5.2011, mean=0.1099

==================================================
COMPARISON: adata.X vs adata.raw.X
==================================================
Same shape: True
Number of different elements: 3957042359
Identical matrices: False

==================================================
CATEGORICAL OBSERVATIONS
==================================================
Found 4 categorical columns:

Column: sample (8 unique values)
Values: ['Zebra', 'TH115', 'TH44', 'TH54', 'TH55', 'TH56', 'TH57', 'TH71']
---
Column: renamed_samples (3 unique values)
Values: ['LD', 'NMDA', 'Control']
---
Column: leiden (92 unique values)
Values: ['3', '57', '13', '0', '56', '51', '12', '25', '35', '11', '87', '8', '30', '18', '28', '10', '33', '52', '61', '6', '5', '89', '16', '4', '42', '24', '44', '32', '72', '48', '70', '41', '9', '1', '45', '47', '93', '17', '86', '26', '29', '59', '94', '23', '15', '31', '78', '14', '34', '20', '27', '22', '92', '91', '54', '7', '85', '67', '21', '60', '37', '64', '2', '55', '19', '46', '75', '68', '43', '38', '36', '65', '63', '39', '53', '88', '90', '84', '83', '50', '69', '40', '82', '71', '77', '62', '49', '73', '74', '79', '81', '95']
---
Column: celltype (15 unique values)
Values: ['PR precursors', 'HC', 'MG', 'MGPC', 'Rod', 'BC', 'Cones', 'AC', 'RGC', 'RPE', 'Microglia_ImmuneCells', 'Oligodenrocyte', 'Perycites', 'Endothelial', 'Melanocyte']
---

ADDITIONAL METADATA
==================================================
adata.obs columns: ['sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'renamed_samples', 'predicted_doublets', 'leiden', 'celltype']
adata.var columns: ['mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std']

No additional layers found.



## How to run Snakemake 

For dry run to check everythign before actual run:

    snakemake -j1 -p --configfile config.yaml -n 

For Actual run: 
    

##### Update the configfile 

The config files has the sample names with the suffix in samples.tsv 
and the annotations in annotations.txt. Edit both files correspondingly. You can edit the config file to change their names. 



About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors