Genomic Data Science

Welcome to the Genomic Data Science repository. In this repository, you'll find algorithms and best practices for Genomic Data Analysis, ranging from basic DNA string manipulation to advanced sequence matching and genome assembly.

Part 1.1: DNA Strings Notebook
Part 1.2: DNA Sequence Notebook
Part 1.3: DNA Sequence Matching Notebook
Part 1.4: Lambda Virus Genome Notebook
Part 2.1: Approx Boyer Moore Notebook
Part 2.2: DNA Indexing Notebook
Part 2.3: Approximate Matching Notebook
Part 2.4: Homo Sapiens Chromosome 1 GRCh38 Notebook
Part 3.1: Dynamic Programming and Distances Notebook
Paret 3.2: Dynamic Programming Application Notebook
Part 4: Shortest Common Superstrings

Part 1.1: DNA Strings Notebook

This section covers the basics of DNA strings and fundamental algorithms for their manipulation.

Topics Covered

Basics of DNA Strings
Understand the properties and representation of DNA sequences.
Longest Common Sequence
Determine the longest common subsequence between two strings.
Exact String Matching
Check if two given strings are identical.
Complementary Strand Generation
Obtain the complementary strand of a given DNA sequence.
Virus Sequence Retrieval
Extract the sequence of a virus using the lambda_virus.fa file.
Nucleobase Counting
Count the individual nucleobases (A, T, G, and C) in a genome.
Histogram Plotting
Use Python's matplotlib to plot a histogram of base counts for A, T, G, and C.

Part 1.2: DNA Sequence Notebook

Focuses on working with DNA sequence data in FASTQ format and performing quality control.

Topics Covered

Data Acquisition
Download model DNA sequence data in FASTQ format.
FASTQ File Parsing
Read FASTQ files to generate sequences along with their associated quality scores.
Quality Score Conversion
Convert quality score characters (Phred+33 encoded) to numerical values.
Quality Score Visualization
Plot quality scores to assess the sequencing quality.
GC Content Analysis
Determine the GC content of the species based on the sequencing data.

Part 1.3: DNA Sequence Matching Notebook

Develop and implement algorithms for DNA sequence matching.

Topics Covered

Phix Genome Retrieval
Obtain the Phix genome file.
Genome Reading and Parsing
Read the genome and implement a matching algorithm.
Sequence Matching
Download a sequence and apply the matching algorithm.
Complementary Strand Matching
Enhance matching by considering the complementary strand of DNA.
Histogram of Matching Regions
Plot a histogram showing regions in the genome with matching sequences.

Part 1.4: Lambda Virus Genome Notebook

Apply genomic data science techniques on the Lambda virus genome and extend these methods to the Human genome.

Topics Covered

Lambda Virus Genome Analysis
Implement and test algorithms on the Lambda virus genome.
Human Genome Analysis
Extend the concepts and algorithms to analyze the Human genome.

Part 2.1: Approx Boyer Moore Notebook

While exact matching algorithms have been extensively studied, real-world genomic applications often demand approximate matching methods. However, when exact matching is appropriate, the Boyer-Moore algorithm stands out due to its exceptional efficiency in scanning massive DNA sequences. By leveraging heuristics like the bad-character and good-suffix rules, Boyer-Moore frequently skips large sections of the text, dramatically reducing the number of comparisons compared to naive methods.

Advantages of the Boyer-Moore Algorithm in Genomic Data Science

Efficiency:
Operates in sublinear time on average by quickly eliminating unlikely match positions, making it ideal for handling vast genomic datasets.
Reduced Comparisons:
Employs heuristics that minimize unnecessary character comparisons, significantly speeding up the search process.
Effective Preprocessing:
Preprocesses the pattern in linear time, a beneficial feature when the same pattern is used to search across extensive genomic data.
Scalability:
Its computational efficiency allows it to scale effectively to large DNA sequences, a critical requirement in genomic research.
Adaptability:
The core principles of Boyer-Moore can be extended to support approximate matching, addressing biological variability where perfect matches are rare.

Overall, the Boyer-Moore algorithm is a robust and adaptable tool for efficient and effective pattern matching in genomic research.

More algorithms

Raita Algorithm

The Raita algorithm is a variant of the Boyer-Moore string matching algorithm that improves performance by performing quick pre-checks. Specifically, it first compares the last, first, and middle characters of the pattern against the corresponding characters in the text. Only if these key positions match does it proceed to a full comparison of the entire pattern. This strategy reduces the number of unnecessary character comparisons, making the algorithm more efficient when scanning large texts, such as genomic sequences.

Apostolico–Giancarlo Algorithm

The Apostolico–Giancarlo algorithm is an optimized string matching method that refines the Boyer–Moore approach. It introduces a skip mechanism that avoids rechecking characters that have already been matched in previous alignments, thereby reducing redundant comparisons. This makes the algorithm particularly efficient in scenarios with overlapping or repetitive pattern occurrences, offering improved performance in worst-case cases while maintaining the benefits of Boyer–Moore’s heuristics such as the bad-character and good-suffix rules.

Knuth–Morris–Pratt Algorithm

The Knuth–Morris–Pratt (KMP) algorithm is an efficient string matching method that preprocesses the pattern to create a "failure function" (or prefix function). This function indicates the longest proper prefix of the pattern that is also a suffix, enabling the algorithm to bypass redundant comparisons when a mismatch occurs. By doing so, KMP achieves a worst-case time complexity of O(n + m), where n is the length of the text and m is the length of the pattern, making it highly effective for searching within large texts.

Part 2.2: DNA Indexing Notebook

Part 2.3: Approximate Matching Notebook

Part 2.4: Homo Sapiens Notebook

Part 3.1: Dynamic Programming and Distances Notebook

Part 3.2: Dynamic Programming Application Notebook

This repository is designed to guide you through the core algorithms and methodologies used in Genomic Data Science. Each notebook builds upon previous exercises to help you gain a comprehensive understanding of the field. Enjoy exploring and learning!

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
__pycache__		__pycache__
Approx Boyer Moore.ipynb		Approx Boyer Moore.ipynb
Approximate Matching.ipynb		Approximate Matching.ipynb
Boyer_Moore.py		Boyer_Moore.py
DNA Indexing.ipynb		DNA Indexing.ipynb
DNA Sequence Matching.ipynb		DNA Sequence Matching.ipynb
DNA Sequence.ipynb		DNA Sequence.ipynb
DNA Strings.ipynb		DNA Strings.ipynb
Dynamic Programming Application.ipynb		Dynamic Programming Application.ipynb
Dynamic Programming and Distances.ipynb		Dynamic Programming and Distances.ipynb
ERR037900_1.first1000.fastq		ERR037900_1.first1000.fastq
ERR266411_1.first1000.fastq		ERR266411_1.first1000.fastq
Homo sapiens chromosome 1, GRCh38.ipynb		Homo sapiens chromosome 1, GRCh38.ipynb
LICENSE		LICENSE
Lambda Virus Genome.ipynb		Lambda Virus Genome.ipynb
README.md		README.md
SRR835775_1.first1000.fastq		SRR835775_1.first1000.fastq
Shortest Common Superstrings.ipynb		Shortest Common Superstrings.ipynb
chr1.GRCh38.excerpt.fasta		chr1.GRCh38.excerpt.fasta
lambda_virus.fa		lambda_virus.fa
lambda_virus.fa.1		lambda_virus.fa.1
optimized_boyer_moore.py		optimized_boyer_moore.py
phix.fa		phix.fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic Data Science

Table of Contents

Part 1.1: DNA Strings Notebook

Topics Covered

Part 1.2: DNA Sequence Notebook

Topics Covered

Part 1.3: DNA Sequence Matching Notebook

Topics Covered

Part 1.4: Lambda Virus Genome Notebook

Topics Covered

Part 2.1: Approx Boyer Moore Notebook

Advantages of the Boyer-Moore Algorithm in Genomic Data Science

More algorithms

Raita Algorithm

Apostolico–Giancarlo Algorithm

Knuth–Morris–Pratt Algorithm

Part 2.2: DNA Indexing Notebook

Part 2.3: Approximate Matching Notebook

Part 2.4: Homo Sapiens Notebook

Part 3.1: Dynamic Programming and Distances Notebook

Part 3.2: Dynamic Programming Application Notebook

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Genomic Data Science

Table of Contents

Part 1.1: DNA Strings Notebook

Topics Covered

Part 1.2: DNA Sequence Notebook

Topics Covered

Part 1.3: DNA Sequence Matching Notebook

Topics Covered

Part 1.4: Lambda Virus Genome Notebook

Topics Covered

Part 2.1: Approx Boyer Moore Notebook

Advantages of the Boyer-Moore Algorithm in Genomic Data Science

More algorithms

Raita Algorithm

Apostolico–Giancarlo Algorithm

Knuth–Morris–Pratt Algorithm

Part 2.2: DNA Indexing Notebook

Part 2.3: Approximate Matching Notebook

Part 2.4: Homo Sapiens Notebook

Part 3.1: Dynamic Programming and Distances Notebook

Part 3.2: Dynamic Programming Application Notebook

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages