Scripts to convert raw DNA data (from 23AndMe, etc) to VCF
The generated support files for 23AndMe v4 are included in this repo. The scripts here should be able to convert AncestryDNA and FTdna to 23AndMe format and generate the support files. The entire workflow has only been tested for 23AndMe v4.
-
Use make_filter.py to create a filter that lists all tested markers at a given vendor. It takes a list of raw data files in 23AndMe format. The point of using multiple data files is to make sure that nothing is missing. filter/23andme_v4.tsv contains a list of all markers tested by 23AndMe v4. It was generated by make_filter.py
-
Use make_map.py to make a mapping file between the vendor's IDs and positions and what's used in dbSNP. It compares the above mentioned filter file to a VCF that lists ALL markers in dbSNP. map/23andme_v4.map was generated using GRCh37p13 b150 (WARNING: 7GB download - ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz)
-
Use make_template.py to build a template that's used to convert 23AndMe formatted data files to VCF. The template is essentially just a master VCF that was filtered to only contain the needed lines. This way, we only need to parse a ~30MB file instead of a ~7GB file. template/23andme_v4.vcf.gz was generated from GRCh37p13 b150
-
Use 23andme_to_vcf.py to convert the 23AndMe format raw data file to VCF with the help of the template and mapping files made in 2 and 3.
ls genome*.txt | xargs -n 1 ./make_filter.py > filter/23andme_v4.tsv
Look at a bunch of raw data files and make a list of all variants that are reported. Not sure if multiple files are actually needed. The thought was that some files might not have all the tested variants (like if they weren't called).
./make_map.py map/23andme_v4.map filter/23andme_v4.tsv All_20170710.vcf.gz
Make a file that maps between 23AndMe rsid's and dbSNP rsid's
./make_template.py template/23andme_v4.vcf map/23andme_v4.map All_20170710.vcf.gz; bgzip template/23andme_v4.vcf; tabix template/23andme_v4.vcf.gz
Filter the giant dbSNP VCF into something that's easier to handle. Only grab the entries we need for 23AndMe
./23andme_to_vcf.py genome_Philip_Baltar.txt Philip_Baltar map/23andme_v4.map template/23andme_v4.vcf.gz Philip_Baltar.vcf
This is the main script that does all the magic.
./AncestryDNA_to_23andme.awk data.txt > AncestryDNA_23andme.txt
Script to convert AncestryDNA files to 23AndMe format
./FTdna_to_23andme.awk FTdna.csv > FTdna_23andme.txt
Script to convert AncestryDNA files to 23AndMe format
This work was inspired by Giulio Genovese (http://apol1.blogspot.com/2013/08/impute-apoe-and-apol1-with-23andme.html)