Genotype Calling with Uncertainty from Sequencing Data in Polyploids and Diploids.
Python helper scripts for polyRAD
Using Python
If you are new to running Python from the command line, and are using RStudio, the simplest thing to do is click on the "Terminal" tab to get to your operating system's terminal/shell/command prompt. Python scripts should be run from there. (Not from the R Console.)
Python 3 is required for the scripts in this directory to function. However, you may have Python 2 as the default Python on your computer, since many computer programs and even some operating systems depend on it. To check, run
python --version
You should see something like Python 3.x.x
, where there are numbers in place of x
. If you see something else, you may need to install Python 3 and/or make sure the path to Python 3 is in your system's PATH
variable. If you need help with that, your department's IT person can probably get it done in five or ten minutes (ok, on CentOS it was more of a challenge. But on Windows it should be quick). In some operating systems, instead of typing python
you can type python3
or python36
to specify the version of Python to use, if multiple versions are installed.
Find GBS/RAD tags associated with alleles from TASSEL
If you used VCF2RADdata
, with phaseSNPs = TRUE
and a non-null refgenome
argument, to import a VCF that was generated by the TASSEL-GBSv2 pipeline, the script tassel_vcf_tags.py
can help you to find the full tag sequence(s) associated with each allele.
If obj
is the name of a RADdata
object in your R environment, from R run
cat(GetAlleneNames(obj), sep = "\n", file = "myalleles.txt")
Then in the Terminal, run
python tassel_vcf_tags.py -a myalleles.txt -s alignment_from_tassel.sam -o mytags.txt
where alignment_from_tassel.sam
was the SAM file created by Bowtie2 or BWA as part of the TASSEL-GBSv2 pipeline.
The file mytags.txt
is tab-delimited. The first column contains the allele names from polyRAD. The second column contains the tag sequences, starting at the restriction cut site. If multiple tag sequences matched the allele, they will be separated by a semi-colon (;
). Note that if a tag aligned to the bottom strand, the sequence seen in the allele name may be the reverse complement of the sequence seen in the tag.
In my own data, this script has been successful in identifying tags for about 90% of alleles. The rest can be attributed to quirks in how TASSEL determines SNP locations, as well as errors in the phasing performed by VCF2RADdata
.
Adjust tag alignments in highly duplicated genomes.
The files process_sam_multi.py
, isoloci_fun.py
, and process_isoloci.py
are intended to assist with assigning tags to correct genomic locations in highly duplicated reference genomes, such as those of recent or ancient allopolyploids. See the vignette "Variant and Genotype Calling in Highly Duplicated Genomes" (isolocus_sorting.Rmd
) for more information.