MyNixOS website logo
Description

Informatic Sequence Classification Trees.

Provides tools for probabilistic taxon assignment with informatic sequence classification trees. See Wilkinson et al (2018) <doi:10.7287/peerj.preprints.26812v1>.

Informatic sequence classification trees

insect is an R package for taxonomic identification of amplicon sequence variants generated by DNA meta-barcoding analysis. The learning and classification algorithms implemented in the package are based on full probabilistic models (profile hidden Markov models) and offer highly accurate taxon IDs, albeit at a relatively high computational cost.

The package also contains functions for searching and downloading reference sequences and taxonomic information from NCBI, a "virtual PCR" tool for sequence trimming, a function for purging erroneously labeled reference sequences, and several other tools.

insect is designed to be used in conjunction with the dada2 pipeline or other de-noising tools that produce a list of amplicon sequence variants (ASVs). While unfiltered sequences can also be processed with high accuracy, the insect classification algorithm is relatively slow, since it uses a computationally intensive dynamic programming algorithm to find the likelihood values of each sequence given the models at each node of the classification tree. Hence filtered input datasets are generally be much faster to process.

Installation

To download insect from CRAN and load the package, run

install.packages("insect")
library(insect)

To download the latest development version from GitHub, run:

devtools::install_github("shaunpwilkinson/insect", build_vignettes = TRUE) 
library(insect)

Classifying sequences

Classifiers for some of the more commonly used metabarcoding primer sets are available here:

MarkerTargetPrimersSourceVersionDateDownload
12SFishMiFishUF/MiFishUR (Miya et al 2015)GenBank120181111RDS (9MB)
16SMarine crustaceansCrust16S_F/Crust16S_R (Berry et al 2017)GenBank420180626RDS (7.1 MB)
16SMarine fishFish16sF/16s2R (Berry et al 2017; Deagle et al 2007)GenBank420180627RDS (6.8MB)
18SMarine eukaryotes18S_1F/18S_400R (Pochon et al 2017)SILVA_132_LSUParc, GenBank520180709RDS (11.8 MB)
18SMarine eukaryotes18S_V4F/18S_V4R (Stat et al 2017)GenBank420180525RDS (11.5 MB)
23SAlgaep23SrV_f1/p23SrV_r1 (Sherwood & Presting 2007)SILVA_132_LSUParc120180715RDS (26.9MB)
COIMetazoansmlCOIintF/jgHCO2198 (Leray et al 2013)Midori, GenBank520181124RDS (140 MB)
ITS2Cnidarians and spongesscl58SF/scl28SR (Wilkinson et al in prep)GenBank520180920RDS (6.6 MB)

To classify a sequence or set of sequences, first read them into R as a "DNAbin" list object. FASTA files can be parsed as follows:

x <- readFASTA("<path-to-file>.fasta")

Alternatively users may wish to assign taxon IDs to the output from the DADA2 pipeline, in which case the column names of the ouput table can be parsed as in the following example:

data("samoa") 
x <- char2dna(colnames(samoa))
## name the sequences sequentially
names(x) <- paste0("ASV", seq_along(x))

The next step is to download and read in the classifier. It is important to ensure that the classifier was trained using the same primer set as that used to generate the query data. In this example the data were generated from autonomous reef monitoring structures in American Samoa (ARMS) using the COI metabarcoding primers mlCOIintF and jgHCO2198 (Leray et al 2013), and de-noised, filtered and merged following the DADA2 tutorial.

The COI classifier was created using the MIDORI UNIQUE 20180221 trainingset, supplemented with around 14,000 non-metazoan COI sequences downloaded from GenBank.

The 140 MB classifier can be downloaded to the current working directory and read into R as follows:

download.file("https://www.dropbox.com/s/dvnrhnfmo727774/classifier.rds?dl=1", 
              destfile = "classifier.rds", mode = "wb")
classifier <- readRDS("classifier.rds")

There is an option to perform a nearest-neighbor search prior to the computationally-expensive recursive model test procedure, which can save time and improve resolution ('recall') at lower taxonomic ranks. Note that this can be a double-edged sword; if multiple species share an identical or near-identical sequence, and the true taxon of the query sequence is missing from the trainingset, the algorithm may over-classify the sequence and return a congeneric taxon. To perform a nearest-neighbor search with a similarity threshold of 0.99 (meaning any sequence in the trainingset with a similarity greater than or equal to 99% is considered a match), set ping = 0.99. To stay on the safe side, we will set ping = 1 (i.e. only sequences with 100% identity are considered matches).

out <- classify(x, classifier, threshold = 0.8)
representativetaxIDtaxonrankscorekingdomphylumclassorderfamilygenusspecies
ASV12806Florideophyceaeclass0.9981Florideophyceae
ASV26379Chaetopterusgenus1.0000MetazoaAnnelidaPolychaetaSpionidaChaetopteridaeChaetopterus
ASV32806Florideophyceaeclass0.9989Florideophyceae
ASV42172821Multicrustaceasuperclass1.0000MetazoaArthropoda
ASV5131567cellular organismsno rank0.9952
ASV62806Florideophyceaeclass0.9981Florideophyceae
ASV739820Nereididaefamily1.0000MetazoaAnnelidaPolychaetaPhyllodocidaNereididae
ASV8116571Podopleasuperorder0.9995MetazoaArthropodaHexanauplia
ASV92806Florideophyceaeclass0.9482Florideophyceae
ASV101rootno rankNA
ASV11115834Hesionidaefamily1.0000MetazoaAnnelidaPolychaetaPhyllodocidaHesionidae
ASV121443949Corallinophycidaesubclass0.9910Florideophyceae
ASV1333213Bilateriano rank1.0000Metazoa
ASV14131567cellular organismsno rank0.9952
ASV152806Florideophyceaeclass0.9993Florideophyceae
ASV1639820Nereididaefamily1.0000MetazoaAnnelidaPolychaetaPhyllodocidaNereididae

Further reading

A more detailed overview of the package and its functions can be found here or by running

vignette("insect-vignette")

Issues

If you experience a problem using this software please feel free to raise it as an issue on GitHub.

Acknowledgements

This software was developed at Victoria University of Wellington with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand. Unpublished COI data care of Molly Timmers (NOAA).

Metadata

Version

1.4.2

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows