MyNixOS website logo
Description

Calculate F Statistics Using Mixed Haploid and Diploid Organism Data.

Provides functions to estimate population genetics summary statistics from haplo-diploid systems, where one sex is haploid and the other diploid (e.g. Hymenoptera insects). It implements a theoretical model assuming equal sex ratio, random mating, no selection, no mutation, and no gene flow, deriving expected genotype frequencies for both sexes under these equilibrium conditions. The package includes windowed calculations (operating over genomic sliding windows from VCF input) for allele and genotype frequencies, the inbreeding coefficient (Fis), pairwise Fst, Nei's H (gene diversity), Watterson's Theta, and sex-specific reference allele frequencies. Most statistics are agnostic to ploidy, allowing the package to be applied to both strictly haplo-diploid and fully diploid systems.

R-CMD-check

Haplo-Diploid Equilibrium

Currently there are no user-friendly software that can handle mixed-ploidy datasets for haplo-diploid animals (e.g. ants, bees, wasps) aiming to compute basic population genetic statistics. Researchers working with such animals have either 1) sequenced only females or 2) transform the male genotypes from haploid to diploid. These solutions do not come without challenges, focusing only in females is easier when working with social organisms (e.g. honeybees, ants, social wasps) but more difficult when working with solitary ones (e.g. 70% of all bee species are solitary). Furthermore neglecting males can hinder the capacity to better understand some questions such as sex-biased dispersion. The second solution involves changing the genotype and allele frequencies, biasing any population genetic statistic.

We developed a R framework to compute some population genetic statistics of haploid-diploid data without any data transformation. Furthermore we relaxed Hardy-Weinber model in order to accommodate for the differences in ploidy in the population and the the particular reproductive cycle of haplo-diploids (diploids mate with haploids to produce other diploids but produce haploids through asexual reproduction).

This repository contains:

  1. R scripts to estimate population genetic summary statistics from haplo-diploid systems (Script_folder)
  2. Fake data (.vcf and .txt) to test scripts (Fake_data)
  3. A theoretical explanation of the model (see below)

List of scripts

  • R/HaploDip_GenoFreq.R: Functions to calculate genotype frequencies and FI
    /sub
    (not agnostic to ploidy)
  • R/Fst_HapDip.R: Functions to calculate FST from allele frequency data (agnostic to ploidy)
  • R/Haplo-Dip_Sex_Ref_Allele.R: Functions to calculate reference allele frequencies (0) in each sex across populations (agnostic to ploidy)
  • R/HaploDip_Neis_H.R: Functions to calculate Nei's H, a genetic diversity metric (agnostic to ploidy)
  • unfinished/Watternson_Theta.R: script to calculate Watterson's Theta (agnostic to ploidy) unfinished
  • R/helper_functions.R: Helper function for loading data

Running example (e.g. Nei's H)

  • Import vcf data
    vcf.data <- vcf2GT("path/to/input.vcf")

  • Create matrix with sites as rows and samples as columns
    gt_matrix <- vcf.data$gt_matrix

  • contig/scafold/chromosome names
    contigs <- vcf.data$contig_vector

  • site positions
    pos <- vcf.data$positions

  • Import population file with two columns: 1st with sample names (equal to gt_matrix columns) and 2nd with population names. Column names must be ID and Pop
    Pop.File <- read.csv("path/to/input.txt")

  • sanity check
    colnames(gt_matrix) == Pop.File$ID

  • Calculate Nei's H by population within an adjustable (e.g. 10k bp) window size, within each contig
    Neis_h <- compute_Hs_W(geno.data = gt_matrix, pop.file = Pop.File, contigs = contigs, positions = pos, window.size = 10000)

  • Summarise Nei's H. For each population returns mean and sd, weighted by the number of sites at each window
    summary_H_pop <- summarize_NeisH(Neis_h)

Theoretical explanation of the model

In a haplo-diploid system where one sex is haploid and another diploid (e.g. hymentoptera insects) and assuming

  1. equal sex ratio
  2. random mating
  3. infinite population size (no drift)
  4. equal probability of reproduction (no selection),
  5. alleles don`t change from individual to gene pool (no mutation)
  6. no addition of gametes from other gene pools (no gene flow)
  7. equal allele frequencies across sexes
  8. no overlapping generations

Let p be the frequency of the allele "A" and q be the frequency of the allele "a" of a bi-allelic marker so that p + q = 1, then:
f(AA) = p² x 1/2 = p²/2
f(aa) = q² x 1/2 = q²/2
f(Aa) = 2pq x 1/2 = pq
f(A) = p x 1/2 = p/2
f(a) = q x 1/2 = q/2

f(AA) + f(Aa) + f(aa) + f(A) + f(a) = p²/2 + pq + q²/2 + p/2 + q/2 =
= 1/2 x (p² + 2pq + q² + p + q) =
= 1/2 x (p² + 2pq + q²) + 1/2 x (p + q) =
= 1/2 x (p + q)² + 1/2 x (1) =
= 1/2 x (1)² + 1/2 =
= 1/2 + 1/2 =
= 1

Let N(X) be the number of individuals with genotype X, and N.dip and N.haplo being the number of diploid and haploid individuals respectively, then:
p = ( 2x N(AA) + N(Aa) + N(A) ) / (2x N.dip + N.haplo)
q = ( 2x N(aa) + N(Aa) + N(a) ) / (2x N.dip + N.haplo

In order to calculate genotype frequencies on next generation:
if offspring is diploid, then: \

Mating pairOffspring genotype distributionMating pair frequency
AA x AAA (100%)2x ( f(AA) x f(A) )
AA x aAa (100%)2x ( f(AA) x f(a) )
Aa x AAa (50%) + AA (50%)2x ( f(Aa) x f(A) )
Aa x aAa (50%) + aa (50%)2x ( f(Aa) x f(a) )
aa x AAa (100%)2x ( f(aa) x f(A) )
aa x aaa (100%)2x ( f(aa) x f(a) )

if offspring is haploid, then: \

Parent genotypeOffspring genotype distributionParent frequency
AAA (100%)f(AA)
AaA (50%) + a (50%)f(Aa)
aaa (100%)f(aa)

The frequency of each genotype of next generation are:
f(AA)t+1 = 2x (f(AA) x f(A)) + 2x (f(Aa) x f(A))/2 = 2x (p²/2 x p/2) + (pq x p/2) = 2x p³/4 + p²q/2 = p³/2 + p²q/2 = p²/2 x (p + q) = p²/2 x 1 = p²/2

f(aa)t+1 = 2x (f(aa) x f(a)) + 2x (f(Aa) x f(a))/2 = 2x (q²/2 x q/2) + (pq x q/2) = 2x q³/4 + q²p/2 = q³/2 + q²p/2 = q²/2 x (q + p) = q²/2 x 1 = q²/2

f(Aa)t+1 = 2x (f(AA) x f(a)) + 2x (f(Aa) x f(a))/2 + 2x (f(Aa) x f(A))/2 + 2x(f(aa) x f(a)) = 2x(p²/2 x q/2) + (pq x q/2) + (pq x p/2) + 2x(q²/2 + p/2) = = p²q/2 + pq²/2 + p²q/2 + pq²/2 = 1/2 x (p²q + pq² + p²q + pq²) = 1/2 x (2p²q + 2pq²) = 2pq/2 x (p + q) = pq x (1) = pq

f(A)t+1 = f(AA) + f(Aa)/2 = p²/2 + pq/2 = p/2 x (p + q) = p/2 x (1) = p/2

f(a)t+1 = f(aa) + f(Aa)/2 = q²/2 + pq/2 = q/2 x (p + q) = q/2 x (1) = q/2

GenotypeFrequency at t0Frequency at t+1
AAp²/2p²/2
Aapqpq
aaq²/2q²/2
Ap/2p/2
aq/2q/2

Since, the genotype frequencies in t+1 are the same as in t0, no evolution as occurred.

if inbreeding is present:

Let f be the probability of a individual's allele be identical by descent with its other allele and (1-f) the probability that is not. Let XX' be the genotype frequency of XX under inbreeding

AA' = fp + (1-f) x p²/2 = fp + p²/2 -fp²/2 = p²/2 + fp(1-p/2) aa' = fq + (1-f) x q²/2 = fq + p²/2 -fq²/2 = q²/2 + fq(1-q/2) A' = p/2 and a' = q/2 Aa' = 1 - (AA'+ aa' + A' + a') = 1 - [p²/2 + fp(1-p/2) + q²/2 + fq(1-q/2) + p/2 + q/2] = 1 - [p²/2 + q²/2 + p/2 + q/2 + fp + fq - fp²/2 - fq²/2] = 1 - [(1 - pq) + f(-p²/2 -q²/2 + p + q)] = 1 - [1- pq + f(1 - p²/2 - q²/2)] = pq - f(1 - p²/2 - q²/2)

GenotypeFrequency with inbreeding
AAp²/2 + fp(1-p/2)
Aapq - f(1- p²/2 -q²/2)
aaq²/2 + fq(1-p/2)
Ap/2
aq/2
Genotypef = 0f = 1
AAp²/2 + 0p
Aapq - 0-p/2 -q/2
aaq²/2 + 0q
Ap/2p/2
aq/2q/2
Metadata

Version

0.1.0

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows