Analysis of Massive SNP Arrays.
bigsnpr
{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of package {bigstatsr} for the purpose of analyzing genotype data.
To get you started:
List of functions from bigsnpr and from bigstatsr
Extended documentation with more examples + course recording
Installation
In R, run
# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")
or for the CRAN version
install.packages("bigsnpr")
Input formats
This package reads bed/bim/fam files (PLINK preferred format) using functions snp_readBed()
and snp_readBed2()
. Before reading into this package's special format, quality control and conversion can be done using PLINK, which can be called directly from R using snp_plinkQC()
and snp_plinkKINGQC()
.
This package can also read UK Biobank BGEN files using function snp_readBGEN()
. This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores.
This package uses a class called bigSNP
for representing SNP data. A bigSNP
object is a list with some elements:
$genotypes
: AFBM.code256
. Rows are samples and columns are variants. This stores genotype calls or dosages (rounded to 2 decimal places).$fam
: Adata.frame
with some information on the individuals.$map
: Adata.frame
with some information on the variants.
Note that most of the algorithms of this package don't handle missing values. You can use snp_fastImpute()
(taking a few hours for a chip of 15K x 300K) and snp_fastImputeSimple()
(taking a few minutes only) to impute missing values of genotyped variants.
Package {bigsnpr} also provides functions that directly work on bed files with a few missing values (the bed_*()
functions). See paper "Efficient toolkit implementing..".
Polygenic scores
Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:
Penalized regressions with individual-level data (see paper and tutorial)
Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial).
Possible upcoming features
Multiple imputation for GWAS (https://doi.org/10.1371/journal.pgen.1006091).
More interactive (visual) QC.
You can request some feature by opening an issue.
Bug report / Support
How to make a great R reproducible example?
Please open an issue if you find a bug.
If you want help using {bigstatsr} (the big_*()
functions), please open an issue on {bigstatsr}'s repo, or post on Stack Overflow with the tag bigstatsr.
I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.
References
Privé, Florian, et al. "Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr."Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, et al. "Efficient implementation of penalized regression for genetic risk prediction."Genetics 212.1 (2019): 65-74.
Privé, Florian, et al. "Making the most of Clumping and Thresholding for polygenic scores."The American Journal of Human Genetics 105.6 (2019): 1213-1221.
Privé, Florian, et al. "Efficient toolkit implementing best practices for principal component analysis of population genetic data."Bioinformatics 36.16 (2020): 4449-4457.
Privé, Florian, et al. "LDpred2: better, faster, stronger."Bioinformatics 36.22-23 (2020): 5424-5431.
Privé, Florian. "Optimal linkage disequilibrium splitting."Bioinformatics 38.1 (2022): 255–256.
Privé, Florian. "Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics."Bioinformatics 38.13 (2022): 3477-3480.
Privé, Florian, et al. Inferring disease architecture and predictive ability with LDpred2-auto. bioRxiv (2022).