MyNixOS website logo
Description

1000 Genomes Project Metadata.

Metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios. The data is described in Auton et al. (2015) <doi:10.1038/nature15393> and Byrska-Bishop et al. (2022) <doi:10.1016/j.cell.2022.08.004>, and raw data is available at <http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/>. See Turner (2022) <doi:10.48550/arXiv.2210.00539> for more details.

kgp

CRANstatus Lifecycle:stable arXiv

This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.

Installation

You can install the released version of kgp from CRAN with:

install.packages("kgp")

You can install the development version of kgp from GitHub with:

# install.packages("devtools")
devtools::install_github("stephenturner/kgp")

About the data

The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the 1000 Genomes FTP site and GitHub.

There are three data sets available in the kgp package.

library(kgp)
data(kgp)

The kgp3 data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.

kgp3
#> # A tibble: 2,504 × 10
#>    fid     id      pid   mid     sex sexf   pop   reg   population        region
#>    <chr>   <chr>   <chr> <chr> <int> <fct>  <chr> <chr> <chr>             <chr> 
#>  1 HG00096 HG00096 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  2 HG00097 HG00097 0     0         2 female GBR   EUR   British in Engla… Europe
#>  3 HG00099 HG00099 0     0         2 female GBR   EUR   British in Engla… Europe
#>  4 HG00100 HG00100 0     0         2 female GBR   EUR   British in Engla… Europe
#>  5 HG00101 HG00101 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  6 HG00102 HG00102 0     0         2 female GBR   EUR   British in Engla… Europe
#>  7 HG00103 HG00103 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  8 HG00105 HG00105 0     0         1 male   GBR   EUR   British in Engla… Europe
#>  9 HG00106 HG00106 0     0         2 female GBR   EUR   British in Engla… Europe
#> 10 HG00107 HG00107 0     0         1 male   GBR   EUR   British in Engla… Europe
#> # … with 2,494 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The kgpe data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.

kgpe
#> # A tibble: 3,202 × 11
#>    fid     id      pid   mid     sex sexf   pop   reg   population region phase3
#>    <chr>   <chr>   <chr> <chr> <int> <fct>  <chr> <chr> <chr>      <chr>  <lgl> 
#>  1 HG00096 HG00096 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  2 HG00097 HG00097 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  3 HG00099 HG00099 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  4 HG00100 HG00100 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  5 HG00101 HG00101 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  6 HG00102 HG00102 0     0         2 female GBR   EUR   British i… Europe TRUE  
#>  7 HG00103 HG00103 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  8 HG00105 HG00105 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#>  9 HG00106 HG00106 0     0         2 female GBR   EUR   British i… Europe TRUE  
#> 10 HG00107 HG00107 0     0         1 male   GBR   EUR   British i… Europe TRUE  
#> # … with 3,192 more rows
#> # ℹ Use `print(n = ...)` to see more rows

The kgpmeta contains population metadata for the 26 populations across five continental regions.

kgpmeta
#> # A tibble: 26 × 7
#>    pop   population                          reg   region   regco…¹   lat    lng
#>    <chr> <chr>                               <chr> <chr>    <chr>   <dbl>  <dbl>
#>  1 CDX   Chinese Dai in Xishuangbanna, China EAS   East As… #778500  22   101.  
#>  2 CHB   Han Chinese in Bejing, China        EAS   East As… #778500  39.9 116.  
#>  3 JPT   Japanese in Tokyo, Japan            EAS   East As… #778500  35.7 140.  
#>  4 KHV   Kinh in Ho Chi Minh City, Vietnam   EAS   East As… #778500  10.8 107.  
#>  5 CHS   Southern Han Chinese, China         EAS   East As… #778500  23.1 113.  
#>  6 BEB   Bengali in Bangladesh               SAS   South A… #c44cfd  23.7  90.4 
#>  7 GIH   Gujarati Indian in Houston,TX       SAS   South A… #c44cfd  29.8 -95.4 
#>  8 ITU   Indian Telugu in the UK             SAS   South A… #c44cfd  52.5  -1.89
#>  9 PJL   Punjabi in Lahore,Pakistan          SAS   South A… #c44cfd  31.6  74.4 
#> 10 STU   Sri Lankan Tamil in the UK          SAS   South A… #c44cfd  52.5  -1.89
#> # … with 16 more rows, and abbreviated variable name ¹​regcolor
#> # ℹ Use `print(n = ...)` to see more rows

Examples

library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)

Count the number of samples in each region, or in each population:

kgp3 %>% 
  count(region) %>% 
  knitr::kable()
regionn
Africa661
America347
East Asia504
Europe503
South Asia489
kgp3 %>% 
  count(region, population) %>% 
  knitr::kable()
regionpopulationn
AfricaAfrican Ancestry in Southwest US61
AfricaAfrican Caribbean in Barbados96
AfricaEsan in Nigeria99
AfricaGambian in Western Division, The Gambia113
AfricaLuhya in Webuye, Kenya99
AfricaMende in Sierra Leone85
AfricaYoruba in Ibadan, Nigeria108
AmericaColombian in Medellin, Colombia94
AmericaMexican Ancestry in Los Angeles, California64
AmericaPeruvian in Lima, Peru85
AmericaPuerto Rican in Puerto Rico104
East AsiaChinese Dai in Xishuangbanna, China93
East AsiaHan Chinese in Bejing, China103
East AsiaJapanese in Tokyo, Japan104
East AsiaKinh in Ho Chi Minh City, Vietnam99
East AsiaSouthern Han Chinese, China105
EuropeBritish in England and Scotland91
EuropeFinnish in Finland99
EuropeIberian populations in Spain107
EuropeToscani in Italy107
EuropeUtah residents with Northern and Western European ancestry99
South AsiaBengali in Bangladesh86
South AsiaGujarati Indian in Houston,TX103
South AsiaIndian Telugu in the UK102
South AsiaPunjabi in Lahore,Pakistan96
South AsiaSri Lankan Tamil in the UK102
kgp3 %>% 
  count(region, population) %>% 
  arrange(region, n) %>% 
  mutate(population=forcats::fct_inorder(population)) %>% 
  ggplot(aes(population, n)) + 
  geom_col(aes(fill=region)) + 
  labs(fill=NULL, x=NULL, x="N") + 
  coord_flip() + 
  theme_bw() + 
  theme(legend.position="bottom")

The latitude and longitude coordinates in kgpmeta can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.

pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() + 
  geom_polygon(data=map_data("world"), 
               aes(long, lat, group=group), 
               col="gray30", fill="gray95", lwd=.2, alpha=.5) + 
  geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) + 
  scale_colour_manual(values=pal) +
  theme_minimal() + 
  theme(axis.ticks = element_blank(), 
        axis.text = element_blank(), 
        axis.title = element_blank(), 
        legend.title = element_blank(),
        panel.grid = element_blank(),
        legend.position = "bottom")
Map showing locations of the 1000 Genomes Phase 3 populations.

The table below shows a selection of samples from kgpe showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.

kgpe %>% 
  filter(pid!="0" & mid!="0") %>% 
  group_by(pop) %>% 
  slice(1) %>% 
  head(12) %>% 
  arrange(reg, pop) %>% 
  select(fid:reg) %>% 
  select(-sexf) %>% 
  knitr::kable()
fididpidmidsexpopreg
BB01HG01881HG01879HG018802ACBAFR
2367NA19702NA19700NA197011ASWAFR
NG06HG02924HG02923HG029221ESNAFR
GB15HG02463HG02461HG024621GWDAFR
SL02HG03056HG03054HG030551MSLAFR
CLM03HG01114HG01112HG011132CLMAMR
SH001HG00405HG00403HG004042CHSEAS
VN046HG02015HG02017HG020161KHVEAS
1341NA06991NA06993NA069852CEUEUR
IBS001HG01502HG01500HG015011IBSEUR
BD01HG03008HG03006HG030071BEBSAS
IT002HG03719HG03725HG037222ITUSAS

The figure below shows an example of a pedigree plot made by parsing the pedigree information using skater and plotting using kinship2. The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.

kgpe %>% 
  filter(fid=="13291") %>% 
  transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>% 
  skater::fam2ped() %>% 
  pull(ped) %>% 
  purrr::pluck(1) %>% 
  kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
Trios in 1000 Genomes Project family 13291.
Metadata

Version

1.1.1

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows