A 'Sparklyr' Extension for 'Hail'.
A sparklyr extension for Hail
Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). Hail is exposed as a Python library, using primitives for distributed queries and linear algebra implemented in Scala, Spark, and increasingly C++.
The sparkhail
is a R extension using sparklyr package. The idea is to help R users to use Hail functionalities with the well know tidyverse sintax. In this README we are going to reproduce the GWAS tutorial using sparkhail
, sparklyr
, dplyr
and ggplot2
.
Installation
To upgrade to the latest version of sparkhail, run the following command and restart your R session:
install.packages("devtools")
devtools::install_github("r-spark/sparkhail")
You can install Hail manually or using hail_install()
.
sparkhail::hail_install()
Read a matrix table
The data in Hail is naturally represented as a Hail MatrixTable. The sparkhail
converts the MatrixTable to dataframe, in this way is easier to manipulate the data using dplyr
.
library(sparkhail)
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.4", config = hail_config())
hl <- hail_context(sc)
mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail"))
Convert to spark Data Frame as follows
df <- hail_dataframe(mt)
Getting to know our data
You can see the data structure using glimpse()
.
library(dplyr)
glimpse(df)
## Observations: ??
## Variables: 7
## Database: spark_connection
## $ locus <chr> "1:904165", "1:909917", "1:986963", "1:1563691", "1:1707…
## $ alleles <list> [["G", "A"], ["G", "A"], ["C", "T"], ["T", "G"], ["T", …
## $ rsid <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ qual <dbl> 52346.37, 1576.94, 398.06, 1090.75, 93517.82, 736.40, 14…
## $ filter <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ info <list> [[[518], [0.103], 5020, -3.394, -0.17, 17827, FALSE, 2.…
## $ entries <list> [[[4, [4, 0], 4, 12, [0, 12, 147]], [4, [8, 0], 8, 24, …
It’s important to have easy ways to slice, dice, query, and summarize a dataset. The conversion to dataframe is a good way to use dplyr
verbs. Let’s see some examples.
df %>%
dplyr::select(locus, alleles) %>%
head(5)
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
## # Source: spark<?> [?? x 2]
## locus alleles
## <chr> <list>
## 1 1:904165 <list [2]>
## 2 1:909917 <list [2]>
## 3 1:986963 <list [2]>
## 4 1:1563691 <list [2]>
## 5 1:1707740 <list [2]>
Here is how to peek at the first few sample IDs:
s <- hail_ids(mt)
s
## # Source: spark<s> [?? x 1]
## s
## <chr>
## 1 HG00096
## 2 HG00099
## 3 HG00105
## 4 HG00118
## 5 HG00129
## 6 HG00148
## 7 HG00177
## 8 HG00182
## 9 HG00242
## 10 HG00254
## # … with more rows
The genotype calls are in entries
column and we can see it using hail_entries()
function. This function selects and explodes the data frame using sparklyr.nested
.
hail_entries(df)
## # Source: spark<?> [?? x 5]
## call ad dp gq pl
## <int> <list> <int> <int> <list>
## 1 4 <list [284]> 4 12 <list [284]>
## 2 4 <list [284]> 4 24 <list [284]>
## 3 4 <list [284]> 4 23 <list [284]>
## 4 4 <list [284]> 4 21 <list [284]>
## 5 4 <list [284]> 4 15 <list [284]>
## 6 4 <list [284]> 4 11 <list [284]>
## 7 4 <list [284]> 4 6 <list [284]>
## 8 4 <list [284]> 4 14 <list [284]>
## 9 4 <list [284]> 4 15 <list [284]>
## 10 4 <list [284]> 4 39 <list [284]>
## # … with more rows
Adding column fields
A Hail MatrixTable can have any number of row fields and column fields for storing data associated with each row and column. Annotations are usually a critical part of any genetic study. Column fields are where you’ll store information about sample phenotypes, ancestry, sex, and covariates. Row fields can be used to store information like gene membership and functional impact for use in QC or analysis.
The file provided contains the sample ID, the population and “super-population” designations, the sample sex, and two simulated phenotypes (one binary, one discrete).
This file is a standard text file and can be imported using sparklyr
.
annotations <- spark_read_csv(sc, "table",
path = system.file("extdata/1kg_annotations.txt",
package = "sparkhail"),
overwrite = TRUE,
delimiter = "\t")
A good way to peek at the structure of a Table is to look at its schema.
glimpse(annotations)
## Observations: ??
## Variables: 6
## Database: spark_connection
## $ Sample <chr> "NA19784", "NA19102", "HG00141", "HG01890", …
## $ Population <chr> "MXL", "YRI", "GBR", "ACB", "GBR", "GIH", "S…
## $ SuperPopulation <chr> "AMR", "AFR", "EUR", "AFR", "EUR", "SAS", "S…
## $ isFemale <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE,…
## $ PurpleHair <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALS…
## $ CaffeineConsumption <int> 8, 6, 6, 8, 6, 9, 9, 5, 6, 5, 5, 6, 9, 3, 5,…
Now we’ll use this table to add sample annotations to our dataset. To merge these data we can use joins.
annotations_sample <- inner_join(s, annotations, by = c("s" = "Sample"))
## Warning: `chr_along()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
Query functions
We will start by looking at some statistics of the information in our data. We can aggregate using group_by()
and count the number of occurrences using tally()
.
annotations %>%
group_by(SuperPopulation) %>%
tally()
## # Source: spark<?> [?? x 2]
## SuperPopulation n
## <chr> <dbl>
## 1 AFR 1018
## 2 AMR 535
## 3 SAS 661
## 4 EUR 669
## 5 EAS 617
We can use sdf_describe()
to see the summary statistics of the data.
sdf_describe(annotations)
## # Source: spark<?> [?? x 5]
## summary Sample Population SuperPopulation CaffeineConsumption
## <chr> <chr> <chr> <chr> <chr>
## 1 count 3500 3500 3500 3500
## 2 mean <NA> <NA> <NA> 6.219714285714286
## 3 stddev <NA> <NA> <NA> 1.93905718305461
## 4 min HG00096 ACB AFR 3
## 5 max NA21144 YRI SAS 10
However, these metrics aren’t perfectly representative of the samples in our dataset. Here’s why:
sdf_nrow(annotations)
## [1] 3500
sdf_nrow(annotations_sample)
## [1] 284
Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at sample annotations on the dataset.
annotations_sample %>%
group_by(SuperPopulation) %>%
tally()
## # Source: spark<?> [?? x 2]
## SuperPopulation n
## <chr> <dbl>
## 1 AFR 76
## 2 AMR 34
## 3 SAS 55
## 4 EAS 72
## 5 EUR 47
sdf_describe(annotations)
## # Source: spark<?> [?? x 5]
## summary Sample Population SuperPopulation CaffeineConsumption
## <chr> <chr> <chr> <chr> <chr>
## 1 count 3500 3500 3500 3500
## 2 mean <NA> <NA> <NA> 6.219714285714286
## 3 stddev <NA> <NA> <NA> 1.93905718305461
## 4 min HG00096 ACB AFR 3
## 5 max NA21144 YRI SAS 10
Let’s see another example, now we are going to calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base). To do this, we need to get the alternate allele of each variant and then count the occurences of each unique ref/alt pair. The alleles column is nested, because of this, we need to separete this column using sdf_separate_column()
.
df %>%
sdf_separate_column("alleles") %>%
group_by(alleles_1, alleles_2) %>%
tally() %>%
arrange(-n)
## # Source: spark<?> [?? x 3]
## # Groups: alleles_1
## # Ordered by: -n
## alleles_1 alleles_2 n
## <chr> <chr> <dbl>
## 1 C T 2436
## 2 G A 2387
## 3 A G 1944
## 4 T C 1879
## 5 C A 496
## 6 G T 480
## 7 T G 468
## 8 A C 454
## 9 C G 150
## 10 G C 112
## # … with more rows
It’s nice to see that we can actually uncover something biological from this small dataset: we see that these frequencies come in pairs. C/T and G/A are actually the same mutation, just viewed from from opposite strands. Likewise, T/A and A/T are the same mutation on opposite strands. There’s a 30x difference between the frequency of C/T and A/T SNPs. Why?
The last example, what about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 284 samples and 10,000 variants produce 10 million unique genotypes. Let’s plot this using the ggplot2
and dbplot
package.
library(dbplot)
library(ggplot2)
library(sparklyr.nested) # to access DP nested in info column
df %>%
sdf_select(DP = info.DP) %>%
dbplot_histogram(DP) +
labs(title = "Histogram for DP", y = "Frequency")
Cleanup
Then disconenct from Spark and Hail,
spark_disconnect(sc)