Tidy Verbs for Dealing with Genomic Data Frames.
tidygenomics
Tidy Verbs for Dealing with Genomic Data Frames
Description
Handle genomic data within data frames just as you would with GRanges
. This packages provides method to deal with genomics intervals the "tidy-way" which makes it simpler to integrate in the the general data munging process. The API is inspired by the popular bedtools and the genome_join() method from the fuzzyjoin package.
Installation
install.packages("tidygenomics")
# Or to get the latest development version
devtools::install_github("const-ae/tidygenomics")
Documentation
genome_intersect
Joins 2 data frames based on their genomic overlap. Unlike the genome_join
function it updates the boundaries to reflect the overlap of the regions.
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 400, 300),
end = c(160, 240, 415, 320))
genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
id.x | chromosome | id.y | start | end |
---|---|---|---|---|
1 | chr1 | 1 | 140 | 150 |
4 | chr2 | 3 | 400 | 415 |
genome_subtract
Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr1", "chr1"),
start = c(120, 210, 300, 400),
end = c(125, 240, 320, 415))
genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
id | chromosome | start | end |
---|---|---|---|
1 | chr1 | 100 | 119 |
1 | chr1 | 126 | 150 |
2 | chr1 | 200 | 250 |
3 | chr2 | 300 | 350 |
4 | chr1 | 416 | 450 |
genome_join_closest
Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.
x1 <- data_frame(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data_frame(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
id.x | chr.x | start.x | end.x | id.y | chr.y | start.y | end.y | distance |
---|---|---|---|---|---|---|---|---|
1 | chr1 | 100 | 150 | 2 | chr1 | 210 | 240 | 59 |
2 | chr1 | 200 | 250 | 1 | chr1 | 220 | 225 | 0 |
2 | chr1 | 200 | 250 | 2 | chr1 | 210 | 240 | 0 |
3 | chr2 | 300 | 350 | 4 | chr2 | 400 | 415 | 49 |
4 | chr3 | 400 | 450 | NA | NA | NA | NA | NA |
genome_cluster
Add a new column with the cluster if 2 intervals are overlapping or are within the max_distance
.
x1 <- data.frame(id = 1:4, bla=letters[1:4],
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 120, 300, 260),
end = c(150, 250, 350, 450))
genome_cluster(x1, by=c("chromosome", "start", "end"))
id | bla | chromosome | start | end | cluster_id |
---|---|---|---|---|---|
1 | a | chr1 | 100 | 150 | 0 |
2 | b | chr1 | 120 | 250 | 0 |
3 | c | chr2 | 300 | 350 | 2 |
4 | d | chr1 | 260 | 450 | 1 |
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
id | bla | chromosome | start | end | cluster_id |
---|---|---|---|---|---|
1 | a | chr1 | 100 | 150 | 0 |
2 | b | chr1 | 120 | 250 | 0 |
3 | c | chr2 | 300 | 350 | 1 |
4 | d | chr1 | 260 | 450 | 0 |
genome_complement
Calculates the complement of a genomic region.
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
genome_complement(x1, by=c("chromosome", "start", "end"))
chromosome | start | end |
---|---|---|
chr1 | 1 | 99 |
chr1 | 151 | 199 |
chr1 | 251 | 399 |
chr2 | 1 | 299 |
genome_join
Classical join function based on the overlap of the interval. Implemented and maintained in the fuzzyjoin package and documented here only for completeness.
x1 <- data_frame(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data_frame(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")
id.x | chr.x | start.x | end.x | id.y | chr.y | start.y | end.y |
---|---|---|---|---|---|---|---|
2 | chr1 | 200 | 250 | 1 | chr1 | 220 | 225 |
2 | chr1 | 200 | 250 | 2 | chr1 | 210 | 240 |
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")
id.x | chr.x | start.x | end.x | id.y | chr.y | start.y | end.y |
---|---|---|---|---|---|---|---|
1 | chr1 | 100 | 150 | NA | NA | NA | NA |
2 | chr1 | 200 | 250 | 1 | chr1 | 220 | 225 |
2 | chr1 | 200 | 250 | 2 | chr1 | 210 | 240 |
3 | chr2 | 300 | 350 | NA | NA | NA | NA |
4 | chr3 | 400 | 450 | NA | NA | NA | NA |
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
id | chr | start | end |
---|---|---|---|
1 | chr1 | 100 | 150 |
3 | chr2 | 300 | 350 |
4 | chr3 | 400 | 450 |
Inspiration
If you have any additional questions or encounter issues please raise them on the github page.