A Unified Data Layer for Single-Cell, Spatial and Bulk Immunomics.
immundata-rlang
Installation
Quick start
library(immundata)
library(duckplyr)
md_path <- system.file("extdata", "metadata_samples.tsv", package = "immundata")
samples <- c(
system.file("extdata", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata", "sample_1k_2k.tsv", package = "immundata")
)
md <- load_metadata(md_path)
imdata <- load_repertoires(samples, c("cdr3_aa", "v_call"), md)
Input / output
Supported formats
parquet etc.
Read one or multiple AIRR files into immundata
Suppose you have several files. How to read them?
1. Pass a singular file name
2. Pass a vector of file names
3. Pass a glob of files
Working with the repertoire metadata file
immundata
modularizes different parts to make sure ??? (modularity / one big function is bad). Henceforth, immundata
splits the repertoire dataset loading into three steps:
Optionally, load the metadata via
load_metadata
Load the repertoire files from the disk via
load_repertoires
and convert them intoimmundata
files.Load the ImmunData files from the converted files via
load_immundata
as the final step ofload_repertoires
.
After converting the files to the immundata
format, you can load them directly with load_immundata
.
Re-aggregating repertoires using receptor and repertoire schemas
This is the key concept that distinguished immundata
from DataFrame-based libraries.
- people analyse a specific receptors
- data lineage is crucial for full reproducibility
Modalities of the data source
Bulk -- RepSeq, AIRRSeq
Single-cell -- scRNAseq, scVDJseq, scTCRseq, scBCRseq
- load annotation data
- do something
- write the annotation data back
- visualize AIRR with annotations data
- visualize SC with annotation data
Paired-chain -- scVDJseq or other technologies
???
Spatial -- spatial transcriptomics and cell coordinates
- load annotation data
- do something
- write the annotation data back
- visualize AIRR with annotations data
- visualize SC with annotation data
Immunogenicity -- annotations from external tools
...
Hybrid datasets
Multi-locus data
...
Multiple contigs for TCR
...
BCR-heavy chains with multiple light chains
...
Bulk and single-cell data integration
...
Preprocessing strategies
- filtering non productive
- double contigs
- double BCR chains
- locus
Data manipulation
Filtering
Analyse the data
Immunarch
Advanced topics
Integrate into your package
Take a look at immunarch
.
Change RAM limits to accelerate the backend computations
...
Caching strategies
...
About
Citation
License
Author and contributors
Commercial usage
immundata
is free to use for commercial usage. However, corporate users will not get a prioritized support for immundata
-related issues, immune repertoire analysis questions or data engineering questions, related to building scalable immune repertoire and other -omics pipelines. The priority of open-source tool immundata
is open-source science.
If you are looking for prioritized support and setting up your data pipelines, consider contacting Vadim Nazarov for commercial consulting and support options.
FAQ
Q: Why all the function names or ImmunData fields are so long? I want to write
imdata$rec
instead ofimdata$receptors
.A: Two major reasons - improving the code readability and motivation to leverage the autocomplete tools.
Q: How does
immundata
works under the hood, in simpler terms?A:
immundata
uses the fantasticduckplyr
packageReferences:
Q: Why do you need to create Parquet files with receptors and annotations?
A: First of all, you can turn it off. Second, those are intermediate files, optimized for future data operations, and working with them significantly accelerates
immundata
. Take a look at our benchmark page to learn more:link
Q: Why does
immundata
support only the AIRR standard?!A: Because standards, but
immundata
allows some level of optionality - you can provide column names for barcodes, etc.Q: Why is it so complex? Why do we need to use
dplyr
instead of plain R?A: The short answer is:
- faster computations,
- code, that is easy to maintain and support by other humans,
- and better data skills.
For the long answer, let me give you more details on each of the bullepoint.
Q: How do I get to use all operations from
dplyr
?duckplyr
doesn't support some operations, which I need.A: Let's consider several use cases.
Case 0. You are missing
group_by
fromdplyr
.Case 1. Your data can fit into RAM.
Case 2. Your data won't fit into RAM, and you really need to work on all of this data.
Case 3. Your data won't fit into RAM, but before running intensive computations, you are open to working with smaller dataset first.
Q: You filter out non-productive receptors. How do I explore them?
A: option for saving non-productive chains to a separate file
Q: Why does
immundata
have its own column names for receptors and repertoires? Could you just use the AIRR format - repertoire_id etc.?A: The power of
immundata
lies in the fast re-aggregation of the data, that allows to work with whatever you define as a repertoire on the fly viaImmunData$build_repertoires(schema = ...)
Q: What do I do with following error: "Error in
compute_parquet()
at [...]: ! {"exception_type":"IO","exception_message":"Failed to write [...]: Failed to read file [...]: schema mismatch in glob: column [...] was read from the original file [...], but could not be found in file [...] If you are trying to read files with different schemas, try setting union_by_name=True"}?*A: It means that your repertoire files have different schemas, i.e., different column names. You have two options.
Option 1: Check the data and fix the schema. Explore the reason why the data have different schemas. Remove wrong files. Change column names. And try again.
Option 2: If you know what you are doing, pass argument
enforce_schema = FALSE
toload_repertoires
. The resultant table will have NAs in the place of missing values. But don't use it without considering the first option. Broken schema usually means that there are some issues in the how the data were processed.