A Unified Data Layer for Single-Cell, Spatial and Bulk Immunomics.
immundata-rlang
Installation
Quick start
library(immundata)
library(duckplyr)
md_path <- system.file("extdata", "metadata_samples.tsv", package = "immundata")
samples <- c(
system.file("extdata", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata", "sample_1k_2k.tsv", package = "immundata")
)
md <- load_metadata(md_path)
imdata <- load_repertoires(samples, c("cdr3_aa", "v_call"), md)
Input / output
Supported formats
parquet etc.
Read one or multiple AIRR files into immundata
Suppose you have several files. How to read them?
1. Pass a singular file name
2. Pass a vector of file names
3. Pass a glob of files
Working with the repertoire metadata file
immundata modularizes different parts to make sure ??? (modularity / one big function is bad). Henceforth, immundata splits the repertoire dataset loading into three steps:
Optionally, load the metadata via
load_metadataLoad the repertoire files from the disk via
load_repertoiresand convert them intoimmundatafiles.Load the ImmunData files from the converted files via
load_immundataas the final step ofload_repertoires.
After converting the files to the immundata format, you can load them directly with load_immundata.
Re-aggregating repertoires using receptor and repertoire schemas
This is the key concept that distinguished immundata from DataFrame-based libraries.
- people analyse a specific receptors
- data lineage is crucial for full reproducibility
Modalities of the data source
Bulk -- RepSeq, AIRRSeq
Single-cell -- scRNAseq, scVDJseq, scTCRseq, scBCRseq
- load annotation data
- do something
- write the annotation data back
- visualize AIRR with annotations data
- visualize SC with annotation data
Paired-chain -- scVDJseq or other technologies
???
Spatial -- spatial transcriptomics and cell coordinates
- load annotation data
- do something
- write the annotation data back
- visualize AIRR with annotations data
- visualize SC with annotation data
Immunogenicity -- annotations from external tools
...
Hybrid datasets
Multi-locus data
...
Multiple contigs for TCR
...
BCR-heavy chains with multiple light chains
...
Bulk and single-cell data integration
...
Preprocessing strategies
- filtering non productive
- double contigs
- double BCR chains
- locus
Data manipulation
Filtering
Analyse the data
Immunarch
Advanced topics
Integrate into your package
Take a look at immunarch.
Change RAM limits to accelerate the backend computations
...
Caching strategies
...
About
Citation
License
Author and contributors
Commercial usage
immundata is free to use for commercial usage. However, corporate users will not get a prioritized support for immundata-related issues, immune repertoire analysis questions or data engineering questions, related to building scalable immune repertoire and other -omics pipelines. The priority of open-source tool immundata is open-source science.
If you are looking for prioritized support and setting up your data pipelines, consider contacting Vadim Nazarov for commercial consulting and support options.
FAQ
Q: Why all the function names or ImmunData fields are so long? I want to write
imdata$recinstead ofimdata$receptors.A: Two major reasons - improving the code readability and motivation to leverage the autocomplete tools.
Q: How does
immundataworks under the hood, in simpler terms?A:
immundatauses the fantasticduckplyrpackageReferences:
Q: Why do you need to create Parquet files with receptors and annotations?
A: First of all, you can turn it off. Second, those are intermediate files, optimized for future data operations, and working with them significantly accelerates
immundata. Take a look at our benchmark page to learn more:linkQ: Why does
immundatasupport only the AIRR standard?!A: Because standards, but
immundataallows some level of optionality - you can provide column names for barcodes, etc.Q: Why is it so complex? Why do we need to use
dplyrinstead of plain R?A: The short answer is:
- faster computations,
- code, that is easy to maintain and support by other humans,
- and better data skills.
For the long answer, let me give you more details on each of the bullepoint.
Q: How do I get to use all operations from
dplyr?duckplyrdoesn't support some operations, which I need.A: Let's consider several use cases.
Case 0. You are missing
group_byfromdplyr.Case 1. Your data can fit into RAM.
Case 2. Your data won't fit into RAM, and you really need to work on all of this data.
Case 3. Your data won't fit into RAM, but before running intensive computations, you are open to working with smaller dataset first.
Q: You filter out non-productive receptors. How do I explore them?
A: option for saving non-productive chains to a separate file
Q: Why does
immundatahave its own column names for receptors and repertoires? Could you just use the AIRR format - repertoire_id etc.?A: The power of
immundatalies in the fast re-aggregation of the data, that allows to work with whatever you define as a repertoire on the fly viaImmunData$build_repertoires(schema = ...)Q: What do I do with following error: "Error in
compute_parquet()at [...]: ! {"exception_type":"IO","exception_message":"Failed to write [...]: Failed to read file [...]: schema mismatch in glob: column [...] was read from the original file [...], but could not be found in file [...] If you are trying to read files with different schemas, try setting union_by_name=True"}?*A: It means that your repertoire files have different schemas, i.e., different column names. You have two options.
Option 1: Check the data and fix the schema. Explore the reason why the data have different schemas. Remove wrong files. Change column names. And try again.
Option 2: If you know what you are doing, pass argument
enforce_schema = FALSEtoload_repertoires. The resultant table will have NAs in the place of missing values. But don't use it without considering the first option. Broken schema usually means that there are some issues in the how the data were processed.