r-net4pg

Description

Handle Ambiguity of Protein Identifications from Shotgun Proteomics.

Description

In shotgun proteomics, shared peptides (i.e., peptides that might originate from different proteins sharing homology, from different proteoforms due to alternative mRNA splicing, post-translational modifications, proteolytic cleavages, and/or allelic variants) represent a major source of ambiguity in protein identifications. The 'net4pg' package allows to assess and handle ambiguity of protein identifications. It implements methods for two main applications. First, it allows to represent and quantify ambiguity of protein identifications by means of graph connected components (CCs). In graph theory, CCs are defined as the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph. Here, proteins sharing one or more peptides are thus gathered in the same CC (multi-protein CC), while unambiguous protein identifications constitute CCs with a single protein vertex (single-protein CCs). Therefore, the proportion of single-protein CCs and the size of multi-protein CCs can be used to measure the level of ambiguity of protein identifications. The package implements a strategy to efficiently calculate graph connected components on large datasets and allows to visually inspect them. Secondly, the 'net4pg' package allows to exploit the increasing availability of matched transcriptomic and proteomic datasets to reduce ambiguity of protein identifications. More precisely, it implement a transcriptome-based filtering strategy fundamentally consisting in the removal of those proteins whose corresponding transcript is not expressed in the sample-matched transcriptome. The underlying assumption is that, according to the central dogma of biology, there can be no proteins without the corresponding transcript. Most importantly, the package allows to visually inspect the effect of the filtering on protein identifications and quantify ambiguity before and after filtering by means of graph connected components. As such, it constitutes a reproducible and transparent method to exploit transcriptome information to enhance protein identifications. All methods implemented in the 'net4pg' package are fully described in Fancello and Burger (2022) <doi:10.1186/s13059-022-02701-2>.

README.md

cran.r-project.org

Handle Ambiguity of Protein Identifications from Shotgun Proteomics

Analyze ambiguous protein identifications using graph connected components (CCs)

Protein inference is a central issue in proteomics, given the presence of shared peptides (*i.e.*, peptides that might originate from different proteins sharing homology, from different proteoforms due to alternative mRNA splicing, post-translational modifications, proteolytic cleavages, and/or allelic variants). Indeed, in bottom-up mass spectrometry-based proteomics, the most widely used proteomic approach, peptide-protein connectivity is lost for experimental reasons and protein identifications are to be inferred from peptide identifications. Shared peptides can generate quite complex peptide-to-protein mapping structures but these can be efficiently represented using bipartite graphs, with peptides and proteins as vertices and with edges featuring peptide to protein membership. Graph connected components (CCs) (*i.e.*, the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph) can be used as a mesure of the level of ambiguity in protein identifications. Proteins sharing one or more peptides are gathered in the same CC (multi-protein CCs), while unambiguous protein identifications are represented by CCs with a single protein vertex (single-protein CCs). CCs represent a peptide-centric strategy to group proteins, independent from the variety of protein-centric strategies of protein grouping and protein inference. As such, it does not require protein inference and it is widely applicable, reproducible and transparent.

The CCs4prot package allows to build a graph from shotgun proteomic identifications and calculate its connected components.

Reduce ambiguity of protein identifications by transcriptome-informed filtering

The availability of an increasing number of sample-matched proteomic and transcriptomic datasets can be exploited to reduce ambiguity of protein identifications. Indeed, according to the central dogma of biology, there can be no protein without the corresponding transcript. Following this, protein identifications for which the corresponding transcript is identified in the sample-matched transcriptome are more likely to be correct than those with no expressed transcript.

The CCs4prot package implements a transcriptome-informed filtering strategy to reduce ambiguity of protein identifications and allows to measure the impact of the filtering on ambiguity by assessment of the proportion and size of multi-protein CCs and by visual inspection of peptide-to-protein mappings for ambiguous identifications.

Install the CCs4prot R package

Download the package with the git clone command:

git clone https://github.com/laurafancello/CCs4prot.git

Initiate R and install the R package using devtools (devtools needs to be installed as well)

library("devtools")
devtools::install("CCs4prot")

Usage

To learn how to use CCs4prot, please refer to the introductory vignette posted at this link:

[https://github.com/laurafancello/CCs4prot/blob/main/vignettes/IntroToCCs4prot.Rmd)

License

Distributed under the GPL-3 License.

Contact

Laura Fancello - [email protected]
Thomas.Burger - [email protected].

Handle Ambiguity of Protein Identifications from Shotgun Proteomics

Analyze ambiguous protein identifications using graph connected components (CCs)

Reduce ambiguity of protein identifications by transcriptome-informed filtering

Install the CCs4prot R package

Usage

License

Contact

Version

License

Status

Source

Homepage

Platforms (75)

Handle Ambiguity of Protein Identifications from Shotgun Proteomics

Analyze ambiguous protein identifications using graph connected components (CCs)

Reduce ambiguity of protein identifications by transcriptome-informed filtering

Install the CCs4prot R package

Usage

License

Contact

Version

License

Status

Source

Homepage

Platforms75 (75)

Platforms (75)