PINOT: an intuitive resource for integrating protein-protein interactions

Background The past decade has seen the rise of omics data for the understanding of biological systems in health and disease. This wealth of information includes protein-protein interaction (PPI) data derived from both low- and high-throughput assays, which are curated into multiple databases that capture the extent of available information from the peer-reviewed literature. Although these curation efforts are extremely useful, reliably downloading and integrating PPI data from the variety of available repositories is challenging and time consuming. Methods We here present a novel user-friendly web-resource called PINOT (Protein Interaction Network Online Tool; available at http://www.reading.ac.uk/bioinf/PINOT/PINOT_form.html) to optimise the collection and processing of PPI data from IMEx consortium associated repositories (members and observers) and WormBase, for constructing, respectively, human and Caenorhabditis elegans PPI networks. Results Users submit a query containing a list of proteins of interest for which PINOT extracts data describing PPIs. At every query submission PPI data are downloaded, merged and quality assessed. Then each PPI is confidence scored based on the number of distinct methods used for interaction detection and the number of publications that report the specific interaction. Examples of how PINOT can be applied are provided to highlight the performance, ease of use and potential utility of this tool. Conclusions PINOT is a tool that allows users to survey the curated literature, extracting PPI data in relation to a list of proteins of interest. PINOT extracts a similar numbers of PPIs as other, analogous, tools and incorporates a set of innovative features. PINOT is able to process large queries, it downloads human PPIs live through PSICQUIC and it applies quality control filters on the downloaded PPI data (i.e. removing the need for manual inspection by the user). PINOT provides the user with information on detection methods and publication history for each downloaded interaction data entry and outputs the results in a table format that can be straightforwardly further customised and/or directly uploaded into network visualization software. Video abstract


(Continued from previous page)
Conclusions: PINOT is a tool that allows users to survey the curated literature, extracting PPI data in relation to a list of proteins of interest. PINOT extracts a similar numbers of PPIs as other, analogous, tools and incorporates a set of innovative features. PINOT is able to process large queries, it downloads human PPIs live through PSICQUIC and it applies quality control filters on the downloaded PPI data (i.e. removing the need for manual inspection by the user). PINOT provides the user with information on detection methods and publication history for each downloaded interaction data entry and outputs the results in a table format that can be straightforwardly further customised and/or directly uploaded into network visualization software.
Keywords: Protein interaction, Protein network, Network, Data mining, Protein database, Data integration Background During the past two decades the use of omics data to understand biological systems has become an increasingly valued approach [1]. This includes extensive efforts to detect protein-protein interactions (PPIs) on an almost proteome-wide scale [2,3]. The utility of such data has been greatly supported by primary database curation and the International Molecular Exchange (IMEx) Consortium, which promotes collaborative efforts in standardising and maintaining high quality data curation across the major molecular interaction data repositories [4]. The primary databases, such as IntAct [5] and BioGRID [6], are rich data resources providing a comprehensive record of published PPI literature. PPI data are critical to describe connections among proteins, which in turn supports both inference of new functions for proteins (based on the guilt by association principle [7]) and visualization of protein connectivity via shared interactors. This enables identification of potential communal pathways involving proteins of interest [8][9][10]. Additionally, literature extracted PPI data can support the prioritization of interactions from high-throughput experiments (which generate large lists of potential PPI hits), assisting the selection of candidates for further analysis/validation [11].
However, the process of collating PPI data is currently hampered by the fact that no single data source encompasses the full extent of PPIs reported in literature; hence, users are required to merge (partial) information mined from multiple different primary databases. Merging such data is not straightforward due to inconsistencies in data format and differences in data curation across the PPI databases (e.g. IMEx member vs nonmember databases).
To optimize the use of PPI data within the public domain, we developed a user-friendly tool that assists PPI data extraction and processing: the Protein Interaction Network Online Tool (PINOT). This tool represents the development (and automation) of our previous PPI analysis framework, termed Weighted Protein-Protein Interaction Network Analysis (WPPINA) [9,[11][12][13][14][15]. Through PINOT, PPI data are downloaded directly (i.e. downloaded "live" at the time of the query) from seven databases using the Proteomics Standard Initiative Common Query Interface (PSICQUIC) and integrated to ensure a wide coverage of the PPIs available from these repositories [16]. These data are scored through a simple and transparent procedure based on 'method detection' and 'publication records' and allows the user to further apply customized confidence thresholds. PINOT is fully automated and available online as an open access resource. Output data are provided as a summary table (directly online or emailed to the user), which summarizes the most comprehensive current knowledge of the PPI landscape for the protein(s)-of-interest submitted in the query list. Additionally, and of note, the R scripts that underlie PINOT can be freely downloaded from the help-page.

Protein interaction network online tool (PINOT)
PINOT can be run automatically at http://www.reading. ac.uk/bioinf/PINOT/PINOT_form.html (hereafter referred to as "web tool"). A choice of parameters is integrated by default as explained further below and in Supplementary Materials S1. Alternatively, R scripts can be downloaded from the help-page and in this way PINOT can be used as a "standalone tool" whereby parameters can be modified as per user choice.
A list of proteins of interest (seeds) can be queried to identify their literature-reported interactors that have been curated into PPI databases (Fig. 1).
For Homo sapiens (taxonomy ID: 9606) the seed identifiers submitted into the query field must be in an approved HUGO Gene Nomenclature Committee (HGNC) gene symbol [17] or valid Swiss-Prot UniProt ID format. Upon query submission, PPI data are extracted directly (via API: Shannon, P. (2020) PSICQUIC R package, DOI: https://doi.org/10.18129/B9.bioc.PSICQUIC [18]) from seven primary databases, all of which directly annotate PPI data from peer-reviewed literature: bhf-ucl, BioGRID [6], InnateDB [19], IntAct [5], MBInfo (https:// www.mechanobio.info), MINT [20] and UniProt [21]. The downloaded protein interaction data are then parsed, merged, filtered and scored (Fig. 2) automatically by PINOT. A detailed description of the PINOT pipeline can be found in the Supplementary Materials S1. The user can select to run PINOT with lenient or stringent filter parameters. The output of PINOT ( Fig. 1c-e) consists of: i) a network file (final_network.txt), which is a tab-spaced text file containing the processed PPI data in relation to the seeds in the initial query list; ii) a log file (final_network_log.txt) reporting proteins that have been discarded within the data processing procedure, and; iii) a further log file (final_network_providers.txt) indicating the PPI databases used by the API for data acquisition. The output dataset is available for download and/or emailed to the user.
For Caenorhabditis elegans (taxonomy ID: 6239) the seed identifiers must be in an approved WormBase gene ID [22,23] format, "WBGene" followed by 8 numerical digits. Upon submission, PPI data are downloaded from an internal network stored within PINOT and created (following similar criteria applied for the human PPIsdetails in Supplementary Materials S1) based on the WormBase PPI catalogue (Alliance_molecular_interactions.tar file downloaded from the Alliance of Genome Resources [version 2.1] on 15th April 2019) [22,24]. The user can apply stringent or lenient filtering options. The output of PINOT for a C. elegans query consists of: i) a network file (final_network.txt), which is a tabspaced text file containing the processed PPIs for the seeds in the initial query list; and ii) a log file (final_net-work_log.txt) reporting proteins that have been discarded within the data processing procedure.

Software
The PINOT pipeline is coded in R and runs on a Linux server at the University of Reading, with java servlets processing user's submissions via the web interface.

PINOT quality control
We have tested the PINOT pipeline using multiple input query lists structured as follows: i) small input lists (less than 20 seed proteins), selected randomly or in association with typical processes suspected to be functionally relevant for Parkinson's Disease (PD); and ii) a large input list = 941 proteins, the human mitochondrial proteome as reported by MitoCarta2.0 [25].
PINOT was compared to two alternative yet related online PPI query tools, Human Integrated Protein-Protein Interaction Reference (HIPPIE; for human data only) and Molecular Interaction Search Tool (MIST; for human and C. elegans data). For this analysis, query parameters were selected (where possible) to maximize the extraction of protein interactions: HIPPIE was used with a b c d e Fig. 1 PINOT user interface. a. Screenshot of the PINOT webpage. b. Examples of the text file to be uploaded or list to be populated into the text box of query seeds (i.e. proteins for which protein interactors will be extracted from primary databases that manually curate the literature). c. Example result output file from PINOT, containing the extracted and processed PPI data (only the file header is reported as an example). d.
Example of the discarded proteins log file from PINOT, a text file reporting all the seeds for which interactions are not returned to the user. e. Example of the network providers log file from PINOT containing a list of active databases that were utilised for downloading PPI data confidence score = 0 and no filters on confidence level, interaction type or tissue expression; and MIST was used with no filter by rank parameter to download all PPIs regardless of the assigned confidence score. It is noteworthy to highlight that files from HIPPIE and MIST required manual parsing after download to remove entries that do not include a PMID and/or conversion method code (incomplete entries). Data were downloaded on 18th September 2019 (H. sapiens) and on 24th September 2019 (C. elegans).

Results
PINOT is a web tool that takes a list of proteins/genes (seeds) as input, queries PSICQUIC at submission and returns an up-to-date table containing a comprehensive list of PPIs -published in peer-reviewed literaturecentred upon the seeds. This table consists of a variable number of rows and 11 columns ( Fig. 1c and Fig. 3c). Each row represents a binary interaction between one of the seeds (interactor A) and one of its specific protein interactors (interactor B). The 11 columns contain: the gene name, the Swiss-Prot protein ID and the Entrez gene ID for interactor A and B ("NameA", "SwissA", "EntrezA", "NameB", "SwissB", "EntrezB"); the number and type of different methods through which the interaction has been identified ("Method.Score", "Method"), and; the number of different publications reporting the interaction and the corresponding PubMed IDs ("Publication.Score", "PMIDS"). The final column ("Final.-Score") contains a confidence score based on the number of different methods + the number of different publications reporting the interaction. PPIs with a final score of 2 are reported in literature by 1 publication and detected by 1 technique; these PPIs should be considered with caution since the available data suggest they are not robustly replicated. They might be either: i) false positives, ii) true novel interactions that have not yet been replicated in additional studies, or iii) true interactions that have been replicated in additional studies but which have not yet been incorporated into any of the seven primary databases used for data acquisition. A final score > 2 suggests a degree of replication that can be either or both: multiple publications reporting the PPI and/or multiple distinct techniques used to detect the interaction. It is not possible to obtain a final score < 2 since every PPI annotationto be retained in PINOThas to be supported by at least 1 interaction detection method and 1 PMID; if this condition is not met, the PPI is discarded by PINOT and not shown in the output file.
The PINOT output can be imported into Cytoscape [26] for network visualization by selecting the "NameA" and "NameB" columns as source and target nodes, respectively.

PINOT: example of application
In Fig. 3, PINOT has been used to download PPIs for a limited selection of human protein products of genes mutated in familial PD: ATP13A2, FBXO7, GBA, PINK1, SMPD1 and VPS35 (seeds). PINOT quickly retrieved a table containing 327 interactions from peer-reviewed curated literature (with associated PMIDs) thus supporting and simplifying otherwise time-consuming classical literature mining. The PINOT output was imported into Cytoscape and PPIs were visualized in a network ("NameA" = source and "NameB" = target), the seeds were highlighted in darkred and the edges (interactions between each protein pair) were coded based on the "Final.Score" field, thus highlighting the confidence (number of methods + number of publications) of the interaction, which positively correlated with the thickness of the edge. Since we were interested in interactors that were common to the seeds -and not exclusive interactors of just one seed -the network was filtered retaining only the nodes (interactors) that bridged two or more seeds. The obtained core-network revealed that among the common interactors of the seeds (6 PD proteins) there were 2 proteins (SNCA and PRKN) which are products of 2 additional genes known to be mutated in familial PD. Thus, the analysis pointed towards the involvement of SNCA and PRKN in PD even if they were initially excluded from the list of seeds. Additionally, topological analysis (based on the number and thickness of the edges) suggested that the core network could be subdivided into 2 (See figure on previous page.) Fig. 3 PINOT: An example application. A stepwise insight into the potential use of PINOT. a. A submission list is created as a text file using gene names as per HGNC approved symbols or Swiss-Prot UniProt IDs; the submission list can be uploaded as a file or pasted into the text box within the PINOT interface. b. PINOT downloads, from PSICQUIC, the human PPIs (in this example, stringent filter applied) c. PPIs are provided back to the user via email or from the webpage; results are in a parsable file that can be opened by a text reader application and/or imported into spreadsheet software such as Microsoft Excel. d. The interactions can be visualized in a network format by opening the PINOT output in network visualisation software, such as Cytoscape. Connections between nodes (edges) are coded with increased line width based on the final score that interaction was assigned by PINOT. The wider the edgethe more confident PINOT is about the interactions. e. The interactions can be further processed according to the user's research question, in this case, only interactors that are communal to at least 2 of the initial query proteins have been retained, generating a core network (in dark-red the initial seeds; in bright-green the identified common interactors that are proteins mutated in PD). Based on the network topology the seeds and their interactors can be visually clustered into group 1 (depicted in gold) and group 2 (depicted in blue). f. Specific functional enrichment (GO CC terms) for groups 1 and 2 after filtering out the less represented terms. Analyses performed on the 22nd August 2019 (g:Profiler) distinct clusters: PINK1, FBXO7 and the newly identified PRKN and SNCA in the first cluster, while ATP13A2, VPS35 and SMPD1 were more closely associated in the second cluster, with GBA a bridge seed between the 2 clusters. This observation suggested a dichotomy, based on the protein interactomes of the seeds included in the initial input list. Based on the guilt-by-association principle we hypothesised that the proteins contributing to these clusters could be associated with different cellular functions and components. We therefore performed functional enrichment analysis (based on Gene Ontology (GO) Cellular Component (CC) annotations) [27,28] using g: Profiler [29] revealing that indeed, clusters 1 and 2 are associated with mitochondria and vacuoles/lysosomes/endosomes, respectively.

H. sapiens -PINOT performance
The performance of PINOT was compared to that of alternative resources for both small and large lists of seeds. Regarding the former, five different small seed lists were used as input for PPI query in HIPPIE [30] and MIST [31], two alternative online and freely available resources. It should be noted that, despite apparent similarities, each of these tools has been developed differently. All three resources (PINOT, HIPPIE and MIST) have distinguishing features for addressing different research questions ( Table 1). The results of the different online-platforms queries have been compared, evaluating the total number of interactors provided in the output (Fig. 4a).
PINOT, HIPPIE and MIST retrieved a similar number of PPIs. PINOT with stringent filtering applied, was always extracting fewer interactions; this is an expected outcome since this filter option is built with the purpose of retaining only annotations that have survived stringent screening, largely based on completeness of curated data entries.
Results for the large input list query were compared, PINOT vs HIPPIE. These two web tools allowed for easy processing of more than 900 seeds within the submission list. The number of retrieved interactors was slightly higher for HIPPIE in comparison with PINOT when the stringent quality control (QC) filter was applied; however PINOT retrieved more interactions than HIPPIE when the lenient filter option was selected (Fig. 4b). Furthermore, the vast majority of downloaded interactions were similar when using the two resources, suggesting that PINOT is able to confidently extract specific interations from the curated PPI literature (Fig. 4c).

C. elegans -PINOT performance
The performance of PINOT for querying C. elegans PPI data was tested alongside the C. elegans query option in MIST, assessing interaction networks of different dimensions (Fig. 5). The data acquisition strategy underlying these two resources differs slightly, PINOT extracts data from the latest release of WormBase molecular interaction data, whereas MIST utilises data from numerous sources, including WormBase, BioGRID and IMEx associated repositories.
Similar to comparisons across the human PPI query capacity, PINOT and MIST performed comparably in terms of the number of PPI data entries extracted. More specifically and as previously described with human data, PINOT extracting slightly fewer across these test query cases. However, upon assessing the completeness of these extracted data entries, in terms of interaction detection method and/or PMID annotations, there was a striking difference in performance. Since the PINOT pipeline focusses on the QC of data, all data entries within the output dataset were complete, whereas incomplete data entries persisted in the MIST output dataset thus requiring manual inspection. In the more abundant PPI data pools, for example when querying the ATP and CED C. elegans proteins, incomplete data entries accounted for the majority of the output dataset in MIST.

Discussion
PINOT can be used as a tool to quickly and effectively survey the curated literature and download the most upto-date PPI data available for a given set of proteins/ genes of interest. This is particularly useful to assist anyone attempting to mine overwhelmingly abundant literature targeting certain proteins/genes, in relation to identifying reported PPIs. The PPI data downloaded through PINOT can be used as a literature reference list for experimental PPI data resulting from high-throughput experiments (protein microarrays, yeast two-hybrid screens, etc) facilitating the prioritisation of experimental results for validation. PINOT is also useful to evaluate interactors of different proteins/genes of interest within an input seed list simultaneously. The analysis of the combined interactomes of such seeds can reveal the existence of communal interactors, can provide a base to cluster the seeds into groups and can support further functional analysis to better characterize the functional landscape of seeds of interest.
Alternative tools that appear to be similar to PINOT are HIPPIE and MIST. STRING [32] is a conceptually different tool; it does not report 'interaction detection methods' nor 'Publication IDs' for PPIs, which are crucial pieces of information for the evaluation and interpretation of PPI data. Additionally, the reported interactions are not focused only on the proteins in the input list; interactions of interactors are also reported. Distinguishing features of PINOT, HIPPIE and MIST include the implementation of a tailored confidence score for different methodological approaches (as well as a tissue filtering option) in HIPPIE; MIST provides a valuable resource for users interested in mapping PPIs across species (i.e. interologs); PINOT focusses on high quality PPI data output by implementing multiple QC steps to remove problematic or non-univocal annotations. The performance of PINOT was comparable to that of HIPPIE and MIST both in terms of number and identity of downloaded interactions. However, there are some unique features of PINOT that are not, at the moment, integrated within the other data resources. Human PPIs in PINOT are directly downloaded from PSICQUIC at every query submission. In contrast, PPIs in HIPPIE and MIST are recovered from a central built-in repository within the servers. This difference is clearly demonstrated by searching for interactors of LRRK2, where (at the time of analysis) one high-throughput publication was updated in PSICQUIC, while both HIPPIE and MIST did not contain this full annotation yet (Fig. 6).
PINOT has access to the most up-to-date interactions that could be retrieved at a given time from PSICQUIC (however, it has to be considered that each database is responsible for updating their PSICQUIC service and therefore discrepancies might exist with the central databases).
PINOT implements QC filtering, which involves discarding PPI data entries that are curated without a PMID or with multiple PMIDs and/or without a interaction detection method annotation. Therefore the output file from PINOT does not require any further QC by the user, while output data from HIPPIE and MIST require manual parsing and inspection before analysis to remove incomplete data entries through a time consuming, post-hoc processing procedure.
Another distinctive feature of the PINOT pipeline is the implementation of a unique interaction detection method conversion step. During this step, the interaction detection method annotation for each downloaded interaction data entry is converted based on a conversion table (Supplementary Materials S2) that is available from the PINOT web-portal. During this conversion, technically similar methods are grouped together. For example: "Two Hybrid -MI:0018", "Two Hybrid Array -MI:0397" and "Two Hybrid Pooling Approach -MI:0398" are grouped together into the "Two Hybrid (2Hyb)" method category. This step of 'method clustering and reassigment' is critical to assess the actual number of distinct methods used to detect a particular interaction and to dilute the bias caused in the event of the same technique being annotated under slightly different method codes in different PPI databases.
Interaction scores are provided in different formats by HIPPIE, MIST and PINOT. HIPPIE incorporates a filtering system based on a confidence score between 0 and 1 that can be set either before or after the analysis. This is a complex scoring system, which takes into consideration multiple parameters, such as the number of  publications that report a specific interaction and a semi-computational quality score based on the experimental approach (for example, imaging techniques would score different than direct interactions etc.) [33]. MIST similarly has an option for filtering interactions pre-or post-analysis; however, this is based on fixed ranking values defined as low, medium (interaction supported by other species), or high (supported by multiple experimental methods and/or reported in multiple publications). In the case of PINOT, two different scores are provided: the interaction detection method score (MS) which reports the number of different methods used (after method annotation reassignment), while the publication score (PS) counts the number of different publications which report the interaction. Finally, the coding scripts which underlie the human PPI data PINOT pipeline are fully available for download. They are coded in R to make them accessible to a large research audience and a read-me text file helps customization of the scripts according to the users' needs. Some of the divergent features across PINOT, HIPPIE, MIST and STRING are reported in Table 1.