This document lists algorithms and features which are currently available in CBDD R package.
The functionality of CBDD R package can be roughly classified as follows:
Within each functionality block, algorithms are split into ‘areas’ based on their principal goal and annotated with primary references and input requirements. Other features, such as visualizations, are briefly described as well.
All implemented algorithms are listed alongside with their inputs and outputs. The legend for the input columns reads as follows:
There are several major input formats for CBDD algorithms. The algorithms differ in their input requirements (some require inputs to have specific properties, while some are very general). In most common case, two main inputs are required:
The ‘omics’ data formats are:
The basic network format is simply an adjacency list - two-column matrix showing who interacts with whom. Network edges may have additional attributes:
Pathways are small subnetworks, typically much better studied than the rest of the network. The same network attributes described above are fully applicable in them.
There are two major approaches to create large-scale networks describing molecular relationships in biological systems:
Both approaches have their advantages. The CBDD provides infrastructure to load pre-existing networks from text files and from Clarivate Analytics’ MetaBase. Also, following algorithms are available for data-driven generation and modification of networks:
Area | Description |
---|---|
Data-driven network | De novo generation of relationships between biological entities (e.g. from similarities in gene expression profiles). |
Network adjustment | Weighting and adjustment of networks (e.g. based on tissue expression data), making them more specific to a particular biological context. |
There are three approaches for data-driven reconstruction of networks. The simplest one is conventional co-expression analysis (edges are correlations between matrix rows - i.e. gene expression profiles - which are higher than a threshold).
The latter two methods are more sophisticated. ARACNE uses mutual information to infer expression profile similarities and prunes networks to drive down false positive edges. networkAnova uses analysis of variance to infer edges from patterns of expression change across a range of phenotypes.
Algorithm | Data input | Matrix phenotypes | Reference |
---|---|---|---|
dataDrivenNetwork | Matrix | ||
ARACNE | Matrix | 16723010 | |
networkAnova | Matrix | R | 22467911 |
All these approaches have average performance (minutes).
The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):
Area | Description | Example purpose |
---|---|---|
Node prioritization | Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype. | Drug target identification |
Subnetwork prioritization | Find modules in networks which are associated with phenotype. | Mechanism reconstruction; Biomarker discovery |
Pathway prioritization | Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search. | Mechanism reconstruction; Biomarker discovery |
Unsupervised analysis | Learn how patients stratify into the subtypes and which networks and pathways drive this stratification. | Patient stratification |
Integrative analysis | Many omics data types are routinely available, and there is often need to understand how they talk to one another. Networks can be utilized for answering questions such as ‘which mutations in my dataset affect the differential expression in disease?’. | Any purpose |
Almost all of these algorithhms are based on the guilt-by-association principle, ranking all nodes in the networks by topological closeness to the set of start nodes (the definitions of topological closeness vary). All of these algorithms are pretty fast and can operate on large networsk. They can be roughly split into three subclasses
Algorithm | Func. input | Edge dir. | Edge weight | Edge type | Edge mech. | Data input | St.node weights | St.node p-values | Reference |
---|---|---|---|---|---|---|---|---|---|
overconnectivity | Network | A | Start nodes | 19010930 | |||||
guiltByAssociation | Network | Start nodes | 11101803 | ||||||
neighborhoodScoring | Network | A | Start nodes | R | 20840752 | ||||
interconnectivity | Network | A | Start nodes | 22369140 | |||||
hiddenNodes | Network | A | A | Start nodes | 19309513 | ||||
randomWalk | Network | A | A | Start nodes | A | 18371930 | |||
networkPropagation | Network | A | A | Start nodes | A | 20090828 | |||
toppNetHITS | Network | R | A | Start nodes | A | 19245720 | |||
toppNetKM | Network | R | A | Start nodes | A | 19245720 | |||
GeneMania | Network | A | Start nodes | 18613948 | |||||
causalReasoning | Network | R | R | R | Start nodes | R | 22355083 | ||
SigNet | Network | R | R | R | Start nodes | R | 24518063 |
Pathway prioritization approaches will be added soon
Subnetwork algorithms look for parts of global network which are enriched with the start nodes or somehow associated with input data. Pathway algorithms, similarly, prioritize the small networks from a collection (e.g. canonical signaling pathways) which are associated with the data.
There are several flavors of these algorithms:
Algorithm | Func. input | Edge dir. | Edge weight | Edge type | Edge mech. | Data input | St.node weights | St.node p-values | Matrix phenotypes | Reference |
---|---|---|---|---|---|---|---|---|---|---|
activeModules | Network | Start nodes | R | 12169552 | ||||||
pathwayInference | Network | Start nodes | R | 15509611 | ||||||
HotNet | Network | Start nodes | A | A | 22174262 | |||||
HotNet2 | Network | Start nodes | A | A | 25501392 | |||||
DEGAS | Network | Matrix | R | 20976054 | ||||||
DENSE | Network | Start nodes | 22024446 | |||||||
MCODE | Network | Start nodes | 12525261 | |||||||
pathwayActivityInference | Pathways | R | Matrix | R | 19997592 | |||||
subnetworkMarkers | Network | Matrix | R | 17940530 | ||||||
SPIA | Pathways | R | R | R | A | Start nodes | R | 18990722 | ||
CASNet | Network | A | R | Start nodes | R | R | 23368093 |
Most of these methods are reasonably fast and can handle large input networks and data sets. Exceptions are DENSE and HotNet (high memory requirements). HotNet may run for more than 1 hour on laptop for network with ~10K nodes.
These approaches seek to integrate several input sets of nodes (typically coming from different omics data types, like mutations and expression profiles) and infer subnetworks connecting them.
Algorithm | Func. input | Edge dir. | Edge weight | Edge type | Edge mech. | Data input | St.node weights | St.node p-values | Matrix phenotypes | Reference |
---|---|---|---|---|---|---|---|---|---|---|
TieDie | Network | A | A | A | Start nodes (2) | A | 23986566 | |||
PCST | Network | A | A | Start nodes (2) | A | 19638617 | ||||
PCSF | Network | A | A | Start nodes (2) | A | 23383998 | ||||
eQED | Network | A | A | A | Start nodes (2 or more) | 18319721 | ||||
PARADIGM | Network | R | R | A | Matrix (1 or more) | 20529912 |
Of these approaches, TieDie works fast; PCST / PCSF is slower and more eager for memory (although network with 10K nodes can be handled reasonably fast), and eQED is very sensitive to the input size (likely running into problems in networks as large as 5K nodes)
There are two approaches for patient stratification / unsupervised clustering in the toolkit. Both utilize non-negative matrix factorization supplied with network-based constraints.
For NCIS, this is simply upweighting nodes which are well connected with high variance nodes in the network. For NBS, the constraints actually force node clusters to contain topologically close nodes (although they do not have to be all connected to each other)
Algorithm | Func. input | Edge dir. | Edge weight | Edge type | Edge mech. | Data input | Matrix phenotypes | Reference |
---|---|---|---|---|---|---|---|---|
NCIS | Network | A | A | Matrix | 24491042 | |||
NBS | Network | Matrix | 24037242 |
Both approaches have average performance (minutes); assessing cluster stability (performing multiple runs of algorithms and checking how pairs of samples co-cluster) may take a long time
This block includes functionality intended to make sense of the results produced by the algorithms, namely:
dE-MAP is a differential network analysis approach, which highlights differences between two input networks.
Algorithm | Func. input | Edge dir. | Edge weight | Edge type | Edge mech. | Reference |
---|---|---|---|---|---|---|
dEMAP | Network (2) | A | 21127252 | |||
editDistance | Network (2) | A | A | A | A | doi:10.1109/TSMC.1983.6313167 |
Visualization is crucial aspect of network analysis; especially when it concerns the subnetwork identification task. In order to gain insight from the subnetwork analysis result, analyst should be able to create a network graph and overlay his or her data on top of it. Unfortunately, base R lacks effective tools for creating network-based visualizations.
CBDD includes basic interactive network viewer, which renders subnetwork as an HTML widget. The widget can be explored in RStudio, embedded into the R markdown documents or simply saved as HTML web page.
The new network viewer function has the following features:
There are functions allowing comparisons of the most popular kinds of results the CBDD algorithms yield: subnetworks and ranked lists of entities (e.g. network nodes). Three possible use cases are defined for comparing the results of algorithms:
For each use case, interactive visualizations are available.