This document lists algorithms and features which are currently available in CBDD R package.

The functionality of CBDD R package can be roughly classified as follows:

Within each functionality block, algorithms are split into ‘areas’ based on their principal goal and annotated with primary references and input requirements. Other features, such as visualizations, are briefly described as well.

Data types

All implemented algorithms are listed alongside with their inputs and outputs. The legend for the input columns reads as follows:

There are several major input formats for CBDD algorithms. The algorithms differ in their input requirements (some require inputs to have specific properties, while some are very general). In most common case, two main inputs are required:

The ‘omics’ data formats are:

The basic network format is simply an adjacency list - two-column matrix showing who interacts with whom. Network edges may have additional attributes:

Pathways are small subnetworks, typically much better studied than the rest of the network. The same network attributes described above are fully applicable in them.

Build network

There are two major approaches to create large-scale networks describing molecular relationships in biological systems:

Both approaches have their advantages. The CBDD provides infrastructure to load pre-existing networks from text files and from Clarivate Analytics’ MetaBase. Also, following algorithms are available for data-driven generation and modification of networks:

Area Description
Data-driven network De novo generation of relationships between biological entities (e.g. from similarities in gene expression profiles).
Network adjustment Weighting and adjustment of networks (e.g. based on tissue expression data), making them more specific to a particular biological context.

Data-driven network reconstruction (to be added)

There are three approaches for data-driven reconstruction of networks. The simplest one is conventional co-expression analysis (edges are correlations between matrix rows - i.e. gene expression profiles - which are higher than a threshold).

The latter two methods are more sophisticated. ARACNE uses mutual information to infer expression profile similarities and prunes networks to drive down false positive edges. networkAnova uses analysis of variance to infer edges from patterns of expression change across a range of phenotypes.

Algorithm Data input Matrix phenotypes Reference
dataDrivenNetwork Matrix
ARACNE Matrix 16723010
networkAnova Matrix R 22467911

All these approaches have average performance (minutes).

Analysis

The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):

Area Description Example purpose
Node prioritization Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype. Drug target identification
Subnetwork prioritization Find modules in networks which are associated with phenotype. Mechanism reconstruction; Biomarker discovery
Pathway prioritization Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search. Mechanism reconstruction; Biomarker discovery
Unsupervised analysis Learn how patients stratify into the subtypes and which networks and pathways drive this stratification. Patient stratification
Integrative analysis Many omics data types are routinely available, and there is often need to understand how they talk to one another. Networks can be utilized for answering questions such as ‘which mutations in my dataset affect the differential expression in disease?’. Any purpose

Node prioritization

Almost all of these algorithhms are based on the guilt-by-association principle, ranking all nodes in the networks by topological closeness to the set of start nodes (the definitions of topological closeness vary). All of these algorithms are pretty fast and can operate on large networsk. They can be roughly split into three subclasses

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Reference
overconnectivity Network A Start nodes 19010930
guiltByAssociation Network Start nodes 11101803
neighborhoodScoring Network A Start nodes R 20840752
interconnectivity Network A Start nodes 22369140
hiddenNodes Network A A Start nodes 19309513
randomWalk Network A A Start nodes A 18371930
networkPropagation Network A A Start nodes A 20090828
toppNetHITS Network R A Start nodes A 19245720
toppNetKM Network R A Start nodes A 19245720
GeneMania Network A Start nodes 18613948
causalReasoning Network R R R Start nodes R 22355083
SigNet Network R R R Start nodes R 24518063

Subnetwork and pathway prioritization

Pathway prioritization approaches will be added soon

Subnetwork algorithms look for parts of global network which are enriched with the start nodes or somehow associated with input data. Pathway algorithms, similarly, prioritize the small networks from a collection (e.g. canonical signaling pathways) which are associated with the data.

There are several flavors of these algorithms:

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Matrix phenotypes Reference
activeModules Network Start nodes R 12169552
pathwayInference Network Start nodes R 15509611
HotNet Network Start nodes A A 22174262
HotNet2 Network Start nodes A A 25501392
DEGAS Network Matrix R 20976054
DENSE Network Start nodes 22024446
MCODE Network Start nodes 12525261
pathwayActivityInference Pathways R Matrix R 19997592
subnetworkMarkers Network Matrix R 17940530
SPIA Pathways R R R A Start nodes R 18990722
CASNet Network A R Start nodes R R 23368093

Most of these methods are reasonably fast and can handle large input networks and data sets. Exceptions are DENSE and HotNet (high memory requirements). HotNet may run for more than 1 hour on laptop for network with ~10K nodes.

Integration approaches

These approaches seek to integrate several input sets of nodes (typically coming from different omics data types, like mutations and expression profiles) and infer subnetworks connecting them.

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input St.node weights St.node p-values Matrix phenotypes Reference
TieDie Network A A A Start nodes (2) A 23986566
PCST Network A A Start nodes (2) A 19638617
PCSF Network A A Start nodes (2) A 23383998
eQED Network A A A Start nodes (2 or more) 18319721
PARADIGM Network R R A Matrix (1 or more) 20529912

Of these approaches, TieDie works fast; PCST / PCSF is slower and more eager for memory (although network with 10K nodes can be handled reasonably fast), and eQED is very sensitive to the input size (likely running into problems in networks as large as 5K nodes)

Unsupervised analysis

There are two approaches for patient stratification / unsupervised clustering in the toolkit. Both utilize non-negative matrix factorization supplied with network-based constraints.

For NCIS, this is simply upweighting nodes which are well connected with high variance nodes in the network. For NBS, the constraints actually force node clusters to contain topologically close nodes (although they do not have to be all connected to each other)

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Data input Matrix phenotypes Reference
NCIS Network A A Matrix 24491042
NBS Network Matrix 24037242

Both approaches have average performance (minutes); assessing cluster stability (performing multiple runs of algorithms and checking how pairs of samples co-cluster) may take a long time

Interpretation of results

This block includes functionality intended to make sense of the results produced by the algorithms, namely:

Network comparison tools

dE-MAP is a differential network analysis approach, which highlights differences between two input networks.

Algorithm Func. input Edge dir. Edge weight Edge type Edge mech. Reference
dEMAP Network (2) A 21127252
editDistance Network (2) A A A A doi:10.1109/TSMC.1983.6313167

Visualization and interpretation

Visualization is crucial aspect of network analysis; especially when it concerns the subnetwork identification task. In order to gain insight from the subnetwork analysis result, analyst should be able to create a network graph and overlay his or her data on top of it. Unfortunately, base R lacks effective tools for creating network-based visualizations.

CBDD includes basic interactive network viewer, which renders subnetwork as an HTML widget. The widget can be explored in RStudio, embedded into the R markdown documents or simply saved as HTML web page.

The new network viewer function has the following features:

Comparisons

There are functions allowing comparisons of the most popular kinds of results the CBDD algorithms yield: subnetworks and ranked lists of entities (e.g. network nodes). Three possible use cases are defined for comparing the results of algorithms:

For each use case, interactive visualizations are available.