CBDD algorithms and functions

This document lists algorithms and features which are currently available in CBDD R package.

Within each functionality block, algorithms are split into ‘areas’ based on their principal goal and annotated with primary references and input requirements. Other features, such as visualizations, are briefly described as well.

Data types

All implemented algorithms are listed alongside with their inputs and outputs. The legend for the input columns reads as follows:

R - Required; algorithm will not work if this input / attribute is not provided
A - Allowed; algorithm will work with this attribute and take it into account
Blank - Ignored (algorithm will not use this attribute)

There are several major input formats for CBDD algorithms. The algorithms differ in their input requirements (some require inputs to have specific properties, while some are very general). In most common case, two main inputs are required:

‘Functional’ input: global network of interactions between molecular entities (genes or proteins), or collection of molecular pathways
‘Data’, or ‘omics’ input: molecular profiling data set user wants to interpret (e.g. differentially expressed gene list)

The ‘omics’ data formats are:

Start nodes: a list of genes somehow associated with the phenotype of interest. The start nodes may have associated activity / abundance change values (e.g. expression fold changes) and also p-values showing significance of their phenotype association.
Matrix: a matrix of whole genome measurements, e.g. gene expression signals, plus typically the vector of phenotypes for samples to facilitate inter-group comparison.

The basic network format is simply an adjacency list - two-column matrix showing who interacts with whom. Network edges may have additional attributes:

Direction. Some algorithms require directed network and will imply that signal goes from node 1 to node 2.
Edge type (effect) Some algorithm utilize information on whether interactions result in the activation or inhibition of their target nodes
Mechanism. Some algorithms may differentiate between different types of interactions (e.g. physical binding vs. transcription regulation links)
Weight. Edges may be weighted by confidence.

Pathways are small subnetworks, typically much better studied than the rest of the network. The same network attributes described above are fully applicable in them.

Build network

There are two major approaches to create large-scale networks describing molecular relationships in biological systems:

Scaffold network: experimental detection of physical interactions between genes, proteins and other molecules, or curation of such findings from literature;
Data-driven network: de novo prediction of relationships between molecules based on pattern mining in diverse data sets (for example, search for expression similarities, typically indicative of some sort of functional relationships between genes; or text mining for co-citations of genes in compendia of literature)

Both approaches have their advantages. The CBDD provides infrastructure to load pre-existing networks from text files and from Clarivate Analytics’ MetaBase. Also, following algorithms are available for data-driven generation and modification of networks:

Area	Description
Data-driven network	De novo generation of relationships between biological entities (e.g. from similarities in gene expression profiles).
Network adjustment	Weighting and adjustment of networks (e.g. based on tissue expression data), making them more specific to a particular biological context.

Data-driven network reconstruction (to be added)

There are three approaches for data-driven reconstruction of networks. The simplest one is conventional co-expression analysis (edges are correlations between matrix rows - i.e. gene expression profiles - which are higher than a threshold).

The latter two methods are more sophisticated. ARACNE uses mutual information to infer expression profile similarities and prunes networks to drive down false positive edges. networkAnova uses analysis of variance to infer edges from patterns of expression change across a range of phenotypes.

Algorithm	Data input	Matrix phenotypes	Reference
dataDrivenNetwork	Matrix
ARACNE	Matrix		16723010
networkAnova	Matrix	R	22467911

All these approaches have average performance (minutes).

Analysis

The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):

Area	Description	Example purpose
Node prioritization	Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype.	Drug target identification
Subnetwork prioritization	Find modules in networks which are associated with phenotype.	Mechanism reconstruction; Biomarker discovery
Pathway prioritization	Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search.	Mechanism reconstruction; Biomarker discovery
Unsupervised analysis	Learn how patients stratify into the subtypes and which networks and pathways drive this stratification.	Patient stratification
Integrative analysis	Many omics data types are routinely available, and there is often need to understand how they talk to one another. Networks can be utilized for answering questions such as ‘which mutations in my dataset affect the differential expression in disease?’.	Any purpose

Node prioritization

Almost all of these algorithhms are based on the guilt-by-association principle, ranking all nodes in the networks by topological closeness to the set of start nodes (the definitions of topological closeness vary). All of these algorithms are pretty fast and can operate on large networsk. They can be roughly split into three subclasses

Local approaches (assessing one or several steps away neighborhood of start nodes)
Global approaches (taking into account the entire network topology): random walk, network propagation, toppNet
Causal reasoning and SigNet: local algorithms taking advantage of rich network annotation to make inference about causal regulators of input node set and their possible activation/repression. These methods are more demanding in terms of memory and are a little slower.

Algorithm	Func. input	Edge dir.	Edge weight	Edge type	Edge mech.	Data input	St.node weights	Reference
overconnectivity	Network	A				Start nodes		19010930
guiltByAssociation	Network					Start nodes		11101803
neighborhoodScoring	Network	A				Start nodes	R	20840752
interconnectivity	Network		A			Start nodes		22369140
hiddenNodes	Network	A	A			Start nodes		19309513
randomWalk	Network	A	A			Start nodes	A	18371930
networkPropagation	Network	A	A			Start nodes	A	20090828
toppNetHITS	Network	R	A			Start nodes	A	19245720
toppNetKM	Network	R	A			Start nodes	A	19245720
GeneMania	Network		A			Start nodes		18613948
causalReasoning	Network	R		R	R	Start nodes	R	22355083
SigNet	Network	R		R	R	Start nodes	R	24518063

Subnetwork and pathway prioritization

Pathway prioritization approaches will be added soon

Subnetwork algorithms look for parts of global network which are enriched with the start nodes or somehow associated with input data. Pathway algorithms, similarly, prioritize the small networks from a collection (e.g. canonical signaling pathways) which are associated with the data.

There are several flavors of these algorithms:

Enriched subnetwork finders: activeModules, pathwayInference, HotNet/HotNet2, DEGAS
Dense subnetwork finders (they can also prioritize their findings by association with data): MCODE, DENSE. These approaches are better suited for data-driven networks than for scaffold networks like MetaBase
Biomarker finders, seeking pathways/subnetworks discriminating between two phenotypes in a data set: pathwayActivityInference, subnetworkMarkers
Methods looking for subnetworks/pathways where perturbed nodes are in agreement with network topology and activation/inhibition pathways: CASNet, SPIA

Algorithm	Func. input	Edge dir.	Edge weight	Edge type	Edge mech.	Data input	St.node weights	St.node p-values	Matrix phenotypes	Reference
activeModules	Network					Start nodes		R		12169552
pathwayInference	Network					Start nodes		R		15509611
HotNet	Network					Start nodes	A	A		22174262
HotNet2	Network					Start nodes	A	A		25501392
DEGAS	Network					Matrix			R	20976054
DENSE	Network					Start nodes				22024446
MCODE	Network					Start nodes				12525261
pathwayActivityInference	Pathways	R				Matrix			R	19997592
subnetworkMarkers	Network					Matrix			R	17940530
SPIA	Pathways	R	R	R	A	Start nodes	R			18990722
CASNet	Network		A	R		Start nodes	R	R		23368093

Most of these methods are reasonably fast and can handle large input networks and data sets. Exceptions are DENSE and HotNet (high memory requirements). HotNet may run for more than 1 hour on laptop for network with ~10K nodes.

Integration approaches

These approaches seek to integrate several input sets of nodes (typically coming from different omics data types, like mutations and expression profiles) and infer subnetworks connecting them.

Cause/consequence algorithms: use random walk-like approaches or prize-collecting Steiner tree algorithm to devise pathways connecting nodes from ‘cause’ start set with nodes from ‘consequences’ start set.

Algorithm	Func. input	Edge dir.	Edge weight	Edge type	Edge mech.	Data input	St.node weights	Reference
TieDie	Network	A	A	A		Start nodes (2)	A	23986566
PCST	Network		A		A	Start nodes (2)	A	19638617
PCSF	Network		A		A	Start nodes (2)	A	23383998
eQED	Network	A	A		A	Start nodes (2 or more)		18319721
PARADIGM	Network	R		R	A	Matrix (1 or more)		20529912

Of these approaches, TieDie works fast; PCST / PCSF is slower and more eager for memory (although network with 10K nodes can be handled reasonably fast), and eQED is very sensitive to the input size (likely running into problems in networks as large as 5K nodes)

Unsupervised analysis

There are two approaches for patient stratification / unsupervised clustering in the toolkit. Both utilize non-negative matrix factorization supplied with network-based constraints.

For NCIS, this is simply upweighting nodes which are well connected with high variance nodes in the network. For NBS, the constraints actually force node clusters to contain topologically close nodes (although they do not have to be all connected to each other)

Algorithm	Func. input	Edge dir.	Edge weight	Edge type	Edge mech.	Data input	Matrix phenotypes	Reference
NCIS	Network	A	A			Matrix		24491042
NBS	Network					Matrix		24037242

Both approaches have average performance (minutes); assessing cluster stability (performing multiple runs of algorithms and checking how pairs of samples co-cluster) may take a long time

Interpretation of results

This block includes functionality intended to make sense of the results produced by the algorithms, namely:

Visualization
Biological interpretation (e.g. enrichment analysis to reveal impacted biological processes)
Comparison of algorithm results
Network comparison

Network comparison tools

dE-MAP is a differential network analysis approach, which highlights differences between two input networks.

Algorithm	Func. input	Edge dir.	Edge weight	Edge type	Edge mech.	Reference
dEMAP	Network (2)		A			21127252
editDistance	Network (2)	A	A	A	A	doi:10.1109/TSMC.1983.6313167

Visualization and interpretation

Visualization is crucial aspect of network analysis; especially when it concerns the subnetwork identification task. In order to gain insight from the subnetwork analysis result, analyst should be able to create a network graph and overlay his or her data on top of it. Unfortunately, base R lacks effective tools for creating network-based visualizations.

CBDD includes basic interactive network viewer, which renders subnetwork as an HTML widget. The widget can be explored in RStudio, embedded into the R markdown documents or simply saved as HTML web page.

The new network viewer function has the following features:

Basic features
- Interactive visualization produced by an ordinary R function call and rendered as an HTML page. Users can zoom, pan, select and move nodes around. Nodes and their neighbors are highlighted to help understand the local structure of the network;
- Full control of aesthetics of the network (node shapes, sizes, colors etc.) Default visualization aesthetics would depend on network attributes (e.g. edge width corresponding to edge weight in weighted networks).;
- Nodes can be relabeled to provide human-readable names in the visualization
- Export to image, HTML page, XGMML data format (with preservation of visual attributes and layout) for Cytoscape.
Network and data interpretation tools
- Data overlay: mapping data to network’s visual attributes (e.g. node color); support for multiple data layers.
- Display of experimental data and annotations for nodes and edges on the pop-up tooltips
- Enrichment of selected node sets with supplied ontology; depicting top processes in the separate panel of the visualization page. Highlight the nodes for selected process
- Grouping and collapsing nodes to simplify the network.

Comparisons

There are functions allowing comparisons of the most popular kinds of results the CBDD algorithms yield: subnetworks and ranked lists of entities (e.g. network nodes). Three possible use cases are defined for comparing the results of algorithms:

Pairwise comparison of result sets. For example, how different are score discributions for two node prioritization algorithms?
Overview of nodes (edges) relevance for the multiple result sets. For example, what are nodes consistently ranked high by node prioritization algorithms? Are there any patterns in node scores distributions?
Functional overview multiple result sets. For example, what are GO processes associated with subnetworks found by different algorithms for my data set? Do they make sense?

For each use case, interactive visualizations are available.