CBDD Overview

Overview

CBDD stands for ‘Computational Biology for Drug Discovery’. It is a pre-competitive consortium between Clarivate Analytics (formerly the IP & Science business of Thomson Reuters) and several leading pharmaceutical companies, aimed at evaluation and uniform implementation of algorithms for network and pathway analysis of the molecular data sets.

The selection of an appropriate network analysis algorithm is a challenge. In the last few years, dozens of various network and pathway analysis approaches have been published in the peer-reviewed literature. However, using them requires understanding their required inputs (data and network), assumptions they make about the biology, and the goals that they intend to achieve. Furthermore, each new algorithm is a new learning curve, and implementations are not always readily available.

The goal of CBDD is, first of all, to select the state of the art algorithms which could be beneficial for drug development research and, second, implement them in uniform fashion, providing a robust and easy to use software package.

The CBDD toolkit is designed to enable seamless analysis of diverse data sets within the context of any networks available to the user. A typical design of a computational workflow in systems biology looks like the following:

Gather molecular data to analyze;
Collect an appropriate molecular network;
Select the algorithm intended to solve a particular research problem;
Run the algorithm;
Interpret the results to get insight about the data.

The molecular data for analysis are typically readily available (either from the public domain or generated in-house) and have a relatively simple structure. CBDD helps with the two more complex steps, namely network generation and algorithm selection

The collection of algoritms implemented during CBDD program is available first and foremost as an R package. The web application is available for users who prefer the visual interface over the command line interface.

Functionality

The functionality of CBDD R package can be roughly classified as follows:

Build network: tools to retrieve, create or modify the molecular networks
Analyze data: various algorithms trying to generate insight from the ‘omics’ data sets in the context of the networks and pathways.
Interpret results: gain bioligical insights out of algorithm outputs.

The CBDD doesn’t end on the R package. Deliverables include GUI version of the CBDD, allowing users access to the algorithms via web interface.

In addition to that, collaborative nature of CBDD program provides opportunities for exchange of ideas and best practices of network analysis between members.

Following sections describe each of these pieces of functionality. More detailed description with more detailed description of algorithm areas and other features is available here

Build network

The network generation is a more challenging area. There are two major approaches to create large-scale networks describing relationships in biological systems:

Scaffold network: experimental detection of physical interactions between genes, proteins and other molecules, or curation of such findings from literature;
Data-driven network: de novo prediction of relationships between molecules based on pattern mining in diverse data sets (for example, search for expression similarities, typically indicative of some sort of functional relationships between genes; or text mining for co-citations of genes in compendia of literature)

Both approaches have their advantages. The CBDD provides infrastructure to load pre-existing networks from text files and from Clarivate Analytics’ MetaBase. Also, several of the algorithms are available for data-driven generation and modification of networks:

Area	Description
Data-driven network	De novo generation of relationships between biological entities (e.g. from similarities in gene expression profiles).
Network adjustment	Weighting and adjustment of networks (e.g. based on tissue expression data), making them more specific to a particular biological context.

Analyze

The network analysis algorithms do different things and might be utilized for different purposes. It is often hard to classify the algorithms into precisely defined categories. CBDD uses a classification based upon the end goal (roughly corresponding to the typical research needs in drug development):

Area	Description	Example purpose
Node prioritization	Learn which nodes in the network are well connected to the nodes of interest and might regulate the phenotype.	Drug target identification
Subnetwork prioritization	Find modules in networks which are associated with phenotype.	Mechanism reconstruction; Biomarker discovery
Pathway prioritization (coming soon)	Learning which of the canonical signaling pathways are associated with phenotype provides good clues to the molecular mechanisms behind it and may help with biomarker search.	Mechanism reconstruction; Biomarker discovery
Unsupervised analysis (coming soon)	Learn how patients stratify into the subtypes and which networks and pathways drive this stratification.	Patient stratification
Integrative analysis	Many omics data types are routinely available, and there is often need to understand how they talk to one another. Networks can be utilized for answering questions such as ‘which mutations in my dataset affect the differential expression in disease?’.	Any purpose
Network comparison	Compare mechanisms underlying different diseases or disease models.	Mechanism reconstruction;

Interpret results

This block of features in R package includes functionality intended to make sense of the results produced by the algorithms, namely:

Interactive visualization of analysis results;
Biological interpretation (e.g. enrichment analysis to reveal impacted biological processes);
Functions to compare algorithm results;
Network comparison algorithms.

Beyond R package

Community

CBDD program includes leadership events, dedicated to exchange of ideas on how best to apply network analysis tools to the drug development-related tasks

In addition to that, the tutorials section have been added to the CBDD website, allowing registered CBDD users to share their efforts on different applications of network analysis and provide useful tips and tricks to the rest of community.

GUI

In addition to R package, CBDD functionality is also available as a web service providing GUI for the algorithms. The web service is not hosted by Clarivate Analytics; instead it should be deployed on the members’ IT infrastructure. The main purpose of this software extension is to enable more users (beyond R users) to utilize CBDD network algorithms and explore the results in more familiar graphical interface.

The GUI has following features

Load data and networks, store them on the server side in and manage the files in data and network navigators. Share data with other users;
Run one or more algorithms which are applicable to selected data sets with full control of options;
Save / load workflows (particular data and particular settings of algorithms used in analysis);
Explore, save, export results of algorithms;
Compare results of different algorithm runs.

Interactive network visualizations available in CBDD R package are plugged into the GUI in appropriate places (e.g. view subnetworks from the algorithm result page).

Algorithm selection and implementation principles

Algorithms and other features of CBDD are selected based on collective vote of CBDD members.

The priorities in the development of CBDD are as follows:

Convenience. CBDD is implemented in the conventional form of an R package, making it accessible to wide ranks of bioinformaticians without extensive computer science training. Additional functions for importing networks from different sources and for exploring analysis results will be added.
Generality. CBDD is focused on implementing a wide range of general purpose tools rather than algorithms specifically tied to certain experimental design or data type.
Reliability. Algorithms are extensively tested; the detailed description of steps with toy inputs and outputs are provided.
Performance. Computationally intensive parts of the algorithms are implemented in Java behind the scenes for improved overall performance. The aim is to keep the runtime low even for computationally complex algorithms applied to biologically relevant input data sizes, whilst still keeping as much code in R as possible.
Modularity. The algorithms are often modular in nature [1], and users might want to tweak the modules. CBDD encourages this by modularizing the algorithms as much as possible, making separate R function for each step and encouraging users to reuse these parts and build custom analysis workflows.