Overview and R functions
The original rationale for SPRINT is based on microarray analysis allowing the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples.
The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems.
Although these form the original basis of SPRINT, most of our parallelised functions are in principle useful in any situation where a very large number of computations are carried out or where computations results in very large memory use.
We have designed and built a framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing.
SPRINT allows R users to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.
- Hamming distance for pairs of character strings. Used for example in measuring distance between nucleotide sequences. Based on function stringdist() in package stringdist, created by Mark van der Loo, 2013. Cited source: Hamming RW. Error detecting and Error Correcting codes. The Bell System Technical Journal 29, 147-160.
- Apply any function to each row/column in a matrix. A generic function useful in many situations where for-loops may be slower. Based on function apply() in R base package.
- Bootstrap estimates of any given statistic. Based on boot() function in boot package. Cited source: Angelo Canty and Brian Ripley. "boot: Bootstrap R (S-PLUS) Functions", 2013.
- Pearson correlation for pairs of numeric variables. For example used in obtaining gene adjacency networks through measured gene-gene similarities across a range of samples or conditions. Based on cor() function in package stats: Becker et al. The New S Language. Wadsworth & Brooks/Cole 1988.
- Permutation-adjusted p-values. Used in statistical testing of inference hypotheses (e.g. is a gene differentially expressed between two conditions) to provide robustly estimated p-values that are adjusted for multiple testing. Based on function mt.maxT() in package multtest, created by Yongchao Ge and Sandrine Dudoit. Cited source: Dudoit S et al. Multiple hypothesis testing in microarray experiments [Submitted].
- Partitioning-Around-Medoids clustering. Used in identifying and grouping patterns in data, e.g. gene expression profiles in expression studies. Based on pam() function in package cluster, created by Martin Maechler. Cited source: Reynolds A et al. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms 5, 475-504, 1992.
- Random Forest classification algorithm. Used in classifying (predicting the biological class or medical status) samples in a data set by constructing a large number of decision trees and aggregating their outcomes. Based on function randomForest() in package randomForest, created by Andy Liaw and Matthew Wiener. Cited source: Breiman L. Random Forests. Machine Learning 45(1),5-32, 2001.
- Rank-Product statistical testing. This non-parametric permutation-based statistical test is used in similar circumstances to parametric tests (e.g. t test) but is more robust for small sample sizes and focuses on between-sample ratios rather than per-group means. Based on function RP() in package RankProd, created by Fangxin Hong. Cited source: Breitling R et al. Rank Products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letter, 57383-92.
- Support-Vector-Machine classification (parallelisation of cross-validation only). Used in classifying (predicting the biological class or medical status) samples in a data set by determining a 'hyperplane' that minimises differences between members of a class and maximises differences between member of different classes. Based on function svm() in package e1071, created by David Meyer. Cited source: Chang CC. LIBSVM: a library for Support Vector Machines.
- Simple correct installation testing function.
SPRINT on HECToR
The SPRINT has been ported on the UK national supercomputing service HECToR. The code had been analysed and optimised to enable it to scale to 512 slave processes and beyond. This new code has been succesfully tested and benchmarked on the UK National Supercomputing Service, the HECToR Cray XT system.
NOTE: In 2014, HECToR will be replaced by ARCHER and we are currently preparing SPRINT for use on this increased performance architecture.
The SPRINT framework is made of two core components:
- An intelligent parallel harness that manages all access to the HPC resources hiding the complexity from the user.
- A flexible library of parallelised R functions that can be easily extended by adding more functions.
The parallelisation model adopted is a task farm with a Master process controlling the execution of many Worker processes. All nodes are running R, executing the R script and loading the SPRINT library. When the SPRINT library has been loaded, the Master node takes charges of the execution. When the Master node encounters a parallel function, it distributes the work to be done amongst the available processes. The parallel harness uses C and MPI. The Master process coordinates the reading and writing to files which is performed simultaneously and in parallel by all worker processes using MPI/IO.
The first full release of the SPRINT R package currently (version 1.0.5) includes 8 parallelised functions (see top of this page).
Since version 1.0.4, SPRINT can also be run on Mac OSX computers to utilise desktop/laptop multi-processor setups.
SPRINT v 1.0 Functionality & Performance
A clustering function which performs a parallel Partitioning Around Medoid (PAM) based on the pam() function from the cluster R package has been included in the SPRINT function library.
SPRINT Beta 3 Functionality & Performance
A parallel permutation test function based on the mt.maxT() function from the multtest R package has been added to the SPRINT library of parallelised functions.
SPRINT Beta 2 Functionality & Performance
This new version of the framework includes an improved HPC harness which is fully scalable removing any limits on the size of data that can be processed and on the number of cores that can be used while improving performances.
SPRINT Beta 1 Functionality & Performance
Demonstrator application for the Pearson correlation function.
This work was supported by the Wellcome Trust grant [086696/Z/08/Z].