The original rationale for SPRINT is based on microarray analysis allowing the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples.
The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC) systems, which contain many processors and more memory than desktop computer systems.
Although these form the original basis of SPRINT, most of our parallelised functions are in principle useful in any situation where a very large number of computations are carried out or where computations results in very large memory use.
We have designed and built a framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT) is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing.
SPRINT allows R users to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more functions are added to the framework.
The SPRINT has been ported on the UK national supercomputing service HECToR. The code had been analysed and optimised to enable it to scale to 512 slave processes and beyond. This new code has been succesfully tested and benchmarked on the UK National Supercomputing Service, the HECToR Cray XT system.
NOTE: In 2014, HECToR will be replaced by ARCHER and we are currently preparing SPRINT for use on this increased performance architecture.
The SPRINT framework is made of two core components:
The parallelisation model adopted is a task farm with a Master process controlling the execution of many Worker processes. All nodes are running R, executing the R script and loading the SPRINT library. When the SPRINT library has been loaded, the Master node takes charges of the execution. When the Master node encounters a parallel function, it distributes the work to be done amongst the available processes. The parallel harness uses C and MPI. The Master process coordinates the reading and writing to files which is performed simultaneously and in parallel by all worker processes using MPI/IO.
The first full release of the SPRINT R package currently (version 1.0.5) includes 8 parallelised functions (see top of this page).
Since version 1.0.4, SPRINT can also be run on Mac OSX computers to utilise desktop/laptop multi-processor setups.
SPRINT v 1.0 Functionality & Performance
A clustering function which performs a parallel Partitioning Around Medoid (PAM) based on the pam() function from the cluster R package has been included in the SPRINT function library.
SPRINT Beta 3 Functionality & Performance
A parallel permutation test function based on the mt.maxT() function from the multtest R package has been added to the SPRINT library of parallelised functions.
SPRINT Beta 2 Functionality & Performance
This new version of the framework includes an improved HPC harness which is fully scalable removing any limits on the size of data that can be processed and on the number of cores that can be used while improving performances.
SPRINT Beta 1 Functionality & Performance
Demonstrator application for the Pearson correlation function.
This work was supported by the Wellcome Trust grant [086696/Z/08/Z].