EIDF GPU Service launched
The EIDF GPU Service supports scalable data processing and artificial intelligence workloads.
This containerised platform, which can be accessed as part of EIDF’s portfolio of resources, enables users to access the power of NVIDIA A100 GPUs to drive their own bespoke software environments, accelerating and deepening research insights. By using containers, researchers can develop and define their application runtime environment in a lightweight, portable, and distributable form to run on scalable platforms.
About the service
The EIDF GPU Service comprises 20 Apollo 6500 GPU servers containing NVIDIA A100 GPUs. It has a total of 160 GPUs, and individual projects have access to up to 12 GPUs.
The service is accessed through a Virtual Machine (VM) set up for each project within the EIDF Data Science Cloud and is operated via Kubernetes, an open-source system for automating the management of containerised applications. Under Kubernetes users submit jobs directly to one or more GPUs or make use of a smaller section of a GPU for any individual job. Jobs using full GPUs can be run on pods of up to eight GPUs per job. Each GPU allocated to a user will by default have roughly 100GB of host memory and 8 CPU cores associated with it.
Sub-GPU-scale jobs use NVIDIA Multi-Instance GPU (MIG) technology, which allows multiple users to run jobs on a single GPU in complete isolation from other jobs. MIG can partition runs of between two and seven jobs per GPU, depending on the target workload. These jobs will by default have memory and CPU scaled to the amount of GPU used.
The GPU Service in action
The University of Edinburgh’s School of Informatics was an early adopter of the EIDF GPU Service and helped to shape and validate it. Here users describe two projects which it has enabled.
The rapid increase in the size of Large Language Models (LLMs) leads to substantial computational requirements for their adaptation to specific domains, such as healthcare. Our study introduced a two-step Parameter-Efficient Fine-Tuning framework, which achieves state-of-the-art performance while significantly alleviating the computational requirements for domain adaptation and downstream fine-tuning. Our proposed approach can be implemented by care providers to facilitate automated outcome prediction for clinical triage, workflow optimization, and resource management.
The EIDF GPU Service offers a large number of computing resources, enabling me to execute computationally demanding experiments in parallel. This resource is particularly advantageous when undertaking extensive hyperparameter searches, which are crucial for obtaining conclusive answers to our research questions.
For full details of the project see: https://arxiv.org/abs/2307.03042v2
The problem we are attempting to solve involves proposing a standardised evaluation methodology for comparing various methods for the efficient training of large language models. The training of state-of-the-art transformer-based language models currently requires considerable computational resources. The process requires hundreds of thousands of GPU hours, incurs millions of dollars in costs, and consumes as much energy as several average US family households consume annually. We aim to enhance the efficiency of this process, which could potentially result in substantial cost and energy savings, thereby making these models more accessible to a wider range of researchers and practitioners.
The EIDF GPU Service has been indispensable to our research. It has equipped us with the power needed to train these computationally demanding models and to evaluate different efficiency strategies. Compared to earlier or alternative services, the EIDF GPU Service provides superior hardware in the form of A100 NVIDIA GPUs, which significantly expedite the research process.
For full details of the project see: https://arxiv.org/abs/2307.06440
“The EIDF GPU Service is enabling us to explore new methods for training more explainable, robust, and trustworthy AI systems; to design and experiment with models that can learn to search for the information they need for solving arbitrary knowledge-intensive tasks; and to design statistical models for solving challenging biomedical and clinical problems.”
Zero Carbon contribution
The Edinburgh International Data Facility and the GPU Service are operated by EPCC at its state-of-the-art data centre, the Advanced Computing Facility (ACF). The purpose-built power and cooling systems at the ACF, along with the HPE “Advanced Rack Cooling System” (ARCS) racks which contain the GPU Service, ensure that the cooling of the GPU servers is handled as efficiently as possible, so reducing the overall energy required to make these resources available. In addition, as a shared resource, the GPU Service is more resource-efficient than individual services hosted by individual users.
Both EIDF and the ACF use entirely renewable electricity, contributing towards the University of Edinburgh’s commitment to become zero carbon by 2040.
Access to the EIDF GPU Service
Access routes to EIDF: https://www.ed.ac.uk/edinburgh-international-data-facility/access
To apply to join EIDF and access the GPU Service and other resources, please see the EIDF Portal: https://portal.eidf.ac.uk/
Existing EIDF users should submit a support request for the EIDF GPU Service: https://portal.eidf.ac.uk/queries/submit
GPU Service technical information: https://www.ed.ac.uk/edinburgh-international-data-facility/services/computing/gpu-service
GPU Service user documentation and tutorial: https://epcced.github.io/eidf-docs/services/gpuservice/training/L1_getting_started/
The University of Edinburgh Climate Strategy: https://www.ed.ac.uk/sustainability/programmes-and-projects/climate-strategy/zero-by-2040
EPCC Advanced Computing Facility: www.epcc.ed.ac.uk/hpc-services/advanced-computing-facility