Enhancing The Data Infrastructure
The third work package of the Advanced Care Research Centre.
Research use of routine health and social care data is constrained by current reliance on data recorded in structured fields. However, information about physical and mental function, care needs, and social support are often recorded in free-text. Building on our £4.3M UoE/City Deal investment in the regional DataLoch™, we will develop and deploy tailored Natural Language Processing and AI-based classification methods to enhance existing routine data with functional and other data extracted from free-text clinical records, which will have a wide range of potential applications and benefits for research and health and social care.
Dissemination Outputs and Impact
To develop, evaluate and routinely implement mining of free-text health and social care records to obtain a complete and deep understanding of people’s medical profiles and circumstances (including diagnoses, social and family history, the presence of geriatric syndromes, functional deficits and frailty markers), place of residence (home, extra-care housing, care home) and household composition (living alone, fitness and frailty of other members of the household). Specifically, we have the following objectives:
- Understand requirements and datasets
- Standardised terminologies, geriatric syndrome ontology and computable phenotypes
- Surface deep data using natural language processing and machine learning
- Build a collaborative community with academics, geriatricians, primary care physicians, palliative care physicians, nurses, allied health professionals, social carers, regional/national health data initiatives
The work proposed in WP3 will be published in international peer-reviewed conferences and journals.
We have access to world-class linked routine health care data in Scotland and in other UK countries through our leadership role in Health Data Research UK. The £4.3M City Deal investment in the ‘DataLoch’ will enhance access, linkage, data security and the core analytical platform for the regional population of 1.3 million people. This provides a superb foundation to exploit the potential of quantitative data to inform our understanding of health and care, to underpin new prediction tools, and to support the implementation and evaluation of new models of care with the potential to scale and spread across the UK. However, even in centres of excellence, existing routine data is suboptimal for research in later life because critically important data for this context is often only recorded in free text fields (which is a problem in NHS data, but particularly the case for social care data).
Figure 1. WP3 Architecture of Research Design
WP3 is to provide a data infrastructure to support various data-driven research activities in ACRC and it is composed of 4 areas of interrelated tasks as depicted in Figure 1. Lower components provide essential basis for upper ones and right components provide key supports to the left ones.
The first task, probably the most important at the initial stage for WP3, is understanding the ‘data infrastructure’-level requirements for realising ACRC goals. This will be achieved via a forum that brings together all stakeholders. It will have two deliverables: (a) Data infrastructure requirement specifications. Two iterations are planned for this deliverable. Month 6 (M6) delivers the first specification and M30 will revisit it and conclude the second one at M42; (b) Dataset identification and access. This deliverable identifies relevant datasets for ACRC and gains access to them. It will be a joint effort with other ACRC collaborators and be updated periodically.
Task 3.2 is to work on terminology standardisation (ontologies) and computable Phenotypes. Standardisation is essential in health and social data research, with multiple levels of data standardisations that are relevant to ACRC data infrastructure. It will deliver: (a) A terminology for late life health and social care; (b) A geriatric syndrome ontology; and (c) A phenotype library for geriatric syndrome and frailty
Task 3.3 is to use Natural Language Processing (NLP) to surface deep data from various unstructured data sources to complement structured datasets. Built upon the team’s current NLP work, it will deliver: (a) Adapted NLP models on structured reports of medical imaging data for geriatric medicine; (b) New NLP models for late life health and social care; (c) Transfer learning NLP and machine learning models for ACRC research.
Task 3.4 is to establish an active Research Community for co-design and collaboration on the technical work in WP3. The community will comprise of leads of other ACRC work packages, Healthcare and social care professionals, NLP research groups, National health data initiatives (HDR UK), Regional health data initiatives and biomedical AI / MRC precision medicine CDTs.
- We work closely with the UK clinical NLP groups under the HDR UK text analytics project including King’s College London, University College London, University of Birmingham, Cambridge University, Swansea University, Manchester University and University of Sheffield.
- As part of Edinburgh Clinical NLP Group’s collaborations, we work closely with the Mayo Clinic.
- We will seek collaborations with other top NLP groups such as Stanford NLP group and particularly establish connections with the industry players in clinical NLP such as Deepmind, Facebook and Amazon.
Core Team Members
WP3 is led by Drs Honghan Wu and Beatrice Alex who both lead the Clinical NLP Group at the University of Edinburgh. Dr Wu is a Lecturer in Health Informatics at UCL, London and a Rutherford Research Fellow doing clinical data science in Usher Institute, University of Edinburgh. Dr Alex is a Chancellor’s Fellow at the Edinburgh Futures Institute and Turing Fellow at the Alan Turing Institute and the School of Informatics at the University of Edinburgh.