Asthma UK Centre for Applied Research


Data-driven approaches to identifying asthma subtypes - Elsie Horne

Elsie Horne explains the rationale and some key findings behind a review looking at the challenges of clustering multimodal clinical data

In May 2020 a group of AUKCAR researchers from the University of Edinburgh had their review ‘Challenges of clustering multimodal clinical data: a review of applications in asthma subtyping’ published in JMIR Medical Informatics. Elsie Horne, Asthma UK Centre for Applied Research PhD student and first author of the paper, explains the rationale for this review and some key findings in this blog.

Asthma subtypes

It is widely accepted that asthma is a variable disease with multiple underlying subtypes. Despite decades of research in the area, there is still no consensus on these subtypes, the underlying changes they cause to the body or how to diagnose them. At the time of writing, the Asthma UK website lists 10 different types of asthma, but cautions the reader that ‘it can be difficult to know which type of asthma you have’.

Can we take a data-driven approach?

As with many unanswered clinical questions, researchers are investigating whether answers lie in existing data, and can be teased out using sophisticated data-driven techniques. The techniques that have been used previously in efforts to identify asthma subtypes are generally described as unsupervised, so-called because the pattern that they seek to identify is currently unknown. Our review focuses on a technique called cluster analysis.

What is cluster analysis?

Cluster analysis is a technique for identifying groups in data. It’s a large area of machine learning, with thousands of existing algorithms and newly developed algorithms frequently emerging from the research community. Despite this, the application of cluster analysis to clinical problems is typically restricted to two popular methods: k-means and Ward’s method.

Why did we carry out this review?

There is a key issue here: both k-means and Ward’s are best suited to continuous data (data which can be measured on a common scale: for example, weight, height etc.). As we explain in the paper, clinical data are often mixed (both continuous and categorical variables, like sex, age group etc.) and are rarely measured on a common scale. We describe such data as multimodal. The aim of our review was investigate this mismatch between the methods and the data to which they are applied.

What were the key findings?

We identified 63 studies that had used cluster analysis to identify subtypes of asthma using multimodal clinical data. This demonstrates that cluster analysis is frequently applied in this context, and should therefore be a topic of interest for the Asthma UK Centre for Applied Research methodology group. Our three key findings from the paper were:

  1. Of the 63 studies, 55 used either Ward’s method or k-means. There are certain pre-processing measures which may make multimodal clinical data more compatible with methods such as k-means and Ward’s. Our review found that these were rarely done or documented in practice.
  2. Small sample size was a common issue. Unlike with many statistical methods applied in medicine, there is no well-established rule for finding an appropriate sample size for a cluster analysis. However, as we explain in the paper, a small sample size in cluster analysis can greatly compromise the reliability of the results.
  3. Few studies appropriately tested the quality of their results. Most methods of cluster analysis will always find some way to group the data, regardless of whether these groups are reproducible, stable or clinically meaningful. Testing the quality of the results is therefore crucial.

Final comments

For people who do not consider themselves methodologists, I appreciate that this paper might not be top of your to-do list! However, I hope that this blog has made you aware of some of the challenges in identifying asthma subtypes, so that you can take a more critical approach when interpreting studies like these.

For methodologists, I hope you find the time to read the paper. I welcome feedback and comments ( – clustering of multimodal clinical data comprises a large component of my PhD, so this is ongoing work.

Read the article

Read the article in the JMIR Medical Informatics.

Cite as

Horne E, Tibble H, Sheikh A, Tsanas A, Challenges of Clustering Multimodal Clinical Data: Review of Applications in Asthma Subtyping, JMIR Med Inform 2020;8(5):e16452 URL: DOI: 10.2196/16452 PMID: 32463370 PMCID: 7290450

See Elsie Horne's Student profile