EAVE II logo

Ethnic misclassification and risk of severe COVID-19 in minority ethnic groups in Scotland

October 2023: Study published in The Journey of Public Health examines the quality of ethnicity coding within national public health datasets in Scotland.

Quality of ethnicity data within Scottish health records and implications of misclassification for ethnic inequalities in severe COVID-19: a national linked data study

Amele, S;  McCabe, R; Kibuchi, E; Pearce, A; et. al.

Journal of Public Health

Published online on 19 October 2023

Available online at: https://doi.org/10.1093/pubmed/fdad196

Summary in plain English


Since the start of the COVID-19 pandemic, some minority ethnic groups have been excessively impacted, drawing greater attention to ethnic inequalities in health. However, to date, there has been limited use of ethnicity data in research due to a lack of quality and completeness, specifically with the classification of ethnic groups in the UK.

Why did we carry out this work?

Previous research has reported poor quality ethnicity data in health records for minority ethnic groups compared to the White majority. However, previous research broadly combined different ethnic groups which can mask important differences between these groups.

We used severe COVID-19 (i.e. COVID-19 related hospitalisation or death) as an example to explore the quality and accuracy of ethnicity classification or coding in health records. Classification involves giving specific codes to ‘groups’ of patients based on common characteristics (e.g. ethnicity). We wanted to find out about the impact of ethnicity coding on ethnic health inequalities. The findings should help improve understanding of health inequalities among minority groups in Scotland, particularly for the Gypsy/Traveller groups.

What data did we use?

We used data from the EAVE II platform compared to data from the 2011 Scottish Census. From EAVE II, the following datasets were used:

  • Public Health Scotland Ethnicity Look-up (PHS-EL) for classification of ethnicity
  • Electronic Communication or Surveillance in Scotland (ECOSS) for SARS-CoV-2 testing data
  • Scottish Morbidity Record 01 (SMR-01) for hospitalisations
  • National Records of Scotland death registry (NRS deaths)
  • Accident & Emergency (A&E) services.

We included data on individuals who:

  • were 16 years of age and older,
  • had a health record in the Census and Community Health Index (CHI) registers, and
  • lived in Scotland on 1 March 2020 –the date of the first laboratory confirmed case of COVID-19 in Scotland.

We defined COVID-19 based on a positive PCR lab test result and classified ethnicity using 16 subgroups which we reduced to five broad ethnicity groups or categories as follows.

  • White: includes White Scottish, White Other British, White Irish, White Gypsy/Traveller, White Polish, Other White subgroups
  • Mixed or Multiple Ethnicity: Hereafter referred to as Mixed
  • Asian:  Pakistani (includes Pakistani Scottish, Pakistani British, Indian (includes Scottish Indian, Indian British Bangladeshi (includes Bangladeshi Scottish, Bangladeshi British), Chinese (includes Chinese Scottish, Chinese British, Other Asian
  • African, Caribbean, or Black: African (includes African Scottish, African British, Caribbean (includes Caribbean Scottish, Caribbean British, Black (includes Black Scottish, Black British
  • Other ethnicity: Arab (includes Arab Scottish, Arab British), Other ethnicity  

Using the above data, we measured quality in EAVE II data on:

  • Level of missingness (missing, unknown, or not provided data)
  • Misclassification (incorrect coding of an individual’s ethnicity, compared to the Census)

Individuals were followed from 1 March 2020 and up until either:

  • Their first experience of severe COVID-19
  • death of any cause
  • 1 March 2022

Statistical analysis was used to assess the level of misclassification in datasets compared to the Census and estimate the risk of severe COVID-19 by ethnic group. Characteristics including age and sex were considered. 

What did we find?

We found that incomplete and misclassification of ethnicity data was higher for all minority ethnic subgroups compared to the White Scottish majority subgroup. We also found the data underestimated the risk of severe COVID-19. These findings are important because poor quality ethnicity data in health records can create bias and lower the ability to review, understand and address ethnic health inequalities.

Misclassification and Missingness (Census vs PHS-EL)

  • 8.5 per cent of individuals in the Census were misclassified by the PHS-EL data.
  • Misclassification was higher for all minority ethnic subgroups (12.5% to 69.1%) compared to the White Scottish majority (5.1%), with the highest in Caribbean or Black (49.6%) and White Gypsy/Traveller (69.1%) subgroups.
  • When grouped into the five broad categories, misclassification was highest among the Mixed group (44.2%) and lowest among the White group (0.3%).
  • 30% of individuals had missing ethnicity data in the PHS-EL, with the highest among the White Other British subgroup (39%) and lowest among the Pakistani subgroup (17%).
  • When grouped to five broad categories, missingness was over 22% for all groups.

Risk of severe COVID-19

  • Using Census coding, the White Gypsy/Traveller subgroup had an increased risk of severe COVID-19 but we did not find this within PHS-EL data.
  • The Bangladeshi subgroup had an elevated risk using the Census coding. However, when using PHS-EL data this was decreased.
  • PHS-EL data often overestimated severe COVID-19 risk compared to the Census, particularly for the Caribbean or Black subgroups.
  • We observed no major differences in risk for the five broad groups, except for the Other Ethnicity group whose risk was increased for PHS-EL compared to the Census.

Sensitivity and PPV dataset comparisons (Census vs SMR-01; Census vs A&E; Census vs NRS)

  • The White Scottish subgroup had the highest sensitivity and PPV for all dataset comparisons.
  • The Mixed group had the lowest sensitivity in the NRS deaths comparison, while the White Gypsy/Traveller group had the lowest sensitivity in SMR-01 and A&E.
  • The Mixed group had the lowest PPV for both NRS deaths and A&E.

Why are these results important?

Our study shows the unclear quality of ethnicity coding and how this impacts the observation of ethnic health inequalities. We demonstrated that combination of ethnicity groups provides a higher risk of misclassification, particularly for the White Gypsy/Traveller group, as when grouped into the broad ‘White group’, outcomes are dominated by the White Scottish majority subgroup. We are also the first study to show that this minority ethnic group has an increased risk of severe COVID-19 in Scotland.

The main limitation of the study is that some data may be out of date as we used the 2011 Census. Some individuals’ ethnicity may be more accurately represented in PHS-EL and other datasets.

Overall, our study provides useful insight into ethnicity coding in health records and suggests linkage across datasets may offer improved ethnicity data quality in Scotland.


A Patient and Public Involvement Coordinator and Public Advisory Group Co-Lead collaborated with the research team to help with research questions and interpretation of findings. This summary was also reviewed by Patient Advisory Group members and team members directly involved in the study.