Balancing data power with data privacy
Slamming sets of public information into one another can reveal powerful patterns and insights. But can it be done without damaging the public's trust that their data is being used sensitively? An Edinburgh team has some ingenious solutions.
Imagine a doctor wants to carry out research into the relationship between respiratory illness and air quality. The benefits of investigating this are obvious but accessing and analysing the necessary datasets is far from easy.
Our hypothetical medic would need access to population healthcare data including sensitive details about where patients live. They would then need to cross-reference this with environmental data about air quality. And to be thorough, the researcher might want to check for sources of pollution in an area, such as traffic congestion or industrial premises.
Such datasets may be owned by different organisations and hosted at different physical sites. They may be subject to different regulations around access and use.
In short, combining datasets to draw new research insights is, well, complicated.
It is a challenge that Professor Chris Dibben and his team at the University of Edinburgh’s Administrative Data Research – Scotland (ADR-S) have been addressing for the last fifteen years.
The team has established six research programmes to enable researchers to access data in novel ways and approaches to link sensitive personal information for professional training and evidence-based policy making.
These programmes include establishing the legal concept of functional anonymization, which means determining whether data are personal or non-personal; SafePODS, a physical solution for digital security; and the memorably titled Synthpop, software and statistical methods for producing synthetic population data.
It may sound esoteric and niche, but ADR-S’s work is keeping the public’s data safe while opening up new vistas for researchers. This careful balancing act has potentially huge benefits for all.
To begin with the first of this trio of breakthroughs, how does one go about creating a legal concept? Professor Dibben explains. “It is establishing a concept that captures an aspect of the law, that then can be used to argue for the legality of an approach to data holding and management – for example a system making data available to researchers. We produced descriptions of the concept and approach in papers and reports and this allows organisations to use the approach, arguing its legality. Our work has also featured in guidance by networks advising on the ‘safe’ research.”
A safe space
To create a physical safe haven for accessing sensitive data Professor Dibben’s team came up with SafePODS. These secure cubicles are a similar concept to how countries use embassies as diplomatic spaces in other countries. They allow a data centre to have control over a space, or a ‘pod’, in another location. For example, a data centre in London could let a researcher access files via a SafePOD in Inverness.
Prof Dibben explains: “SafePODs are about enabling the data controllers to be confident that legitimate end users are doing only what they expect them to do. This means the provision of assurances that users are not re-identifying individuals, or taking data out of the environment or bringing in new data.”
SafePOD users are prohibited from taking phones into a pod and must sign up to a written agreement beforehand. The obvious advantages of the pods are that users do not need to travel as far to access the safe setting. This goes some way to addressing inequalities in access and use of data in different parts of the country.
Top of the pops
Another key part of ADR-S’s work is ‘Synthpop’, which can be used for any type of sensitive data of real individuals, such as data from the criminal justice system.
Synthpop mimics the broad population characteristics from a real dataset in a way that preserves the ability to draw accurate conclusions but without disclosing real data to researchers. For example, there may be a certain number of men and women or particular age range within a Synthpop dataset but the other details will be artificial.
“It’s very useful in training or teaching situations,” says Prof Dibben. “For example, it avoids having to ask fifty people to sign data sharing agreement for a training seminar. There aren’t many disadvantages though we do make clear to researchers that they should not use it as a basis for publication of final results in their research papers.”
Of course, researchers at the University of Edinburgh have been using and innovating with data for as long as it has been a centre for academic research. But it became a major strategic focus for the University in 2018 when it launched the Data-Driven Innovation initiative as part of the City Region Deal, with its ambition to position Edinburgh as the data capital of Europe.
Prior to this, the advent in recent decades of supercomputing and artificial intelligence has vastly increased the potential of data analytics. It is not surprising that the requirement for robust standards and regulations to underpin this has also increased.
The research that eventually led to the creation of the ADR-S began in the mid-2000s with funding from Economic and Social Research Council. Since then the widening use of public data has come under increased scrutiny from campaign groups and because of scandals such as the Facebook/Cambridge Analytica bringing the issue into the media spotlight.
“We focus on building trust and confidence with the public and stakeholders,” says Prof Dibben. “It is very important to engage with and have discussion with such groups.” ADR-S has always had a panel of members of the public to explore issues and to help researchers understand what the public are comfortable with in this area of research.
Overall, he believes the political mood for such work is positive. Governments see the potential applications for it in improving the provision of services and for policymaking. UK government agencies, the Scottish government, police forces and the National Health Service have all made use of the work done by the ADR-S - including in the response to Covid-19.
There is even wider potential for addressing a range of interesting social questions, according to the team. For example, how well are children doing in schools and what is the relationship between the school system and social mobility?
For the moment, at least, it seems that most of the innovation in this area is with public data. “My centres manage a lot of data – and it’s all from public institutions,” says Prof Dibben. “There is no data from private organisations at the moment, however this is something that we may explore in the future. Importantly we only, legally, do ‘public benefit research’ and we cannot, therefore, support work that is purely in support of a for-profit organisation.”
Where next for a project that has already achieved so much? “For the UK, we want to support bringing data together for greater public benefit and to explore whether SafePODs can make European data more accessible to UK researchers and vice versa,” says Prof Dibben.
ADR-S look set to continue innovating and improving standards and regulation in the access and linking of datasets for some time to come.
Image credits: padlock - Yuichiro Chino/Getty