Institute of Genetics and Cancer

Need to derive more representative and annotated molecular datasets from diverse populations

Edinburgh researchers led a study that emphasises the need for more diverse and better annotated cancer transcriptomics datasets: August 2020

Gene expression heat map
Heatmap identification of gene co-expression patterns across different samples. Each column contains the measurements for gene expression change for a single sample. Relative gene expression is indicated by colour: high-expression (red), median-expression (white) and low-expression (blue). Genes and samples with similar expression profiles can be automatically grouped (left and bottom trees). Samples may be different individuals, tissues, environments or health conditions. In this example, expression of gene set 1 is high and expression of gene set 2 is low in samples 1, 2, and 3. [Thomas Shafee, CC BY 4.0, modified by A. Welman]

Transcription is the first of several steps of DNA based gene expression in which a particular segment of DNA is copied into RNA (especially mRNA) by the enzyme RNA polymerase. Quantitative analysis of RNA molecules in the cell can provide valuable information on “activity state” and “expression patterns” of different genes. In fact, analysis of cellular RNA plays increasingly important role in anticancer drug discovery, cancer diagnosis and designing of personalised therapies.

As pools of RNA transcripts in cells are also known as transcriptomes, the techniques used to study them are often referred to as transcriptomics technologies or simply transcriptomics. Transcriptomics are powerful research tool. They are frequently used to analyse clinical samples from cancer patients providing important insights into molecular characteristics of their tumours.

An ever-increasing number of cancer transcriptomics datasets are now available enabling researchers to perform highly informative retrospective gene expression analyses. Thus, gene expression data from studies utilising cancer cell lines or animal models can be compared with clinical datasets to evaluate the reliability of model systems to recapitulate the disease. The clinical datasets can also be used to assess associations between putative oncogenes or tumour suppressors and different signalling pathways or clinical characteristics to examine whether certain subgroups of tumours have elevated or reduced expression of particular genes. It is therefore important that the spectrum of available clinical cancer transcriptomics datasets accurately reflects the spectrum of tumours at the population level, and that these clinical datasets are annotated with information needed to maximise their utility.     

First author and corresponding authors of the study published in npj Breast Cancer
Yanping Xie, Jonine Figueroa, Andrew Sims - First author and corresponding authors of the study published in npj Breast Cancer

In a recent study titled “Breast cancer gene expression datasets do not reflect the disease at the population level” and published in the journal npj Breast Cancer, investigators from the University of Edinburgh, UK and the National Cancer Institute, USA describe their conclusions from analysing 70 breast cancer datasets accounting for 16,130 patients from 20 countries across 5 continents. The work, led by Doctor Andrew Sims and Doctor Jonine Figueroa from Edinburgh Cancer Research Centre, demonstrated that publicly available breast cancer gene expression datasets tend to be enriched for high grade, estrogen receptor negative (ER-) tumours from European ancestry patients. The results of the study emphasise the need to derive more representative and better annotated molecular datasets from diverse populations. Suggestions for possible ways to achieve this are also presented in the paper. 

The work was supported by funding from Cancer Research UK, Breast Cancer Now, Wellcome Trust, UKRI Global Challenges Research Fund and the National Cancer Institute.

Related Links:

Article in npj Breast Cancer:

Doctor Andrew Sims Group website:

Doctor Jonine Figueroa Group website:

Information about breast cancer:

Facing breast cancer:

Information about transcription:

Related Stories:

In HER2 positive early breast cancer 6 months treatment with Herceptin is as good as 12 months for preventing cancer return:

Professor David Cameron appointed BIG Chair:

Best poster prize at the Edinburgh Breast Cancer Special Symposium:

HER2 drives an increased hypoxic response in breast cancer:

Distinguishing acquired resistance from dormant tumours in neoadjuvant treatment of breast cancer:

Guidelines for treatment of breast cancer patients with delays in surgery due to COVID-19: