Clinical Natural Language Processing Research Group

7th Dec 2021 - 2pm - Cameron Fairfield

Title: ToKSA: Tokenised Key Sentence Annotation - An Efficient Method to Approximate Ground Truth for NLP

Abstract: Identifying phenotypes and pathology from free text is an essential task for clinical work and research. Natural language processing (NLP) is a key tool for processing free text at scale. Developing and validating NLP models requires labelled data. Labels are generated through time-consuming and repetitive manual annotation and are hard to obtain for sensitive clinical data. To enable efficient annotation, we describe a novel approach of tokenized key sentence-specific annotation (ToKSA) for annotating radiology reports. Firstly, individual sentences are grouped together into a term-frequency matrix. Annotation of key (i.e. the most frequently occurring) sentences is then used to generate labels for multiple reports simultaneously. We compare ToKSA-derived labels to those generated from annotating full reports and demonstrate that labels from ToKSA can be used to train an accurate document classifier using convolutional neural networks. Using ultrasonography reports and gallstone disease we demonstrate, that by annotating only 2,000 frequent sentences, we are able to generate labels for 85,177 reports with an accuracy of 99.2%.

Speaker: Dr Cameron Fairfield, Clinical Research Fellow, Usher Institute, University of Edinburgh

The meeting will be held online via Zoom. Please contact Dr Hang Dong if you would like to attend, we will send the Zoom link to you.

Dec 07 2021 -

7th Dec 2021 - 2pm - Cameron Fairfield

The talk is titled "ToKSA: Tokenised Key Sentence Annotation - An Efficient Method to Approximate Ground Truth for NLP".

Zoom