Connecting the Dots in Clinical NLP using Relation Extraction Models in Spark NLP

Published in

John Snow Labs

7 min readJan 18, 2023

Relation extraction is a process of identifying and extracting relationships between named entities in a text. In the clinical domain, relation extraction is particularly important because it can help extract valuable information from clinical documents and enable various use cases, such as improving patient care, enabling better clinical decision making, and facilitating clinical research. Some examples of clinical named entities that may be related to each other include diseases, symptoms, treatments, and medications.

By extracting the relationships between these entities, it is possible to gain a better understanding of the connections between different aspects of a patient’s health and the potential impact of different interventions. Relation extraction can be performed using a variety of techniques, including rule-based systems, machine learning approaches, and hybrid systems that combine both.

Introduction

In this blogpost, I will go over the clinical and biomedical relation extraction (RE) architectures and models provided by Spark NLP for Healthcare library and elaborate more on how to run Spark NLP pipelines to extract relations across clinical named entities in a single pipeline at scale. For the sake of brevity, this blogpost will be limited to pretrained models and how to build pipelines to extract relations. For more details regarding the algorithms behind the relationship extractions models provided, please check the following peer-reviewed papers written by me and colleagues at John Snow Labs.

Deeper Clinical Document Understanding Using Relation Extraction

Connecting the dots in clinical document understanding with Relation Extraction at scale

Spark NLP for Healthcare

Downloaded more than 100K per day, with a total of 45 million, Spark NLP is one of the fastest growing NLP libraries supported in popular programming languages like Python, R, Scala and Java. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.

Spark NLP comes with more than 12 thousands pretrained models and its licensed extension, Spark NLP for Healthcare comes with more than 860 pretrained clinical models that are all developed and trained with the latest state-of-the-art algorithms to solve real world problems in healthcare domain at scale. For more information and sample Colab notebooks, we highly suggest that you check our workshop repo.

Basic Components of Relation Extraction in Spark NLP for Healthcare

Relation extraction between two named entities has several stages and needs inputs (features) from certain NLP modules, namely:

Embeddings (vector representation of tokens)
Named Entity Recognition (NER)
Part-of-speech (POS) tagging
Dependency parsing

Basic Components of Relation Extraction in Spark NLP for Healthcare

Since I am focusing only on Relation Extraction models in this post, I will not get into the details of other stages.

Pretrained Relation Extraction Models in Spark NLP for Healthcare

Spark NLP for Healthcare has more than 40 pretrained RE models that can be used with a few lines of codes effortlessly.

List of pretrained RE models in Spark NLP for Healthcare

Using any of these models can be as simple as follows:

# loading a pretrained RE model to extract relations between body parts and problem entities
reModel = RelationExtractionModel()\
    .pretrained("re_bodypart_problem")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\

text = "Some numbness in his left hand noted, no other neurologic deficts."

Output:

In simple terms, this RE model gets in two named entities (numbness and hand) and return relations between them if there is any. According to the example above, numbness is related to hand with a confidence score of 1.0.

Lets see how we can build an end to end pipeline to extract everything we’d need at the end.

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")
    
dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

# get pretrained ner model 
clinical_ner_tagger = MedicalNerModel()\
    .pretrained('ner_jsl','en','clinical/models')\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")    

ner_chunker = NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") \
    .setInputCols(["sentences", "pos_tags", "tokens"]) \
    .setOutputCol("dependencies")

reModel = RelationExtractionModel.pretrained("re_bodypart_problem","en","clinical/models")\
    .setInputCols(["embeddings","ner_chunks","pos_tags","dependencies"])\
    .setOutputCol("relations") \
    .setRelationPairs(['symptom-external_body_part_or_region'])

nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, clinical_ner_tagger, ner_chunker, dependency_parser, reModel])

sample_df = spark.createDataFrame([["No neurologic deficits other than some numbness in his left hand."]]

model = pipeline.fit(sample_df).transform(sample_df)

This pipeline is also compatible with LightPipeline that makes the inference 10x faster without writing the entire pipeline from scratch . Just one liner .annotate() that accepts string or list of strings (no Spark data frame needed).

model = pipeline.fit(sample_df)

light_model = LightPipeline(model)

text = "No neurologic deficits other than some numbness in his left hand."

results = light_model.fullAnnotate(text)

Available Relation Extraction models in Spark NLP for Healthcare, their labels, optimal NER models that accompany them, and meaningful relation pairs are illustrated in the table at official documentation. Mind that the table shared over there is just to give you a rough idea about which pretrained models can be used together. You can get better or worse performance by playing out with different models.

You can train your own RE model using this colab notebook:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb

You can even finetune an existing RE model with your own data:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.4.Resume_RelationExtractionApproach_Training.ipynb

Generic Relation Extraction

In Spark NLP for Healthcare, there are already more than 40 relation extraction (RE) models that can extract relations between certain named entities. Nevertheless, there are some rare entities or cases that you may not find the right RE model or the one you find may not work as expected due to the nature of your dataset. In order to ease this burden, we released a generic RE model (generic_re) that can be used between any named entities using the syntactic distances, POS tags and dependency tree between the entities. You can tune this model by using the setMaxSyntacticDistance param.

reModel = RelationExtractionModel()\
    .pretrained("generic_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"]) \
    .setMaxSyntacticDistance(4)

text = "Pathology showed tumor cells, which were positive for estrogen and progesterone receptors."

>>>

|sentence |entity1_begin |entity1_end | chunk1    | entity1          |entity2_begin |entity2_end | chunk2                 | entity2          | relation                        |confidence|
|--------:|-------------:|-----------:|:----------|:-----------------|-------------:|-----------:|:-----------------------|:-----------------|:--------------------------------|----------|
|       0 |            1 |          9 | Pathology | Pathology_Test   |           18 |         28 | tumor cells            | Pathology_Result | Pathology_Test-Pathology_Result |         1|
|       0 |           42 |         49 | positive  | Biomarker_Result |           55 |         62 | estrogen               | Biomarker        | Biomarker_Result-Biomarker      |         1|
|       0 |           42 |         49 | positive  | Biomarker_Result |           68 |         89 | progesterone receptors | Biomarker        | Biomarker_Result-Biomarker      |         1|

Transformer-based Relation Extraction Model

Spark NLP for Healthcare has another RE model architecture called RelationExtractionDLModel. In contrast with RelationExtractionModel, RelationExtractionDLModel is based on BERT.

It is an end to end architecture that uses its own embeddings coming from the model itself rather than the embeddings used for NER stage in the same pipeline. That is why it is larger in size, and might be a little bit slower; but the accuracy might be much better than classic RelationExtractionModel.

It needs another annotator called RENerChunksFilter that filters the named entities which are then used for RE. So it generates all pairs of entities and checks if they are whitelisted (i.e. in setRelationPairs) and whether the syntactic distance between the entities in each pair is bellow the desired threshold (setMaxSyntacticDistance). It generates NER chunks which annotated accordingly (paired_to is added to the metadata), which is used by RE to select relation candidates. Basically it is used to determine which entities to pair before the pairs are passed to RE models for classification.

See the official documentation for details and Models Hub for pretrained models available.

As of now, there is no support for training a RelationExtractionDLModel from scratch with your own data. The finetuning will be available soon in NLP Lab, a no code annotation and NLP platform.

Here is the code snippet:

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")  

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")
    
dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations")

ade_pipeline = Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

ade_model = ade_pipeline.fit(empty_data)

Conclusion

In conclusion, this blogpost has provided an overview of the clinical and biomedical relation extraction architectures and models offered by the Spark NLP for Healthcare library. The ability to extract relations across clinical named entities in a single pipeline at scale is a powerful tool for those working in healthcare.

The pretrained models discussed in this post make it easy to get started with relation extraction, and for those looking for more detailed information about the algorithms behind the models, the provided peer-reviewed papers by the author and colleagues at John Snow Labs offer a wealth of knowledge. Overall, Spark NLP for Healthcare is a valuable resource for those looking to extract meaningful relationships from clinical and biomedical text using more than 800 pretrained models and pipelines.

Colab notebooks used in this blogpost:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb