Skip links

Named Entity Recognition (NER) Annotation for Clinical NLP

Well-Annotated and Gold Standard clinical text data to train/develop clinical NLP to build next version of Healthcare API

The importance of clinical Natural Language Processing (NLP) has been increasingly recognized over the past years and has led to transformative advances. Clinical NLP allows computers to understand the rich meaning that lies behind a doctor’s written analysis of a patient. Clinical NLP can have multiple use cases ranging from population health analytics to improvement in clinical documentation to speech recognition to clinical trial matching etc.

To develop and train any clinical NLP models, you require accurate, unbiased, and well-annotated datasets in enormous volumes. Gold Standard and diverse data help in enhancing precision and recall of NLP engines.


No. of Documents Annotated
No. of Pages Annotated
0 +
Project Duration
< 0 months


The client was looking forward to train and develop their Natural Language Processing (NLP) Platform with new entity types and also identify the relationship among various types. Moreover, they were evaluating vendors who offered high accuracy, complied with local laws and had the required medical knowledge to annotate a large set of data.

The task was to label and annotate up to 20,000 Labeled Records including up to 15,000 Labeled Records from inpatient and outpatient electronic health record (EHR) data and up to 5,000 Labeled Records from transcribed medical dictations, equally distributed across (1) geographical provenances and (2) available medical specialties.

So, to summarize the challenges:

Organize heterogeneous clinical data to train NLP Platform
  • Identify the relationship between different entities to derive critical information
  • Ability and expertise to label / annotate a broad set of complex clinical documents
  • Keeping cost in control to label / annotate a large volume of data to train clinical NLP within the stipulated time frame
  • Annotate entities in the clinical dataset that consists of 75% EHR and 25% Dictation records.
  • Data De-identification at the time of delivery
Other Challenges in Natural Language Understanding


Words are unique but can have different meanings depending on the context resulting in ambiguity on the lexical, syntactic, and semantic levels.


We can express the same idea with different terms which are also synonyms: big and large mean the same when describing an object.


The process of finding all expressions that refer to the same entity in a text is called coreference resolution.

Intention, Emotions

Depending on the personality of the speaker, their intention and emotions, might be expressed differently for the same idea.


A large volume of medical data and knowledge is available, in the form of medical documents, but it is mainly in an unstructured format. With Medical entity Annotation / Named Entity Recognition (NER) Annotation, Insights AI was able to convert unstructured data into a structured format by annotating useful information from diverse types of clinical records. Once the entities were identified, the relationship among them was also mapped to identify critical information.

Scope of Work: Healthcare Entity Mention Annotation

9 Entity Types

  1. Medical Condition
  2. Medical Procedure
  3. Anatomical Structure
  4. Medicine
  5. Medical Device
  6. Body Measurement
  7. Substance Abuse
  8. Laboratory data
  9. Body function

17 Modifiers

  1. Medication Modifiers: Strength, Unit, Dose, From, Frequency, Route, Duration, Status
  2. Body Measurement Modifiers: Value, Unit, Result
  3. Procedure Modifiers: Method
  4. Laboratory data Modifier: Lab Value, Lab Unit, Lab Result
  5. Severity
  6. Procedure result

27 Relationships & Patient Status


The annotated data would be used to develop and train Client’s clinical NLP Platform, which would be incorporated in the next version of their Healthcare API. The benefits that the client derived were:

  1. The data labeled/annotated met Client’s standard data annotation guidelines.
  2. Heterogeneous datasets were used to train the NLP Platform for greater accuracy.
  3. Relationship between different entities, i.e. Anatomical body structure <> Medical Device, Medical Condition <> Medical Device, Medical Condition <> Medication, Medical Condition <> Procedure were identified to derive critical medical information.
  4. The broad set of data that were labeled/annotated were also de-identified at the time of delivery.

Our collaboration with Insights AI significantly advanced our project in Ambient Technology and Conversational AI within healthcare. Their expertise in creating and transcribing synthetic healthcare dialogues provided a solid foundation, showcasing the potential of synthetic data in overcoming regulatory challenges. With Insights AI, we navigated these hurdles and are now a step closer to realizing our vision of intuitive healthcare solutions.

Golden 5 Star

Let us know more about you!