Entity Linking System

Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate).

The interview questions can be: design an entity linking system that

  • Identifies potential named entity mentions in the text

  • Searches for possible corresponding entities in the target knowledge base for disambiguation

  • Returns either the best candidate corresponding entity or nil

0. Introduction

Named entity linking (NEL) is the process of detecting and linking entity mentions in a given text to corresponding entities in a target knowledge base.

  • Named-entity recognition (NER)

    NER detects and classifies potential named entities in the text into predefined categories such as person, organization, location, medical code, time expression, etc. (multi-class prediction).

  • Disambiguation

For example,

  1. The text is “Michael Jordan is a machine learning professor at UC Berkeley.”

  2. NER detects and classifies the named entities Michael Jordan and UC Berkeley as person and organisation.

  3. Disambiguation takes place. Assume that there are two ‘Michael Jordan’ entities in the given knowledge base, the UC Berkeley professor and the athlete. Michael Jordan in the text is linked to the professor at the University of California, Berkeley entity in the knowledge base (that the text is referring to). Similarly, UC Berkeley in the text is linked to the University of California entity in the knowledge base.

1. Problem Statement

"Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate)."

Interview questions:

  1. How would you build an entity recognizer system?

  2. How would you build a disambiguation system?

  3. Given a piece of text, how would you extract all persons, countries, and businesses mentioned in it?

  4. How would you measure the performance of a disambiguator/entity recognizer/entity linker?

  5. Given multiple disambiguators/recognizers/linkers, how would you figure out which is the best one?

2. Metrics

Offline metrics

1. Named Entity Recognition

Possible predictions by NER for given sentence
  • For the texts: "Michael Jordan is the best professor at UC Berkeley", the two entities are (1) Micheal Jordan, (2) UC Berkeley

  • NER should detect both entities correctly, but it may detect

    • Both correctly

    • One correctly

    • None correctly (wrongly detect non-entity as an entity)

    • Correct entity but with the wrong type

    • No entity, i.e., altogether miss the entities in the sentence

  • Therefore, we want to use precision, recall and F1:

    • Precision=no.  of  correctly  recognized  named  entitiesno.  of  total  recognized  named  entitiesPrecision = \frac{no.\;of \; correctly \; recognized\; named\; entities}{no.\;of \; total \; recognized\; named\; entities}

    • Recall=no.  of  correctly  recognized  named  entitiesno.  of  named  entities  in  corpusRecall = \frac{no.\;of \; correctly \; recognized\; named\; entities}{no.\;of \; named\; entities \; in\; corpus}

    • F1score=2×precision×recallprecision+recallF1-score = 2\times \frac{precision\times recall}{precision + recall}

2. Disambiguation

Possible disambiguation outputs for recognized entity mentions in text
  • The disambiguation layer receives the recognized entity mentions in the texts and links them to entities in the knowledge base.

    • Link the mention to the correct entity

    • Link the mention to the wrong entity

    • Not link the mention to any entity

  • Since it can links all recognized entities to an object in the knowledge base, the recall rate does not apply here. We should only use the precision as the disambiguation metric:

    • Precision=no.  of  mentions  correctly  linkedno.  of  total  mentionsPrecision = \frac{no.\;of\;mentions\;correctly\;linked}{no.\;of\;total\;mentions}

Named-entity linking component

Combining the metrics for Named Entity Recognition and Disambiguation. We can use F1-score and the end-to-end metric

  • True positive: an entity has been correctly recognized and linked

  • True negative: a non-entity has been correctly recognized as a non-entity

  • False positive: a non-entity has been wrongly recognized as an entity or an entity has been wrongly linked.

  • False negative: an entity is wrongly recognized as a non-entity, or an entity that has a corresponding entity in the knowledge base is not linked.

Micro v.s. macro metrics

  • Macro is averaging the performance of each class

    • Macroaveraged  precision=i=1nPdinMacro-averaged \;precision = \frac{\sum_{i=1}^n P_{di}}{n} , where PdiP_{di} is precision over document i

    • Macroaveraged  recall=i=1nRdinMacro-averaged\; recall = \frac{\sum_{i=1}^n R_{di}}{n} , where RdiR_{di} is precision over document i

    • Macroaveraged  F1=2×Macro  precision×Macro  recallMacro  precision+Macro  recallMacro-averaged \; F1 = 2\times\frac{Macro \;precision \times Macro\; recall}{Macro\; precision + Macro\; recall}

  • Micro is weighted averaging of the performance

    • Microaveraged  precision=i=1nTPii=1nTPi+i=1nFPiMicro-averaged \; precision = \frac{\sum_{i=1}^n TP_i}{\sum_{i=1}^n TP_i + \sum_{i=1}^n FP_i}

    • Microaveraged  recall=i=1nTPii=1nTPi+i=1nFNiMicro-averaged \; recall = \frac{\sum_{i=1}^n TP_i}{\sum_{i=1}^n TP_i + \sum_{i=1}^n FN_i}

    • Microaveraged  F1=2×micro  precision×micro  recallmicro  precision+micro  recallMicro-averaged \; F1 = 2\times\frac{micro \;precision \times micro\; recall}{micro\; precision + micro\; recall}

Online metrics

With a good offline performance model, we still need to check the model's performance online.

  • We can conduct AB testing for the overall system

  • Then, measuring the general user's satisfaction

Two examples,

  • Search engine.

    • Allows us to directly answer the user’s query by returning the entity or its properties that the user wants to know. The user no longer needs to open up search results and look for the information that is required.

    • User satisfaction lies in the query being properly answered, which can be measured by session success rate, i.e., % of sessions with user intent satisfied.

  • Virtual assistants

    • Helps perform tasks for a person based on commands or questions.

    • The evaluation metric for the VA would be user satisfaction (percentage of questions successfully answered).

3. Architectural Components

Architectural diagram for entity linking

Model generation path

  • Begin by gathering training data for entity linking through open-source database

  • Pass the training data to the Named entity recognition (NER) model

    • It is used to recognize entities, like a person, organization, etc for a given input

  • Pass the result of NER to named entity disambiguation (NED) model

    • Candidate generation

      • It finds potential matches for the entity mentions, by reducing the size of the knowledge base to a smaller subset of candidate documents/entities.

    • Linking

      • It selects the exact corresponding entry in the knowledge base for each recognized entity.

Model execution path

  • It begins with an input sentence that is fed to the NER component.

  • NER identifies the entity mentions in the sentence, along with their types, and sends this information to the NED component.

  • This component then links each entity mention to its corresponding entity in the knowledge base (if it exists).

4. Training data generation

Open-source datasets:

NER: CoNLL-2003 for named-entity recognition

CoNLL-2003 for named entity recognition

Named-entity disambiguation: AIDA CoNLL-YAGO Dataset

Aida CoNLL Yago dataset
Mapping between CoNLL-2003 and AIDA CoNLL YAGO

5. Modeling

Represent words by embedding vectors

  • ELMo

  • BERT

Last updated

Was this helpful?