Entity Linking System

Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate).

The interview questions can be: design an entity linking system that

Identifies potential named entity mentions in the text
Searches for possible corresponding entities in the target knowledge base for disambiguation
Returns either the best candidate corresponding entity or nil

0. Introduction

Named entity linking (NEL) is the process of detecting and linking entity mentions in a given text to corresponding entities in a target knowledge base.

Named-entity recognition (NER)
NER detects and classifies potential named entities in the text into predefined categories such as person, organization, location, medical code, time expression, etc. (multi-class prediction).
Disambiguation

For example,

The text is “Michael Jordan is a machine learning professor at UC Berkeley.”
NER detects and classifies the named entities Michael Jordan and UC Berkeley as person and organisation.
Disambiguation takes place. Assume that there are two ‘Michael Jordan’ entities in the given knowledge base, the UC Berkeley professor and the athlete. Michael Jordan in the text is linked to the professor at the University of California, Berkeley entity in the knowledge base (that the text is referring to). Similarly, UC Berkeley in the text is linked to the University of California entity in the knowledge base.

1. Problem Statement

"Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate)."

Interview questions:

How would you build an entity recognizer system?
How would you build a disambiguation system?
Given a piece of text, how would you extract all persons, countries, and businesses mentioned in it?
How would you measure the performance of a disambiguator/entity recognizer/entity linker?
Given multiple disambiguators/recognizers/linkers, how would you figure out which is the best one?

2. Metrics

Offline metrics

1. Named Entity Recognition

For the texts: "Michael Jordan is the best professor at UC Berkeley", the two entities are (1) Micheal Jordan, (2) UC Berkeley
NER should detect both entities correctly, but it may detect
- Both correctly
- One correctly
- None correctly (wrongly detect non-entity as an entity)
- Correct entity but with the wrong type
- No entity, i.e., altogether miss the entities in the sentence
Therefore, we want to use precision, recall and F1:
- $Precision = \frac{no.\;of \; correctly \; recognized\; named\; entities}{no.\;of \; total \; recognized\; named\; entities}$
- $Recall = \frac{no.\;of \; correctly \; recognized\; named\; entities}{no.\;of \; named\; entities \; in\; corpus}$
- $F1-score = 2\times \frac{precision\times recall}{precision + recall}$

2. Disambiguation

The disambiguation layer receives the recognized entity mentions in the texts and links them to entities in the knowledge base.
- Link the mention to the correct entity
- Link the mention to the wrong entity
- Not link the mention to any entity
Since it can links all recognized entities to an object in the knowledge base, the recall rate does not apply here. We should only use the precision as the disambiguation metric:
- $Precision = \frac{no.\;of\;mentions\;correctly\;linked}{no.\;of\;total\;mentions}$

Named-entity linking component

Combining the metrics for Named Entity Recognition and Disambiguation. We can use F1-score and the end-to-end metric

True positive: an entity has been correctly recognized and linked
True negative: a non-entity has been correctly recognized as a non-entity
False positive: a non-entity has been wrongly recognized as an entity or an entity has been wrongly linked.
False negative: an entity is wrongly recognized as a non-entity, or an entity that has a corresponding entity in the knowledge base is not linked.

Micro v.s. macro metrics

Macro is averaging the performance of each class
- $Macro-averaged \;precision = \frac{\sum_{i=1}^n P_{di}}{n}$ , where $P_{di}$ is precision over document i
- $Macro-averaged\; recall = \frac{\sum_{i=1}^n R_{di}}{n}$ , where $R_{di}$ is precision over document i
- $Macro-averaged \; F1 = 2\times\frac{Macro \;precision \times Macro\; recall}{Macro\; precision + Macro\; recall}$
Micro is weighted averaging of the performance
- $Micro-averaged \; precision = \frac{\sum_{i=1}^n TP_i}{\sum_{i=1}^n TP_i + \sum_{i=1}^n FP_i}$
- $Micro-averaged \; recall = \frac{\sum_{i=1}^n TP_i}{\sum_{i=1}^n TP_i + \sum_{i=1}^n FN_i}$
- $Micro-averaged \; F1 = 2\times\frac{micro \;precision \times micro\; recall}{micro\; precision + micro\; recall}$

Online metrics

With a good offline performance model, we still need to check the model's performance online.

We can conduct AB testing for the overall system
Then, measuring the general user's satisfaction

Two examples,

Search engine.
- Allows us to directly answer the user’s query by returning the entity or its properties that the user wants to know. The user no longer needs to open up search results and look for the information that is required.
- User satisfaction lies in the query being properly answered, which can be measured by session success rate, i.e., % of sessions with user intent satisfied.
Virtual assistants
- Helps perform tasks for a person based on commands or questions.
- The evaluation metric for the VA would be user satisfaction (percentage of questions successfully answered).

3. Architectural Components

Model generation path

Begin by gathering training data for entity linking through open-source database
Pass the training data to the Named entity recognition (NER) model
- It is used to recognize entities, like a person, organization, etc for a given input
Pass the result of NER to named entity disambiguation (NED) model
- Candidate generation
  - It finds potential matches for the entity mentions, by reducing the size of the knowledge base to a smaller subset of candidate documents/entities.
- Linking
  - It selects the exact corresponding entry in the knowledge base for each recognized entity.

Model execution path

It begins with an input sentence that is fed to the NER component.
NER identifies the entity mentions in the sentence, along with their types, and sends this information to the NED component.
This component then links each entity mention to its corresponding entity in the knowledge base (if it exists).

4. Training data generation

Open-source datasets:

NER: CoNLL-2003 for named-entity recognition

Named-entity disambiguation: AIDA CoNLL-YAGO Dataset

5. Modeling

Represent words by embedding vectors

ELMo
BERT

PreviousAd Prediction System NextRecommender Systems

Last updated 4 years ago

Was this helpful?