Entity Linking System
Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate).
The interview questions can be: design an entity linking system that
Identifies potential named entity mentions in the text
Searches for possible corresponding entities in the target knowledge base for disambiguation
Returns either the best candidate corresponding entity or nil
0. Introduction
Named entity linking (NEL) is the process of detecting and linking entity mentions in a given text to corresponding entities in a target knowledge base.
Named-entity recognition (NER)
NER detects and classifies potential named entities in the text into predefined categories such as person, organization, location, medical code, time expression, etc. (multi-class prediction).
Disambiguation
For example,
The text is “Michael Jordan is a machine learning professor at UC Berkeley.”
NER detects and classifies the named entities Michael Jordan and UC Berkeley as person and organisation.
Disambiguation takes place. Assume that there are two ‘Michael Jordan’ entities in the given knowledge base, the UC Berkeley professor and the athlete. Michael Jordan in the text is linked to the professor at the University of California, Berkeley entity in the knowledge base (that the text is referring to). Similarly, UC Berkeley in the text is linked to the University of California entity in the knowledge base.

1. Problem Statement
"Given a text and knowledge base, find all the entity mentions in the text(Recognize) and then link them to the corresponding correct entry in the knowledge base(Disambiguate)."
Interview questions:
How would you build an entity recognizer system?
How would you build a disambiguation system?
Given a piece of text, how would you extract all persons, countries, and businesses mentioned in it?
How would you measure the performance of a disambiguator/entity recognizer/entity linker?
Given multiple disambiguators/recognizers/linkers, how would you figure out which is the best one?
2. Metrics
Offline metrics
1. Named Entity Recognition

For the texts: "Michael Jordan is the best professor at UC Berkeley", the two entities are (1) Micheal Jordan, (2) UC Berkeley
NER should detect both entities correctly, but it may detect
Both correctly
One correctly
None correctly (wrongly detect non-entity as an entity)
Correct entity but with the wrong type
No entity, i.e., altogether miss the entities in the sentence
Therefore, we want to use precision, recall and F1:
2. Disambiguation

The disambiguation layer receives the recognized entity mentions in the texts and links them to entities in the knowledge base.
Link the mention to the correct entity
Link the mention to the wrong entity
Not link the mention to any entity
Since it can links all recognized entities to an object in the knowledge base, the recall rate does not apply here. We should only use the precision as the disambiguation metric:
Named-entity linking component
Combining the metrics for Named Entity Recognition and Disambiguation. We can use F1-score and the end-to-end metric
True positive: an entity has been correctly recognized and linked
True negative: a non-entity has been correctly recognized as a non-entity
False positive: a non-entity has been wrongly recognized as an entity or an entity has been wrongly linked.
False negative: an entity is wrongly recognized as a non-entity, or an entity that has a corresponding entity in the knowledge base is not linked.
Micro v.s. macro metrics
Macro is averaging the performance of each class
, where is precision over document i
, where is precision over document i
Micro is weighted averaging of the performance
Online metrics
With a good offline performance model, we still need to check the model's performance online.
We can conduct AB testing for the overall system
Then, measuring the general user's satisfaction
Two examples,
Search engine.
Allows us to directly answer the user’s query by returning the entity or its properties that the user wants to know. The user no longer needs to open up search results and look for the information that is required.
User satisfaction lies in the query being properly answered, which can be measured by session success rate, i.e., % of sessions with user intent satisfied.
Virtual assistants
Helps perform tasks for a person based on commands or questions.
The evaluation metric for the VA would be user satisfaction (percentage of questions successfully answered).
3. Architectural Components

Model generation path
Begin by gathering training data for entity linking through open-source database
Pass the training data to the Named entity recognition (NER) model
It is used to recognize entities, like a person, organization, etc for a given input
Pass the result of NER to named entity disambiguation (NED) model
Candidate generation
It finds potential matches for the entity mentions, by reducing the size of the knowledge base to a smaller subset of candidate documents/entities.
Linking
It selects the exact corresponding entry in the knowledge base for each recognized entity.
Model execution path
It begins with an input sentence that is fed to the NER component.
NER identifies the entity mentions in the sentence, along with their types, and sends this information to the NED component.
This component then links each entity mention to its corresponding entity in the knowledge base (if it exists).
4. Training data generation
Open-source datasets:
NER: CoNLL-2003 for named-entity recognition

Named-entity disambiguation: AIDA CoNLL-YAGO Dataset


5. Modeling
Represent words by embedding vectors
ELMo
BERT
Last updated
Was this helpful?