Search Ranking
Design a search relevance system for a search engine.
1. Problem Statement
Problem:
Please design a search relevance system for a search engine
Clarifying questions
In three aspects: scope, scale, and personalization
Problem scope
Is it a general search engine like Google or Bing or a specialized search engine like Amazon products search?
==> Build a generic search engine that returns relevant results for queries like "Richard Nixon", "Programming languages" etc.
Scale
How many websites exist that you want to enable through this search engine?
How many requests per second do you anticipate to handle?
Assume there are billions of documents to search from, and getting around 10K queries per second (QPS)
==> Use funnel to deal with this large scale search system
Personalization
Whether the searcher is a logged-in user or not.
Assume the user is logged in and we have access to their profile and historical search data
2. Metrics
There are two types of metrics to evaluate the success of a search query: (1) Online metrics; (2) Offline metrics.
Online metrics
Metrics that are computed as part of user interaction in a live system as Online metrics
Click-through rate (CTR)
Successful session rate (SSR)
CTR may count the unsuccessful clicks as a success.
For example, the user short clicks where the user only looked at the resultant document and clicked back immediately
==> Successful sessions can be defined as the ones that have a click with 10s or longer dwell time. Dwell time is the length of time a searcher spends viewing a webpage after they’ve clicked a link on a search engine result page (SERP).
Caveat
Zero-click searches: A SERP may answer the searcher’s query right at the top such that the searcher doesn’t need any further clicks to complete the search.
To deal with zero-click searches, we can use Time To Success.
In a user query session, the user uses several keywords to find the results that he wants. The Time-To-Success measures the time that a user needs for the ideal result.
==> Our goal is helping the user uses minimal number of queries and rank the user's wanted answer as high on the results page as possible.
Offline metrics
The offline methods to measure a successful search session makes use of trained human raters.
NDCG
An improvement in cumulative gain (CG)
,
rel = relevence rating of a document by human
i = position of a document
p = position up to which the documents are ranked
DCG discounts are based on the position of the document in human-rated data. The intuition behind it is that the search engine will not be of much use if it doesn’t show the most relevant documents at the top of search result pages.
, where IDCG is ideal discounted cumulated gain
IDCG is the DCG of an ideal ordering of the search engine’s result list
For N queries, the NDCG is
Note:
NDCG does not penalize irrelevant search results. If a document has zero relevance according to the human rater, NDCG just assigns 0 to it.
To remedy, the human rater can assign a negative relevance score to the irrelevant document
3. Architecture Components
Architecture

Query rewriting
Queries are often poorly worded and far from desribing the searcher's actual needs
Spell checker: fix the typo
Query expansion: improve search results by adding terms to the user's query
Query relaxation: remove some words to improve search results
Query understanding
Figuring out the main intent behind the query
Query "gas stations" -> local intent
"Earthquake" -> newsy intent
Document selection
Sift through billions of documents to retrieve millions of relevant documents
Usually, we just use a SQL query for it
Ranker
Use ML to find the best order of documents
If number of documents is significantly large(>10K) and the amount of traffic is also huge (10k QPS)
==> Use multiple stages of ranking with varying degrees of complexity and model sizes
Blender
Gives relevant results from various search verticals, like images, videos, news, local results...
==> It helps improve the diversity of results.
The final output is the Search Engine Result Page (SERP) in response to the searcher's page
Training data generation
It takes online user engagement data from SERP in response to queries and generates positive and negative training examples.
The training data generated is fed to the machine learning models for training.
Layered model approach

For the different layers in a model, we use simple models in the beginning for filtering and complex models later for fine document ranking
4. Document Selection
From the one-hundred billion documents on the internet, let's retrieve the top one-hundred thousand that are relevant to the searcher's query.
Inverted Index
An index data structure that stores a mapping from content, such as words or numbers, to its location in a set of documents.

Document selection process
Searcher input query:
"Italian restaurants"
Selection criteria:
Query expansion to "Italian food"
==> Retrieve documents that (Match item "Italian" AND (Match "restuarant" OR Match "food"))
Relevance scoring scheme
For the selected documents, we need to rank them based on relevance score. We can use the factors below:
Terms match:
Use the IDF of each term to weigh the match. Match for important terms in the query weighs higher
Document match:
Document popularity score is the count of terms in a document
Query intent match:
The intent of a query, like local, news, photo. For a restaurant, it is a local intent
Personalization match:
How well a document meets the searchers' individual requirements
5. Feature Engineering
There are mainly four sections for search:
Searcher
Query
Document
Context

Searcher-specific features
Age
Gender
interest
Query-specific features
Query historical engagement
Historical engagement helps to know users' preference on the results
Query Intent
Identify the kind of information a searcher is looking for. Like news, local, commerce
Document-specific features
Page rank
Document engagement radius
Context-specific features
Time of search
Recent events
Searcher-document features
Distance
Historical engagement
Query-document features
Text Match
Unigram or bigram
Query-document historical engagement
Click rate
Embedding
6. Training Data Generation
Training data generation for pointwise approach
Pointwise approach: In this approach of model training, the training data consists of relevance scores for each document. The loss function looks at the score of one document at a time as an absolute ranking.
Generating positive and negative training examples
For a user's SERP, we pick a positive sample, which the user clicked; we picked a negative sample, which the user did not click.
Remedy less negative example, we can use random negative examples, which is 50th page of the results
Training, Testing split: consider the features. If we used historical log, we need to make sure there is not data leakage
Training data generation for pairwise approach
Pairwise approach: This approach differs from the pointwise approach during model training. Here, the loss function looks at the scores of document pairs as an inequality instead of the score of a single document. This enables the model to learn to rank documents according to their relative order which is closer to the nature of ranking.
There are two method for data generation
Human raters (offline method)
Rate a list of results based on the search queries
User-engagement (online method)
For a list of SERP,
Label positive if a user clicked it or the user spends more time on a page
Negative label is randomly selected.
7. Ranking
For a large scale search engine, we need a multi-layer funnel approach. You want to build a model that will display the most relevant results for a searcher’s query. Extra emphasis should be applied to the relevance of the top few results since a user generally focuses on them more on the SERP.
8. Filtering Results
Remove inappropriate results
We can use ML method
Training data
Human raters
Online user feedback
Features
same feature as the ranker model use
Building a classifier
Last updated
Was this helpful?