Search Ranking

Design a search relevance system for a search engine.

1. Problem Statement

Problem:

Please design a search relevance system for a search engine

Clarifying questions

In three aspects: scope, scale, and personalization

Problem scope
- Is it a general search engine like Google or Bing or a specialized search engine like Amazon products search?
==> Build a generic search engine that returns relevant results for queries like "Richard Nixon", "Programming languages" etc.
Scale
- How many websites exist that you want to enable through this search engine?
- How many requests per second do you anticipate to handle?
Assume there are billions of documents to search from, and getting around 10K queries per second (QPS)
==> Use funnel to deal with this large scale search system
Personalization
- Whether the searcher is a logged-in user or not.
Assume the user is logged in and we have access to their profile and historical search data

2. Metrics

There are two types of metrics to evaluate the success of a search query: (1) Online metrics; (2) Offline metrics.

Online metrics

Metrics that are computed as part of user interaction in a live system as Online metrics

Click-through rate (CTR)
$CTR = \frac{clicks\; \#}{impression \;\#}$
Successful session rate (SSR)
CTR may count the unsuccessful clicks as a success.
- For example, the user short clicks where the user only looked at the resultant document and clicked back immediately
==> Successful sessions can be defined as the ones that have a click with 10s or longer dwell time. Dwell time is the length of time a searcher spends viewing a webpage after they’ve clicked a link on a search engine result page (SERP).
$SSR = \frac{successful\; session \;\#}{total\; sessions \;\#}$
Caveat
- Zero-click searches: A SERP may answer the searcher’s query right at the top such that the searcher doesn’t need any further clicks to complete the search.
- To deal with zero-click searches, we can use Time To Success.
  In a user query session, the user uses several keywords to find the results that he wants. The Time-To-Success measures the time that a user needs for the ideal result.
  ==> Our goal is helping the user uses minimal number of queries and rank the user's wanted answer as high on the results page as possible.

Offline metrics

The offline methods to measure a successful search session makes use of trained human raters.

NDCG
An improvement in cumulative gain (CG)
$CG_p = \sum_{i=1}^{p}rel_i$ ,
- rel = relevence rating of a document by human
- i = position of a document
- p = position up to which the documents are ranked
$DCG_p =\sum_{i=1}^{p}\frac{rel_i}{log_2(i+1)}$
DCG discounts are based on the position of the document in human-rated data. The intuition behind it is that the search engine will not be of much use if it doesn’t show the most relevant documents at the top of search result pages.
$NDCG_p = \frac{DCG_p}{IDCG_p}$ , where IDCG is ideal discounted cumulated gain
IDCG is the DCG of an ideal ordering of the search engine’s result list
For N queries, the NDCG is
$NDCG = \frac{\sum_{i=1}^{N}NDCG_i}{N}$

Note:

NDCG does not penalize irrelevant search results. If a document has zero relevance according to the human rater, NDCG just assigns 0 to it.

To remedy, the human rater can assign a negative relevance score to the irrelevant document

3. Architecture Components

Architecture

Query rewriting
Queries are often poorly worded and far from desribing the searcher's actual needs
- Spell checker: fix the typo
- Query expansion: improve search results by adding terms to the user's query
- Query relaxation: remove some words to improve search results
Query understanding
Figuring out the main intent behind the query
- Query "gas stations" -> local intent
- "Earthquake" -> newsy intent
Document selection
Sift through billions of documents to retrieve millions of relevant documents
- Usually, we just use a SQL query for it
Ranker
Use ML to find the best order of documents
- If number of documents is significantly large(>10K) and the amount of traffic is also huge (10k QPS)
  ==> Use multiple stages of ranking with varying degrees of complexity and model sizes
Blender
Gives relevant results from various search verticals, like images, videos, news, local results...
==> It helps improve the diversity of results.
The final output is the Search Engine Result Page (SERP) in response to the searcher's page
Training data generation
It takes online user engagement data from SERP in response to queries and generates positive and negative training examples.
The training data generated is fed to the machine learning models for training.

Layered model approach

For the different layers in a model, we use simple models in the beginning for filtering and complex models later for fine document ranking

4. Document Selection

From the one-hundred billion documents on the internet, let's retrieve the top one-hundred thousand that are relevant to the searcher's query.

Inverted Index

An index data structure that stores a mapping from content, such as words or numbers, to its location in a set of documents.

Document selection process

Searcher input query:
"Italian restaurants"
Selection criteria:
- Query expansion to "Italian food"
==> Retrieve documents that (Match item "Italian" AND (Match "restuarant" OR Match "food"))
Relevance scoring scheme
For the selected documents, we need to rank them based on relevance score. We can use the factors below:
- Terms match:
  Use the IDF of each term to weigh the match. Match for important terms in the query weighs higher
- Document match:
  Document popularity score is the count of terms in a document
- Query intent match:
  The intent of a query, like local, news, photo. For a restaurant, it is a local intent
- Personalization match:
  How well a document meets the searchers' individual requirements

5. Feature Engineering

There are mainly four sections for search:

Searcher
Query
Document
Context

Searcher-specific features
- Age
- Gender
- interest
Query-specific features
- Query historical engagement
  Historical engagement helps to know users' preference on the results
- Query Intent
  Identify the kind of information a searcher is looking for. Like news, local, commerce
Document-specific features
- Page rank
- Document engagement radius
Context-specific features
- Time of search
- Recent events
Searcher-document features
- Distance
- Historical engagement
Query-document features
- Text Match
- Unigram or bigram
- Query-document historical engagement
  - Click rate
  - Embedding

6. Training Data Generation

Training data generation for pointwise approach

Pointwise approach: In this approach of model training, the training data consists of relevance scores for each document. The loss function looks at the score of one document at a time as an absolute ranking.

Generating positive and negative training examples

For a user's SERP, we pick a positive sample, which the user clicked; we picked a negative sample, which the user did not click.
Remedy less negative example, we can use random negative examples, which is 50th page of the results
Training, Testing split: consider the features. If we used historical log, we need to make sure there is not data leakage

Training data generation for pairwise approach

Pairwise approach: This approach differs from the pointwise approach during model training. Here, the loss function looks at the scores of document pairs as an inequality instead of the score of a single document. This enables the model to learn to rank documents according to their relative order which is closer to the nature of ranking.

There are two method for data generation

Human raters (offline method)
Rate a list of results based on the search queries
User-engagement (online method)
- For a list of SERP,
- Label positive if a user clicked it or the user spends more time on a page
- Negative label is randomly selected.

7. Ranking

For a large scale search engine, we need a multi-layer funnel approach. You want to build a model that will display the most relevant results for a searcher’s query. Extra emphasis should be applied to the relevance of the top few results since a user generally focuses on them more on the SERP.

8. Filtering Results

Remove inappropriate results

We can use ML method

Training data
- Human raters
- Online user feedback
Features
- same feature as the ranker model use
Building a classifier

PreviousPractical ML Techniques/Concepts NextFeed Based System

Last updated 4 years ago

Was this helpful?