Ad Prediction System
The interviewer has asked you to build a system to show relevant ads to users.
The interview questions can be
How would you build an ML system to predict the probability of engagement for Ads?
How would you build an Ads relevance system for a search engine?
How would you build an Ads relevance system for a social network?
1. Problem Statement
Problem Statement
Predict the probability of engagement of an ad for a given user and context(query, device, etc.)
Categories of ad prediction application
Search Engine (Google)
The query will be part of the context along with the searcher.
The system will display ads based on the search query issued by the searcher
Social media platforms (Facebook)
Use user information (location, demographics, interests hierarchy) as part of the context and used to select ads
Automatically detect user interest based on the user's historical interactions and display ads accordingly
==> The main difference:
The context that is used to select and predict ad engagement
2. Metrics
The metrics used in our ad prediction system will help select the best machine-learned models to show relevant ads to the user. They should also ensure that these models help the overall improvement of the platform, increase revenue, and provide value for the advertisers.
Offline metrics
Log Loss
AUC is not suitable for our task:
AUC does not penalize for “how far off” predicted score is from the actual label. For example, let’s take two positive examples (i.e., with actual label 1) that have the predicted scores of 0.51 and 0.7 at threshold 0.5. These scores will contribute equally to our loss even though one is much closer to our predicted value.
AUC is insensitive to well-calibrated probabilities.
Online metrics
Overall revenue
Revenue is basically computed as the sum of the winning bid value (as selected by auction) when the predicted event happens.
If the bid amounts to $0.5 and the user clicks on the ad, the advertiser will be charged $0.5
The business won’t be charged if the user doesn’t click on the ad.
Measuring revenue is a very short term approach
If we may not provide enough value to advertisers, they will move away from the system
But the revenue is still one critical metric to track
Overall ads engagement rate
Click rate
Measure the ratio of user clicks to ads
Downstream action rate
Measure the action rate of a particular action targeted by the advertiser e.g. add to cart rate, purchase rate, message rate etc.
Counter metrics
Track counter metrics to see if the ads are negatively impacting the platform
Track online ads experiments:
Examine the session success going down significantly because of ads
Examine the average queries per user impacted
Examine the number of returning users on the platform
Track user's direct negative feedbacks:
Hide ads
Never see this ad
Report ad as inappropriate
3. Architectural Components
There are mainly two actors in the ad prediction system: (1) Advertiser flow; (2) User flow
Advertiser flow
Advertisers create ads containing their content as well as targeting, i.e., scenarios in which they want to trigger their ads.
Query-based targeting:
This method shows ads to the user based on the query terms. The query terms can be a partial match, full match, expansion, etc.
User-based targeting
The ads will be subjective to the user based on a specific region, demographic, gender, age, etc.
Interest-based targeting
This method shows interest-based ads. Assume that on Facebook, the advertiser might want to show ads based on certain interest hierarchies.
Set-based targeting
This type shows ads to a set of users selected by the advertisers. For example, showing an ad to people who were previous buyers or have spent more than ten minutes on the website.
User flow

When a user queries the system, there are mainly two steps in the system:
Advertisers create ads providing targeting information, and the ads are stored in the ads index.
When a user queries the platform, ads can be selected from the index based on their information (e.g., demographics, interests, etc.) and run through our ads prediction system.
For each of the components:
Ad selection
Fetch the top k ads based on relevance (subject to the user context) and bid from the ads index.
Ad prediction
Predict user engagement with the ad (the probability that an action will be taken on the ad if it is shown), given the ad, advertiser, user, and context. Then, it will rank ads based on relevance score and bid.
Auction
An auction takes place to determine which ads to show. The top relevant ads selected by the ad prediction system are given as input to Auction. Auction then looks at total value based on an ad’s bid as well as its relevance score. ==> Ad with the highest value is the winner
The auction total value depends on the factors
Bid:
An advertiser places for that ad. In other words, the amount the advertiser is willing to pay for a given action such as click or purchase.
User engagement rate
An estimate of user engagement with the ad.
Ad quality score
An assessment of the quality of the ad by taking into account feedback from people viewing or hiding the ad.
Budget
The advertiser’s budget for an ad
Ad predicted score is the results of the combination of the user engagement and ad quality score
Ad rank score calculated based on predicted ad score and bid
The ads with the highest ad rank score wins the auction. The cost per engagement(CPE) or cost per click (CPC) will depend on its ad rank score
Pacing
Pacing an ad means evenly spending the ad budget over the selected time period rather than trying to spend all of it at the start of the campaign.
Pacing overcomes this by dynamically changing the bid such that the ad set is evenly spread out throughout the day
The advertiser get maximum ROI on their campaign
It prevents the overall system from having significantly high load at the start of the day
Training data generation
We need to record the action taken on an ad. This component takes user action on the ads (displayed after the auction) and generates positive and negative training examples for the ad prediction component.
Positive: clicked the ad, viewed the ad
Negative: hide the ad, never viewed after 5 times
Funnel model approach
As going down the funnel, the complexity of the models becomes higher and the set of ads that they run on becomes smaller. For example
A 30-year-old male user issues a query "machine learning"
Ads selection: selected all of the ads that have the targeting criteria and use simple model for ad's relevance score
Ads selection: ranks the ads according to r, where r = bid*relevance score
Ads prediction: go over the selected ads and uses a highly optimized ML model to an ad predicted score
Ads auction: run the auction algorithm on bid and predicted score to select the top most relevant ads
4. Feature Engineering

Ad specific features
ad_id
ad-content-raw-terms
historical-engagement-rate
engagement-last-24-hours
engagement-last-7-days
ad-impression
ad-negative-engagement-rate
ad-embedding
ad-age
ad-bid
Advertiser specific features
advertiser-domain
historical-engagement-rate
region-wise-engagement
User specific features
user-previous-search-term
user-search-term
age
gender
language
embedding-last-k-ads
engagement-content-type
engagement-days
platform-time-spent
region
Context specific features
current-region
time
device
User-ad cross features
embedding-similarity
region-wise-engagement
user-ad-categorical-histogram
user-ad-subcategory-histogram
user-gender-ad-histogram
user-age-ad-historgram
User-advertiser cross features
embedding similarity
user-gender-advertiser-historgram
user-age-advertiser-histogram
Key consideration for historical engagement features
Give the histogram data along with the key value to the model for learning.
For example, in the case of daily historical engagement rate, we will give the model a histogram of seven-day user engagement from Sunday to Monday, which looks like [0.4, 0.2, 0.21, 0.25, 0.27, 0.38, 0.42]. In addition, we will pass the current day as a feature as well, e.g., if the day is Sunday the feature value of day will be 1, and for Saturday the feature value will be 7. So, the model will see eight features in total: seven more raw features for each day historical engagement rate and one feature for the current day of the week.
Giving this histogram information to our model is to create a single feature that gives the current day engagement rate as a value.
The feature value on Sunday will be 0.4, and on Friday it will be 0.38 for the above data.
It would be better to feed both of those features to the model
5. Training Data Generation
Training data generation through online user engagement
The advertiser specifies the positive actions of user
Any user actions are positive

Click an ad is positive

add to cart is positive

Balancing the positive and negative exmaples

Model recalibration
Negative downsampling can accelerate training while enabling us to learn from both positive and negative examples. However, our predicted model output will now be in the downsampling space. For instance, consider that if our engagement rate is 5% and we select only 10% negative samples, our average predicted engagement rate will be near 50%.
==> Auction uses this predicted rate to determine order and pricing; therefore it’s critical that we recalibrate our score before sending them to auction.
q is the re-calibrated prediction score
p is the prediction in downsampling space
w is the negative downsampling rate
Train Test split
We need to be mindful of the fact that user engagement patterns may differ throughout the week. Hence we will use a week’s engagement to capture all patterns during training data generation.
6. Ad Selection
For funnel-based approach for modeling, there are three phases:

Phase 1: Selection of ads
Quick selection of ads for the given query and user context according to selection criteria. The key requirement is to achieve the quick selection objectives is to have the ads in a system that is fast and enables complex selection criteria.
Build an in-memory index to store the ads. It allows us to fetch ads quickly based on different targeting and relevance information from the user
So we index ads on all possible indexing terms that can be used for selection (like targeted terms, city, state, country, region, age)
For example, a male user, aged twenty-five, and located in San Francisco, California is searching stuff related to machine learning.
For search systems, the selection will based on the search term. the selection query will be like
For feed-based system, the selection will base on the user interests. the selection query will be like
Phase 2: Narrow down selection set to top relevant ads

To narrow down the selection set in phase 1, we want to use a simple method to narrow down the top relevant ads.
==> (bid) * (prior CPE score)
bid: the advertiser wants to pay for each engagement
prior CPE (Cost per engagement) score: based on the ad, advertiser, ad type.
Basically, we just use the score from the history, not calculation
For new ad or advertiser, we need to assign a slightly higher score, which is called a score boost.
boost = 0
if ad_age < 48:
boost = 0.1*(1-ad_age/48)
ad_score = bid * (1+ boost)*CPE
Phase 3: Ranking of selected ads using a simplistic model

We can run a simplistic and efficient model on ads selected in phase 2 to further narrow it down. The top ads from this phase will be passed to our ad prediction stage to run more complex and accurate models for better user engagement prediction.
For this step, we can use simple models like logistic regression, random forest, boosted decision trees.
After this step,
we will get a new CPE (cost per engagement) score from the model.
Then, the ads are ranked on the new score: bid*(new CPE)
How does the system scale
Note that our index is sharded
It runs on multiple machines and every machine selects the top k ads based on the prior score
Each machine runs a logistic regression model built on dense features to rank ads.
there are no sparse features to keep the model size small

In this case,
Sharded index with four partitions
Each partition select the top five-hundred ads based on a prior score and then run logistic regression to select the top fifty ads
The top fifty ads from each partition are fed to ad prediction components
7. Ad prediction
Ads are generally short-lived.
=> our predictive model is going to be deployed in a dynamic environment where the ad set is continuously changing over time.
=> more practical and efficient approach would be to keep refreshing the model with the latest impressions and engagements after regular internals (e.g. 15 mins, 30 mins, etc.)
==> online learning
The high-level idea:
Train a base model and keep adding new examples on top of it with every ad impression and engagement
Design a mechanism to generates the latest training example via an online joiner.
The model trainer will then receive these new examples to refresh the model using stochastic gradient descent.
This forms a tightly closed loop where changes in the feature distribution or model output can be detected, learned on, and improved in short successions.
Note:
The refresh of the model doesn’t have to be instantaneous, and we can do it in same batches at a certain frequency, e.g., every 30 mins, 60 mins etc.

Model for online learning
We need to select the models that have ability to update themselves through stochastic gradient descent. The candidate models can be
Logistic regression
XGBoost
Neural Network
Neural Network for feature generation

A deep neural network can be trained on the raw input features to predict our ads engagement to generate features for our logistic regression model.
Finally, we can combine the features from the neural network for the logistic regression model

8. Retraining plan
We may also need to consider re-training the model by monitoring the user engagement prediction performance
Last updated
Was this helpful?