Ad Prediction System

The interviewer has asked you to build a system to show relevant ads to users.

The interview questions can be

  • How would you build an ML system to predict the probability of engagement for Ads?

  • How would you build an Ads relevance system for a search engine?

  • How would you build an Ads relevance system for a social network?

1. Problem Statement

Problem Statement

Predict the probability of engagement of an ad for a given user and context(query, device, etc.)

Categories of ad prediction application

  • Search Engine (Google)

    • The query will be part of the context along with the searcher.

    • The system will display ads based on the search query issued by the searcher

  • Social media platforms (Facebook)

    • Use user information (location, demographics, interests hierarchy) as part of the context and used to select ads

    • Automatically detect user interest based on the user's historical interactions and display ads accordingly

  • ==> The main difference:

    • The context that is used to select and predict ad engagement

2. Metrics

The metrics used in our ad prediction system will help select the best machine-learned models to show relevant ads to the user. They should also ensure that these models help the overall improvement of the platform, increase revenue, and provide value for the advertisers.

Offline metrics

Log Loss

Loss=1Ni=1N[yilogpi+(1yi)log(1pi)]Loss = -\frac{1}{N}\sum^N_{i=1}[y_ilogp_i + (1-y_i)log(1-p_i)]

AUC is not suitable for our task:

  • AUC does not penalize for “how far off” predicted score is from the actual label. For example, let’s take two positive examples (i.e., with actual label 1) that have the predicted scores of 0.51 and 0.7 at threshold 0.5. These scores will contribute equally to our loss even though one is much closer to our predicted value.

  • AUC is insensitive to well-calibrated probabilities.

Online metrics

Overall revenue

  • Revenue is basically computed as the sum of the winning bid value (as selected by auction) when the predicted event happens.

    • If the bid amounts to $0.5 and the user clicks on the ad, the advertiser will be charged $0.5

    • The business won’t be charged if the user doesn’t click on the ad.

  • Measuring revenue is a very short term approach

    • If we may not provide enough value to advertisers, they will move away from the system

    • But the revenue is still one critical metric to track

Overall ads engagement rate

  • Click rate

    Measure the ratio of user clicks to ads

  • Downstream action rate

    Measure the action rate of a particular action targeted by the advertiser e.g. add to cart rate, purchase rate, message rate etc.

Counter metrics

Track counter metrics to see if the ads are negatively impacting the platform

  • Track online ads experiments:

    • Examine the session success going down significantly because of ads

    • Examine the average queries per user impacted

    • Examine the number of returning users on the platform

  • Track user's direct negative feedbacks:

    • Hide ads

    • Never see this ad

    • Report ad as inappropriate

3. Architectural Components

There are mainly two actors in the ad prediction system: (1) Advertiser flow; (2) User flow

Advertiser flow

Advertisers create ads containing their content as well as targeting, i.e., scenarios in which they want to trigger their ads.

  • Query-based targeting:

    • This method shows ads to the user based on the query terms. The query terms can be a partial match, full match, expansion, etc.

  • User-based targeting

    • The ads will be subjective to the user based on a specific region, demographic, gender, age, etc.

  • Interest-based targeting

    • This method shows interest-based ads. Assume that on Facebook, the advertiser might want to show ads based on certain interest hierarchies.

  • Set-based targeting

    • This type shows ads to a set of users selected by the advertisers. For example, showing an ad to people who were previous buyers or have spent more than ten minutes on the website.

User flow

Architectural diagram of ad prediction system

When a user queries the system, there are mainly two steps in the system:

  • Advertisers create ads providing targeting information, and the ads are stored in the ads index.

  • When a user queries the platform, ads can be selected from the index based on their information (e.g., demographics, interests, etc.) and run through our ads prediction system.

For each of the components:

  • Ad selection

    • Fetch the top k ads based on relevance (subject to the user context) and bid from the ads index.

  • Ad prediction

    • Predict user engagement with the ad (the probability that an action will be taken on the ad if it is shown), given the ad, advertiser, user, and context. Then, it will rank ads based on relevance score and bid.

  • Auction

    • An auction takes place to determine which ads to show. The top relevant ads selected by the ad prediction system are given as input to Auction. Auction then looks at total value based on an ad’s bid as well as its relevance score. ==> Ad with the highest value is the winner

    • The auction total value depends on the factors

      • Bid:

        An advertiser places for that ad. In other words, the amount the advertiser is willing to pay for a given action such as click or purchase.

      • User engagement rate

        An estimate of user engagement with the ad.

      • Ad quality score

        An assessment of the quality of the ad by taking into account feedback from people viewing or hiding the ad.

      • Budget

        The advertiser’s budget for an ad

    • Ad predicted score is the results of the combination of the user engagement and ad quality score

    • Ad rank score calculated based on predicted ad score and bid

      Ad  rank  score=Ad  predicted  score×bidAd \; rank \; score = Ad \; predicted\;score\times bid

    • The ads with the highest ad rank score wins the auction. The cost per engagement(CPE) or cost per click (CPC) will depend on its ad rank score

      • CPE=Ad  rank  of  ad  belowAd  rank  score+0.01CPE=\frac{Ad\; rank\;of\;ad\;below}{Ad\;rank\;score} + 0.01

Pacing

Pacing an ad means evenly spending the ad budget over the selected time period rather than trying to spend all of it at the start of the campaign.

Pacing overcomes this by dynamically changing the bid such that the ad set is evenly spread out throughout the day

  • The advertiser get maximum ROI on their campaign

  • It prevents the overall system from having significantly high load at the start of the day

Training data generation

We need to record the action taken on an ad. This component takes user action on the ads (displayed after the auction) and generates positive and negative training examples for the ad prediction component.

  • Positive: clicked the ad, viewed the ad

  • Negative: hide the ad, never viewed after 5 times

Funnel model approach

As going down the funnel, the complexity of the models becomes higher and the set of ads that they run on becomes smaller. For example

  • A 30-year-old male user issues a query "machine learning"

  • Ads selection: selected all of the ads that have the targeting criteria and use simple model for ad's relevance score

  • Ads selection: ranks the ads according to r, where r = bid*relevance score

  • Ads prediction: go over the selected ads and uses a highly optimized ML model to an ad predicted score

  • Ads auction: run the auction algorithm on bid and predicted score to select the top most relevant ads

4. Feature Engineering

  • Ad specific features

    • ad_id

    • ad-content-raw-terms

    • historical-engagement-rate

      • engagement-last-24-hours

      • engagement-last-7-days

    • ad-impression

    • ad-negative-engagement-rate

    • ad-embedding

    • ad-age

    • ad-bid

  • Advertiser specific features

    • advertiser-domain

    • historical-engagement-rate

    • region-wise-engagement

  • User specific features

    • user-previous-search-term

    • user-search-term

    • age

    • gender

    • language

    • embedding-last-k-ads

    • engagement-content-type

    • engagement-days

    • platform-time-spent

    • region

  • Context specific features

    • current-region

    • time

    • device

  • User-ad cross features

    • embedding-similarity

    • region-wise-engagement

    • user-ad-categorical-histogram

    • user-ad-subcategory-histogram

    • user-gender-ad-histogram

    • user-age-ad-historgram

  • User-advertiser cross features

    • embedding similarity

    • user-gender-advertiser-historgram

    • user-age-advertiser-histogram

Key consideration for historical engagement features

  • Give the histogram data along with the key value to the model for learning.

    For example, in the case of daily historical engagement rate, we will give the model a histogram of seven-day user engagement from Sunday to Monday, which looks like [0.4, 0.2, 0.21, 0.25, 0.27, 0.38, 0.42]. In addition, we will pass the current day as a feature as well, e.g., if the day is Sunday the feature value of day will be 1, and for Saturday the feature value will be 7. So, the model will see eight features in total: seven more raw features for each day historical engagement rate and one feature for the current day of the week.

  • Giving this histogram information to our model is to create a single feature that gives the current day engagement rate as a value.

    The feature value on Sunday will be 0.4, and on Friday it will be 0.38 for the above data.

  • It would be better to feed both of those features to the model

5. Training Data Generation

Training data generation through online user engagement

The advertiser specifies the positive actions of user

  • Any user actions are positive

  • Click an ad is positive

  • add to cart is positive

Balancing the positive and negative exmaples

Model recalibration

Negative downsampling can accelerate training while enabling us to learn from both positive and negative examples. However, our predicted model output will now be in the downsampling space. For instance, consider that if our engagement rate is 5% and we select only 10% negative samples, our average predicted engagement rate will be near 50%.

==> Auction uses this predicted rate to determine order and pricing; therefore it’s critical that we recalibrate our score before sending them to auction.

q=pp+(1p)/wq = \frac{p}{p+(1-p)/w}
  • q is the re-calibrated prediction score

  • p is the prediction in downsampling space

  • w is the negative downsampling rate

Train Test split

We need to be mindful of the fact that user engagement patterns may differ throughout the week. Hence we will use a week’s engagement to capture all patterns during training data generation.

6. Ad Selection

For funnel-based approach for modeling, there are three phases:

Phases in ML systems

Phase 1: Selection of ads

Quick selection of ads for the given query and user context according to selection criteria. The key requirement is to achieve the quick selection objectives is to have the ads in a system that is fast and enables complex selection criteria.

  • Build an in-memory index to store the ads. It allows us to fetch ads quickly based on different targeting and relevance information from the user

  • So we index ads on all possible indexing terms that can be used for selection (like targeted terms, city, state, country, region, age)

For example, a male user, aged twenty-five, and located in San Francisco, California is searching stuff related to machine learning.

  • For search systems, the selection will based on the search term. the selection query will be like

  • For feed-based system, the selection will base on the user interests. the selection query will be like

Phase 2: Narrow down selection set to top relevant ads

Phase 1 on left with one million ads selected. Phase 2 on the right using bid * prior CPM to narrow it down to top ten-thousand ads

To narrow down the selection set in phase 1, we want to use a simple method to narrow down the top relevant ads.

==> (bid) * (prior CPE score)

  • bid: the advertiser wants to pay for each engagement

  • prior CPE (Cost per engagement) score: based on the ad, advertiser, ad type.

    • Basically, we just use the score from the history, not calculation

    • For new ad or advertiser, we need to assign a slightly higher score, which is called a score boost.

      boost = 0

      if ad_age < 48:

      • boost = 0.1*(1-ad_age/48)

      • ad_score = bid * (1+ boost)*CPE

Phase 3: Ranking of selected ads using a simplistic model

Ranking of ads in phase 3 from an efficient ML model of results coming from phase 2

We can run a simplistic and efficient model on ads selected in phase 2 to further narrow it down. The top ads from this phase will be passed to our ad prediction stage to run more complex and accurate models for better user engagement prediction.

For this step, we can use simple models like logistic regression, random forest, boosted decision trees.

After this step,

  • we will get a new CPE (cost per engagement) score from the model.

  • Then, the ads are ranked on the new score: bid*(new CPE)

How does the system scale

Note that our index is sharded

  • It runs on multiple machines and every machine selects the top k ads based on the prior score

  • Each machine runs a logistic regression model built on dense features to rank ads.

    • there are no sparse features to keep the model size small

In this case,

  • Sharded index with four partitions

  • Each partition select the top five-hundred ads based on a prior score and then run logistic regression to select the top fifty ads

  • The top fifty ads from each partition are fed to ad prediction components

7. Ad prediction

Ads are generally short-lived.

  • => our predictive model is going to be deployed in a dynamic environment where the ad set is continuously changing over time.

  • => more practical and efficient approach would be to keep refreshing the model with the latest impressions and engagements after regular internals (e.g. 15 mins, 30 mins, etc.)

    ==> online learning

The high-level idea:

  • Train a base model and keep adding new examples on top of it with every ad impression and engagement

  • Design a mechanism to generates the latest training example via an online joiner.

  • The model trainer will then receive these new examples to refresh the model using stochastic gradient descent.

  • This forms a tightly closed loop where changes in the feature distribution or model output can be detected, learned on, and improved in short successions.

    • Note:

      The refresh of the model doesn’t have to be instantaneous, and we can do it in same batches at a certain frequency, e.g., every 30 mins, 60 mins etc.

Flow of events for ad prediction system

Model for online learning

We need to select the models that have ability to update themselves through stochastic gradient descent. The candidate models can be

  • Logistic regression

  • XGBoost

  • Neural Network

Neural Network for feature generation

Last hidden layer of neural networks provides features for the logistic regression model

A deep neural network can be trained on the raw input features to predict our ads engagement to generate features for our logistic regression model.

Finally, we can combine the features from the neural network for the logistic regression model

8. Retraining plan

We may also need to consider re-training the model by monitoring the user engagement prediction performance

Last updated

Was this helpful?