Feed Based System

Design a Twitter feed system.

1. Problem Statement

Problem statement:

Design a Twitter feed system that will show the most relevant tweets for a user based on their social graph.

Previous method for feed based system

User A is connected with other people and business on Twitter
The user want to knowing the activity of his connections
Originally, all tweets are generated by their followees since the user's last visit in reverse chronological order
It does not work since the potentially engaging Tweet may be pushed futher down due to time
==> We need the feed based on relevance ranking

Scale of the problem

Define the scope of the problem:

Assume there are 500 million daily active users
On average, every user is connected to 100 users
Every user fetches their feed ten times a day
==> 500 million users * 10 logins/day = 5 billions calls/day to run the model

Finally, the ML problem is:

Given a list of tweets, train an ML model that predicts the probability of engagement of tweets and orders them based on that score

2. Metrics

The feed ranking system aims to maximize user engagement. We list the user's actions as below:

Positive actions:
- Time spent viewing the tweet
- Liking a Tweet
- Retweeting
- Commenting on a Tweet
Negative actions:
- Hiding a Tweet
- Reporting Tweets as inappropriate

User engagement metrics

There are different positive and negative engagement on a Twitter feed.

Selecting feed optimization metric

We can select the different user engagement bsed on the business needs

Business want to focus on more activity in a diaglogue:
==> Model focus more on the number of comments
Business focus on the overall engagement
==> Model focus on the overall engagement: comments, likes, retweets
Business require to optimize for time spent on application:
==> Model focus on time spent on Twitter

Negative engagement or counter metric

A user may perform multiple negative actions such as reporting an inappropriate tweet, block users, hide a tweet. ==> Keep track of those actions as average negative action per user is crucial to measure and track

Weighted engagement

We also can define the different weights for each action and track the overall engagement as weighted combination engagement.

Assign different weights to the different user actions
For a group of users, count the different user actions and multiply by the weights
Sum up the weighted impacts of user actions
Normalize the weighted score
$Normalized \; Score = \frac{Weighted \; impact}{Total \; number \; of \; users}$

The weighted engagement means, a higher score leads to higher user engagement.

The weights of each user action can be tweaked to balance the business needs.

3. Architectural Components

There are mainly three components:

Tweet selection
- It fetches a pool of tweets from user's network, since the last login
Training data generation
- User's engagement action will be used to generate positve and negative samples for Twitter feed prediction model
Ranker
- We can train a single model to predict the overall engagement on the tweet
- We can also train seperate models. Each model focus on predicting the occurence probability of a certain user's action for the tweet.
- The seperate models allows us to have greater control over the importance of each engagement actions.

4. Tweet Selection

Due to the nature of Twitter, there are different types of Tweets that we need to select

New Tweets

Select all tweets that happened after a user logged in previously.
Select not-so-new tweet: Consider a Tweet that user A has viewed previously. However, by the time the user logs in again, this Tweet has received a much bigger engagement and/or A’s network has interacted with it.

Unseen Tweets

When user loggin at 9pm, he read 100 tweets
When user login again at 10 pm, there are new tweets between 9 pm and 10 pm. There are also tweets that he did not read before 9 pm.
==> Both of the unseen tweets and the new tweets should be considered

User returning after a while (2 weeks)

Plenty of tweets are generated
there should a limit on the fetched number of tweets. For example, we can set it as 500 tweets

Interest/ popularity-based tweets

Tweets do not need to limits to the user's network. It can be OUTSIDE of the user's network

Align with the user's interest
Is locally/ globally trending
Engages the user's network

5. Feature engineering

There are mainly four aspects of features in a twitter feed:

The user
The tweet
Tweet's author
The context

User-author features
Capture the social relationship between the user and the author
- User-author historical interactions
  - Author-liked-posts-3months: The percentage of an author's tweets that are liked by the user
  - Normalized Author-liked-posts-count-1year: the number of an author's tweets that interacted with the user
- User-author similarity
  - common-followees
  - topic-similarity:
    Use TF-IDF to measure the similarity between hashtags of topics
    Followed by the logged-in user and author
    Present in the post of the user and the author have interacts with
    Used by the author and user in their past post
  - tweet-content-embedding-similarity
    Use word embedding to find the content embedding of a user's tweets
  - Social-embedding-similarity
    Check the networks between the user and the author to generate embeddings based on social graph.
Author features
- Author's degree of influence
  - is_verified
  - author-social rank
  - author-num-follower
  - follower-to-following ratio
- Historical trend of interactions on the author's tweets
  - author-engagement-rate-3months:
    $engagement \; rate = \frac{Tweet-interactions}{Tweet-views}$
  - author-topic-engagement-rate-3months:
    Topic can be identified:
    By hashtags used
    Predict based on its content
User-Tweet features
- topic-similarity
  - The similarity between the hashtags or the content of the tweets and the user has tweeted or interacted with
- embedding-similarity
  Similarity based on the embedding of tweet and user
Tweet features
- Features based on Tweet's content
  - Tweet-length
  - Tweet-recency
  - is-image-video
  - is_URL
- Features based on Tweet's interaction
  - num-total-interactions.
    We need to apply time decay model to weight the latest interaction more than the ones happened some time ago
  - interaction-in-last-1hour
  - interaction-in-last-day
  - interaction-in-last-3day
  - interaction-in-last-week
- Seperate features for different engagements
  - likes-in-last-3-days
  - comments-in-last-3-days
  - reshares-in-last-2-hours
    ==> we can also have the features based on the user's network only
Context-based features
- day-of-week
- time-of-day
- current-user-location
- season
- latest-k-tags-interactions
- approaching-holiday
Sparse features
- unigram/bigram of a tweet
- user-id
- tweets-id

6. Training Data Generation

Training data generation through online user engagement

(1) Any user engagement counts as positive examples

The rest of impression tweets are negative samples

(2) Seperate models to predict user engagement

Balance the positive and negative samples

Since most of the samples are negative samples, we can use downsampling to remove the size of negative samples.

Train test split

Since the user engagement is different in weekday and weeken, we need to include the data of the whole week for training.

Since the previous tweets engagement is used as a feature in our data, we need to split the data based on time.

7. Ranking

It is a binary classification problem. The model is used to predict if a user is engaged with a tweet. There are mainly two types of models:

Train a single model predict overall engagement
Train seperate models to predict the diferent engagement actions

Logistic regression
Random forest
Neural network with multiple tasks

Staking models and online learning

One way to outperform the "single model" is to use multiple models to utilize the power of different techniques.

Advantages of using stacking models:

It give us all the learning power of deep neural networks and tree-based models, along with the flexibility of training logistic regressions model.
==> Almost keeping the mode real-time refreshed with online learning
The real-time online learning with logistic regression enable the usage of sparse features to learn the interaction