Practical ML Techniques/Concepts

1. Performance and Capacity Considerations

Major performance and capacity discussion come in during the following two phases of building an ML system

Training time: how much training data and capacity is needed to build the model
Evaluation time: what are the service level agreement (SLA) that we need to meet while serving the model and capacity needs.

Complexities consideration for an ML system

There are three types of complexities:

Training complexity
The time is taken to train a model for a given task
Evaluation complexity
The time is taken to evaluate the input at testing time
Sample complexity
The total number of training samples required to learn a target function successfully.
The sample complexity changes if the model capacity changes. For example, the number of training examples for a neural network is larger than a decision tree model

Performance and capacity considerations in large scale systems

When designing an ML system, we want to get the optimal result while meeting the system's constraints, which are Service Level Agreements (SLA). For a ML system, the (1) performance and the (2) capacity are the most important to think about.

Performance-based SLA: ensures we return the results back within a given time frame for 99% of queries
Capacity-based SLA: ensures the load that our system can handle, like the system can support 1000 QPS (queries per second)

For a search system,

We used tree-based model and it takes 1us to process one sample
For 100 million documents, the model will take 100s to process those documents
==> Therefore, we need a distributed system
We can have 1000 machines to execute our model on the 100 millions documents
==> It will take 100s/1000 = 100ms

However, if we use a deep learning model, which is a much slower model than the tree-based model, the current machines are not enough to fulfill the SLA.

We can keep adding more machines to fulfill the requirements
We can use the funnel based modeling approach

Layered/ funnel based modeling approach

To manage both the performance and capacity of a system, one reasonable approach is to

Start with a relatively fast model when you have the most number of documents
In later stage, continue increase the complexity and execution time on a reduce number of documents
In the final stage, use a deep neural network.
==> Therefore, we can fulfill the SLA requirements

Example 1:

Example 2:

2. Traning Data Collection Strategies

Collection techniques

- User's interaction with the pre-existing system (Online)

The user’s interaction with the pre-existing system can generate good quality training data.

For example, we can use the user's liked item as a positive and ignored item as negative when building a movie recommendation system.

. - Human labelers (offline)

The user of the system would not be able to generate training data. Here, you will utilize labelers to generate good quality training data.

For example, image segmentation of the surroundings of a self-driving vehicle is impossible to be generated by pre-existing system.

There are mainly three ways to human labelers:

Crowdsourcing: Amazon Mturk
Specialized labelers
Open-source datasets

Train, test, & validation splits

Points to consider during splitting:

The size of each split will depend on your pariticular scenario. Common ratios for training, validation, and test splits are 60%, 20%, 20%; or 70%, 15%, 15%
Ensure capturing all kinds of patterns in each split.
For time-series data, split based on time.

Quantity of training data

To see the optimal amount of training data, you can plot the model’s performance against the number of training data samples, as shown below. After a certain quantity of training data, you can observe that there isn’t any gain in the model’s performance.

Training data filtering

Cleaning up data
- Handling missing data
- Remove outlier
- Remove duplicates
- Dropping out irrelevant features
Removing bias
- Due to the pre-existing recommendation techniques, the items that are popular will get higher chance to be recommended and liked again.
  ==> The rich becomes richer
- Instead of recommendations all based on population, we need to randomized recommendations. It means we randomly display some items in the whole list.
  ==> It can help reduce the bias of data
Bootstrapping new items (cold start)
- We can do recommendations based on the similarity of new items based on their similarity to the existing ones.

3. Online Experiment

Hypothesis and metrics intuition

The team can have multiple hypotheses that need to be validated via experimentation. To test the hypotheses, we need online experimentation.

==> It allows us to conduct controlled experiments that provide a valuable way to assess the impact of new features on customer behavior.

Running an online experiment

A/B testing is very beneficial for gauging the impact of new features or changes in the system on the user experience. It is a method of comparing two versions of a webpage or app against each other simultaneously to determine which one performs better.

The two hypotheses for the A/B test:

The null hypothesis, H0 is when the design change will not have an effect on variation.
The alternative hypothesis, H1 is alternate to the null hypothesis whereby the design change will have an effect on the variation.

Measuring results

Computing statistical significance
P-value is used to help determine the statistical significance of the results. In interpreting the p-value of a significance test, a significance level (alpha) must be specified.

Measuring long-term effects

Back Testing

When we do A/B testing, we clearly see the improvement. But we are not sure if this is caused by some other scenarios or the changes that we made on purpose. To confirm the hypothesis and be more confident about the results, we can perform a backtest. Now we change criteria, system A is the previous system B, and vice versa.

When we conduct the back testing and we observe a clear decrease in the system, it confirms that our changes indeed improve our system performance.

Long-running A/B testing

The long-running experiment, which measures long-term behaviors, can also be done via a backtest. We can launch the experiment based on initial positive results while continuing to run a long-running backtest to measure any potential long term effects. If we can notice any significant negative behavior, we can revert the changes from the launched experiment.

4. Model Debugging and Testing

There are mainly two phases of developing an ML model:

Building the first version of the model and ML system
Iterative improvements on top of the first version as well as debugging issues in large scale ML systems.

Building model v1

Steps on building the first model

Identify the business problem and mapping it to a machine learning problem
Explore the training data and ML techniques for this problem
Train the model with available data and features, play around with hyper-parameters
Once the model is set up and we have offline metrics, like accuracy, precision/recall, AUC etc, we can continue to play around the various features and training data strategies to improve the offline metrics
If there is already an existing system, our objective from the offline model is perform as good as the current system.

Note:

We want to get the v1 model launched to the real system quickly rather than spending too much time to optimize it. The reason is primarily that model improvement is an iterative process and we want validation from real traffic and data along with offline validation.

Deploying and debugging v1 model

From an offline model to online, the results may do not look as good as we anticipated offline. Here are a few failures and how to debug them

Change in feature distribution

The change in the feature distribution of training and evaluation set can negatively affect the model performance.

Solutions:

Retrain the model
Add other features with the new dataset

Feature logging issues

Some offline features can not be generated for online.

Iterative model improvement

Missing important features
For example, consider a scenario where a movie actually liked by the user was ranked very low by our recommendation system. On debugging, we figure out that the user has previously watched two movies by the same actor, so adding a feature on previous ratings by the user for this movie actor can help our model perform better in this case.
Insufficient training examples
We may also find that we are lacking training examples in cases where the model isn’t performing well.

Debugging large scale systems

In the case of debugging large scale systems with multiple components(or models), we need to see which part of the overall system is not working correctly.

Identify the failure component
- Finding the architecture component resulting in a high number of failures in our failure set.
- If we see a lot of failures, we should start with the component that results in more failures
Improve the quality of the component
- Same idea as above

PreviousIntroduction NextSearch Ranking

Last updated 4 years ago

Was this helpful?