Data Definition and Baseline
0. Learning Objective
List the questions you need to answer in the process of data definition.
Compare and contrast the types of data problems you need to solve for structured vs. unstructured and big vs. small data.
Explain why label consistency is important and how you can improve it
Explain why beating human level performance is not always indicative of success of an ML model.
Make a case for improving human level performance rather than beating it.
Identify how much training data you should gather given time and resource constraints.
Describe the key steps in a data pipeline.
Compare and contrast the proof of concept vs. production phases on an ML project.
Explain the importance of keeping track of data provenance and lineage.
1. Define Data and Establish Baseline
1.1 Why is data definition hard?
The fundamental reason is label ambiguity
1.2 Major types of data problems

Unstructured vs structured data
Unstructured data
May or may not have huge collection of unlabeled examples x
Humans can label more data
Data augmentation more likely to be helpful
Structured data
May be more difficult to obtain more data
Human labeling may not be possible (with some exceptions)
Small data (< 10K) vs. big data (>10K)
Small data
Clean labels are critical
Can manually look through dataset and fix labels
Can get all the labelers to talk to each other
Big data
Emphasis data process
1.3 Small data and label consistency
Why label consistency is important

Big data problems can have small data challenges too
Problems with large dataset but where there's a long tail of rare events in the input will have small data challenges too
Web search
Self-driving cars
Product recommendation systems
1.4 Improving label consistency
Have multiple labelers label same example
When there is disagreement, have MLE, subject matter expert, to discuss definition of y to reach agreement
If labelers believe that x doesn't contain enough information, consider changing x
Iterate until it is hard to significantly increase agreement
Have a class/label to capture uncertainty
Small data vs big data (unstructured data)
Small data
Usually small number of labelers
Can ask labelers to discusss specific labels
Big data
Get consistent definition with small group
Then send labeling instructions to labelers
Can consider having multiple labelers label every example and using voting to increase accuracy
1.5 Human Level Performance (HLP)
Why measure HLP
Estimate Bayes error/irreducible error to help with error analysis and prioritization
Other uses of HLP
In academia, establish and beat a respectable benchmark to support publication
Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target
"Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it. -> Rarely work
Problem with beating HLP as a "proof" of ML superiority
1.6 Raising HLP
When the label y comes from a human label, HLP << 100% may indicate ambiguous labeling instructions
Improving label consistency will raise HLP
This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance
HLP on structured data
Structured data problems are less likely to involve human labelers ==> HLP is less frequently used
Some exceptions
User ID merging: Same person?
Based on network traffic, is the computer hacked?
Is the transaction fraudulent?
Spam account? Bot?
From GPS, what is the mode of transportation - on foot, bike, car, bus
2. Label and Organize Data
2.1 Obtaining data
How long should you spend obtaining data?
Get into this iteration loop as quickly as possible
Instead of asking: How long it would take to obtain m examples? =>
Ask: How much data can we obtain in k days
Exceptions: If you have worked on the problem before and from experience you know you need m examples
Inventory data: Brainstorm list of data sources
Labeling data
Options: In-house vs outsourced vs. crowdsourced
Having MLEs label data is expensive. But doing this for just a few days is usually fine
Don't increase data by more than 10x at a time
2.2 Data pipeline

POC (Proof-of-concept)
Goal is to decide if the application is workable and worth deploying
Focus on getting the prototype to work
It is OK if data pre-processing is manual. But take extensive notes/comments
Production phase
After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable
E.g. Tensorflow Transform, Apache Beam, Airflow
2.3 Meta-data, data provenance and lineage
Data pipeline example

Meta-data
Examples:
Manufacturing visual inspection: time, factory, line #, camera settings, phone model, inspector ID
Speech recognition: Device type, Labeler ID, VAD model ID
Useful for:
Error analysis, Spotting unexpected effects
Keeping track of data provenance
3. Scoping
3.1 What is scoping
Scoping example: Ecommerce retailer looking to increase sales
Better recommender system
Better search
Improve catalog data
Inventory management
Price optimization
Questions:
What project should we work on?
What are the metrics for success?
What are the resources (data, time, people) needed?
3.2 Scoping process
Brainstorm business problems (not AI problems)
Brainstorm AI solutions
Assess the feasibility and value of potential solutions
Determine milestones
Budget for resources
What are the top 3 things you wish were working better?
Increase conversion
Reduce inventory
Increase margin (profit per item)
Separating problem identification from solution

3.3 Diligence on feasibility and value
Feasibility: Is this project technically feasible?
Use external benchmark (literature, other company, competitor)
Why use HLP to benchmark
People are very good on unstructured data tasks
Criteria: Can a human, given the same data, perform the task?
Do we have features that are predictive?
Value
Ethical considerations
Is this project creating net positive social value?
Is this project reasonably fair and free from bias?
Have any ethical concern been openly aired and debated?
3.4 Milestones and resourcing
Key specifications:
ML metrics (accuracy, precision, F1)
Software metrics (latency throughput, given compute resources)
Business metrics (revenue)
Resources needed (data, personnel, help from other teams)
Timeline
If unsure, considering benchmark to other projects, or building a POC first
Last updated
Was this helpful?