Data Definition and Baseline

0. Learning Objective

List the questions you need to answer in the process of data definition.
Compare and contrast the types of data problems you need to solve for structured vs. unstructured and big vs. small data.
Explain why label consistency is important and how you can improve it
Explain why beating human level performance is not always indicative of success of an ML model.
Make a case for improving human level performance rather than beating it.
Identify how much training data you should gather given time and resource constraints.
Describe the key steps in a data pipeline.
Compare and contrast the proof of concept vs. production phases on an ML project.
Explain the importance of keeping track of data provenance and lineage.

The fundamental reason is label ambiguity

Why label consistency is important

Problems with large dataset but where there's a long tail of rare events in the input will have small data challenges too
- Web search
- Self-driving cars
- Product recommendation systems

Have multiple labelers label same example
When there is disagreement, have MLE, subject matter expert, to discuss definition of y to reach agreement
If labelers believe that x doesn't contain enough information, consider changing x
Iterate until it is hard to significantly increase agreement
Have a class/label to capture uncertainty
Small data vs big data (unstructured data)
- Small data
  - Usually small number of labelers
  - Can ask labelers to discusss specific labels
- Big data
  - Get consistent definition with small group
  - Then send labeling instructions to labelers
  - Can consider having multiple labelers label every example and using voting to increase accuracy

Why measure HLP
- Estimate Bayes error/irreducible error to help with error analysis and prioritization
Other uses of HLP
- In academia, establish and beat a respectable benchmark to support publication
- Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target
- "Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it. -> Rarely work
  - Problem with beating HLP as a "proof" of ML superiority

When the label y comes from a human label, HLP << 100% may indicate ambiguous labeling instructions
Improving label consistency will raise HLP
This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance

How long should you spend obtaining data?
- Get into this iteration loop as quickly as possible
- Instead of asking: How long it would take to obtain m examples? =>
  - Ask: How much data can we obtain in k days
- Exceptions: If you have worked on the problem before and from experience you know you need m examples
Inventory data: Brainstorm list of data sources
Labeling data
- Options: In-house vs outsourced vs. crowdsourced
- Having MLEs label data is expensive. But doing this for just a few days is usually fine
- Don't increase data by more than 10x at a time

POC (Proof-of-concept)
- Goal is to decide if the application is workable and worth deploying
- Focus on getting the prototype to work
- It is OK if data pre-processing is manual. But take extensive notes/comments
Production phase
- After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable
- E.g. Tensorflow Transform, Apache Beam, Airflow

Data pipeline example

Scoping example: Ecommerce retailer looking to increase sales

Questions:

What are the top 3 things you wish were working better?

Separating problem identification from solution

Feasibility: Is this project technically feasible?
Use external benchmark (literature, other company, competitor)
- Why use HLP to benchmark
  People are very good on unstructured data tasks
  - Criteria: Can a human, given the same data, perform the task?
- Do we have features that are predictive?
Value
- Ethical considerations
  - Is this project creating net positive social value?
  - Is this project reasonably fair and free from bias?
  - Have any ethical concern been openly aired and debated?

Last updated 4 years ago

Was this helpful?