Data Definition and Baseline

0. Learning Objective

  • List the questions you need to answer in the process of data definition.

  • Compare and contrast the types of data problems you need to solve for structured vs. unstructured and big vs. small data.

  • Explain why label consistency is important and how you can improve it

  • Explain why beating human level performance is not always indicative of success of an ML model.

  • Make a case for improving human level performance rather than beating it.

  • Identify how much training data you should gather given time and resource constraints.

  • Describe the key steps in a data pipeline.

  • Compare and contrast the proof of concept vs. production phases on an ML project.

  • Explain the importance of keeping track of data provenance and lineage.

1. Define Data and Establish Baseline

1.1 Why is data definition hard?

The fundamental reason is label ambiguity

1.2 Major types of data problems

  • Unstructured vs structured data

    • Unstructured data

      • May or may not have huge collection of unlabeled examples x

      • Humans can label more data

      • Data augmentation more likely to be helpful

    • Structured data

      • May be more difficult to obtain more data

      • Human labeling may not be possible (with some exceptions)

  • Small data (< 10K) vs. big data (>10K)

    • Small data

      • Clean labels are critical

      • Can manually look through dataset and fix labels

      • Can get all the labelers to talk to each other

    • Big data

      • Emphasis data process

1.3 Small data and label consistency

Why label consistency is important

Big data problems can have small data challenges too

  • Problems with large dataset but where there's a long tail of rare events in the input will have small data challenges too

    • Web search

    • Self-driving cars

    • Product recommendation systems

1.4 Improving label consistency

  • Have multiple labelers label same example

  • When there is disagreement, have MLE, subject matter expert, to discuss definition of y to reach agreement

  • If labelers believe that x doesn't contain enough information, consider changing x

  • Iterate until it is hard to significantly increase agreement

  • Have a class/label to capture uncertainty

  • Small data vs big data (unstructured data)

    • Small data

      • Usually small number of labelers

      • Can ask labelers to discusss specific labels

    • Big data

      • Get consistent definition with small group

      • Then send labeling instructions to labelers

      • Can consider having multiple labelers label every example and using voting to increase accuracy

1.5 Human Level Performance (HLP)

  • Why measure HLP

    • Estimate Bayes error/irreducible error to help with error analysis and prioritization

  • Other uses of HLP

    • In academia, establish and beat a respectable benchmark to support publication

    • Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target

    • "Prove" the ML system is superior to humans doing the job and thus the business or product owner should adopt it. -> Rarely work

      • Problem with beating HLP as a "proof" of ML superiority

1.6 Raising HLP

  • When the label y comes from a human label, HLP << 100% may indicate ambiguous labeling instructions

  • Improving label consistency will raise HLP

  • This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance

  • HLP on structured data

    Structured data problems are less likely to involve human labelers ==> HLP is less frequently used

    • Some exceptions

      • User ID merging: Same person?

      • Based on network traffic, is the computer hacked?

      • Is the transaction fraudulent?

      • Spam account? Bot?

      • From GPS, what is the mode of transportation - on foot, bike, car, bus

2. Label and Organize Data

2.1 Obtaining data

  • How long should you spend obtaining data?

    • Get into this iteration loop as quickly as possible

    • Instead of asking: How long it would take to obtain m examples? =>

      • Ask: How much data can we obtain in k days

    • Exceptions: If you have worked on the problem before and from experience you know you need m examples

  • Inventory data: Brainstorm list of data sources

  • Labeling data

    • Options: In-house vs outsourced vs. crowdsourced

    • Having MLEs label data is expensive. But doing this for just a few days is usually fine

    • Don't increase data by more than 10x at a time

2.2 Data pipeline

  • POC (Proof-of-concept)

    • Goal is to decide if the application is workable and worth deploying

    • Focus on getting the prototype to work

    • It is OK if data pre-processing is manual. But take extensive notes/comments

  • Production phase

    • After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable

    • E.g. Tensorflow Transform, Apache Beam, Airflow

2.3 Meta-data, data provenance and lineage

Data pipeline example

  • Meta-data

    • Examples:

      • Manufacturing visual inspection: time, factory, line #, camera settings, phone model, inspector ID

      • Speech recognition: Device type, Labeler ID, VAD model ID

    • Useful for:

      • Error analysis, Spotting unexpected effects

      • Keeping track of data provenance

3. Scoping

3.1 What is scoping

Scoping example: Ecommerce retailer looking to increase sales

  • Better recommender system

  • Better search

  • Improve catalog data

  • Inventory management

  • Price optimization

Questions:

  • What project should we work on?

  • What are the metrics for success?

  • What are the resources (data, time, people) needed?

3.2 Scoping process

  1. Brainstorm business problems (not AI problems)

  2. Brainstorm AI solutions

  3. Assess the feasibility and value of potential solutions

  4. Determine milestones

  5. Budget for resources

What are the top 3 things you wish were working better?

  • Increase conversion

  • Reduce inventory

  • Increase margin (profit per item)

Separating problem identification from solution

3.3 Diligence on feasibility and value

  • Feasibility: Is this project technically feasible?

    Use external benchmark (literature, other company, competitor)

    • Why use HLP to benchmark

      People are very good on unstructured data tasks

      • Criteria: Can a human, given the same data, perform the task?

    • Do we have features that are predictive?

  • Value

    • Ethical considerations

      • Is this project creating net positive social value?

      • Is this project reasonably fair and free from bias?

      • Have any ethical concern been openly aired and debated?

3.4 Milestones and resourcing

  • Key specifications:

    • ML metrics (accuracy, precision, F1)

    • Software metrics (latency throughput, given compute resources)

    • Business metrics (revenue)

    • Resources needed (data, personnel, help from other teams)

    • Timeline

  • If unsure, considering benchmark to other projects, or building a POC first

Last updated

Was this helpful?