Collecting, Labeling and Validating Data

0. Overview

0.0 Learning Objective

  • Describe the differences between ML modeling and a production ML system

  • Identify responsible data collection for building a fair production ML system

  • Discuss data and concept change and how to address it by annotating new training data with direct labeling and/or human labeling

  • Address training data issues by generating dataset statistics and creating, comparing and updating data schemas

0.1 Outline

  • Machine Learning (ML) engineering for production: Overview

  • Production ML = ML development + software development

  • Challenges in production ML

0.2 Traditional ML modeling vs Production ML systems

0.2.1 Traditional ML

0.2.2 Production ML systems require so much more

0.2.3 ML modeling vs production ML

0.2.4 As a ML Engineer

  • Managing the entire life cycle of data

    • Labeling

    • Feature space coverage

    • Minimal dimensionality

    • Maximum predictive data

    • Fairness

    • Rare conditions

  • Modern software development

    • Scalability

    • Extensibility

    • Configuration

    • Consistency & reproducibility

    • Safety & security

    • Modularity

    • Testability

    • Monitoring

    • Best practices

0.2.5 Production machine learning system

0.3 Challenges in production grade ML

  • Build integrated ML

  • Continuously operate it in production

  • Handle continuously changing data

  • Optimize compute resource costs

1. ML Pipelines

Infrastructure for automating, monitoring, and maintaining model training and deployment

1.1 Production ML infrastructure

1.2 Pipeline orchestration frameworks

  • Responsible for scheduling the various components in an ML pipeline DAG dependencies

  • Help with pipeline automation

  • Examples: Airflow, Argo, Celery, Luigi, Kubeflow

1.3 TensorFlow Extended (TFX)

End to end platform for deploying production ML pipelines

Sequence of components that are designed for scalable, high-performance machine learning tasks

1.3.1 TFX production components

1.3.2 TFX Hello World

Key points

  • Production ML pipelines: automating, monitoring, and maintaining end-to-end process

  • Production ML is much more than just ML code

    • ML development + software development

  • TFX is an open-source end-to-end ML platform

2. Collecting Data

2.1 Importance of Data

Outline

  • Importance of data quality

  • Data pipeline: data collection, ingestion and preparation

  • Data collection and monitoring

2.1.1 ML: Data is first class citizen

  • Software 1.0

    • Explicit instruction to the computer

  • Software 2.0

    • Specify some goal on the behavior of a program

    • Find solution using optimization techniques

    • Good data is key for success

    • Code in Software = Data in ML

Everything starts with data

  • Models aren't magic

  • Meaningful data:

    • Maximize predictive content

    • Remove non-informative data

    • Feature space coverage

Key Points:

  • Understand users, translate user needs into data problems

  • Ensure data coverage and high predictive signal

  • Source, store and monitor quality data responsibly

2.2 Example Application: Suggesting Runs

  • Key considerations

    • Data availability and collection

      • What kind of/how much data is available

      • How often does the new data come in

      • Is it annotated?

        • If not, how hard/expensive is it to get it labeled

    • Translate user needs into data needs

      • Data needed

      • Features needed

      • Labels needed

Example data
  • Get to know your data

    • Identify data sources

    • Check if they are refreshed

    • Consistency for values, units, & data types

    • Monitor outliers and errors

  • Dataset issues

    • Inconsistent formatting

      • Is zero "0", "0.0", or an indicator of a missing measurement

      • Compounding errors from other ML Models

      • Monitor data sources for system issues and outages

  • Measure data effectiveness

    • Intuition about data value can be misleading

      • Which features have predictive value and which ones do not?

      • Feature engineering helps to maximize the predictive signals

      • Feature selection helps to measure the predictive signal

  • Translate user needs into data needs

    • Data needs

      • Running data from the app

      • Demographic data

      • Local geographic data

    • Features needs

      • Runner demographics

      • Time of day

      • Run completion rate

      • Pace

      • Distance ran

      • Elevation gained

      • Heart rate

    • Labels needs

      • Runner acceptance or rejection of app suggestions

      • User generated feedback regarding why suggestions was rejected

      • User rating of enjoyment of recommended runs

Key points

  • Understand your user, translate their needs into data problems

    • What kind of/how much data is available

    • What are the details and issues of your data

    • What are your predictive features

    • What are the labels you are tracking

    • What are your metrics

2.3 Responsible Data: Security, Privacy & Fairness

Source Data Responsibly
  • Data security and privacy

    • Data collection and management isn't just about your model

      • Give user control of what data can be collected

      • Is there a risk of inadvertently revealing user data?

    • Compliance with regulations and policies (e.g. GDPR)

  • User Privacy

    • Protect personally identification information

      • Aggregation - replace unique values with summary value

      • Redaction - remove some data to create less complete picture

  • How ML system can fail users

    • Representational harm

    • Opportunity denial

    • Disproportionate product failure

    • Harm by disadvantage

  • Commit to fairness

    • Make sure your models are fair

      • Group fairness, equal accuracy

    • Bias in human labeled and/ or collected data

    • ML models can amplify biases

  • Reducing bias: design fair labeling systems

    • Accurate labels are necessary for supervised learning

    • Labeling can be done by:

      • Automation (logging or weak supervision)

      • Humans (aks "Raters", ofter semi-supervised)

  • Types of human raters

    • Generalists: crowdsourcing tools

    • Subject matter experts: Specialized tools: e.g. medical image labelling

    • Your users: Derived labels, e.g. tagged photos

Key points

  • Ensure rater pool diversity

  • Investigate rater context and incentive

  • Evaluate rater tools

  • Manage cost

  • Determined freshness requirements

3. Labeling Data

3.1 Data and concept Changes in Production ML

  • Detecting problems with deployed models

    • Data and scope changes

    • Monitor models and validate data to find problems early

    • Changing ground truth: label new training data

  • Easy problems

    • Ground truth changes slowly (months, years)

    • Model retraining driven by:

      • Model improvements, better data

      • Changes in software and/or systems

    • Labeling

      • Curated datasets

      • Crowd-based

  • Harder problems

    • Ground truth changes faster (weeks)

    • Model retraining driven by:

      • Declining model performance

      • Model improvements, better data

      • Changes in software and/or system

    • Labeling

      • Direct feedback

      • Crowd-based

  • Really hard problems

    • Ground truth changes very fast (days, hours, mins)

    • Model retraining driven by:

      • Declining model performance

      • Model improvements, better data

      • Changes in software and/or system

    • Labeling

      • Direct feedback

      • Weak supervision

Key points

  • Model performance decays over time

    • Data and Concept drift

  • Model retraining helps to improve performance

    • Data labeling for changing ground truth and scare labels

3.2 Process feedback and human labeling

  • Data labeling

    • Process feedback (Direct Labeling)

      Example: Actual vs predicted click-through

    • Human Labeling

      Cardiologies labeling MRI images

    • Semi-supervised Labeling (advanced method)

    • Active Learning (advanced method)

    • Weak Supervision (advanced method)

  • Why is labeling important in production ML?

    • Using business/organization available data

    • Frequent model retraining

    • Labeling ongoing datasets required labels

  • Direct labeling: continuous creation of training dataset

    • Process feedback - Advantages

      • Training dataset continuous creation

      • Labels evolve quickly

      • Captures strong label signals

    • Process feedback - Disadvantages

      • Hindered by inherent nature of the problem

      • Failure to capture ground truth

      • Largely bespoke design

    • Open-source log analysis tools

      • Logstash

        • Free and open source data processing pipeline

          • Ingests data from a multitude of sources

          • Transforms it

          • Sends it to your favorite "stash"

      • Fluentd

        • Open source data collector

        • Unify the data collection and consumption

      • Cloud log analytics

        • Google Cloud Logging

        • AWS ElasticSearch

        • Azure Monitor

  • Human labeling

    People to examine data and assign labels manually

    • Unlabeled data is collected

    • Human "raters" are recruited

    • Instructions to guild raters are created

    • Data is divided and assigned to raters

    • Labels are collected and conflicts resolved

    • Advantages:

      • More labels

      • Pure supervised learning

    • Disadvantages

      • Quality consistency: many datasets difficult for human labeling

      • slow

      • Expensive

      • Small dataset curation

    • Why is human labeling a problem?

      • MRI: high cost specialist labeling

      • Single rater: limited # examples per day

      • Recruitment is slow and expensive

Key points

  • Various methods of data labeling

    • Process feedback

    • Human feedback

  • Pros and Cons of both

4. Validating data

4.1 Detecting Data Issues

  • Data issues

    • Drift and skew

      • Drift: Changes in data over time, such as data collected once a day ==> Model decay

        • Performance decay: concept drift

      • Skew: Difference between two static version, or different sources, such as training set and serving set

    • Detecting data issues

      • Detecting scheme skew

        • Training and serving data do not conform to the same schema

      • Detecting distribution skew

        • Dataset shift --> covariate or concept drift

      • Required continuous evaluation

Skew detection workflow

4.2 Tensorflow Data Validation

  • TensorFlow Data Validation (TFDV)

    • Understand, validate, and monitor ML data at scale

    • Used to analyze and validate petabytes of data at Google every day

    • Proven track record in helping TFX users maintain the health of their ML pipelines

  • TFDV capabilities

    • Generates data statistics and browser visualization

    • Infers the data schema

    • Performs validity checks against schema

    • Detects training/serving skews

  • TFDV can detect skews

    • schema skew

    • Feature skew

    • Distribution skew

  • Skew - TFDV

    • Supported for categorical features

    • Expressed in terms of L-infinity distance (Chebyshev Distance)

      • DChebyshev(x,y)=maxi(∣xi−yi∣)D_{Chebyshev} (x,y) = max_{i}(|x_i-y_i|)

    • Set a threshold to receive warnings

  • Schema skew

    • Serving and training data don't conform to same schema: e.g. int != float

  • Feature skew

    • Training features values are different than the serving feature values

      • Feature values are modified between training and serving time

      • Transformation applied only in one of the two instances

  • Distribution skew

    • Distribution of serving and training dataset is significantly different

      • Faulty sampling method during training

      • Different data sources for training and serving data

      • Trend, seasonality, changes in data over time

Key points

  • TFDV: descriptive statistics at scale with the embedded facets visualizations

  • It provides insight into:

    • What are the underlying statistics of your data

    • How does your training, evaluation, and serving dataset statistics compare

    • How can you detect and fix data anomalies

Last updated

Was this helpful?