Collecting, Labeling and Validating Data

0. Overview

0.0 Learning Objective

Describe the differences between ML modeling and a production ML system
Identify responsible data collection for building a fair production ML system
Discuss data and concept change and how to address it by annotating new training data with direct labeling and/or human labeling
Address training data issues by generating dataset statistics and creating, comparing and updating data schemas

0.1 Outline

Machine Learning (ML) engineering for production: Overview
Production ML = ML development + software development
Challenges in production ML

0.2 Traditional ML modeling vs Production ML systems

0.2.1 Traditional ML

0.2.2 Production ML systems require so much more

0.2.3 ML modeling vs production ML

0.2.4 As a ML Engineer

Managing the entire life cycle of data
- Labeling
- Feature space coverage
- Minimal dimensionality
- Maximum predictive data
- Fairness
- Rare conditions
Modern software development
- Scalability
- Extensibility
- Configuration
- Consistency & reproducibility
- Safety & security
- Modularity
- Testability
- Monitoring
- Best practices

0.2.5 Production machine learning system

0.3 Challenges in production grade ML

Build integrated ML
Continuously operate it in production
Handle continuously changing data
Optimize compute resource costs

1. ML Pipelines

Infrastructure for automating, monitoring, and maintaining model training and deployment

1.1 Production ML infrastructure

1.2 Pipeline orchestration frameworks

Responsible for scheduling the various components in an ML pipeline DAG dependencies
Help with pipeline automation
Examples: Airflow, Argo, Celery, Luigi, Kubeflow

1.3 TensorFlow Extended (TFX)

Sequence of components that are designed for scalable, high-performance machine learning tasks

1.3.1 TFX production components

1.3.2 TFX Hello World

Key points

Production ML pipelines: automating, monitoring, and maintaining end-to-end process
Production ML is much more than just ML code
- ML development + software development
TFX is an open-source end-to-end ML platform

2. Collecting Data

2.1 Importance of Data

Outline

Importance of data quality
Data pipeline: data collection, ingestion and preparation
Data collection and monitoring

2.1.1 ML: Data is first class citizen

Software 1.0
- Explicit instruction to the computer
Software 2.0
- Specify some goal on the behavior of a program
- Find solution using optimization techniques
- Good data is key for success
- Code in Software = Data in ML

Everything starts with data

Models aren't magic
Meaningful data:
- Maximize predictive content
- Remove non-informative data
- Feature space coverage

Key Points:

Understand users, translate user needs into data problems
Ensure data coverage and high predictive signal
Source, store and monitor quality data responsibly

2.2 Example Application: Suggesting Runs

Key considerations
- Data availability and collection
  - What kind of/how much data is available
  - How often does the new data come in
  - Is it annotated?
    If not, how hard/expensive is it to get it labeled
- Translate user needs into data needs
  - Data needed
  - Features needed
  - Labels needed

Get to know your data
- Identify data sources
- Check if they are refreshed
- Consistency for values, units, & data types
- Monitor outliers and errors
Dataset issues
- Inconsistent formatting
  - Is zero "0", "0.0", or an indicator of a missing measurement
  - Compounding errors from other ML Models
  - Monitor data sources for system issues and outages
Measure data effectiveness
- Intuition about data value can be misleading
  - Which features have predictive value and which ones do not?
  - Feature engineering helps to maximize the predictive signals
  - Feature selection helps to measure the predictive signal
Translate user needs into data needs
- Data needs
  - Running data from the app
  - Demographic data
  - Local geographic data
- Features needs
  - Runner demographics
  - Time of day
  - Run completion rate
  - Pace
  - Distance ran
  - Elevation gained
  - Heart rate
- Labels needs
  - Runner acceptance or rejection of app suggestions
  - User generated feedback regarding why suggestions was rejected
  - User rating of enjoyment of recommended runs

Key points

Understand your user, translate their needs into data problems
- What kind of/how much data is available
- What are the details and issues of your data
- What are your predictive features
- What are the labels you are tracking
- What are your metrics

2.3 Responsible Data: Security, Privacy & Fairness

Data security and privacy
- Data collection and management isn't just about your model
  - Give user control of what data can be collected
  - Is there a risk of inadvertently revealing user data?
- Compliance with regulations and policies (e.g. GDPR)
User Privacy
- Protect personally identification information
  - Aggregation - replace unique values with summary value
  - Redaction - remove some data to create less complete picture
How ML system can fail users
- Representational harm
- Opportunity denial
- Disproportionate product failure
- Harm by disadvantage
Commit to fairness
- Make sure your models are fair
  - Group fairness, equal accuracy
- Bias in human labeled and/ or collected data
- ML models can amplify biases
Reducing bias: design fair labeling systems
- Accurate labels are necessary for supervised learning
- Labeling can be done by:
  - Automation (logging or weak supervision)
  - Humans (aks "Raters", ofter semi-supervised)
Types of human raters
- Generalists: crowdsourcing tools
- Subject matter experts: Specialized tools: e.g. medical image labelling
- Your users: Derived labels, e.g. tagged photos

Key points

Ensure rater pool diversity
Investigate rater context and incentive
Evaluate rater tools
Manage cost
Determined freshness requirements

3. Labeling Data

3.1 Data and concept Changes in Production ML

Detecting problems with deployed models
- Data and scope changes
- Monitor models and validate data to find problems early
- Changing ground truth: label new training data
Easy problems
- Ground truth changes slowly (months, years)
- Model retraining driven by:
  - Model improvements, better data
  - Changes in software and/or systems
- Labeling
  - Curated datasets
  - Crowd-based
Harder problems
- Ground truth changes faster (weeks)
- Model retraining driven by:
  - Declining model performance
  - Model improvements, better data
  - Changes in software and/or system
- Labeling
  - Direct feedback
  - Crowd-based
Really hard problems
- Ground truth changes very fast (days, hours, mins)
- Model retraining driven by:
  - Declining model performance
  - Model improvements, better data
  - Changes in software and/or system
- Labeling
  - Direct feedback
  - Weak supervision

Key points

Model performance decays over time
- Data and Concept drift
Model retraining helps to improve performance
- Data labeling for changing ground truth and scare labels

3.2 Process feedback and human labeling

Data labeling
- Process feedback (Direct Labeling)
  Example: Actual vs predicted click-through
- Human Labeling
  Cardiologies labeling MRI images
- ~~Semi-supervised Labeling~~ (advanced method)
- ~~Active Learning~~ (advanced method)
- ~~Weak Supervision~~ (advanced method)
Why is labeling important in production ML?
- Using business/organization available data
- Frequent model retraining
- Labeling ongoing datasets required labels
Direct labeling: continuous creation of training dataset
- Process feedback - Advantages
  - Training dataset continuous creation
  - Labels evolve quickly
  - Captures strong label signals
- Process feedback - Disadvantages
  - Hindered by inherent nature of the problem
  - Failure to capture ground truth
  - Largely bespoke design
- Open-source log analysis tools
  - Logstash
    Free and open source data processing pipeline
    Ingests data from a multitude of sources
    Transforms it
    Sends it to your favorite "stash"
  - Fluentd
    Open source data collector
    Unify the data collection and consumption
  - Cloud log analytics
    Google Cloud Logging
    AWS ElasticSearch
    Azure Monitor
Human labeling
People to examine data and assign labels manually
- Unlabeled data is collected
- Human "raters" are recruited
- Instructions to guild raters are created
- Data is divided and assigned to raters
- Labels are collected and conflicts resolved
- Advantages:
  - More labels
  - Pure supervised learning
- Disadvantages
  - Quality consistency: many datasets difficult for human labeling
  - slow
  - Expensive
  - Small dataset curation
- Why is human labeling a problem?
  - MRI: high cost specialist labeling
  - Single rater: limited # examples per day
  - Recruitment is slow and expensive

Key points

Various methods of data labeling
- Process feedback
- Human feedback
Pros and Cons of both

4. Validating data

4.1 Detecting Data Issues

Data issues
- Drift and skew
  - Drift: Changes in data over time, such as data collected once a day ==> Model decay
    Performance decay: concept drift
  - Skew: Difference between two static version, or different sources, such as training set and serving set
- Detecting data issues
  - Detecting scheme skew
    Training and serving data do not conform to the same schema
  - Detecting distribution skew
    Dataset shift --> covariate or concept drift
  - Required continuous evaluation

Skew detection workflow

4.2 Tensorflow Data Validation

TensorFlow Data Validation (TFDV)
- Understand, validate, and monitor ML data at scale
- Used to analyze and validate petabytes of data at Google every day
- Proven track record in helping TFX users maintain the health of their ML pipelines
TFDV capabilities
- Generates data statistics and browser visualization
- Infers the data schema
- Performs validity checks against schema
- Detects training/serving skews
TFDV can detect skews
- schema skew
- Feature skew
- Distribution skew
Skew - TFDV
- Supported for categorical features
- Expressed in terms of L-infinity distance (Chebyshev Distance)
  - $D_{Chebyshev} (x,y) = max_{i}(|x_i-y_i|)$
- Set a threshold to receive warnings
Schema skew
- Serving and training data don't conform to same schema: e.g. int != float
Feature skew
- Training features values are different than the serving feature values
  - Feature values are modified between training and serving time
  - Transformation applied only in one of the two instances
Distribution skew
- Distribution of serving and training dataset is significantly different
  - Faulty sampling method during training
  - Different data sources for training and serving data
  - Trend, seasonality, changes in data over time

Key points

TFDV: descriptive statistics at scale with the embedded facets visualizations
It provides insight into:
- What are the underlying statistics of your data
- How does your training, evaluation, and serving dataset statistics compare
- How can you detect and fix data anomalies

PreviousMachine Learning Data Lifecycle in Production NextFeature Engineering, Transformation and Selection

Last updated 4 years ago

Was this helpful?