Collecting, Labeling and Validating Data
0. Overview
0.0 Learning Objective
Describe the differences between ML modeling and a production ML system
Identify responsible data collection for building a fair production ML system
Discuss data and concept change and how to address it by annotating new training data with direct labeling and/or human labeling
Address training data issues by generating dataset statistics and creating, comparing and updating data schemas
0.1 Outline
Machine Learning (ML) engineering for production: Overview
Production ML = ML development + software development
Challenges in production ML
0.2 Traditional ML modeling vs Production ML systems
0.2.1 Traditional ML

0.2.2 Production ML systems require so much more

0.2.3 ML modeling vs production ML

0.2.4 As a ML Engineer
Managing the entire life cycle of data
Labeling
Feature space coverage
Minimal dimensionality
Maximum predictive data
Fairness
Rare conditions
Modern software development
Scalability
Extensibility
Configuration
Consistency & reproducibility
Safety & security
Modularity
Testability
Monitoring
Best practices
0.2.5 Production machine learning system

0.3 Challenges in production grade ML
Build integrated ML
Continuously operate it in production
Handle continuously changing data
Optimize compute resource costs
1. ML Pipelines

Infrastructure for automating, monitoring, and maintaining model training and deployment
1.1 Production ML infrastructure

1.2 Pipeline orchestration frameworks

Responsible for scheduling the various components in an ML pipeline DAG dependencies
Help with pipeline automation
Examples: Airflow, Argo, Celery, Luigi, Kubeflow
1.3 TensorFlow Extended (TFX)

Sequence of components that are designed for scalable, high-performance machine learning tasks
1.3.1 TFX production components

1.3.2 TFX Hello World

Key points
Production ML pipelines: automating, monitoring, and maintaining end-to-end process
Production ML is much more than just ML code
ML development + software development
TFX is an open-source end-to-end ML platform
2. Collecting Data
2.1 Importance of Data
Outline
Importance of data quality
Data pipeline: data collection, ingestion and preparation
Data collection and monitoring
2.1.1 ML: Data is first class citizen
Software 1.0
Explicit instruction to the computer
Software 2.0
Specify some goal on the behavior of a program
Find solution using optimization techniques
Good data is key for success
Code in Software = Data in ML
Everything starts with data
Models aren't magic
Meaningful data:
Maximize predictive content
Remove non-informative data
Feature space coverage
Key Points:
Understand users, translate user needs into data problems
Ensure data coverage and high predictive signal
Source, store and monitor quality data responsibly
2.2 Example Application: Suggesting Runs

Key considerations
Data availability and collection
What kind of/how much data is available
How often does the new data come in
Is it annotated?
If not, how hard/expensive is it to get it labeled
Translate user needs into data needs
Data needed
Features needed
Labels needed

Get to know your data
Identify data sources
Check if they are refreshed
Consistency for values, units, & data types
Monitor outliers and errors
Dataset issues
Inconsistent formatting
Is zero "0", "0.0", or an indicator of a missing measurement
Compounding errors from other ML Models
Monitor data sources for system issues and outages
Measure data effectiveness
Intuition about data value can be misleading
Which features have predictive value and which ones do not?
Feature engineering helps to maximize the predictive signals
Feature selection helps to measure the predictive signal
Translate user needs into data needs
Data needs
Running data from the app
Demographic data
Local geographic data
Features needs
Runner demographics
Time of day
Run completion rate
Pace
Distance ran
Elevation gained
Heart rate
Labels needs
Runner acceptance or rejection of app suggestions
User generated feedback regarding why suggestions was rejected
User rating of enjoyment of recommended runs
Key points
Understand your user, translate their needs into data problems
What kind of/how much data is available
What are the details and issues of your data
What are your predictive features
What are the labels you are tracking
What are your metrics
2.3 Responsible Data: Security, Privacy & Fairness

Data security and privacy
Data collection and management isn't just about your model
Give user control of what data can be collected
Is there a risk of inadvertently revealing user data?
Compliance with regulations and policies (e.g. GDPR)
User Privacy
Protect personally identification information
Aggregation - replace unique values with summary value
Redaction - remove some data to create less complete picture
How ML system can fail users
Representational harm
Opportunity denial
Disproportionate product failure
Harm by disadvantage
Commit to fairness
Make sure your models are fair
Group fairness, equal accuracy
Bias in human labeled and/ or collected data
ML models can amplify biases
Reducing bias: design fair labeling systems
Accurate labels are necessary for supervised learning
Labeling can be done by:
Automation (logging or weak supervision)
Humans (aks "Raters", ofter semi-supervised)
Types of human raters
Generalists: crowdsourcing tools
Subject matter experts: Specialized tools: e.g. medical image labelling
Your users: Derived labels, e.g. tagged photos
Key points
Ensure rater pool diversity
Investigate rater context and incentive
Evaluate rater tools
Manage cost
Determined freshness requirements
3. Labeling Data
3.1 Data and concept Changes in Production ML
Detecting problems with deployed models
Data and scope changes
Monitor models and validate data to find problems early
Changing ground truth: label new training data
Easy problems
Ground truth changes slowly (months, years)
Model retraining driven by:
Model improvements, better data
Changes in software and/or systems
Labeling
Curated datasets
Crowd-based
Harder problems
Ground truth changes faster (weeks)
Model retraining driven by:
Declining model performance
Model improvements, better data
Changes in software and/or system
Labeling
Direct feedback
Crowd-based
Really hard problems
Ground truth changes very fast (days, hours, mins)
Model retraining driven by:
Declining model performance
Model improvements, better data
Changes in software and/or system
Labeling
Direct feedback
Weak supervision
Key points
Model performance decays over time
Data and Concept drift
Model retraining helps to improve performance
Data labeling for changing ground truth and scare labels
3.2 Process feedback and human labeling
Data labeling
Process feedback (Direct Labeling)
Example: Actual vs predicted click-through
Human Labeling
Cardiologies labeling MRI images
Semi-supervised Labeling(advanced method)Active Learning(advanced method)Weak Supervision(advanced method)
Why is labeling important in production ML?
Using business/organization available data
Frequent model retraining
Labeling ongoing datasets required labels
Direct labeling: continuous creation of training dataset
Process feedback - Advantages
Training dataset continuous creation
Labels evolve quickly
Captures strong label signals
Process feedback - Disadvantages
Hindered by inherent nature of the problem
Failure to capture ground truth
Largely bespoke design
Open-source log analysis tools
Logstash
Free and open source data processing pipeline
Ingests data from a multitude of sources
Transforms it
Sends it to your favorite "stash"
Fluentd
Open source data collector
Unify the data collection and consumption
Cloud log analytics
Google Cloud Logging
AWS ElasticSearch
Azure Monitor
Human labeling
People to examine data and assign labels manually
Unlabeled data is collected
Human "raters" are recruited
Instructions to guild raters are created
Data is divided and assigned to raters
Labels are collected and conflicts resolved
Advantages:
More labels
Pure supervised learning
Disadvantages
Quality consistency: many datasets difficult for human labeling
slow
Expensive
Small dataset curation
Why is human labeling a problem?
MRI: high cost specialist labeling
Single rater: limited # examples per day
Recruitment is slow and expensive
Key points
Various methods of data labeling
Process feedback
Human feedback
Pros and Cons of both
4. Validating data
4.1 Detecting Data Issues
Data issues
Drift and skew
Drift: Changes in data over time, such as data collected once a day ==> Model decay
Performance decay: concept drift
Skew: Difference between two static version, or different sources, such as training set and serving set
Detecting data issues
Detecting scheme skew
Training and serving data do not conform to the same schema
Detecting distribution skew
Dataset shift --> covariate or concept drift
Required continuous evaluation

Skew detection workflow

4.2 Tensorflow Data Validation
TensorFlow Data Validation (TFDV)
Understand, validate, and monitor ML data at scale
Used to analyze and validate petabytes of data at Google every day
Proven track record in helping TFX users maintain the health of their ML pipelines
TFDV capabilities
Generates data statistics and browser visualization
Infers the data schema
Performs validity checks against schema
Detects training/serving skews
TFDV can detect skews
schema skew
Feature skew
Distribution skew
Skew - TFDV
Supported for categorical features
Expressed in terms of L-infinity distance (Chebyshev Distance)
Set a threshold to receive warnings
Schema skew
Serving and training data don't conform to same schema: e.g. int != float
Feature skew
Training features values are different than the serving feature values
Feature values are modified between training and serving time
Transformation applied only in one of the two instances
Distribution skew
Distribution of serving and training dataset is significantly different
Faulty sampling method during training
Different data sources for training and serving data
Trend, seasonality, changes in data over time
Key points
TFDV: descriptive statistics at scale with the embedded facets visualizations
It provides insight into:
What are the underlying statistics of your data
How does your training, evaluation, and serving dataset statistics compare
How can you detect and fix data anomalies
Last updated
Was this helpful?