As machine learning evolves from experimentation to serving production workloads, so does the need to effectively manage the end-to-end training and production workflow including model management, versioning, and serving. Clemens Mewald offers an overview of TensorFlow Extended (TFX), the end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet. Many TFX components rely on the Beam SDK to define portable data processing workflows. This talk motivates the development of a Spark runner for Beam Python.
2. Data
Ingestion
Data
Analysis + Validation
Feature
Engineering
Trainer
Model Evaluation
and Validation
Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
TensorFlow Extended (TFX) is an
end-to-end ML pipeline for TensorFlow
3. TFX powers our most important bets and
products...
(incl. )
Major ProductsAlphaBets
4. Data
Ingestion
Data
Analysis + Validation
Feature
Engineering
Trainer
Model Evaluation
and Validation
Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
TensorFlow Extended (TFX) is an
end-to-end ML pipeline for TensorFlow
6. What is Apache Beam?
- A unified batch and stream distributed processing API
- A set of SDK frontends: Java, Python, Go, Scala, SQL
- A set of Runners which can execute Beam jobs into
various backends: Local, Apache Flink, Apache Spark,
Apache Gearpump, Apache Samza, Apache Hadoop,
Google Cloud Dataflow, …
7. Building Components out of Libraries
Data Ingestion
TensorFlow
Transform
Estimator Model
TensorFlow
Model Analysis
Honoring
Validation
Outcomes
TensorFlow
Data Validation
TensorFlow
Serving
ExampleGen
StatisticsGen
SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher Model Server
Powered by Beam Powered by Beam
8. What makes a Component
Model
Validator
Packaged binary
or container
9. What makes a Component
Last Validated
Model
New (Candidate)
Model
Validation
Outcome
Well defined
inputs and outputs
Model
Validator
10. What makes a Component
Config
Last Validated
Model
New (Candidate)
Model
Validation
Outcome
Well defined
configuration
Model
Validator
11. Metadata Store
What makes a Component
Last Validated
Model
New (Candidate)
Model
Validation
Outcome
Context
Model
Validator
Config
12. Metadata Store
What makes a Component
Trainer
Last Validated
Model
New (Candidate)
Model
New Model
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome
Deployment targets:
TensorFlow Serving
TensorFlow Lite
TensorFlow JS
TensorFlow Hub
Model
Validator
Config
15. Metadata Store? That’s new
Task-Aware Pipelines
Input Data
Transformed
Data
Trained
Models
Deployment
Task- and Data-Aware Pipelines
Pipeline + Metadata Storage
Training Data
TrainerTransformTrainerTransform
16. What’s in the Metadata Store?
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
17. What’s in the Metadata Store?
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
18. What’s in the Metadata Store?
Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact
28. Examples of Metadata-Powered Functionality
Compare previous model runs
Carry-over state from previous models
Use-cases enabled by lineage tracking
29. Examples of Metadata-Powered Functionality
Use-cases enabled by lineage tracking Compare previous model runs
Carry-over state from previous models Re-use previously computed outputs
Use-cases enabled by lineage tracking
30. How do we orchestrate TFX?
Component Component Component
31. How do we orchestrate TFX?
Metadata Store
Component
Driver
Publisher
Component
Driver
Publisher
Component
Driver
Publisher
32. How do we orchestrate TFX?
Metadata Store
Component
Driver
Publisher
Component
Driver
Publisher
Component
Driver
Publisher
Executor Executor Executor
39. … Back to orchestration
TFX Config
Metadata Store
Component
Driver
Publisher
Component
Driver
Publisher
Component
Driver
Publisher
Executor Executor Executor
40. Bring your very own favorite orchestrator
TFX Config
Metadata Store
Component
Driver
Publisher
Component
Driver
Publisher
Component
Driver
Publisher
Executor Executor Executor
Airflow Runtime Kubeflow Runtime Your own runtime...
46. Component: ExampleGen
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
Configuration
Eval
Example
Gen
Raw Data
CSV TF Record
Split
TF Record
Data
Training
Eval
Inputs and Outputs
54. 54
Component: SchemaGen
SchemaGen
Statistics
StatisticsGen
Inputs and Outputs
Schema
● High-level description of the data
○ Expected features
○ Expected value domains
○ Expected constraints
○ and much more!
● Codifies expectations of “good” data
● Initially inferred, then user-curated
66. 66
Component: Transform
Transform
Data
Transform Graph
● Applied at training time
● Embedded in serving graph
Transformed Data
● Optional, for performance optimization
Schema
Transform
Graph
Transformed
Data
ExampleGen SchemaGen
Trainer
Inputs and Outputs
Code
67. Component: Transform
transform = Transform(
input_data=example_gen.outputs.examples,
schema=infer_schema.outputs.output,
module_file=taxi_module_file)
Configuration
for key in _DENSE_FLOAT_FEATURE_KEYS:
outputs[_transformed_name(key)] = transform.scale_to_z_score(
_fill_in_missing(inputs[key]))
# ...
outputs[_transformed_name(_LABEL_KEY)] = tf.where(
tf.is_nan(taxi_fare),
tf.cast(tf.zeros_like(taxi_fare), tf.int64),
# Test if the tip was > 20% of the fare.
tf.cast(
tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))
# ...
Code
Transform
Data Schema
Transform
Graph
Transformed
Data
ExampleGen SchemaGen
Trainer
Inputs and Outputs
Code
78. Component: ModelValidator
model_validator = ModelValidator(
examples=examples_gen.outputs.output,
model=trainer.outputs.output,
eval_spec=taxi_mv_spec)
Configuration
● Configuration options
○ Validate using current eval data
○ “Next-day eval”, validate using unseen data
Model
Validator
Data
ExampleGen Trainer
Inputs and Outputs
Validation
Outcome
Model (x2)
84. Beam Vision
Provide a comprehensive portability framework
for data processing pipelines, one that allows you
to write your pipeline once in your language of
choice and run it with minimal effort on the
execution engine of choice.
85. Apache Beam
Sum Per Key
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
87. How does Beam (Java) map to Spark?
Beam Java: Already runs on Spark!
88. How does Beam (Python) map to Spark?
Beam Portability (Python, …)
• Active work in progress!
• Several PRs are already in!
– Supports: Impulse, ParDo, GroupByKey,
Combine, Flatten, PAssert, Metrics, Side
inputs, ...
– Missing: State/Timers, SDF, ...
89. How does Beam map to Spark?
Call to action!
• Help with code, reviews, testing.
• Tracking JIRA(s)
– BEAM-2891
– BEAM-2590
90. Get started with TensorFlow Extended
(TFX)
An End-to-End ML Platform
github.com/tensorflow/tfx
tensorflow.org/tfx