The document summarizes a meetup on data streaming and machine learning with Google Cloud Platform. The meetup consisted of two presentations:
1. The first presentation discussed using Apache Beam (Dataflow) on Google Cloud Platform to parallelize machine learning training for improved performance. It showed how Dataflow was used to reduce training time from 12 hours to under 30 minutes.
2. The second presentation demonstrated building a streaming pipeline for sentiment analysis on Twitter data using Dataflow. It covered streaming patterns, batch vs streaming processing, and a demo that ingested tweets from PubSub and analyzed them using Cloud NLP API and BigQuery.
6. WHAT’S OUR AIM? Predict how well an item will sell based on past performance of
similar items.
7. HOW DO WE PROCEED? Machine Learning
1. Train on existing data to understand what
attributes (features) have which impact on
performance of items
2. Use trained model to predict how unknown
items would perform based on their attributes
(features)
8. MACHINE LEARNING
Training Process
● Generate features for every item (ex. color, brand, pattern)
● Shuffle data
● Split data in training entries (training set + test set)
● Generate hyperparameters from a predefined search space
● Combine hyperparameters and training entries
● Train a model for each combination
○ Evaluate error out of sample
● Combine the errors for same hyperparameter
● Rank every hyperparameter errors to select the best
● Train on all data and keep trained model for prediction
10. WHAT’S THE PROBLEM?
Multiple combinations to train
Ex.: For 700 combinations of hyperparameters and training entries,
training took around 12 hours to complete.
11. HOW TO SOLVE IT? - Parallelize by running all trainings concurrently
- Need a lot of processing power
- Need a lot of memory
12. APACHE BEAM - Implementation agnostic & open source
- Java & Python SDKs
- Allows to build pipelines
https://beam.apache.org/documentation/runners/capability-matrix/
13. PIPELINE BASICS 1. Create streaming or batch pipeline
2. Read data from various sources
○ Files
○ Databases
○ Cloud solutions
○ Code generated data
○ ...
3. Apply transforms to process data
4. Write or output final pipeline data
5. Run on a specific runner
○ Direct (locally on your machine)
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump
https://beam.apache.org/documentation/pipelines/create-your-pipeline/
15. RESULTS Testing 700 combinations of hyperparameters and training entries
Previous execution took 12 hours to process sequentially
Whole dataflow job took a bit less than 28 minutes
16. Topic #2
Building a Dataflow
streaming pipeline
for sentiment
analysis on Twitter
16Data Science | Design | Technology
(Arsho)
18. Confidential & Proprietary
Trade-offs and challenges in Big Data
Apache Beam SDK
Cloud Dataflow Service
Batch or Stream
Demo/Code
Getting Started Resources
Agenda
20. Confidential & Proprietary
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available messaging
Fully Managed, No-Ops Services
Introduction
Google Managed Services Toolbox for Big Data
21. Confidential & Proprietary
Cloud Dataflow is a collection
of SDKs for building batch or
streaming parallelized data
processing pipelines.
Cloud Dataflow is a fully managed
service for executing optimized
parallelized data processing
pipelines.
Introduction
Google Cloud Dataflow
23. Confidential & Proprietary
Cloud Dataflow SDK
Dataflow Benefits
❯ Unified programming model for both batch & stream processing
• Independent from the execution back-end aka “runner”
❯ Google driven & open sourced
• Java 7
• Python 2 (streaming is in Alpha)
24. Confidential & Proprietary
<- At once guarantee (modulo completeness
thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Completeness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
25. Confidential & Proprietary
• A Directed Acyclic Graph of data
processing transformations
• Can be submitted to the Dataflow
Service for optimization and execution
or executed on an alternate runner e.g.
Spark
• May include multiple inputs and multiple
outputs
• May encompass many logical
MapReduce operations
• PCollections flow through the pipeline
Pipeline
Cloud Dataflow SDK
26. Confidential & Proprietary
❯ Read from standard Google Cloud
Platform data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by
teaching Dataflow how to read it in
parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON,
XML, Avro formatted data
Cloud Dataflow SDK
Inputs & Outputs
Your
Source/Sink
Here
27. Confidential & Proprietary
{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a PCollection
independently using a user-provided
DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() - N
times, finishBundle()
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e.
ParDo->GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
28. Confidential & Proprietary
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values with
the same key
• Corresponds to the shuffle phase in
Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
29. Confidential & Proprietary
❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on an
unbounded PCollection, but can also be used for
bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival time or custom event time
❯ Watermarks + Triggers enable robust
completeness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
30. Confidential & Proprietary
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join, Min, Max,
Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse, etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
31. Confidential & Proprietary
Run the same code in multiple modes using different runners
❯ Direct Runner
• For local, in-memory execution.
• Great for developing and unit tests
❯ Cloud Dataflow Service Runner
• Runs on the fully-manage Dataflow Service
• Your code runs distributed across GCE instances
❯ Community sourced
• Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh!
• Flink runner coming soon from dataArtisans
Cloud Dataflow Runners
Cloud Dataflow SDK
32. Confidential & Proprietary
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Life of a Pipeline
Cloud Dataflow Service
Progress & Logs
33. Confidential & Proprietary
• At-once processing*
• Graph optimization (ref. FlumeJava)
• Worker lifecycle management
• Worker resource scaling
• Worker scaling
• Restful management API and CLI
• Real-time job monitoring, Cloud Debugger & Cloud Logging integration
• Project based security with auto wipeout
* no enforcement on external service idempotency, dependant upon correctness thresholds
Cloud Dataflow Service Benefits
Cloud Dataflow Service
39. Confidential & Proprietary
Optimizing Your Time
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
ReliabilityDeployment & configuration
Handling
growing scale
Utilization
improvements
Typical Data Processing
Programming
Cloud Dataflow Service
Data Processing with Cloud Dataflow
41. Confidential & Proprietary
● Boundedness
○ Bounded data - finite data set, fixed in schema, is complete regardless of
time, typically at rest in a common durable store
○ Unbounded data - infinite, potentially changing schema, is never complete,
typically not at rest and stored in multiple temporary yet durable stores
● Time to answer
○ Batch processing presents risks of increased cost (under-utilized resources),
increased time to answer and decrease of correctness (late arriving events)
Considerations
Batch / Streaming
42. Confidential & Proprietary
Latency
There are situations where batch
processing growing datasets
breaks down.
Batch failure mode #1: time to answer
Batch / Streaming
The first is latency-sensitive
processing. You can't use an hourly
or daily batch job to do low-latency
fraud, abuse or anomaly detection.
43. Confidential & Proprietary
MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
The second is sessions: batch processing of individual chunks doesn't account for sessions
across batch boundaries. This is a real problem if you cannot afford to miss or duplicate
important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you
decrease the chunk size.
Batch failure mode #2: Sessions
Batch / Streaming
44. Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Element-wise transformations
Batch / Streaming
A streaming pipeline naturally handles unbounded, infinite collections of data.
Element-wise transformations like filtering can simply be applied as elements flow past.
45. Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
Batch / Streaming
However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements
into finite sized chunks that can be processed independently.
The simplest way to do this is just to take whatever elements we see in a fixed time period.
But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and
2pm, but there are still a few stragglers from 9am showing up.
46. Confidential & Proprietary
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Streaming Patterns: Event-Time Based Windows
Batch / Streaming
48. Confidential & Proprietary
Demo Architecture Overview
Demo/Code
a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub
b. Dataflow :
i. Pulls from pub/sub
ii. Sends text of matching mentions to GCP NLP API
iii. Loads output into BigQuery
c. Datastudio connects to BigQuery datasource
d. Datastudio generates visualization
(This demo is not a PROD-ready implementation)
Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE
NLP API
49. Confidential & Proprietary
Pipeline IO
(Text)
Pub/Sub
PTransform
Pipeline IO
(Text)
Big Query
NLP analysis
Read from Pub/Sub
Write to Big Query
with specified schema
Demo pipeline (Python SDK)
Demo/Code
54. ● Last meetup for 2017… Next meetup in January 2018
54
Data Science | Design | Technology
● 1000 members and counting...
● 2018: More co-presentations. Your meetup, your topics
● Special thanks to speakers, hosts, sponsors, members
55. Merci / Thank You
55
@jdalabsmtl
Data Science | Design | Technology
(Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)