SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Data Science | Design | Technology
(November 21, 2017)
https://www.meetup.com/DSDTMTL
1
Agenda 6:00 - 6:15: Welcome 
6:15 - 6:50: Machine Learning through Dataflow
6:50 - 7:30: Building a Streaming Pipeline
7:30 - 8:00: Q&A + Networking
2
Data Streaming and Machine Learning
with Google Cloud Platform
3
Maxime Legault-Venne
Software Architect, JDA Labs
Arsho Toubi
Customer Engineer, Google Cloud
Data Streaming and Machine Learning
with Google Cloud Platform
Topic #1
Machine Learning
through Google
Cloud Platform’s
Dataflow
4Data Science | Design | Technology
(Max)
5
Machine Learning Through
Google Cloud Platform’s Dataflow
WHAT’S OUR AIM? Predict how well an item will sell based on past performance of
similar items.
HOW DO WE PROCEED? Machine Learning
1. Train on existing data to understand what
attributes (features) have which impact on
performance of items
2. Use trained model to predict how unknown
items would perform based on their attributes
(features)
MACHINE LEARNING
Training Process
● Generate features for every item (ex. color, brand, pattern)
● Shuffle data
● Split data in training entries (training set + test set)
● Generate hyperparameters from a predefined search space
● Combine hyperparameters and training entries
● Train a model for each combination
○ Evaluate error out of sample
● Combine the errors for same hyperparameter
● Rank every hyperparameter errors to select the best
● Train on all data and keep trained model for prediction
TRAINING ENTRIES
Splitting Data
Training Set
Test Set
3 months
Training entries to get model errors
Final training data set
WHAT’S THE PROBLEM?
Multiple combinations to train
Ex.: For 700 combinations of hyperparameters and training entries,
training took around 12 hours to complete.
HOW TO SOLVE IT? - Parallelize by running all trainings concurrently
- Need a lot of processing power
- Need a lot of memory
APACHE BEAM - Implementation agnostic & open source
- Java & Python SDKs
- Allows to build pipelines
https://beam.apache.org/documentation/runners/capability-matrix/
PIPELINE BASICS 1. Create streaming or batch pipeline
2. Read data from various sources
○ Files
○ Databases
○ Cloud solutions
○ Code generated data
○ ...
3. Apply transforms to process data
4. Write or output final pipeline data
5. Run on a specific runner
○ Direct (locally on your machine)
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump
https://beam.apache.org/documentation/pipelines/create-your-pipeline/
EXECUTING IN DATAFLOW
RESULTS Testing 700 combinations of hyperparameters and training entries
Previous execution took 12 hours to process sequentially
Whole dataflow job took a bit less than 28 minutes
Topic #2
Building a Dataflow
streaming pipeline
for sentiment
analysis on Twitter
16Data Science | Design | Technology
(Arsho)
Streaming pipelines with
Google Cloud Dataflow
Confidential & Proprietary
Trade-offs and challenges in Big Data
Apache Beam SDK
Cloud Dataflow Service
Batch or Stream
Demo/Code
Getting Started Resources
Agenda
Confidential & Proprietary
AccuracySpeed
Cost Control Complexity
Time to Answer
Introduction
The Tense Quadrachotomy of Big Data
Confidential & Proprietary
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available messaging
Fully Managed, No-Ops Services
Introduction
Google Managed Services Toolbox for Big Data
Confidential & Proprietary
Cloud Dataflow is a collection
of SDKs for building batch or
streaming parallelized data
processing pipelines.
Cloud Dataflow is a fully managed
service for executing optimized
parallelized data processing
pipelines.
Introduction
Google Cloud Dataflow
Cloud Dataflow SDK
Confidential & Proprietary
Cloud Dataflow SDK
Dataflow Benefits
❯ Unified programming model for both batch & stream processing
• Independent from the execution back-end aka “runner”
❯ Google driven & open sourced
• Java 7
• Python 2 (streaming is in Alpha)
Confidential & Proprietary
<- At once guarantee (modulo completeness
thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Completeness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
Confidential & Proprietary
• A Directed Acyclic Graph of data
processing transformations
• Can be submitted to the Dataflow
Service for optimization and execution
or executed on an alternate runner e.g.
Spark
• May include multiple inputs and multiple
outputs
• May encompass many logical
MapReduce operations
• PCollections flow through the pipeline
Pipeline
Cloud Dataflow SDK
Confidential & Proprietary
❯ Read from standard Google Cloud
Platform data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by
teaching Dataflow how to read it in
parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON,
XML, Avro formatted data
Cloud Dataflow SDK
Inputs & Outputs
Your
Source/Sink
Here
Confidential & Proprietary
{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a PCollection
independently using a user-provided
DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() - N
times, finishBundle()
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e.
ParDo->GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
Confidential & Proprietary
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values with
the same key
• Corresponds to the shuffle phase in
Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
Confidential & Proprietary
❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on an
unbounded PCollection, but can also be used for
bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival time or custom event time
❯ Watermarks + Triggers enable robust
completeness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
Confidential & Proprietary
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join, Min, Max,
Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse, etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
Confidential & Proprietary
Run the same code in multiple modes using different runners
❯ Direct Runner
• For local, in-memory execution.
• Great for developing and unit tests
❯ Cloud Dataflow Service Runner
• Runs on the fully-manage Dataflow Service
• Your code runs distributed across GCE instances
❯ Community sourced
• Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh!
• Flink runner coming soon from dataArtisans
Cloud Dataflow Runners
Cloud Dataflow SDK
Confidential & Proprietary
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Life of a Pipeline
Cloud Dataflow Service
Progress & Logs
Confidential & Proprietary
• At-once processing*
• Graph optimization (ref. FlumeJava)
• Worker lifecycle management
• Worker resource scaling
• Worker scaling
• Restful management API and CLI
• Real-time job monitoring, Cloud Debugger & Cloud Logging integration
• Project based security with auto wipeout
* no enforcement on external service idempotency, dependant upon correctness thresholds
Cloud Dataflow Service Benefits
Cloud Dataflow Service
Confidential & Proprietary
❯ ParDo fusion
• Producer Consumer
• Sibling
• Intelligent fusion boundaries
❯ Combiner lifting e.g. partial aggregations before reduction
❯ Flatten unzipping
❯ Reshard placement
...
Graph Optimization
Cloud Dataflow Service
Confidential & Proprietary
= ParallelDo
GBK = GroupByKey
+
=
CombineValues
C
consumer-producer sibling
Optimiser: ParallelDo Fusion
Cloud Dataflow Service
D C D
C+D
C+D
Confidential & Proprietary
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
Confidential & Proprietary
800 RPS 1,200 RPS 5,000 RPS 50 RPS
Worker Scaling
Cloud Dataflow Service
Confidential & Proprietary
100 mins. 65 mins.vs.
Worker Optimization
Cloud Dataflow Service
Confidential & Proprietary
Optimizing Your Time
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
ReliabilityDeployment & configuration
Handling
growing scale
Utilization
improvements
Typical Data Processing
Programming
Cloud Dataflow Service
Data Processing with Cloud Dataflow
Batch or Streaming
Confidential & Proprietary
● Boundedness
○ Bounded data - finite data set, fixed in schema, is complete regardless of
time, typically at rest in a common durable store
○ Unbounded data - infinite, potentially changing schema, is never complete,
typically not at rest and stored in multiple temporary yet durable stores
● Time to answer
○ Batch processing presents risks of increased cost (under-utilized resources),
increased time to answer and decrease of correctness (late arriving events)
Considerations
Batch / Streaming
Confidential & Proprietary
Latency
There are situations where batch
processing growing datasets
breaks down.
Batch failure mode #1: time to answer
Batch / Streaming
The first is latency-sensitive
processing. You can't use an hourly
or daily batch job to do low-latency
fraud, abuse or anomaly detection.
Confidential & Proprietary
MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
The second is sessions: batch processing of individual chunks doesn't account for sessions
across batch boundaries. This is a real problem if you cannot afford to miss or duplicate
important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you
decrease the chunk size.
Batch failure mode #2: Sessions
Batch / Streaming
Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Element-wise transformations
Batch / Streaming
A streaming pipeline naturally handles unbounded, infinite collections of data.
Element-wise transformations like filtering can simply be applied as elements flow past.
Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
Batch / Streaming
However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements
into finite sized chunks that can be processed independently.
The simplest way to do this is just to take whatever elements we see in a fixed time period.
But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and
2pm, but there are still a few stragglers from 9am showing up.
Confidential & Proprietary
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Streaming Patterns: Event-Time Based Windows
Batch / Streaming
Demo/Code
Confidential & Proprietary
Demo Architecture Overview
Demo/Code
a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub
b. Dataflow :
i. Pulls from pub/sub
ii. Sends text of matching mentions to GCP NLP API
iii. Loads output into BigQuery
c. Datastudio connects to BigQuery datasource
d. Datastudio generates visualization
(This demo is not a PROD-ready implementation)
Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE
NLP API
Confidential & Proprietary
Pipeline IO
(Text)
Pub/Sub
PTransform
Pipeline IO
(Text)
Big Query
NLP analysis
Read from Pub/Sub
Write to Big Query
with specified schema
Demo pipeline (Python SDK)
Demo/Code
Getting Started Resources
Confidential & Proprietary
❯ cloud.google.com/dataflow
❯ stackoverflow.com/questions/tagged/google-cloud-dataflow
❯ github.com/GoogleCloudPlatform/DataflowJavaSDK
Resources
Getting Started
Confidential & Proprietary
Thank you
Wrap-up
53Data Science | Design | Technology
(JL)
● Last meetup for 2017… Next meetup in January 2018
54
Data Science | Design | Technology
● 1000 members and counting...
● 2018: More co-presentations. Your meetup, your topics
● Special thanks to speakers, hosts, sponsors, members
Merci / Thank You
55
@jdalabsmtl
Data Science | Design | Technology
(Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)

Weitere ähnliche Inhalte

Was ist angesagt?

Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerVu Nguyen Duy
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingAraf Karsh Hamid
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Resilient microservices with Kubernetes - Mete Atamel
Resilient microservices with Kubernetes - Mete AtamelResilient microservices with Kubernetes - Mete Atamel
Resilient microservices with Kubernetes - Mete AtamelITCamp
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architecturesnine
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataWeCloudData
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Andrés Leonardo Martinez Ortiz
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesDatabricks
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 

Was ist angesagt? (20)

Deploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and dockerDeploy data analysis pipeline with mesos and docker
Deploy data analysis pipeline with mesos and docker
 
Big Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb ShardingBig Data Redis Mongodb Dynamodb Sharding
Big Data Redis Mongodb Dynamodb Sharding
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
DataFlow & Beam
DataFlow & BeamDataFlow & Beam
DataFlow & Beam
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Resilient microservices with Kubernetes - Mete Atamel
Resilient microservices with Kubernetes - Mete AtamelResilient microservices with Kubernetes - Mete Atamel
Resilient microservices with Kubernetes - Mete Atamel
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 

Ähnlich wie Data Streaming and Machine Learning with Google Cloud Platform

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
AWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAmazon Web Services
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adxRiccardo Zamana
 
AWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAmazon Web Services
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 

Ähnlich wie Data Streaming and Machine Learning with Google Cloud Platform (20)

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
AWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for Government
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
 
AWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for GovernmentAWS Webcast - Build Agile Applications in AWS Cloud for Government
AWS Webcast - Build Agile Applications in AWS Cloud for Government
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 

Mehr von DSDT_MTL

DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT_MTL
 
DSDT Meetup August 2021
DSDT Meetup August 2021DSDT Meetup August 2021
DSDT Meetup August 2021DSDT_MTL
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021DSDT_MTL
 
DSDT Meetup May 2021
DSDT Meetup May 2021DSDT Meetup May 2021
DSDT Meetup May 2021DSDT_MTL
 
DSDT Meetup April 2021
DSDT Meetup April 2021DSDT Meetup April 2021
DSDT Meetup April 2021DSDT_MTL
 
DSDT Meetup May 2019
DSDT Meetup May 2019DSDT Meetup May 2019
DSDT Meetup May 2019DSDT_MTL
 
DSDT Meetup March 2019
DSDT Meetup March 2019DSDT Meetup March 2019
DSDT Meetup March 2019DSDT_MTL
 
DSDT Meetup February 2019
DSDT Meetup February 2019DSDT Meetup February 2019
DSDT Meetup February 2019DSDT_MTL
 
DSDT Meetup May 2017
DSDT Meetup May 2017DSDT Meetup May 2017
DSDT Meetup May 2017DSDT_MTL
 
DSDT Meetup July 2017
DSDT Meetup July 2017DSDT Meetup July 2017
DSDT Meetup July 2017DSDT_MTL
 
DSDT Meetup October 2017
DSDT Meetup October 2017DSDT Meetup October 2017
DSDT Meetup October 2017DSDT_MTL
 
DSDT Meetup January 2018
DSDT Meetup January 2018DSDT Meetup January 2018
DSDT Meetup January 2018DSDT_MTL
 
DSDT Meetup February 2018
DSDT Meetup February 2018DSDT Meetup February 2018
DSDT Meetup February 2018DSDT_MTL
 
DSDT Meetup May 2018
DSDT Meetup May 2018DSDT Meetup May 2018
DSDT Meetup May 2018DSDT_MTL
 
DSDT Meetup June 2018
DSDT Meetup June 2018DSDT Meetup June 2018
DSDT Meetup June 2018DSDT_MTL
 
DSDT Meetup July 2018
DSDT Meetup July 2018DSDT Meetup July 2018
DSDT Meetup July 2018DSDT_MTL
 
DSDT Meetup November 2018
DSDT Meetup November 2018DSDT Meetup November 2018
DSDT Meetup November 2018DSDT_MTL
 

Mehr von DSDT_MTL (17)

DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021DSDT Meetup Septembre 2021
DSDT Meetup Septembre 2021
 
DSDT Meetup August 2021
DSDT Meetup August 2021DSDT Meetup August 2021
DSDT Meetup August 2021
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021
 
DSDT Meetup May 2021
DSDT Meetup May 2021DSDT Meetup May 2021
DSDT Meetup May 2021
 
DSDT Meetup April 2021
DSDT Meetup April 2021DSDT Meetup April 2021
DSDT Meetup April 2021
 
DSDT Meetup May 2019
DSDT Meetup May 2019DSDT Meetup May 2019
DSDT Meetup May 2019
 
DSDT Meetup March 2019
DSDT Meetup March 2019DSDT Meetup March 2019
DSDT Meetup March 2019
 
DSDT Meetup February 2019
DSDT Meetup February 2019DSDT Meetup February 2019
DSDT Meetup February 2019
 
DSDT Meetup May 2017
DSDT Meetup May 2017DSDT Meetup May 2017
DSDT Meetup May 2017
 
DSDT Meetup July 2017
DSDT Meetup July 2017DSDT Meetup July 2017
DSDT Meetup July 2017
 
DSDT Meetup October 2017
DSDT Meetup October 2017DSDT Meetup October 2017
DSDT Meetup October 2017
 
DSDT Meetup January 2018
DSDT Meetup January 2018DSDT Meetup January 2018
DSDT Meetup January 2018
 
DSDT Meetup February 2018
DSDT Meetup February 2018DSDT Meetup February 2018
DSDT Meetup February 2018
 
DSDT Meetup May 2018
DSDT Meetup May 2018DSDT Meetup May 2018
DSDT Meetup May 2018
 
DSDT Meetup June 2018
DSDT Meetup June 2018DSDT Meetup June 2018
DSDT Meetup June 2018
 
DSDT Meetup July 2018
DSDT Meetup July 2018DSDT Meetup July 2018
DSDT Meetup July 2018
 
DSDT Meetup November 2018
DSDT Meetup November 2018DSDT Meetup November 2018
DSDT Meetup November 2018
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Data Streaming and Machine Learning with Google Cloud Platform

  • 1. Data Science | Design | Technology (November 21, 2017) https://www.meetup.com/DSDTMTL 1
  • 2. Agenda 6:00 - 6:15: Welcome  6:15 - 6:50: Machine Learning through Dataflow 6:50 - 7:30: Building a Streaming Pipeline 7:30 - 8:00: Q&A + Networking 2 Data Streaming and Machine Learning with Google Cloud Platform
  • 3. 3 Maxime Legault-Venne Software Architect, JDA Labs Arsho Toubi Customer Engineer, Google Cloud Data Streaming and Machine Learning with Google Cloud Platform
  • 4. Topic #1 Machine Learning through Google Cloud Platform’s Dataflow 4Data Science | Design | Technology (Max)
  • 5. 5 Machine Learning Through Google Cloud Platform’s Dataflow
  • 6. WHAT’S OUR AIM? Predict how well an item will sell based on past performance of similar items.
  • 7. HOW DO WE PROCEED? Machine Learning 1. Train on existing data to understand what attributes (features) have which impact on performance of items 2. Use trained model to predict how unknown items would perform based on their attributes (features)
  • 8. MACHINE LEARNING Training Process ● Generate features for every item (ex. color, brand, pattern) ● Shuffle data ● Split data in training entries (training set + test set) ● Generate hyperparameters from a predefined search space ● Combine hyperparameters and training entries ● Train a model for each combination ○ Evaluate error out of sample ● Combine the errors for same hyperparameter ● Rank every hyperparameter errors to select the best ● Train on all data and keep trained model for prediction
  • 9. TRAINING ENTRIES Splitting Data Training Set Test Set 3 months Training entries to get model errors Final training data set
  • 10. WHAT’S THE PROBLEM? Multiple combinations to train Ex.: For 700 combinations of hyperparameters and training entries, training took around 12 hours to complete.
  • 11. HOW TO SOLVE IT? - Parallelize by running all trainings concurrently - Need a lot of processing power - Need a lot of memory
  • 12. APACHE BEAM - Implementation agnostic & open source - Java & Python SDKs - Allows to build pipelines https://beam.apache.org/documentation/runners/capability-matrix/
  • 13. PIPELINE BASICS 1. Create streaming or batch pipeline 2. Read data from various sources ○ Files ○ Databases ○ Cloud solutions ○ Code generated data ○ ... 3. Apply transforms to process data 4. Write or output final pipeline data 5. Run on a specific runner ○ Direct (locally on your machine) ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Apache Gearpump https://beam.apache.org/documentation/pipelines/create-your-pipeline/
  • 15. RESULTS Testing 700 combinations of hyperparameters and training entries Previous execution took 12 hours to process sequentially Whole dataflow job took a bit less than 28 minutes
  • 16. Topic #2 Building a Dataflow streaming pipeline for sentiment analysis on Twitter 16Data Science | Design | Technology (Arsho)
  • 18. Confidential & Proprietary Trade-offs and challenges in Big Data Apache Beam SDK Cloud Dataflow Service Batch or Stream Demo/Code Getting Started Resources Agenda
  • 19. Confidential & Proprietary AccuracySpeed Cost Control Complexity Time to Answer Introduction The Tense Quadrachotomy of Big Data
  • 20. Confidential & Proprietary BigQuery Ingest data at 100,000 rows per second Dataflow Stream & batch processing, unified and simplified Pub/Sub Scalable, flexible, and globally available messaging Fully Managed, No-Ops Services Introduction Google Managed Services Toolbox for Big Data
  • 21. Confidential & Proprietary Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines. Introduction Google Cloud Dataflow
  • 23. Confidential & Proprietary Cloud Dataflow SDK Dataflow Benefits ❯ Unified programming model for both batch & stream processing • Independent from the execution back-end aka “runner” ❯ Google driven & open sourced • Java 7 • Python 2 (streaming is in Alpha)
  • 24. Confidential & Proprietary <- At once guarantee (modulo completeness thresholds) Cloud Dataflow SDK <- Aggregations, Filters, Joins, ... <- Completeness Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Cloud Dataflow SDK - Logical Model
  • 25. Confidential & Proprietary • A Directed Acyclic Graph of data processing transformations • Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark • May include multiple inputs and multiple outputs • May encompass many logical MapReduce operations • PCollections flow through the pipeline Pipeline Cloud Dataflow SDK
  • 26. Confidential & Proprietary ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Dataflow how to read it in parallel • Currently for bounded sources only ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data Cloud Dataflow SDK Inputs & Outputs Your Source/Sink Here
  • 27. Confidential & Proprietary {Seahawks, NFC, Champions, Seattle, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} ❯ Processes each element of a PCollection independently using a user-provided DoFn ❯ Elements are processed in arbitrary ‘bundles’ e.g. “shards” • startBundle(), processElement() - N times, finishBundle() ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo KeyBySessionId ParDo (“Parallel Do”) Cloud Dataflow SDK
  • 28. Confidential & Proprietary Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop Cloud Dataflow SDK GroupByKey {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 29. Confidential & Proprietary ❯ Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions ❯ Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections ❯ Window.into() can be called at any point in the pipeline and will be applied when needed ❯ Can be tied to arrival time or custom event time ❯ Watermarks + Triggers enable robust completeness Windows Cloud Dataflow SDK Nighttime Mid-Day Nighttime
  • 30. Confidential & Proprietary GroupByKey Pair With Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Cloud Dataflow SDK
  • 31. Confidential & Proprietary Run the same code in multiple modes using different runners ❯ Direct Runner • For local, in-memory execution. • Great for developing and unit tests ❯ Cloud Dataflow Service Runner • Runs on the fully-manage Dataflow Service • Your code runs distributed across GCE instances ❯ Community sourced • Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh! • Flink runner coming soon from dataArtisans Cloud Dataflow Runners Cloud Dataflow SDK
  • 32. Confidential & Proprietary GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Life of a Pipeline Cloud Dataflow Service Progress & Logs
  • 33. Confidential & Proprietary • At-once processing* • Graph optimization (ref. FlumeJava) • Worker lifecycle management • Worker resource scaling • Worker scaling • Restful management API and CLI • Real-time job monitoring, Cloud Debugger & Cloud Logging integration • Project based security with auto wipeout * no enforcement on external service idempotency, dependant upon correctness thresholds Cloud Dataflow Service Benefits Cloud Dataflow Service
  • 34. Confidential & Proprietary ❯ ParDo fusion • Producer Consumer • Sibling • Intelligent fusion boundaries ❯ Combiner lifting e.g. partial aggregations before reduction ❯ Flatten unzipping ❯ Reshard placement ... Graph Optimization Cloud Dataflow Service
  • 35. Confidential & Proprietary = ParallelDo GBK = GroupByKey + = CombineValues C consumer-producer sibling Optimiser: ParallelDo Fusion Cloud Dataflow Service D C D C+D C+D
  • 36. Confidential & Proprietary Deploy Schedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  • 37. Confidential & Proprietary 800 RPS 1,200 RPS 5,000 RPS 50 RPS Worker Scaling Cloud Dataflow Service
  • 38. Confidential & Proprietary 100 mins. 65 mins.vs. Worker Optimization Cloud Dataflow Service
  • 39. Confidential & Proprietary Optimizing Your Time More time to dig into your data Programming Resource provisioning Performance tuning Monitoring ReliabilityDeployment & configuration Handling growing scale Utilization improvements Typical Data Processing Programming Cloud Dataflow Service Data Processing with Cloud Dataflow
  • 41. Confidential & Proprietary ● Boundedness ○ Bounded data - finite data set, fixed in schema, is complete regardless of time, typically at rest in a common durable store ○ Unbounded data - infinite, potentially changing schema, is never complete, typically not at rest and stored in multiple temporary yet durable stores ● Time to answer ○ Batch processing presents risks of increased cost (under-utilized resources), increased time to answer and decrease of correctness (late arriving events) Considerations Batch / Streaming
  • 42. Confidential & Proprietary Latency There are situations where batch processing growing datasets breaks down. Batch failure mode #1: time to answer Batch / Streaming The first is latency-sensitive processing. You can't use an hourly or daily batch job to do low-latency fraud, abuse or anomaly detection.
  • 43. Confidential & Proprietary MapReduce TuesdayWednesday Jose Lisa Ingo Asha Cheryl Ari WednesdayTuesday The second is sessions: batch processing of individual chunks doesn't account for sessions across batch boundaries. This is a real problem if you cannot afford to miss or duplicate important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you decrease the chunk size. Batch failure mode #2: Sessions Batch / Streaming
  • 44. Confidential & Proprietary 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Streaming Patterns: Element-wise transformations Batch / Streaming A streaming pipeline naturally handles unbounded, infinite collections of data. Element-wise transformations like filtering can simply be applied as elements flow past.
  • 45. Confidential & Proprietary 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time Streaming Patterns: Aggregating Time Based Windows Batch / Streaming However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements into finite sized chunks that can be processed independently. The simplest way to do this is just to take whatever elements we see in a fixed time period. But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and 2pm, but there are still a few stragglers from 9am showing up.
  • 46. Confidential & Proprietary Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Streaming Patterns: Event-Time Based Windows Batch / Streaming
  • 48. Confidential & Proprietary Demo Architecture Overview Demo/Code a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub b. Dataflow : i. Pulls from pub/sub ii. Sends text of matching mentions to GCP NLP API iii. Loads output into BigQuery c. Datastudio connects to BigQuery datasource d. Datastudio generates visualization (This demo is not a PROD-ready implementation) Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE NLP API
  • 49. Confidential & Proprietary Pipeline IO (Text) Pub/Sub PTransform Pipeline IO (Text) Big Query NLP analysis Read from Pub/Sub Write to Big Query with specified schema Demo pipeline (Python SDK) Demo/Code
  • 51. Confidential & Proprietary ❯ cloud.google.com/dataflow ❯ stackoverflow.com/questions/tagged/google-cloud-dataflow ❯ github.com/GoogleCloudPlatform/DataflowJavaSDK Resources Getting Started
  • 53. Wrap-up 53Data Science | Design | Technology (JL)
  • 54. ● Last meetup for 2017… Next meetup in January 2018 54 Data Science | Design | Technology ● 1000 members and counting... ● 2018: More co-presentations. Your meetup, your topics ● Special thanks to speakers, hosts, sponsors, members
  • 55. Merci / Thank You 55 @jdalabsmtl Data Science | Design | Technology (Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)