SlideShare ist ein Scribd-Unternehmen logo
1 von 33
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
Big Data Best Practices
Real Time Analytics
Lior Hipsh
10/7/17
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
AllCloud in a NutShell
● 9 years of cloud experience
● 1500+ successful deployments
● 1000+ customers
● 3 operating centers
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Agenda
Big Data Introduction
Real Time Analytics
GCP DataFlow
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Big Data
Volume - DWHs and Storage, Shard based DB (NoSQL & SQL), Unstructured
data parallel processing
Velocity - Real time analytics, quick response & reduced DWH
Variety - Schemaless - flexibility of the data - (Document DB); flexibility of
Relations (Graph DB)
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
… With unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
“Historical” Pattern
High Volume Store Structured Data
Injestion
(transport,
capture)
DWH BI
Structured Data
ETL steps created
OLAP cubes and any
processed digested
data
ETL
(sql)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Today Common Pattern
High Volume (batch)
Injestion
(transport,
capture)
Data
Processing
(batch)
DWH/SQL BI
Multi step/pipes
processing. Best to
pass temporary data via
the transport
Multi step/pipes
processing could be
required also on
digested data for
additional analysis
ETL
Analytical
data
Transformed
data
Unstructured & Structured Data
Analytics data processing typically by
Map/Reduced as Spark or Hadoop over
files or NoSQL.
ETL can also be done by Map Reduce
but mostly done by ETL tools
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics
Simplest
Ingestion
Data
Processing
(streaming /
Rule Based
Engine/ CEP)
BI
(visual+sm
all size db)
Action
RT vs Batch - level of 2-3 sec
and below
Data may not be ETL to DWH
after analytics been produced
Does Batch is just Real-Time
with skew parameter = 1h?...
Analytical
data
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics
In practice
Ingestion
(capture)
Data
Processing
BI
Database
(digested
output)
In Memory
● MapReduce
● SQL in-mem DB
● NoSQL in mem (e.g. Redis)
● Transport/Queues
Rules accesses
Memory
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Bigtable
GCP Simple Pattern
Pubsub DataFlow
Big
Query
BI Tool
(e.g. Tableau)
C SQL
Multi step/pipes
processing
Case of processed
output analytics is
yet high volume
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics with File Archive
Ingestion
Data
Processing
BI
Low cost
Bucket
Database
In Memory
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics w/ Retroactive Batch
processing
Ingestion
Data
Processing
(streaming)
BI
Low cost
Bucket
Data
Processing
(batch)
Database
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics with DWH for “off
line”
Ingestion
Data
Processing
BI
SQL
Database
DWHETL
Analytical
data
Raw
data
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
On GCP
Pubsub
DataFlow
(streaming)
Big
Query
(3 month
raw )
BI Tool
(e.g. Tableau)
C SQL
(digested
data)
Multi step/pipes
processing
Low cost Bucket
full history
Analytical
data
BQ support internal aging
which can save the low
cost bucket
19
Google
Analytics
Premium
Cloud
Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
Capture Store Analyze
Google
Stackdriver
Process
Stream
Use
Data
Scientists
Business
Analysts
Cloud Dataproc
Cloud
Datalab
Real-time analytics
Real-time
dashboard
Real-time
alerts
Cloud ML
Batch
Firebase
Storage
Transfer
Service
Cloud
Dataflow
CEP over GCP Stack
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Analytics Stack
architect decisions to do
Ingestion
Message bus
Files
Database
pubsub, Kafka,
bucket, HBASE,
HFDS, BigTable, etc
Data Processing
SQL Rules
programmatically
Tableau, Looker, Data
Studio (free…), BO
GCP DataFlow
(Apache Beam),
Apache Flink, Spark
Streaming, Drools ,
SQLStream, Tibco
Streaming Analytics ,
IBM Streams
Share Batch &
real time pipe
Separate
DWH
Columnar DWH
Low cost SQL DB
(if possible)
BigQuery, IBM
Netezza , Vertica,
InfoBright, Teradata
BI
OLAP
Report
Generator
Data Processing
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Data Processing - CEP engines
Complex Event Processing
GCP DataFlow - programmatically (Apache
Beam) , Python/Java. Same code
framework used also on batch processing
and real time.
Apache Flink - programmatically.
Spark Streaming - micro-batches.
Kafka Streams (programmatically).
Drools (Jboss)
Sqlstream (SQL rules)
Esper (SQL like - “EPL” - Event Processing
Language)
Cisco Stream Analytics (SQL)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Real Time Pipes Technology
decision guide
Pipe logic to be able to use data outside the streams (ext db)
Pipe code better be testable also out of cloud (+ cloud agnostic)
Day 0 decision - do we need the time pipe also on batch.
Extensibility to unmanaged pipe - e.g. - CPP code that do one of the steps
Eco-system/Libs - i.e. - does the pipe needs Sci Libs or ML as well.
23
Beam=Batch+Stream
Apache Beam (incubating)
Cloud Dataflow
Based on Apache Beam. Pipelines are portable to your favorite runtime.
Confidential & ProprietaryGoogle Cloud Platform 24
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
“Windowing”
Data typically is an infinite time series.
Need to check rule match per event
while using historical data from the
last X minutes.
Framework works by definition of
Windows, mainly using sliding
windows.
Can be tied to arrival time or custom
event time
Watermarks + Triggers enable robust
completeness
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Dataflow -> Apache Beam
Batch and Real time become “2 edge points” on a scale of processing definitions for
delay-from-real-time “factor” : a parameter in the processing code
In default , none parameterized - do batch.
Full control (per processing of a data collection) on the Windowing and time shift from event
to processing.
Full streaming control.
Python or Java
Open Source. Can run it in cloud or at home.
Code can be running on Spark or Flink.
Dynamic Work Rebalancing
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Multi Pipes Flows
Data Processing engine should support stream processing (pushing/routing output to next stream/pipe).
Option for multi-step processing supported without going via transport
Monitoring is a must.
Recovery is a must.
Auto-scale (cloud…) of each step. Assumes peaks.
Cross Cloud and Hybrids
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Ingestion guide
Go on bucket for “first line” if possible.
Can work in many systems, including some IoT, where devices can upload
short batches rather single event at a time.
React to files by moving to pubsub - flatten peaks issue
Invest time on sharding design (good on any sharded system….)
No need in GCP ! (there are Partitions in Beam but for App logic needs)
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Thank you!
[EDIT IN MASTER] Presentation Title | Date
This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved.
www.allcloud.io
Backup slides
Confidential & ProprietaryGoogle Cloud Platform 31
Scenario
Confidential & ProprietaryGoogle Cloud Platform 32
Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
33
Cloud Pub/Sub
Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.

Weitere ähnliche Inhalte

Was ist angesagt?

Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Chris Jang
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Chetan Sharma
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleData Con LA
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comAlex Van Boxel
 
Google BigQuery - Features & Benefits
Google BigQuery - Features & BenefitsGoogle BigQuery - Features & Benefits
Google BigQuery - Features & BenefitsAndreas Raible
 
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More CapacityGera Shegalov
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)bigdata trunk
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud eventPreetyKhatkar
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Simon Su
 
Reblaze Case Study on GCP
Reblaze Case Study on GCPReblaze Case Study on GCP
Reblaze Case Study on GCPIdan Tohami
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveDatabricks
 
Understanding cloud with Google Cloud Platform
Understanding cloud with Google Cloud PlatformUnderstanding cloud with Google Cloud Platform
Understanding cloud with Google Cloud PlatformDr. Ketan Parmar
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinBoyd Hemphill
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Jason Flittner
 
Big Data, HPC and Streaming
Big Data, HPC and StreamingBig Data, HPC and Streaming
Big Data, HPC and StreamingAnjani Phuyal
 

Was ist angesagt? (20)

Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Google Bigtable
Google BigtableGoogle Bigtable
Google Bigtable
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 
Google BigQuery - Features & Benefits
Google BigQuery - Features & BenefitsGoogle BigQuery - Features & Benefits
Google BigQuery - Features & Benefits
 
#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity#SlimScalding - Less Memory is More Capacity
#SlimScalding - Less Memory is More Capacity
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud event
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
Reblaze Case Study on GCP
Reblaze Case Study on GCPReblaze Case Study on GCP
Reblaze Case Study on GCP
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data Perspective
 
Understanding cloud with Google Cloud Platform
Understanding cloud with Google Cloud PlatformUnderstanding cloud with Google Cloud Platform
Understanding cloud with Google Cloud Platform
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsR, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
 
StackEngine Demo - Docker Austin
StackEngine Demo - Docker AustinStackEngine Demo - Docker Austin
StackEngine Demo - Docker Austin
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
Big Data, HPC and Streaming
Big Data, HPC and StreamingBig Data, HPC and Streaming
Big Data, HPC and Streaming
 

Ähnlich wie Big Data Best Practices on GCP

Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Getting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaGetting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaEdelweiss Kammermann
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudAlluxio, Inc.
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GooglePatrick Pierson
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Guido Schmutz
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 

Ähnlich wie Big Data Best Practices on GCP (20)

Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Getting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaGetting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and Kafka
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 

Kürzlich hochgeladen

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Big Data Best Practices on GCP

  • 1. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. Big Data Best Practices Real Time Analytics Lior Hipsh 10/7/17
  • 2. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io AllCloud in a NutShell ● 9 years of cloud experience ● 1500+ successful deployments ● 1000+ customers ● 3 operating centers
  • 3. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Agenda Big Data Introduction Real Time Analytics GCP DataFlow
  • 4. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Big Data Volume - DWHs and Storage, Shard based DB (NoSQL & SQL), Unstructured data parallel processing Velocity - Real time analytics, quick response & reduced DWH Variety - Schemaless - flexibility of the data - (Document DB); flexibility of Relations (Graph DB)
  • 8. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:00
  • 9. … With unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  • 10. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Historical” Pattern High Volume Store Structured Data Injestion (transport, capture) DWH BI Structured Data ETL steps created OLAP cubes and any processed digested data ETL (sql)
  • 11. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Today Common Pattern High Volume (batch) Injestion (transport, capture) Data Processing (batch) DWH/SQL BI Multi step/pipes processing. Best to pass temporary data via the transport Multi step/pipes processing could be required also on digested data for additional analysis ETL Analytical data Transformed data Unstructured & Structured Data Analytics data processing typically by Map/Reduced as Spark or Hadoop over files or NoSQL. ETL can also be done by Map Reduce but mostly done by ETL tools
  • 12. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Simplest Ingestion Data Processing (streaming / Rule Based Engine/ CEP) BI (visual+sm all size db) Action RT vs Batch - level of 2-3 sec and below Data may not be ETL to DWH after analytics been produced Does Batch is just Real-Time with skew parameter = 1h?... Analytical data
  • 13. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics In practice Ingestion (capture) Data Processing BI Database (digested output) In Memory ● MapReduce ● SQL in-mem DB ● NoSQL in mem (e.g. Redis) ● Transport/Queues Rules accesses Memory
  • 14. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Bigtable GCP Simple Pattern Pubsub DataFlow Big Query BI Tool (e.g. Tableau) C SQL Multi step/pipes processing Case of processed output analytics is yet high volume
  • 15. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with File Archive Ingestion Data Processing BI Low cost Bucket Database In Memory
  • 16. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics w/ Retroactive Batch processing Ingestion Data Processing (streaming) BI Low cost Bucket Data Processing (batch) Database
  • 17. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics with DWH for “off line” Ingestion Data Processing BI SQL Database DWHETL Analytical data Raw data
  • 18. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io On GCP Pubsub DataFlow (streaming) Big Query (3 month raw ) BI Tool (e.g. Tableau) C SQL (digested data) Multi step/pipes processing Low cost Bucket full history Analytical data BQ support internal aging which can save the low cost bucket
  • 19. 19 Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics Capture Store Analyze Google Stackdriver Process Stream Use Data Scientists Business Analysts Cloud Dataproc Cloud Datalab Real-time analytics Real-time dashboard Real-time alerts Cloud ML Batch Firebase Storage Transfer Service Cloud Dataflow CEP over GCP Stack
  • 20. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Analytics Stack architect decisions to do Ingestion Message bus Files Database pubsub, Kafka, bucket, HBASE, HFDS, BigTable, etc Data Processing SQL Rules programmatically Tableau, Looker, Data Studio (free…), BO GCP DataFlow (Apache Beam), Apache Flink, Spark Streaming, Drools , SQLStream, Tibco Streaming Analytics , IBM Streams Share Batch & real time pipe Separate DWH Columnar DWH Low cost SQL DB (if possible) BigQuery, IBM Netezza , Vertica, InfoBright, Teradata BI OLAP Report Generator Data Processing
  • 21. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Data Processing - CEP engines Complex Event Processing GCP DataFlow - programmatically (Apache Beam) , Python/Java. Same code framework used also on batch processing and real time. Apache Flink - programmatically. Spark Streaming - micro-batches. Kafka Streams (programmatically). Drools (Jboss) Sqlstream (SQL rules) Esper (SQL like - “EPL” - Event Processing Language) Cisco Stream Analytics (SQL)
  • 22. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Real Time Pipes Technology decision guide Pipe logic to be able to use data outside the streams (ext db) Pipe code better be testable also out of cloud (+ cloud agnostic) Day 0 decision - do we need the time pipe also on batch. Extensibility to unmanaged pipe - e.g. - CPP code that do one of the steps Eco-system/Libs - i.e. - does the pipe needs Sci Libs or ML as well.
  • 23. 23 Beam=Batch+Stream Apache Beam (incubating) Cloud Dataflow Based on Apache Beam. Pipelines are portable to your favorite runtime.
  • 24. Confidential & ProprietaryGoogle Cloud Platform 24 • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Cloud Dataflow? AnalysisETL Orchestration
  • 25. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io “Windowing” Data typically is an infinite time series. Need to check rule match per event while using historical data from the last X minutes. Framework works by definition of Windows, mainly using sliding windows. Can be tied to arrival time or custom event time Watermarks + Triggers enable robust completeness
  • 26. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Dataflow -> Apache Beam Batch and Real time become “2 edge points” on a scale of processing definitions for delay-from-real-time “factor” : a parameter in the processing code In default , none parameterized - do batch. Full control (per processing of a data collection) on the Windowing and time shift from event to processing. Full streaming control. Python or Java Open Source. Can run it in cloud or at home. Code can be running on Spark or Flink. Dynamic Work Rebalancing
  • 27. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Multi Pipes Flows Data Processing engine should support stream processing (pushing/routing output to next stream/pipe). Option for multi-step processing supported without going via transport Monitoring is a must. Recovery is a must. Auto-scale (cloud…) of each step. Assumes peaks. Cross Cloud and Hybrids
  • 28. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Ingestion guide Go on bucket for “first line” if possible. Can work in many systems, including some IoT, where devices can upload short batches rather single event at a time. React to files by moving to pubsub - flatten peaks issue Invest time on sharding design (good on any sharded system….) No need in GCP ! (there are Partitions in Beam but for App logic needs)
  • 29. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Thank you!
  • 30. [EDIT IN MASTER] Presentation Title | Date This document contains Confidential and Proprietary Information of AllCloud Ltd. that may not be redistributed or disclosed, at any time, to any third party, without AllCloud prior written consent. © 2017, AllCloud Ltd. All rights reserved. www.allcloud.io Backup slides
  • 31. Confidential & ProprietaryGoogle Cloud Platform 31 Scenario
  • 32. Confidential & ProprietaryGoogle Cloud Platform 32 Pipeline p = Pipeline.create( OptionsBuilder.RunOnService(true, false)); PCollection<String> rawData = p.begin().apply(TextIO.Read .from(OptionsBuilder.GCS_RAWDUMP_URI)); PCollection<PlaybackEvent> events = rawData.apply( new ParseTransform()); events.apply(new ArchiveTransform()); events.apply(new SessionAnalysisTransform()); events.apply(new AssetTransform()); p.run(); Java 7 Implementation
  • 33. 33 Cloud Pub/Sub Fast, reliable, event delivery. Serverless, autoscaling, pay for what you use.

Hinweis der Redaktion

  1. Who uses today dataflow?
  2. Who uses today dataflow?
  3. here’s gaming logs each square represents an event where a user scored some points for their team
  4. game gets popular
  5. start organizing it into a repeated structure
  6. repetitive structure just a cheap way of representing an infinite data source. game logs are continuous distributed systems can cause ambiguity...
  7. Lets look at some points that were scored at 8am <animate> red score 8am, received quickly <animate> yellow score also happened at 8am, received at 8:30 due to network congestion <animate> green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land. so now we’ve got an unordered, infinite data set, how do we process it...