You want to implement an Big Data/IoT solution and would like to know if it should be implemented in the cloud or on-premises. You are interested in the cloud offerings of vendors and what benefits they provide and if a similar solution would not be possible on-premises.
This presentation deals with this and other questions. Starting from an vendor-independent reference architecture and corresponding design patterns, different cloud solutions from various vendors are compared and rated. Additionally it will be shown how such solution could be implemented on-premises and how a hybrid Big Data/IoT solution could look like.
Take control of your SAP testing with UiPath Test Suite
Big Data - in the cloud or rather on-premises?
1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Big Data - in der Cloud oder doch
lieber On-Premises?
Guido Schmutz
Kassel, 21.9.2017
@gschmutz guidoschmutz@wordpress.com
2. Guido Schmutz
Working at Trivadis for more than 20 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
2
3. Agenda
1. Cloud Primer
2. Big Data and IoT Architecture
3. Big Data in the Cloud
4. Various Models for Big Data in Cloud
5. Big Data On-Premises
6. Hybrid Big Data Solutions
5. Cloud Primer
5
Instance
• the thing running in the cloud provider’s infrastructure
• can be a VM but does not have to be
Instance Type
• the size of the instance (Combination of CPU, Memory, Disk Storage => Cost)
• Azure: Instance sizes
Instance Control
• lifecycle of an instance
• Instances can be stopped or terminated (deleted)
6. Cloud Primer
6
Images
• the template used for provisioning an
instance
Serverless
• Run code “without” servers => only
specify functions (Java, C#, Python,
Node.js)
• Pay only for the compute time you
consume
• easy scale-out
• management and capacity planning
decision done by provider
Regions and Availability Zones
• represents geographic distribution of
cloud provider
• Regions are the geographic areas
where a service is offered
• Availability Zones (AZ) add high
availability within a Region
• communication within AZ in same
region cost less than across regions
7. Cloud Primer – Specific Instances
7
On-Demand Instance
• flexible, on-demand usage
• billing increment dependent on provider
Temporary Instance
• can disappear at any time (bid price)
• are charged significantly less
• well suited for Hadoop workloads (if storage
and compute are separated)
• AWS: spot instances
Reserved Instance
• reserved capacity in advance
• reduced pricing (up to 75% to on-demand)
Dedicated Instance
• pay for instances
• run on hardware dedicated to you
• Amazon decides placement
Dedicated Host
• pay for entire physical server
• full flexibility of placement of instances (VM)
• solves existing server-bound licenses issues
Bare Metal
• bare hardware resources, no virtualization by
cloud provider
• full flexibility / full control
• almost no automation provided
8. Cloud Primer - Storage
8
Block Storage
• most common type offered by a cloud
provider
• disk-like storage
• comes with each instance when provisioned
• accessed as filesystem mounts => volumes,
disks
• persistent volumes survive beyond lifetime
of instance that spawned it
• ephemeral volumes are limited to life of
instance to which they are attached
• AWS: EBS
• Azure: VHDS & Azure File Storage
• Oracle: Block Storage
Object Storage
• each chunk of data is treated as its own
entity independent of any instance
• content of each object is opaque to the
provider
• API or URL is used to access data (no
mount)
• well suited for Big Data
• hot and cold storage options
• AWS: S3 & Glacier
• Azure: Azure Blob Storage
• Oracle: Object Storage & Archive Storage
9. Cloud Primer – Usage Patterns
9
Short Lived (Transient)
👍 Minimal maintenance, high efficiency
👎 spin up time, higher resource demand
👎 data transfer to permanent storage
Self-Service
👍 efficiency of on-demand creation
👎 need to maintain tooling
Cloud-Only
👍 data transfer stay within cloud, minimal on-
premises costs, integration with provider
👎 higher cloud expenditure
Long lived (Long Running)
👍 less time waiting for clusters to start/stop
👍 lower resource demand
👎 wasted idle time (if there is)
👎 maintenance burden, growing size over time
Managed
👍 easy alignment with budget constraints
👎 waiting time for usage, admin effort
Hybrid
👍 lower cloud expenditure, local resources
available
👎 complex workflows, data transfer costs
20. Big Data in the Cloud – two usage patterns
21
Short Lived Cluster (Transient)
data is repurposed, and used for a
specific use case in a specific workload
Cluster spun up only when needed
Flexibility
• spin up arbitrary number of nodes quickly
• Expand quickly from very small to very large
Simplicity
• use as is, solve problem and move on
Long Lived Cluster (Long Running)
data is acquired and augmented
continuously
cluster is in permanent use for mixed
workloads
Performance
• Raw compute performance across wide range
of workloads
• time of availability
21. BDaaS – Possible Cost Optimizations
22
Autoscaling
• scale up when a query comes in
• scale down when jobs finish
• match utilization with job demand
• benchmark: auto-scaling saves 33% in
compute costs compared to static-
sized cluster
Excess capacity
• Hadoop is fault tolerant, can take
advantage of unreliable instances
such as temporary instances
• benchmark: if 50% is done on spot
nodes, save 80% compared to normal
nodes
Common workload distribution with Big Data applications
22. Data Locality vs. Compute/Storage Separation
23
Data Local Compute Separate Compute and Storage
Worker #1
Disk
Processing
Master Node
Worker #2
Disk
Processing
Worker #3
Disk
Processing
Network
Storage
Disk Disk Disk
Compute #1
Processing
Compute #2
Processing
Compute #3
Processing
Network
Master Node
Network
Separation of compute
and storage – the
fundamental difference
• store data in Object
Storage instead of DFS
• bring up Compute nodes
only for data processing
• multiple workloads on
separate clusters can
access same data
23. A new way to Manage Big Data
24
Big Data Traditional
Assumptions
Bare-metal
Data Locality
HDFS on local disks
Big Data
A New Approach
Containers and VMs
Compute and storage
separation
Shared storage
Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
24. 5 ½ ways to get Big Data in the Cloud
26
1. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on Bare Metal
2. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on VM
3. Hadoop PaaS from Cloud Provider’s Marketplace
4. Dedicated (Long-Running) BigData-as-a-Service
5. Elastic (Transient) Big-Data-as-a-Service (storage and compute
separated)
6. “Cloud on Premises” (Cloud Stack from Vendors on Premises)
26. Various Models for Big Data in Cloud
29
1. Bare Metal Cloud (Bring Your Own Hadoop - BYOH)
2. IaaS with any Hadoop Distribution (Bring Your Own Hadoop)
3. PaaS with Hadoop (from Marketplace)
4. Dedicated (Long-Running) BDaaS
5. Elastic (Transient) BDaaS
6. BDaas + Analytics SaaS
27. 1) Bare Metal Cloud (BYOH)
30
Compute (Bare Metal)
Big Data (Custom)
Oracle Compute
Analytics (Custom)
Storage (Bare Metal)
Oracle Block Volume &
Object Storage, Data
Transfer Service
Intelligence (Custom)
Amazon
Azure
Oracle
Custom
n.a. (Dedicated Host
close, but runs VMs)
n.a.
n.a. (Dedicated Host,
close, but runs VMs)
n.a.
Bring Your Own Hadoop
(BYOH)
Custom (SQL, Machine
Learning, ..)
Custom (Image-,
Speech-Recognition,
Bots, …)
28. 2) IaaS (Bring Your Own Hadoop)
31
Amazon EC2 & EC2 Azure VM
Bring Your Own Hadoop
(BYOH)
Bring Your Own Hadoop
(BYOH)
Custom (SQL, Machine
Learning, ..)
Custom (SQL, Machine
Learning, ..)
General Purpose
Compute & Dedicated
Compute
Bring Your Own Hadoop
(BYOH)
Custom (SQL, Machine
Learning, ..)
S3, EBS, Glacier,
Snowball, Snowball
Edge, Snowmobile
Storage (Blob), Data
Lake Store,
Import/Export
Custom (Image-,
Speech-Recognition,
Bots, …)
Custom (Image-,
Speech-Recognition,
Bots, …)
Oracle Object & Archive
Storage, Data Transfer
Service
Custom (Image-,
Speech-Recognition,
Bots, …)
Amazon
Azure
Oracle
Custom
Compute (Bare Metal)
Big Data (Custom)
Analytics (Custom)
Storage (Bare Metal)
Intelligence (Custom)
29. 3) PaaS (Hadoop from Marketplace)
32
S3, EBS, Glacier,
Snowball, Snowball
Edge, Snowmobile
Hadoop (Hortonworks,
MapR)
Hadoop (Cloudera,
Hortonworks, MapR)
Custom (SQL, Machine
Learning, ..)
Custom (SQL, Machine
Learning, ..)
Amazon EC2 Azure VM
General Purpose
Compute & Dedicated
Compute
Azure Storage (Blob,
Block, Disk, File), Azure
Data Lake Store
Custom (Image-,
Speech-Recognition,
Bots, …)
Custom (Image-,
Speech-Recognition,
Bots, …)
Oracle Object & Archive
Storage, Data Transfer
Service
n.a.
Amazon
Azure
Oracle
Custom
Compute (Bare Metal)
Big Data (Custom)
Analytics (Custom)
Storage (Bare Metal)
Intelligence (Custom)
31. 5) Elastic BDaaS
34
S3, EBS, Glacier
Amazon EMR
Azure HDInsight
(Hortonworks)
Custom (SQL, Machine
Learning, ..)
Custom (SQL, Machine
Learning, ..)
Amazon EC2 Azure VM
General Purpose
Compute & Dedicated
Compute
Azure Storage (Blob,
Block, Disk, File), Azure
Data Lake Store
Image-, Speech-
Recognition, Bots, …
Image-, Speech-
Recognition, Bots, …
Oracle Object & Archive
Storage, Data Transfer
Service
Big Data CS Compute
Edition (Hortonworks)
Custom (SQL, Machine
Learning, ..)
Image-, Speech-
Recognition, Bots, …
Amazon
Azure
Oracle
Custom
Compute (Bare Metal)
Big Data (Custom)
Analytics (Custom)
Storage (Bare Metal)
Intelligence (Custom)
32. 6) BDaaS + Analytics SaaS
35
S3, EBS, Glacier
Amazon EMR
Azure HDInsight
(Hortonworks)
Machine Learning,
Polly, …
Machine Learning, Data
Lake Analytics, …
Amazon EC2 & EC2
Dedicated Hosts
Azure VM
General Purpose
Compute & Dedicated
Compute
Azure Storage (Blob,
Block, Disk, File), Azure
Data Lake Store
Alexa, Lex, Polly
Cortana, Speech API,
Computer Vision API,
Video API, ...
Oracle Object & Archive
Storage, Data Transfer
Service
Big Data CS Compute
Edition / Big Data CS
Big Data Discovery CS,
Analytics Cloud, Data
Spatial & Graph
n.a.
Amazon
Azure
Oracle
Custom
Compute (Bare Metal)
Big Data (Custom)
Analytics (Custom)
Storage (Bare Metal)
Intelligence (Custom)
33. Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Oracle Cloud
36
IoT CS
Event Hub CS
Stream
Analytics
Big Data CS
NoSQL CS
Big Data
Discovery CS
Big Data CS –
Compute
Object
Storage
Archive
Storage
Data Transfer
Service
Block
Storage
NoSQL CS
Data Special
& Graph
Data Transfer
Service
BigData SQL
Data Transfer
Service
NoSQL CS
Event Hub CS
Data Transfer
Service
Integration CS
Messaging CS
BI CS
Process CS
Mobile CS
Container CS
Application
Container CS
GoldenGate
Visual Builder
Big Data
Preparation CS
Data
Visualization CS
Oracle Data
Integrator CS Analytics CS
34. Amazon AWS
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Elastic MapReduce (EMR)
Polly
ML
Lex
Rekognition
Kinesis Analytics
Kinesis Streams
Kinesis Firehose
Snowmobile
Snowball
AWS IoT Platform Lambda
Direct Connect
S3
Glacier
Dynamo DB
EC2 Auto Scaling
EBS
EFS
Alexa
Athena
Dynamo DB
Snowball
Direct Connect
Snowball Edge
Kinesis Firehose
Athena
Snowball
Greengrass
Rules Engine
Lambda
Redshift
EC2 Container Service
EC2 Container Registry
Mobile Hub
Mobile SDK
Lambda
SQSSNSEmail
PinpointAPI Gateway
Elasticsearch
ElasticCache
Dynamo DB
Elasticsearch
TensorFlow
Glue
Data pipeline
QuickSight
35. Microsoft Azure
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
HD Insight
Storage Blob
Machine
Learning
Data Lake
Store
Storage Block
Data Lake
Analytics
Event Hub
Stream
Analytics
IoT Suite
Cosmos DB
Import/Export
Import/Export
Speech
API
Vision API
Cortana
Bot Service
Service Bus
Notification Hub
API Management
Power BI
BizTalk Services
Event Hub
IoT Hub
IoT Edge
SQL Data
Warehouse
Table Storage
Redis Cache
Functions
Container Service
Container Registry
Cosmos DB
Table Storage
Container Instances
Time Series Insight
Time Series Insight
Event Grid
37. Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
On-Premises – Oracle Cloud Machine
44
IoT CS
Event Hub CS
Stream
Analytics
Big Data CS
NoSQL CS
Big Data
Discovery CS
Big Data CS –
Compute
Object
Storage
Archive
Storage
Data Transfer
Service
Block
Storage
NoSQL CS
Data Special
& Graph
Data Transfer
Service
BigData SQL
Data Transfer
Service
NoSQL CS
Event Hub CS
Data Transfer
Service
Integration CS
Messaging CS
BI CS
Process CS
Mobile CS
Container CS
Application
Container CS
38. Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
On Premises – Open Source
45