Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam

Karthikeyan Nagalingam
Nilesh Bagad
ANALYZING IOT DATA IN APACHE
SPARK ACROSS DATA CENTERS
AND CLOUD
By

Agenda
• IoT Data Management Challenges
• NetApp Data Fabric Architecture for Big Data
• IoT Customer Use Cases on Data Fabric
• Q&A

IoT Data Management
Challenges

IoT Data Flow
Data is created1
Data is analyzed in realtime2
Data is aggregated and sent to Core3
Data is stored1
EdgeEdge
Data Center
Edge
Geo Distributed Data Lake
…. ….
Data is analyzed2
Data is protected3
EDGE
CORE
Cloud

IoT Data Management Challenges
AnalyzeStoreCollect Transport Protect
• Flexibility and Agility
• Cost
• Data Protection

NetApp Data Fabric
Architecture for Big Data

The NetApp Data Fabric
Helping customers unleash data to address their business imperatives
HARNESS
the power of the
hybrid cloud
BUILD
a next-generation
data center
MODERNIZE
storage through
data management
PUBLIC
CLOUD
MULTI-
CLOUD
NEXT-GEN
DATA
CENTER
ENTERPRISE
IT
NETAPP
DATA
FABRIC

Introducing Data Fabric Building Blocks for Analytics
NetApp Private Storage
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
Applications
ONTAP
Express Route
SnapMirror®
Applications Applications Applications
S3
Direct Connect
Cold
Data
StorageGRID
FabricPool

In Place Analytics:
+ Enable big data analytics on NFSv3 data
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
Apache Spark
Cluster
NetApp FAS NFS Connector
Apache Spark
Cluster
HDFS
NFS
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
NFS
Key Benefits
• Avoid data move to HDFS.
Reduce replicas
• Scale compute and storage
independently
• Enterprise data protection
• Hybrid cloud deployment
• Hortonworks Certified
Confit 1 : NFS as a Storage Confit 2 : HDFS and NFS in Single Spark Cluster

Analytics with Data Fabric
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
Apache Spark Cluster
ONTAP
NFS
On Premises
HDInsight Spark Cluster
Databrics or EMR
Express Route
SnapMirror®
Direct Connect

IoT Customer Use Cases
on Data Fabric

Customer Scenario
• IoT data received in AWS and analyzed using Apache Spark
cluster in AWS
• Data Management Challenge:
– How to Backup 10 TB data without load on cluster?
– How to protect the data to on-premises?
Broadcasting Provider

An Architecture for Processing IoT-Data Ingested in Cloud
Backup and DR to On Premises
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
ONTAP
NFS
On Premises
EBS
Spark Cluster
NetApp FAS NFS
Connector
Kafka
Events Via
Rest API

Customer Scenario
• IoT data is received in AWS and analyzed using Apache Spark
Cluster in AWS
– How to reduce the solution cost?
– How to consume analytics services in data center and multiple
clouds?
IT Service Provider

An Architecture for Processing IoT-Data Ingested in Cloud
Multi Cloud Connectivity
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
ONTAP
NFS
On Premises
Analytics Service
Spark Cluster
Express Route
SnapMirror®
Direct Connect
Kafka
Events Via
Rest API

Customer Scenario
• IoT data received on-premises and analyzed using Cloudera Spark
Cluster across data center and cloud
– How to leverage cloud computation for analytics ?
– How to consume legacy data (7PB) for analytics?
Insurance Company

An Architecture for Processing IoT-Data Ingested on premises
DR in Cloud; Analytic across data centers
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
NetApp FAS NFS
Connector
ONTAP
NFS
On Premises
HDInsight
DataBricks
Express Route
SnapMirror®
Direct Connect
Spark ClusterKafka
Events Via
Rest API

Customer Scenario
• IoT data received on-premises and stored in Hadoop Data Lake.
Data needs to be backed up for compliance
– How to reduce the backup window and optimize solution cost?
– How to minimize impact on analytics performance during
backup?
Large Bank

Use Case: Backup for IoT Data
HDFS
(SAN)
Tape
HDFS
(DAS)
Backup
Edge
OR + Archive
Current
distcp –update
HDFS (snap) NetApp (snap)
HDFS
(SAN)
FlexClone
NetBackup
(Files)
NFS Export
NetApp Solution ATypical
HDFS
(DAS)
HDFS
(DAS)
Backup
Edge
OR + Archive
Proposed
distcp -update –diff
HDFS (snap)
distcp -update –diff
(Files)
NFS
FlexClone
NFS Connector
NetApp Solution B
HDFS
(DAS)
SnapMirror
FlexClone
FlexClone FlexClone
Run lots
of parallel
mappers
NetApp Backup Solution A
• NetApp Snapshot Backup
• Backup Archival
• Cloud Compatible
NetApp Backup Solution B
• Hadoop Native Support
• Offload Backup Operation
• Enterprise Management

Customer Scenario
• Large Hadoop Data Lake implementation on premises with Multiple
Spark clusters
– How to make data available for dev/test teams?
– How to build the new cluster in minutes from an existing cluster?
Online Music Distribution

Use Case: Dev/Test for IoT Data
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
Production
ONTAP
NFS
On Premises
Development
NFS
QA/Test
NFS
FlexClone FlexClone

Customer Scenario
• Run analytics on archival data in object store
– How to run Hadoop jobs in object store
– Archive the Hadoop data on primary or a secondary site
Online Marketplace

Analytics with NetApp StorageGRID
In place analytics with StorageGRID
Secondary Site
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
ONTAP
NFS
On Premises
Direct Connect
AFF
A200
4 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 94 5 6 70 1 2 3 1 2 1 3 1 4 1 58 9 1 0 1 1 2 0 2 1 2 2 2 31 6 1 7 1 8 1 9
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
1600GB
StorageGRID
StorageGRID
S3
S3
S3A

• Input dataset size – 1.5TB
• ONTAP– 52% better than JBOD
Spark Performance
Type Hadoop
Worker
Nodes
Drives
per
Node
Number of
Storage
Arrays
JBOD 6 12 NA
ONTAP 6 6 1
0
1000
2000
3000
4000
5000
6000
Througput(MBytes/Sec)
MB/Sec
Spark Scala WordCount - Throughput MB/Sec
(Higher is better)
JBOD ONTAP
0
50
100
150
200
250
300
350
400
450
500
Duration(s)
Seconds
Spark Scala WordCount - Duration(s)
( Lower Is better )
JBOD ONTAP
HiBench – Spark Scala Word Count

Key Takeaways
Flexibility and Agility
• On Demand analytics with Hybrid Cloud/Multi Cloud deployments
• Rapid provisioning of clusters for test/dev environments
Lower Cost
• Add storage capacity without adding compute nodes
• One copy vs 3 copies of data for HDFS
• Data Tiering with FabricPool
Enterprise Data Protection
• Efficient backup, DR and Archival

Further Resources
§ Please visit us at: Booth #512
§ Visit our Big Data Website: www.netapp.com/bigdata

Thank You.
Nilesh Bagad: nileshb@netapp.com
Karthikeyan Nagalingam: nkarthik@netapp.com

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam

Ähnlich wie Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam