Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Ähnlich wie Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam
Ähnlich wie Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam (20)
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp Data Fabric and NetApp Private Storage with Nilesh Bagad and Karthikeyan Nagalingam
4. IoT Data Flow
Data is created1
Data is analyzed in realtime2
Data is aggregated and sent to Core3
Data is stored1
EdgeEdge
Data Center
Edge
Geo Distributed Data Lake
…. ….
Data is analyzed2
Data is protected3
EDGE
CORE
Cloud
5. IoT Data Management Challenges
AnalyzeStoreCollect Transport Protect
• Flexibility and Agility
• Cost
• Data Protection
7. The NetApp Data Fabric
Helping customers unleash data to address their business imperatives
HARNESS
the power of the
hybrid cloud
BUILD
a next-generation
data center
MODERNIZE
storage through
data management
PUBLIC
CLOUD
MULTI-
CLOUD
NEXT-GEN
DATA
CENTER
ENTERPRISE
IT
NETAPP
DATA
FABRIC
12. Customer Scenario
• IoT data received in AWS and analyzed using Apache Spark
cluster in AWS
• Data Management Challenge:
– How to Backup 10 TB data without load on cluster?
– How to protect the data to on-premises?
Broadcasting Provider
14. Customer Scenario
• IoT data is received in AWS and analyzed using Apache Spark
Cluster in AWS
• Data Management Challenge:
– How to reduce the solution cost?
– How to consume analytics services in data center and multiple
clouds?
IT Service Provider
16. Customer Scenario
• IoT data received on-premises and analyzed using Cloudera Spark
Cluster across data center and cloud
• Data Management Challenge:
– How to leverage cloud computation for analytics ?
– How to consume legacy data (7PB) for analytics?
Insurance Company
18. Customer Scenario
• IoT data received on-premises and stored in Hadoop Data Lake.
Data needs to be backed up for compliance
• Data Management Challenge:
– How to reduce the backup window and optimize solution cost?
– How to minimize impact on analytics performance during
backup?
Large Bank
19. Use Case: Backup for IoT Data
HDFS
(SAN)
Tape
HDFS
(DAS)
Backup
Edge
OR + Archive
Current
distcp –update
HDFS (snap) NetApp (snap)
HDFS
(SAN)
FlexClone
NetBackup
(Files)
NFS Export
NetApp Solution ATypical
HDFS
(DAS)
HDFS
(DAS)
Backup
Edge
OR + Archive
Proposed
distcp -update –diff
HDFS (snap)
distcp -update –diff
(Files)
NFS
FlexClone
NFS Connector
NetApp Solution B
HDFS
(DAS)
SnapMirror
FlexClone
FlexClone FlexClone
Run lots
of parallel
mappers
NetApp Backup Solution A
• NetApp Snapshot Backup
• Backup Archival
• Cloud Compatible
NetApp Backup Solution B
• Hadoop Native Support
• Offload Backup Operation
• Enterprise Management
20. Customer Scenario
• Large Hadoop Data Lake implementation on premises with Multiple
Spark clusters
• Data Management Challenge:
– How to make data available for dev/test teams?
– How to build the new cluster in minutes from an existing cluster?
Online Music Distribution
22. Customer Scenario
• Run analytics on archival data in object store
• Data Management Challenge:
– How to run Hadoop jobs in object store
– Archive the Hadoop data on primary or a secondary site
Online Marketplace
24. • Input dataset size – 1.5TB
• ONTAP– 52% better than JBOD
Spark Performance
Type Hadoop
Worker
Nodes
Drives
per
Node
Number of
Storage
Arrays
JBOD 6 12 NA
ONTAP 6 6 1
0
1000
2000
3000
4000
5000
6000
Througput(MBytes/Sec)
MB/Sec
Spark Scala WordCount - Throughput MB/Sec
(Higher is better)
JBOD ONTAP
0
50
100
150
200
250
300
350
400
450
500
Duration(s)
Seconds
Spark Scala WordCount - Duration(s)
( Lower Is better )
JBOD ONTAP
HiBench – Spark Scala Word Count
25. Key Takeaways
Flexibility and Agility
• On Demand analytics with Hybrid Cloud/Multi Cloud deployments
• Rapid provisioning of clusters for test/dev environments
Lower Cost
• Add storage capacity without adding compute nodes
• One copy vs 3 copies of data for HDFS
• Data Tiering with FabricPool
Enterprise Data Protection
• Efficient backup, DR and Archival