Weitere ähnliche Inhalte Ähnlich wie Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020 (20) Kürzlich hochgeladen (20) Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 20202. 2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
▪ Where is Aerospike Spark Connecter located in the EcoSystem
▪ A Quick Overview of Aerospike Spark Connector
▪ Some Code Example
▪ Scaling up: A Customer Story
Agenda
3. 3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Data Warehouse Data Lake
Legacy RDBMS HDFS Based
Aerospike Simplifies Real-time Architecture at any Scale
Aerospike
Database
SoE Location 1
SoE Location 2
SoE Location 3
XDR
XDR
Transactional
Systems
Aerospike
Database
XDR
XDR
Enterprise Environment
Transactional
Systems
Legacy Database
(Mainframe)
RDBMS
Database
Delivering Extreme Scalability:
✓ Simplicity
✓ Maintainability
✓ Durability
✓ Strong Consistency
✓ Scalability
✓ Low Cost ($)
✓ Less Data Drag
XDR Legacy RDBMS
Data LakeReal-time Data Warehouse
System of Record Query &
Reporting Store
XDR
4. 4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Connect for Spark
5. 5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Aerospike Connect for Spark
Example Use Cases
✓ Fraud prevention: transaction data via
streaming and need to analyze based on
historical data in real time
✓ Recommendation Engines: Real-time
recommendations and targeting based on user
behavior
✓ Ad Tech: Ad Fraud and real-time retargeting
base on user behavior
✓ Digital Identity Management
✓ Industrial Internet of Things (IIoT): Real-time &
closed loop business decisions
6. 6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark connection for Aerospike – both loading the data and using it as dataframe (i.e.
Spark SQL) or by using it as streamed data
• Supports Scala (spark-shell) for all Aerospike’s Spark Operations
• Support Python (pyspark) for some operations – Dataset operations not supported
• Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html
Aerospike Connect for Spark
7. 7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Use SparkSQL to fetch data from Aerospike
• Aerospike Connect for Spark provides the capability to use Spark SQL in order to
query records from an Aerospike cluster.
• Load Aerospike data into Spark for processing
• Load data from Aerospike into DataFrames for processing
• The connector support Scan and Queries (secondary indexes)
• Save data from DataFrame back into Aerospike
• A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as
the Primary Key or the Digest.
• Joins Data using Aerospike [Scala Only]
• Provides an AeroJoin function which allows you to read records from Aerospike given
a Dataset which contains keys to the records of interest.
• This operation takes advantage of Aerospike's batch read functionality.
Aerospike Spark Operations
8. 8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: Spark SQL
9. 9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Save DataFrame to Aerospike (by Key, with schema)
10. 10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: AeroJoin
11. 11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark partition data for workers, supervised by executor (one per spark node)
• Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node)
• That means there is a mismatch in parallization between the number of cores on the spark
side and the number of nodes on Aerospike side
Customer Story: Is Scaling an Issue?
12. 12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Data is distributed evenly across nodes in a cluster using the Aerospike Smart
Partitions™ algorithm.
▪ Automatic Sharding
▪ 4096 Data Partitions
▪ Even distribution of
▪ Partitions across nodes
▪ Records across Partitions
▪ Data across Flash devices
▪ Primary and Replica Partitions
Aerospike Partitions: Even Data Distribution
13. 13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Customer Environment:
• 33 Aerospike nodes
• Over 10B objects, over 125TB unique data
• ~200 Spark Nodes with 36 core each (~7200 total cores/workers)
• The Problem: Less than 1 percent utilization on the spark side in data load operation.
• The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096
partitions, Aerospike Spark Connector 2.0 Supports partition scan.
• The Result:
• The customer got a RC for Aerospike 4.9 + Spark Connector 2.0
• Using over 10B unique records (125TB unique data) was scanned, load and
filtered in ~45 minutes.
Customer Story: Scaling Things Up (With 4.9 RC Access)
14. 14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Time for Q&A!
15. 15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Thank You!
zelkayam@aerospike.com