SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Real-Time Insights by
Leveraging Spark with
Aerospike
Aerospike Spark Connector
Zohar Elkayam, Aerospike
2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
▪ Where is Aerospike Spark Connecter located in the EcoSystem
▪ A Quick Overview of Aerospike Spark Connector
▪ Some Code Example
▪ Scaling up: A Customer Story
Agenda
3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Data Warehouse Data Lake
Legacy RDBMS HDFS Based
Aerospike Simplifies Real-time Architecture at any Scale
Aerospike
Database
SoE Location 1
SoE Location 2
SoE Location 3
XDR
XDR
Transactional
Systems
Aerospike
Database
XDR
XDR
Enterprise Environment
Transactional
Systems
Legacy Database
(Mainframe)
RDBMS
Database
Delivering Extreme Scalability:
✓ Simplicity
✓ Maintainability
✓ Durability
✓ Strong Consistency
✓ Scalability
✓ Low Cost ($)
✓ Less Data Drag
XDR Legacy RDBMS
Data LakeReal-time Data Warehouse
System of Record Query &
Reporting Store
XDR
4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Connect for Spark
5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Aerospike Connect for Spark
Example Use Cases
✓ Fraud prevention: transaction data via
streaming and need to analyze based on
historical data in real time
✓ Recommendation Engines: Real-time
recommendations and targeting based on user
behavior
✓ Ad Tech: Ad Fraud and real-time retargeting
base on user behavior
✓ Digital Identity Management
✓ Industrial Internet of Things (IIoT): Real-time &
closed loop business decisions
6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark connection for Aerospike – both loading the data and using it as dataframe (i.e.
Spark SQL) or by using it as streamed data
• Supports Scala (spark-shell) for all Aerospike’s Spark Operations
• Support Python (pyspark) for some operations – Dataset operations not supported
• Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html
Aerospike Connect for Spark
7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Use SparkSQL to fetch data from Aerospike
• Aerospike Connect for Spark provides the capability to use Spark SQL in order to
query records from an Aerospike cluster.
• Load Aerospike data into Spark for processing
• Load data from Aerospike into DataFrames for processing
• The connector support Scan and Queries (secondary indexes)
• Save data from DataFrame back into Aerospike
• A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as
the Primary Key or the Digest.
• Joins Data using Aerospike [Scala Only]
• Provides an AeroJoin function which allows you to read records from Aerospike given
a Dataset which contains keys to the records of interest.
• This operation takes advantage of Aerospike's batch read functionality.
Aerospike Spark Operations
8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: Spark SQL
9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Save DataFrame to Aerospike (by Key, with schema)
10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: AeroJoin
11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark partition data for workers, supervised by executor (one per spark node)
• Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node)
• That means there is a mismatch in parallization between the number of cores on the spark
side and the number of nodes on Aerospike side
Customer Story: Is Scaling an Issue?
12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Data is distributed evenly across nodes in a cluster using the Aerospike Smart
Partitions™ algorithm.
▪ Automatic Sharding
▪ 4096 Data Partitions
▪ Even distribution of
▪ Partitions across nodes
▪ Records across Partitions
▪ Data across Flash devices
▪ Primary and Replica Partitions
Aerospike Partitions: Even Data Distribution
13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Customer Environment:
• 33 Aerospike nodes
• Over 10B objects, over 125TB unique data
• ~200 Spark Nodes with 36 core each (~7200 total cores/workers)
• The Problem: Less than 1 percent utilization on the spark side in data load operation.
• The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096
partitions, Aerospike Spark Connector 2.0 Supports partition scan.
• The Result:
• The customer got a RC for Aerospike 4.9 + Spark Connector 2.0
• Using over 10B unique records (125TB unique data) was scanned, load and
filtered in ~45 minutes.
Customer Story: Scaling Things Up (With 4.9 RC Access)
14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Time for Q&A!
15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Thank You!
zelkayam@aerospike.com

Weitere ähnliche Inhalte

Was ist angesagt?

CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on Demand
CTRLS
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
Aerospike, Inc.
 

Was ist angesagt? (20)

Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Presto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy end
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on Demand
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
 
Webinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift Spectrum
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 

Ähnlich wie Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

Ähnlich wie Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020 (20)

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
 
C5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparcC5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparc
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Lift and shift to sparc cloud
Lift and shift to sparc cloudLift and shift to sparc cloud
Lift and shift to sparc cloud
 
Why_Oracle_Hardware.ppt
Why_Oracle_Hardware.pptWhy_Oracle_Hardware.ppt
Why_Oracle_Hardware.ppt
 
Oracle Cloud Infrastructure
Oracle Cloud InfrastructureOracle Cloud Infrastructure
Oracle Cloud Infrastructure
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 

Mehr von Aerospike

Mehr von Aerospike (9)

Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
 
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanContentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
 
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
 
Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020
 
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
 
Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 
Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019
 
JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

  • 1. Real-Time Insights by Leveraging Spark with Aerospike Aerospike Spark Connector Zohar Elkayam, Aerospike
  • 2. 2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. ▪ Where is Aerospike Spark Connecter located in the EcoSystem ▪ A Quick Overview of Aerospike Spark Connector ▪ Some Code Example ▪ Scaling up: A Customer Story Agenda
  • 3. 3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Data Warehouse Data Lake Legacy RDBMS HDFS Based Aerospike Simplifies Real-time Architecture at any Scale Aerospike Database SoE Location 1 SoE Location 2 SoE Location 3 XDR XDR Transactional Systems Aerospike Database XDR XDR Enterprise Environment Transactional Systems Legacy Database (Mainframe) RDBMS Database Delivering Extreme Scalability: ✓ Simplicity ✓ Maintainability ✓ Durability ✓ Strong Consistency ✓ Scalability ✓ Low Cost ($) ✓ Less Data Drag XDR Legacy RDBMS Data LakeReal-time Data Warehouse System of Record Query & Reporting Store XDR
  • 4. 4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Connect for Spark
  • 5. 5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Aerospike Connect for Spark Example Use Cases ✓ Fraud prevention: transaction data via streaming and need to analyze based on historical data in real time ✓ Recommendation Engines: Real-time recommendations and targeting based on user behavior ✓ Ad Tech: Ad Fraud and real-time retargeting base on user behavior ✓ Digital Identity Management ✓ Industrial Internet of Things (IIoT): Real-time & closed loop business decisions
  • 6. 6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark connection for Aerospike – both loading the data and using it as dataframe (i.e. Spark SQL) or by using it as streamed data • Supports Scala (spark-shell) for all Aerospike’s Spark Operations • Support Python (pyspark) for some operations – Dataset operations not supported • Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html Aerospike Connect for Spark
  • 7. 7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Use SparkSQL to fetch data from Aerospike • Aerospike Connect for Spark provides the capability to use Spark SQL in order to query records from an Aerospike cluster. • Load Aerospike data into Spark for processing • Load data from Aerospike into DataFrames for processing • The connector support Scan and Queries (secondary indexes) • Save data from DataFrame back into Aerospike • A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as the Primary Key or the Digest. • Joins Data using Aerospike [Scala Only] • Provides an AeroJoin function which allows you to read records from Aerospike given a Dataset which contains keys to the records of interest. • This operation takes advantage of Aerospike's batch read functionality. Aerospike Spark Operations
  • 8. 8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: Spark SQL
  • 9. 9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Save DataFrame to Aerospike (by Key, with schema)
  • 10. 10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: AeroJoin
  • 11. 11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark partition data for workers, supervised by executor (one per spark node) • Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node) • That means there is a mismatch in parallization between the number of cores on the spark side and the number of nodes on Aerospike side Customer Story: Is Scaling an Issue?
  • 12. 12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. ▪ Automatic Sharding ▪ 4096 Data Partitions ▪ Even distribution of ▪ Partitions across nodes ▪ Records across Partitions ▪ Data across Flash devices ▪ Primary and Replica Partitions Aerospike Partitions: Even Data Distribution
  • 13. 13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Customer Environment: • 33 Aerospike nodes • Over 10B objects, over 125TB unique data • ~200 Spark Nodes with 36 core each (~7200 total cores/workers) • The Problem: Less than 1 percent utilization on the spark side in data load operation. • The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096 partitions, Aerospike Spark Connector 2.0 Supports partition scan. • The Result: • The customer got a RC for Aerospike 4.9 + Spark Connector 2.0 • Using over 10B unique records (125TB unique data) was scanned, load and filtered in ~45 minutes. Customer Story: Scaling Things Up (With 4.9 RC Access)
  • 14. 14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Time for Q&A!
  • 15. 15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Thank You! zelkayam@aerospike.com