Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

•

3 gefällt mir•1,931 views

Spark Summit

Training at Spark Summit

Daten & Analysen

Reza Zadeh
Matrix and Graph Computations
@Reza_Zadeh | http://reza-zadeh.com

Overview

Graph Computations and Pregel"

Introduction to Matrix Computations

Data Flow Models
Restrict the programming interface so that the
system can do more automatically
Express jobs as graphs of high-level operators
» System picks how to split each operator into tasks
and where to run each task
» Run parts twice fault recovery
New example: Pregel (parallel graph google)

Pregel
oogle
Expose specialized APIs to simplify graph
programming.

“Think like a vertex”

Graph-Parallel Pattern

Model / Alg.
State
Computation depends
only on the neighbors

Pregel
Data
Flow

Input
graph
Vertex
state
1
Messages
1

Superstep
1

Vertex
state
2
Messages
2

Superstep
2

.

.

.

Group
by
vertex
ID

Group
by
vertex
ID

Simple
Pregel
in
Spark

Separate
RDDs
for
immutable
graph
state
and

for
vertex
states
and
messages
at
each
iteration

Use
groupByKey
to
perform
each
step

Cache
the
resulting
vertex
and
message
RDDs

Optimization:
co-‐partition
input
graph
and

vertex
state
RDDs
to
reduce
communication

Update ranks in parallel
Iterate until convergence
Rank of
user i

Weighted sum of
neighbors’ ranks

Example: PageRank
R[i] = 0.15 +
X
j2Nbrs(i)
wjiR[j]

PageRank
in
Pregel

Input
graph
Vertex
ranks
1
Contributions
1

Superstep
1
(add
contribs)

Vertex
ranks
2
Contributions
2

Superstep
2
(add
contribs)

.

.

.

Group
&
add
by
vertex

Group
&
add
by
vertex

GraphX
Provides
Pregel
message-‐passing
and
other

operators
on
top
of
RDDs

GraphX: Triplets
The triplets operator joins vertices and edges:

F

E

Map Reduce Triplets
Map-Reduce for each vertex
D

B

A

C

mapF( )

A
B

mapF( )

A
C

A1

A2

reduceF( , )

A1
A2
A

F

E

Example: Oldest Follower
D

B

A

C
What is the age of the oldest
follower for each user?
val oldestFollowerAge = graph
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
23

42

30

19

75

16

Summary of Operators
All operations:
https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators

Pregel API:
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api

The GraphX Stack"
(Lines of Code)
GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank
(5)

Connected
Comp. (10)

Shortest
Path (10)

ALS

(40)

LDA

(120)

K-core

(51)

Triangle

Count

(45)

SVD

(40)

Optimizations
Overloaded vertices have their work
distributed

More examples
In your HW: Single-Source-Shortest Paths
using Pregel

Distributing Matrices
How to distribute a matrix across machines?
»  By Entries (CoordinateMatrix)
»  By Rows (RowMatrix)
»  By Blocks (BlockMatrix)
All of Linear Algebra to be rebuilt using these
partitioning schemes
As
of
version
1.3

Distributing Matrices
Even the simplest operations require thinking
about communication e.g. multiplication

How many different matrix multiplies needed?
»  At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10
»  More because multiplies not commutative

Distributed Singular Value Decomposition

Singular Value Decomposition
Two cases
»  Tall and Skinny
»  Short and Fat (not really)
»  Roughly Square
SVD method on RowMatrix takes care of
which one to call.

Tall and Skinny SVD
Gets
us

V
and
the

singular
values

Gets
us

U
by
one

matrix
multiplication

Square SVD
ARPACK: Very mature Fortran77 package for
computing eigenvalue decompositions"

JNI interface available via netlib-java"

Distributed using Spark – how?

Square SVD via ARPACK
Only interfaces with distributed matrix via
matrix-vector multiplies

The result of matrix-vector multiply is small.
The multiplication can be distributed.

Square SVD
With 68 executors and 8GB memory in each,
looking for the top 5 singular vectors

Empfohlen

Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono

Parallel Graph AnalyticsStefano Romanazzi

Download Itbutest

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit

Streaming sql w kafka and flinkKenny Gorman

Scalding intro 20141125Erik Schmiegelow

Writing an Interactive Interface for SQL on FlinkEventador

Empfohlen

Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono

Parallel Graph AnalyticsStefano Romanazzi

Download Itbutest

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit

Streaming sql w kafka and flinkKenny Gorman

Scalding intro 20141125Erik Schmiegelow

Writing an Interactive Interface for SQL on FlinkEventador

Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri

Designing a machine learning algorithm for Apache SparkMarco Gaido

Spark Summit EU talk by Chris Pool and Jeroen Vlek Spark Summit

Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng

Inside Apache SystemML by Frederick ReissSpark Summit

Willump: Optimizing Feature Computation in ML InferenceDatabricks

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran

Anatomy of Spark SQL Catalyst - Part 2datamantra

Advanced Neo4j Use Cases with the GraphAware FrameworkMichal Bachman

Akka StreamsDiego Pacheco

Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha

Spark architecturedatamantra

Scilab-by-dr-gomez-june2014Ir. Dr. R.Badlishah Ahmad

Cape2013 scilab-workshop-19Oct13Naren P.R.

Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra

Optimized declarative transformation First Eclipse QVTc resultsEdward Willink

Cypher for Apache SparkopenCypher

Papyrus @ Eclipse Summit Europe 2010CEA - Commissariat à l'énergie atomique et aux énergies alternatives

ScilabKripal Priyadarshi

Modeling & Simulation of CubeSat-based Missions'Concept of OperationsObeo

Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095

GraphQL & DGraph with GoJames Tan

Weitere ähnliche Inhalte

Was ist angesagt?

Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri

Designing a machine learning algorithm for Apache SparkMarco Gaido

Spark Summit EU talk by Chris Pool and Jeroen Vlek Spark Summit

Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng

Inside Apache SystemML by Frederick ReissSpark Summit

Willump: Optimizing Feature Computation in ML InferenceDatabricks

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran

Anatomy of Spark SQL Catalyst - Part 2datamantra

Advanced Neo4j Use Cases with the GraphAware FrameworkMichal Bachman

Akka StreamsDiego Pacheco

Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha

Spark architecturedatamantra

Scilab-by-dr-gomez-june2014Ir. Dr. R.Badlishah Ahmad

Cape2013 scilab-workshop-19Oct13Naren P.R.

Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra

Optimized declarative transformation First Eclipse QVTc resultsEdward Willink

Cypher for Apache SparkopenCypher

Papyrus @ Eclipse Summit Europe 2010CEA - Commissariat à l'énergie atomique et aux énergies alternatives

ScilabKripal Priyadarshi

Modeling & Simulation of CubeSat-based Missions'Concept of OperationsObeo

Was ist angesagt? (20)

Asymmetry in Large-Scale Graph Analysis, Explained

Designing a machine learning algorithm for Apache Spark

Spark Summit EU talk by Chris Pool and Jeroen Vlek

Uber Business Metrics Generation and Management Through Apache Flink

Inside Apache SystemML by Frederick Reiss

Willump: Optimizing Feature Computation in ML Inference

Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...

Anatomy of Spark SQL Catalyst - Part 2

Advanced Neo4j Use Cases with the GraphAware Framework

Akka Streams

Online learning with structured streaming, spark summit brussels 2016

Spark architecture

Scilab-by-dr-gomez-june2014

Cape2013 scilab-workshop-19Oct13

Anatomy of Data Source API : A deep dive into Spark Data source API

Optimized declarative transformation First Eclipse QVTc results

Cypher for Apache Spark

Papyrus @ Eclipse Summit Europe 2010

Scilab

Modeling & Simulation of CubeSat-based Missions'Concept of Operations

Ähnlich wie Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095

GraphQL & DGraph with GoJames Tan

Apache Spark: What? Why? When?Massimo Schenone

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Data Processing with Apache Spark Meetup TalkEren Avşaroğulları

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave

Tuning and Debugging in Apache SparkDatabricks

Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit

SparkSQL: A Compiler from Queries to RDDsDatabricks

No more struggles with Apache Spark workloads in productionChetan Khatri

Haskell AccelerateSteve Severance

Apache Spark: The Analytics Operating SystemAdarsh Pannu

Recent Developments In SparkR For Advanced AnalyticsDatabricks

Apache Spark Majid Hajibaba

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj

Hadoop ecosystemRan Silberman

Tuning and Debugging in Apache SparkPatrick Wendell

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Ähnlich wie Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford) (20)

Graphs in data structures are non-linear data structures made up of a finite ...

GraphQL & DGraph with Go

Apache Spark: What? Why? When?

Apache spark - Architecture , Overview & libraries

Data Processing with Apache Spark Meetup Talk

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Tuning and Debugging in Apache Spark

Advanced Data Science on Spark-(Reza Zadeh, Stanford)

SparkSQL: A Compiler from Queries to RDDs

No more struggles with Apache Spark workloads in production

Haskell Accelerate

Apache Spark: The Analytics Operating System

Recent Developments In SparkR For Advanced Analytics

Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Spark: The State of the Art Engine for Big Data Processing

Hadoop ecosystem

Tuning and Debugging in Apache Spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Kürzlich hochgeladen

INTRODUCTION TO Natural language processingsocarem879

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Learn How Data Science Changes Our WorldEduminds Learning

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Cyber awareness ppt on the recorded dataTecnoIncentive

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Principles and Practices of Data VisualizationKianJazayeri1

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Kürzlich hochgeladen (20)

INTRODUCTION TO Natural language processing

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Data Factory in Microsoft Fabric (MsBIP #82)

Student Profile Sample report on improving academic performance by uniting gr...

Top 5 Best Data Analytics Courses In Queens

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Learn How Data Science Changes Our World

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Cyber awareness ppt on the recorded data

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Defining Constituents, Data Vizzes and Telling a Data Story

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Principles and Practices of Data Visualization

Real-Time AI Streaming - AI Max Princeton

Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

1. Reza Zadeh Matrix and Graph Computations @Reza_Zadeh | http://reza-zadeh.com

2. Overview Graph Computations and Pregel" Introduction to Matrix Computations

3. Graph Computations and Pregel

4. Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery New example: Pregel (parallel graph google)

5. Pregel oogle Expose specialized APIs to simplify graph programming. “Think like a vertex”

6. Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors

7. Pregel Data Flow Input graph Vertex state 1 Messages 1 Superstep 1 Vertex state 2 Messages 2 Superstep 2 . . . Group by vertex ID Group by vertex ID

8. Simple Pregel in Spark Separate RDDs for immutable graph state and for vertex states and messages at each iteration Use groupByKey to perform each step Cache the resulting vertex and message RDDs Optimization: co-‐partition input graph and vertex state RDDs to reduce communication

9. Update ranks in parallel Iterate until convergence Rank of user i Weighted sum of neighbors’ ranks Example: PageRank R[i] = 0.15 + X j2Nbrs(i) wjiR[j]

10. PageRank in Pregel Input graph Vertex ranks 1 Contributions 1 Superstep 1 (add contribs) Vertex ranks 2 Contributions 2 Superstep 2 (add contribs) . . . Group & add by vertex Group & add by vertex

11. GraphX Provides Pregel message-‐passing and other operators on top of RDDs

12. GraphX: Properties

13. GraphX: Triplets The triplets operator joins vertices and edges:

14. F E Map Reduce Triplets Map-Reduce for each vertex D B A C mapF( ) A B mapF( ) A C A1 A2 reduceF( , ) A1 A2 A

15. F E Example: Oldest Follower D B A C What is the age of the oldest follower for each user? val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices 23 42 30 19 75 16

16. Summary of Operators All operations: https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators Pregel API: https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api

17. The GraphX Stack" (Lines of Code) GraphX (3575) Spark Pregel (28) + GraphLab (50) PageRank (5) Connected Comp. (10) Shortest Path (10) ALS (40) LDA (120) K-core (51) Triangle Count (45) SVD (40)

18. Optimizations Overloaded vertices have their work distributed

19. Optimizations

20. More examples In your HW: Single-Source-Shortest Paths using Pregel

21. Distributing Matrix Computations

22. Distributing Matrices How to distribute a matrix across machines? »  By Entries (CoordinateMatrix) »  By Rows (RowMatrix) »  By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes As of version 1.3

23. Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed? »  At least one per pair of {Coordinate, Row, Block, LocalDense, LocalSparse} = 10 »  More because multiplies not commutative

24. Distributed Singular Value Decomposition

25. Singular Value Decomposition

26. Singular Value Decomposition Two cases »  Tall and Skinny »  Short and Fat (not really) »  Roughly Square SVD method on RowMatrix takes care of which one to call.

27. Tall and Skinny SVD

28. Tall and Skinny SVD Gets us V and the singular values Gets us U by one matrix multiplication

29. Square SVD ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions" JNI interface available via netlib-java" Distributed using Spark – how?

30. Square SVD via ARPACK Only interfaces with distributed matrix via matrix-vector multiplies The result of matrix-vector multiply is small. The multiplication can be distributed.

31. Square SVD With 68 executors and 8GB memory in each, looking for the top 5 singular vectors