SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
FastR + Apache Flink
Enabling distributed data
processing in R
<juan.fumero@ed.ac.uk>
Disclaimer
This is a work in progress
Outline
1. Introduction and motivation
○ Flink and R
○ Oracle Truffle + Graal
2. FastR and Flink
○ Execution Model
○ Memory Management
3. Preliminary Results
4. Conclusions and Future Work
5. [DEMO] within a distributed configuration
Introduction and
motivation
Why R?
http://spectrum.ieee.org
R is one of the most popular languages for data analytics
Functional
Lazy
Object Oriented
Dynamic
R is:
But, R is slow
Benchmark
SlowdowncomparedtoC
Problem
It is neither fast nor distributed
Problem
It is neither fast nor distributed
Apache Flink
● Framework for distributed stream and batch data processing
● Custom scheduler
● Custom optimizer for operations
● Hadoop/Amazon plugins
● YARN connector for cluster execution
● It runs on laptops, clusters and supercomputers with the same code
● High level APIs: Java, Scala and Python
Flink Example, Word Count
public class LineSplit … {
public void flatMap(...) {
for (String token : value.split("W+"))
if (token.length() > 0) {
out.collect(new Tuple2<String, Integer>(token, 1);
}
}
…
DataSet<String> textInput = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> counts= textInput.flatMap(new LineSplit())
.groupBy(0)
.sum(1);
JobManager
TaskManager
TaskManager
Flink Client And
Optimizer
We propose a compilation approach
built within Truffle/Graal for
distributed computing on top of
Apache Flink.
Our Solution
How can we enable distributed data
processing in R exploiting Flink?
FastR on Top of Flink
* Flink Material
FastR
Truffle/Graal Overview
How to Implement Your Own Language?
Oracle Labs material
Truffle/Graal Infrastructure
Oracle Labs material
Truffle node specialization
Oracle Labs material
Deoptimization and rewriting
Oracle Labs material
More Details about Truffle/Graal
https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations
[Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles
Duboscq, Christian Humer, Gregor Richards, Doug Simon, Mario Wolczko]
One VM to Rule Them All
In Proceedings of Onward!, 2013.
OpenJDK Graal - New JIT for Java
● Graal Compiler is a JIT compiler for Java which is written in Java
● The Graal VM is a modification of the HotSpot VM -> It replaces the client and server compilers with
the Graal compiler.
● Open Source: http://openjdk.java.net/projects/graal/
FastR + Flink:
implementation
Our Solution: FastR - Flink Compiler
● Our goal:
○ Run R data processing applications in a distributed system as easy as possible
● Approach
○ Custom FastR-Flink compiler builtins (library)
○ Minimal R/Flink setup and configuration
○ “Offload” hard computation if the cluster is available
FastR-Flink API Example - Word Count
# Word Count in R
bigText <- flink.readTextFile("hdfs://192.168.1.10/user/juan/quijote.txt")
createTuples <- function(text) {
words <- strsplit(text, " ")[[1]]
tuples = list()
for (w in words) {
tuples[[length(tuples)+1]] = list(w, 1)
}
return (tuples)
}
splitText <- flink.flatMap(bigText, createTuples)
groupBy <- flink.groupBy(splitText, 0)
count <- flink.sum(groupBy, 1)
hdfsPath <- "hdfs://192.168.1.10/user/juan/newQUIJOTE.txt"
flink.writeAsText(count, hdfsPath)
Supported Operations
Operation Description
flink.sapply Local and remove apply (blocking)
flink.execute Execute previous operation (blocking)
flink.collect Execute the operations (blocking)
flink.map Apply local/remove map (non-blocking)
flink.sum Sum arrays (non-blocking)
flink.groupBy Group tuples (non-blocking)
Supported Operations
Operation Description
flink.reduce Reduction operation (blocking and non-blocking versions)
flink.readTextFile Read from HDFS (non-blocking)
flink.filter Filter data according to the function (blocking and non-blocking)
flink.arrayMap Map with bigger chunks per Flink thread
flink.connectToJobManager Establish the connection with the main server
flink.setParallelism Set the parallelism degree for future Flink operations
Assumptions
● Distributed code with no side effects allowed → better for parallelism
● R is mostly a functional programming language
alpha <- 0.01652
g <- function(x) {
…
if (x < 10) {
alpha <- 123.46543
}
…
}
Restrictions
● Apply R function which change the data type
f: Integer → (String | Double)
f <- function(x) {
if (x < 10) {
return("YES")
}
else {
return(0.123)
}
}
> f(2)
[1] “YES” # String
> f(100)
[1] 0.123 # Double
In FastR + Flink we infered the
data type based on the first
execution.
We can not change the type with
Flink
FastR + Flink :
Execution model
Cool, so how does it work?
Where R you?
ipJobManager <- "192.168.1.10"
flink.connectToJobManager(ipJobManager)
flink.setParallelism(2048)
userFunction <- function(x) {
x * x
}
result <- flink.sapply(1:10000, userFunction)
VMs cluster organization
FastR+ GraalVM
TaskManager
TaskManager
FastR + GraalVM
FastR + GraalVM
R Source Code
R Source Code
JobManager
FastR-Flink Execution Workflow
1a) Identify the skeleton
map < − f l i n k . sapply ( input , f )
switch (builtin) {
…
case "flink.sapply":
return new FlinkFastRMap();
…
}
R User Program:
FastR-Flink Compiler:
Parallel Map Node
1b ) Create the JAR File
public static ArrayList<String> compileDefaultUserFunctions() {
for (String javaSource : ClassesToCompile) {
JavaCompileUtils.compileJavaClass(...) ;
String jarFileName = ...
String jarPath = FastRFlinkUtils.buildTmpPathForJar(jarFileName);
jarFiles.add(jarPath);
JavaCompileUtils.createJarFile(...);
}
jarFiles.add("fastr.jar");
return jarFiles;
}
Jar compilation for User Defined Function:
2-3. Data Type Inference and Flink Objects
private static <T> MapOperator<?, ?> remoteApply(Object input, RFunction function, RAbstractVector[] args, String[] scopeVars,
String returns) {
Object[] argsPackage = FastRFlinkUtils.getArgsPackage(nArgs, function, (RAbstractVector) input, args, argsName);
firstOutput = function.getTarget().call(argsPackage);
returnsType = inferOutput(firstOutput);
ArrayList<T> arrayInput = prepareArrayList(returnsType, input, args);
MapOperator<?, ?> mapOperator = computation(arrayInput, codeSource, args, argsName, returnsType, scopeVars, frame);
MetadataPipeline mapPending = new MetadataPipeline(input, firstOutput, returns);
mapPending.setIntermediateReturnType(returnsType);
FlinkPromises.insertOperation(Operation.REMOTE_MAP, mapPending);
return mapOperator;
}
4. Code Distribution
ArrayList<String> compileUDF = FlinkUDFCompilation.compileDefaultUserFunctions();
RemoteEnvironment fastREnv = new RemoteEnvironment();
fastREnv.createRemoteEnvironment(FlinkJobManager.jobManagerServer, FLINK_PORT, jarFiles);
Send the JAR files:
The Flink UDF Class receives also the String which represents the R User code:
inputSet.map(new FastRFlinkMap(codeSource)).setParallelism(parallelism)).returns(returns);
5. Data Transformation
RVector Flink
DataSets
Flink DataSets preparation
RVector in FastR are not serializable → Boxing and Unboxing are needed
Paper: “Runtime Code Generation and Data Management for Heterogeneous Computing in Java”
Creating a view for the data:
6. FastR-Flink Execution, JIT Compilation
@Override
public R map(T t) throws Exception {
if (!setup) {
setup();
}
Object[] argsFunction = composeArgsForRFunction(t);
Object[] argumentsPackage = RArguments.create(userFunction, null, null, 0, argsFunction, rArguments, null);
Object result = target.call(argumentsPackage);
return (R) new Tuple2<>(t.getField(0), result);
}
6. FastR-Flink Execution, JIT Compilation
● Each Flink execution thread has a copy of
the R code
● It builds the AST and start the
interpretation using Truffle
● After a while it will reach the compiled
code (JIT)
● The language context (RContext) is
shared among the threads
● Scope variables are passed by broadcast
7. Data Transformation - Get the result
RVectorFlink
DataSets
Build RVectors:
The R user gets the final results within the right R data representation
Flink Web Interface
● It works with FastR!!!
● Track Jobs
● Check history
● Check configuration
But, Java trace.
● Maybe not very useful for R users.
Memory Management
And Memory Management on the TaskManager?
70% of the heap for
hashing by default for
Apache Flink
Solution: taskmanager.memory.fraction:0.5, in the config file.
User Space:
● Graal Compilation Threads
● Truffle Compilation Threads
● R Interpreter
● R memory space
Some Questions
● Is there any heuristic to know the ideal heap distribution?
● Is there any way to format the name of the operations for the Flink Web
Interface? R functions names? Custom names for easy track
● Is there any way to attach custom initialization code into TaskManager?
Preliminary Results
Evaluation Setup
● Shared memory:
○ Big server machine
■ Xeon(R) 32 real cores
■ Heap: 12GB
■ FastR compiled with OpenJDK 1.8_60
● 4 Benchmarks:
○ PI approximation
○ Binary Trees
○ Binomial
○ Nested-Loops (high computation)
Preliminary Results
Preliminary Results
And Distributed Memory?
● It works (I will show you in demo in a bit)
● Some performance issues concerning:
○ Multi-thread context issues.
○ Each Truffle language has a context initialized within the application
○ It will need pre-initialization within the Flink TaskManager
● This is work in progress
Conclusions and
Future work
Conclusions
● Prototype: R implementation with + some Flink operations included in the
compiler
● FastR-Flink in the compiler
● Easy installation and easy programmability
● Good speedups in shared memory for high computation
● Investigating distributed memory to increase speedups
Future Work
● A world to explore:
○ Solve distribute performance issues
○ More realistic benchmarks (big data)
○ Distributed benchmarks
○ Support more Flink operations
○ Support Flink for other Truffle languages using same approach
○ Define a stable API for high level programming languages
○ Optimisations (caching)
○ GPU support based on our previous work [1]
[1] Juan Fumero, Toomas Remmelg, Michel Steuwer and Christophe Dubach. Runtime Code Generation and Data Management for Heterogeneous
Computing in Java. 2015 International Conference on Principles and Practices of Programming on the Java Platform
Check it out! It is Open Source
Clone and compile:
$ mx sclone https://bitbucket.org/allr/fastr-flink
$ hg update -r rflink-0.1
$ mx build
Run:
$ mx R # start the interpreter with Flink local environment
Thanks for your attention
> flink.ask(questions)
AND DEMO !
Contact: Juan Fumero <juan.fumero@ed.ac.uk>
Thanks to Oracle Labs, TU Berlin and
The University of Edinburgh

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSDataWorks Summit
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...HostedbyConfluent
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 

Was ist angesagt? (20)

Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFS
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 

Andere mochten auch

Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache FlinkJohn Gorman (BSc, CISSP)
 
한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용r-kor
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and KafkaN Masahiro
 
선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화r-kor
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices Zalando Technology
 
Apache ZooKeeper 로
 분산 서버 만들기
Apache ZooKeeper 로
 분산 서버 만들기Apache ZooKeeper 로
 분산 서버 만들기
Apache ZooKeeper 로
 분산 서버 만들기iFunFactory Inc.
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 

Andere mochten auch (10)

Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
 
한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용한글 언어 자원과 R: KoNLP 개선과 활용
한글 언어 자원과 R: KoNLP 개선과 활용
 
Redis acc
Redis accRedis acc
Redis acc
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
 
선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화선박식별정보를 이용한 어업활동 공간밀도 가시화
선박식별정보를 이용한 어업활동 공간밀도 가시화
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Apache ZooKeeper 로
 분산 서버 만들기
Apache ZooKeeper 로
 분산 서버 만들기Apache ZooKeeper 로
 분산 서버 만들기
Apache ZooKeeper 로
 분산 서버 만들기
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 

Ähnlich wie FastR+Apache Flink

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming APIFlink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming APIFlink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Taiwan User Group
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScriptJorg Janke
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseKnoldus Inc.
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 

Ähnlich wie FastR+Apache Flink (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming APIFlink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Grails 101
Grails 101Grails 101
Grails 101
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Dart the Better JavaScript
Dart the Better JavaScriptDart the Better JavaScript
Dart the Better JavaScript
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 

Kürzlich hochgeladen

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 

Kürzlich hochgeladen (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

FastR+Apache Flink

  • 1. FastR + Apache Flink Enabling distributed data processing in R <juan.fumero@ed.ac.uk>
  • 2. Disclaimer This is a work in progress
  • 3. Outline 1. Introduction and motivation ○ Flink and R ○ Oracle Truffle + Graal 2. FastR and Flink ○ Execution Model ○ Memory Management 3. Preliminary Results 4. Conclusions and Future Work 5. [DEMO] within a distributed configuration
  • 5. Why R? http://spectrum.ieee.org R is one of the most popular languages for data analytics Functional Lazy Object Oriented Dynamic R is:
  • 6. But, R is slow Benchmark SlowdowncomparedtoC
  • 7. Problem It is neither fast nor distributed
  • 8. Problem It is neither fast nor distributed
  • 9. Apache Flink ● Framework for distributed stream and batch data processing ● Custom scheduler ● Custom optimizer for operations ● Hadoop/Amazon plugins ● YARN connector for cluster execution ● It runs on laptops, clusters and supercomputers with the same code ● High level APIs: Java, Scala and Python
  • 10. Flink Example, Word Count public class LineSplit … { public void flatMap(...) { for (String token : value.split("W+")) if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1); } } … DataSet<String> textInput = env.readTextFile(input); DataSet<Tuple2<String, Integer>> counts= textInput.flatMap(new LineSplit()) .groupBy(0) .sum(1); JobManager TaskManager TaskManager Flink Client And Optimizer
  • 11. We propose a compilation approach built within Truffle/Graal for distributed computing on top of Apache Flink. Our Solution How can we enable distributed data processing in R exploiting Flink?
  • 12. FastR on Top of Flink * Flink Material FastR
  • 14. How to Implement Your Own Language? Oracle Labs material
  • 18. More Details about Truffle/Graal https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations [Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, Mario Wolczko] One VM to Rule Them All In Proceedings of Onward!, 2013.
  • 19. OpenJDK Graal - New JIT for Java ● Graal Compiler is a JIT compiler for Java which is written in Java ● The Graal VM is a modification of the HotSpot VM -> It replaces the client and server compilers with the Graal compiler. ● Open Source: http://openjdk.java.net/projects/graal/
  • 21. Our Solution: FastR - Flink Compiler ● Our goal: ○ Run R data processing applications in a distributed system as easy as possible ● Approach ○ Custom FastR-Flink compiler builtins (library) ○ Minimal R/Flink setup and configuration ○ “Offload” hard computation if the cluster is available
  • 22. FastR-Flink API Example - Word Count # Word Count in R bigText <- flink.readTextFile("hdfs://192.168.1.10/user/juan/quijote.txt") createTuples <- function(text) { words <- strsplit(text, " ")[[1]] tuples = list() for (w in words) { tuples[[length(tuples)+1]] = list(w, 1) } return (tuples) } splitText <- flink.flatMap(bigText, createTuples) groupBy <- flink.groupBy(splitText, 0) count <- flink.sum(groupBy, 1) hdfsPath <- "hdfs://192.168.1.10/user/juan/newQUIJOTE.txt" flink.writeAsText(count, hdfsPath)
  • 23. Supported Operations Operation Description flink.sapply Local and remove apply (blocking) flink.execute Execute previous operation (blocking) flink.collect Execute the operations (blocking) flink.map Apply local/remove map (non-blocking) flink.sum Sum arrays (non-blocking) flink.groupBy Group tuples (non-blocking)
  • 24. Supported Operations Operation Description flink.reduce Reduction operation (blocking and non-blocking versions) flink.readTextFile Read from HDFS (non-blocking) flink.filter Filter data according to the function (blocking and non-blocking) flink.arrayMap Map with bigger chunks per Flink thread flink.connectToJobManager Establish the connection with the main server flink.setParallelism Set the parallelism degree for future Flink operations
  • 25. Assumptions ● Distributed code with no side effects allowed → better for parallelism ● R is mostly a functional programming language alpha <- 0.01652 g <- function(x) { … if (x < 10) { alpha <- 123.46543 } … }
  • 26. Restrictions ● Apply R function which change the data type f: Integer → (String | Double) f <- function(x) { if (x < 10) { return("YES") } else { return(0.123) } } > f(2) [1] “YES” # String > f(100) [1] 0.123 # Double In FastR + Flink we infered the data type based on the first execution. We can not change the type with Flink
  • 27. FastR + Flink : Execution model
  • 28. Cool, so how does it work? Where R you? ipJobManager <- "192.168.1.10" flink.connectToJobManager(ipJobManager) flink.setParallelism(2048) userFunction <- function(x) { x * x } result <- flink.sapply(1:10000, userFunction)
  • 29. VMs cluster organization FastR+ GraalVM TaskManager TaskManager FastR + GraalVM FastR + GraalVM R Source Code R Source Code JobManager
  • 31. 1a) Identify the skeleton map < − f l i n k . sapply ( input , f ) switch (builtin) { … case "flink.sapply": return new FlinkFastRMap(); … } R User Program: FastR-Flink Compiler: Parallel Map Node
  • 32. 1b ) Create the JAR File public static ArrayList<String> compileDefaultUserFunctions() { for (String javaSource : ClassesToCompile) { JavaCompileUtils.compileJavaClass(...) ; String jarFileName = ... String jarPath = FastRFlinkUtils.buildTmpPathForJar(jarFileName); jarFiles.add(jarPath); JavaCompileUtils.createJarFile(...); } jarFiles.add("fastr.jar"); return jarFiles; } Jar compilation for User Defined Function:
  • 33. 2-3. Data Type Inference and Flink Objects private static <T> MapOperator<?, ?> remoteApply(Object input, RFunction function, RAbstractVector[] args, String[] scopeVars, String returns) { Object[] argsPackage = FastRFlinkUtils.getArgsPackage(nArgs, function, (RAbstractVector) input, args, argsName); firstOutput = function.getTarget().call(argsPackage); returnsType = inferOutput(firstOutput); ArrayList<T> arrayInput = prepareArrayList(returnsType, input, args); MapOperator<?, ?> mapOperator = computation(arrayInput, codeSource, args, argsName, returnsType, scopeVars, frame); MetadataPipeline mapPending = new MetadataPipeline(input, firstOutput, returns); mapPending.setIntermediateReturnType(returnsType); FlinkPromises.insertOperation(Operation.REMOTE_MAP, mapPending); return mapOperator; }
  • 34. 4. Code Distribution ArrayList<String> compileUDF = FlinkUDFCompilation.compileDefaultUserFunctions(); RemoteEnvironment fastREnv = new RemoteEnvironment(); fastREnv.createRemoteEnvironment(FlinkJobManager.jobManagerServer, FLINK_PORT, jarFiles); Send the JAR files: The Flink UDF Class receives also the String which represents the R User code: inputSet.map(new FastRFlinkMap(codeSource)).setParallelism(parallelism)).returns(returns);
  • 35. 5. Data Transformation RVector Flink DataSets Flink DataSets preparation RVector in FastR are not serializable → Boxing and Unboxing are needed Paper: “Runtime Code Generation and Data Management for Heterogeneous Computing in Java” Creating a view for the data:
  • 36. 6. FastR-Flink Execution, JIT Compilation @Override public R map(T t) throws Exception { if (!setup) { setup(); } Object[] argsFunction = composeArgsForRFunction(t); Object[] argumentsPackage = RArguments.create(userFunction, null, null, 0, argsFunction, rArguments, null); Object result = target.call(argumentsPackage); return (R) new Tuple2<>(t.getField(0), result); }
  • 37. 6. FastR-Flink Execution, JIT Compilation ● Each Flink execution thread has a copy of the R code ● It builds the AST and start the interpretation using Truffle ● After a while it will reach the compiled code (JIT) ● The language context (RContext) is shared among the threads ● Scope variables are passed by broadcast
  • 38. 7. Data Transformation - Get the result RVectorFlink DataSets Build RVectors: The R user gets the final results within the right R data representation
  • 39. Flink Web Interface ● It works with FastR!!! ● Track Jobs ● Check history ● Check configuration But, Java trace. ● Maybe not very useful for R users.
  • 41. And Memory Management on the TaskManager? 70% of the heap for hashing by default for Apache Flink Solution: taskmanager.memory.fraction:0.5, in the config file. User Space: ● Graal Compilation Threads ● Truffle Compilation Threads ● R Interpreter ● R memory space
  • 42. Some Questions ● Is there any heuristic to know the ideal heap distribution? ● Is there any way to format the name of the operations for the Flink Web Interface? R functions names? Custom names for easy track ● Is there any way to attach custom initialization code into TaskManager?
  • 44. Evaluation Setup ● Shared memory: ○ Big server machine ■ Xeon(R) 32 real cores ■ Heap: 12GB ■ FastR compiled with OpenJDK 1.8_60 ● 4 Benchmarks: ○ PI approximation ○ Binary Trees ○ Binomial ○ Nested-Loops (high computation)
  • 47. And Distributed Memory? ● It works (I will show you in demo in a bit) ● Some performance issues concerning: ○ Multi-thread context issues. ○ Each Truffle language has a context initialized within the application ○ It will need pre-initialization within the Flink TaskManager ● This is work in progress
  • 49. Conclusions ● Prototype: R implementation with + some Flink operations included in the compiler ● FastR-Flink in the compiler ● Easy installation and easy programmability ● Good speedups in shared memory for high computation ● Investigating distributed memory to increase speedups
  • 50. Future Work ● A world to explore: ○ Solve distribute performance issues ○ More realistic benchmarks (big data) ○ Distributed benchmarks ○ Support more Flink operations ○ Support Flink for other Truffle languages using same approach ○ Define a stable API for high level programming languages ○ Optimisations (caching) ○ GPU support based on our previous work [1] [1] Juan Fumero, Toomas Remmelg, Michel Steuwer and Christophe Dubach. Runtime Code Generation and Data Management for Heterogeneous Computing in Java. 2015 International Conference on Principles and Practices of Programming on the Java Platform
  • 51. Check it out! It is Open Source Clone and compile: $ mx sclone https://bitbucket.org/allr/fastr-flink $ hg update -r rflink-0.1 $ mx build Run: $ mx R # start the interpreter with Flink local environment
  • 52. Thanks for your attention > flink.ask(questions) AND DEMO ! Contact: Juan Fumero <juan.fumero@ed.ac.uk> Thanks to Oracle Labs, TU Berlin and The University of Edinburgh