SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Aljoscha Krettek / Till Rohrmann
Flink committers
Co-founders @ data Artisans
aljoscha@apache.org / trohrmann@apache.org
Data Analysis With
Apache Flink
What is Apache Flink?
1
Functional
API
Relational
API
Graph API
Machine
Learning
…
Iterative Dataflow Engine
Apache Flink Stack
2
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Distributed Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current Flink master + few PRs
Table
Example Use Case: Log
Analysis
3
What Seems to be the Problem?
 Collect clicks from a
webserver log
 Find interesting URLs
 Combine with user
data
4
Web server
log
user
data base
Interesting
User Data
Extract
Clicks
Combine
Massage
The Execution Environment
 Entry point for all Flink programs
 Creates DataSets from data sources
5
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
Getting at Those Clicks
6
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap(
(String line, Collector<Tuple2<String, Integer>> out) ->
String[] parts = in.split("*magic regex*");
if (isClick(parts)) {
out.collect(new Tuple2<>(parts[1],Integer.parseInt(parts[2])));
}
}
)
post /foo/bar… 313
get /data/pic.jpg 128
post /bar/baz… 128
post /hello/there… 42
The Table Environment
 Environment for dealing with Tables
 Converts between DataSet and Table
7
TableEnvironment tableEnv = new TableEnvironment();
Counting those Clicks
8
Table clicksTable = tableEnv.toTable(clicks, "url, userId");
Table urlClickCounts = clicksTable
.groupBy("url, userId")
.select("url, userId, url.count as count");
Getting the User Information
9
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo)
.where("userId = id && count > 10")
.select("url, count, name, …");
The Final Step
10
class Result {
public String url;
public int count;
public String name;
…
}
DataSet<Result> set =
tableEnv.toSet(resultTable, Result.class);
DataSet<Result> result =
set.groupBy("url").reduceGroup(new ComplexOperation());
result.writeAsText("hdfs:///result");
env.execute();
API in a Nutshell
 Element-wise
• map, flatMap, filter
 Group-wise
• groupBy, reduce, reduceGroup, combineGroup,
mapPartition, aggregate, distinct
 Binary
• join, coGroup, union, cross
 Iterations
• iterate, iterateDelta
 Physical re-organization
• rebalance, partitionByHash, sortPartition
 Streaming
• window, windowMap, coMap, ...
11
What happens under the
hood?
12
From Program to Dataflow
13
Flink Program
Dataflow Plan
Optimized Plan
Distributed Execution
14
Orchestration
Recovery
Master
Memory
Management
Serialization
Worker
Streaming
Network
Advanced Analysis:
Website Recommendation
15
Going Further
 Log analysis result:
Which user visited how
often which web site
 Which other websites
might they like?
 Recommendation by
collaborative filtering
16
Collaborative Filtering
 Recommend items based on users with
similar preferences
 Latent factor models capture underlying
characteristics of items and preferences of
user
 Predicted preference:
17
ˆru,i = xu
T
yi
Matrix Factorization
18
minX,Y ru,i - xu
T
yi( )
2
+l nu xu
2
+ ni yi
2
i
å
u
å
æ
è
ç
ö
ø
÷
ru,i¹0
å
R » XT
Y
Alternating least squares
 Iterative approximation
1. Fix X and optimize Y
2. Fix Y and optimize X
 Communication and
computation intensive
19
R=YX x
R=YX x
Matrix Factorization Pipeline
20
val featureExtractor = HashingFT()
val factorizer = ALS()
val pipeline = featureExtractor.chain(factorizer)
val clickstreamDS =
env.readCsvFile[(String, String, Int)](clickStreamData)
val parameters = ParameterMap()
.add(HashingFT.NumFeatures, 1000000)
.add(ALS.Iterations, 10)
.add(ALS.NumFactors, 50)
.add(ALS.Lambda, 1.5)
val factorization = pipeline.fit(clickstreamDS, parameters)
Clickstream
Data
Hashing
Feature
Extractor
ALS
Matrix
factorization
Does it Scale?
21
• 40 node GCE cluster, highmem-8
• 10 ALS iteration with 50 latent factors
• Based on Spark MLlib’s implementation
Scale of Netflix or Spotify
What Else Can You Do?
 Classification using SVMs
• Conversion goal prediction
 Clustering
• Visitor segmentation
 Multiple linear regression
• Visitor prediction
22
Closing
23
What Have You Seen?
 Flink is a general-purpose analytics system
 Highly expressive Table API
 Advanced analysis with Flink’s machine learning
library
 Jobs are executed on powerful distributed
dataflow engine
24
Flink Roadmap for 2015
 Additions to Machine Learning library
 Streaming Machine Learning
 Support for interactive programs
 Optimization for Table API queries
 SQL on top of Table API
25
26
flink.apache.org
@ApacheFlink
Backup Slides
28
WordCount in DataSet API
29
case class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment()
val lines = env.readTextFile(...)
lines
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”)
.print()
env.execute()
Java and Scala APIs offer the same functionality.
Log Analysis Code
30
ExecutionEnvironment env = TableEnvironment tableEnv = new TableEnvironment();
TableEnvironment tableEnv = new TableEnvironment();
DataSet<String> log = env.readTextFile("hdfs:///log");
DataSet<Tuple2<String, Integer>> clicks = log.flatMap(
new FlatMapFunction<String, Tuple2<String, Integer>>() {
public void flatMap(String in, Collector<Tuple2<>> out) {
String[] parts = in.split("*magic regex*");
if (parts[0].equals("click")) {
out.collect(new Tuple2<>(parts[1], Integer.parseInt(parts[4])));
}
}
});
Table clicksTable = tableEnv.toTable(clicks, "url, userId");
Table urlClickCounts = clicksTable
.groupBy("url, userId")
.select("url, userId, url.count as count");
Table userInfo = tableEnv.toTable(…, "name, id, …");
Table resultTable = urlClickCounts.join(userInfo)
.where("userId = id && count > 10")
.select("url, count, name, …");
DataSet<Result> result = tableEnv.toSet(resultTable, Result.class);
result.writeAsText("hdfs:///result");
env.execute();
Log Analysis Dataflow Graph
31
Log
Map
AggUsers
Join
Result
Group
Log
Map
AggUsers
Join
combine
partition
sort
merge
sort
Result
Group
partition
sort
Pipelined Execution
32
Only 1 Stage
(depending on join strategy)
Data transfer in-memory
and disk if needed
Note: Intermediate DataSets
are not necessarily “created”!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache FlinkJohn Gorman (BSc, CISSP)
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 

Was ist angesagt? (20)

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Don't Cross The Streams - Data Streaming And Apache Flink
Don't Cross The Streams  - Data Streaming And Apache FlinkDon't Cross The Streams  - Data Streaming And Apache Flink
Don't Cross The Streams - Data Streaming And Apache Flink
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Andere mochten auch

Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Managementk_tauhid
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Till Rohrmann
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)hamaken
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 

Andere mochten auch (14)

Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Dbms vs dsms
Dbms vs dsmsDbms vs dsms
Dbms vs dsms
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 

Ähnlich wie Data Analysis With Apache Flink

Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Bruce Johnson
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
 
Linaro Connect 2016 (BKK16) - Introduction to LISA
Linaro Connect 2016 (BKK16) - Introduction to LISALinaro Connect 2016 (BKK16) - Introduction to LISA
Linaro Connect 2016 (BKK16) - Introduction to LISAPatrick Bellasi
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkFabian Hueske
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Operating system concepts
Operating system conceptsOperating system concepts
Operating system conceptsGreen Ecosystem
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactJean-Paul Calbimonte
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Ähnlich wie Data Analysis With Apache Flink (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
Linaro Connect 2016 (BKK16) - Introduction to LISA
Linaro Connect 2016 (BKK16) - Introduction to LISALinaro Connect 2016 (BKK16) - Introduction to LISA
Linaro Connect 2016 (BKK16) - Introduction to LISA
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Operating system concepts
Operating system conceptsOperating system concepts
Operating system concepts
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Data Analysis With Apache Flink

  • 1. Aljoscha Krettek / Till Rohrmann Flink committers Co-founders @ data Artisans aljoscha@apache.org / trohrmann@apache.org Data Analysis With Apache Flink
  • 2. What is Apache Flink? 1 Functional API Relational API Graph API Machine Learning … Iterative Dataflow Engine
  • 3. Apache Flink Stack 2 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Distributed Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current Flink master + few PRs Table
  • 4. Example Use Case: Log Analysis 3
  • 5. What Seems to be the Problem?  Collect clicks from a webserver log  Find interesting URLs  Combine with user data 4 Web server log user data base Interesting User Data Extract Clicks Combine Massage
  • 6. The Execution Environment  Entry point for all Flink programs  Creates DataSets from data sources 5 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  • 7. Getting at Those Clicks 6 DataSet<String> log = env.readTextFile("hdfs:///log"); DataSet<Tuple2<String, Integer>> clicks = log.flatMap( (String line, Collector<Tuple2<String, Integer>> out) -> String[] parts = in.split("*magic regex*"); if (isClick(parts)) { out.collect(new Tuple2<>(parts[1],Integer.parseInt(parts[2]))); } } ) post /foo/bar… 313 get /data/pic.jpg 128 post /bar/baz… 128 post /hello/there… 42
  • 8. The Table Environment  Environment for dealing with Tables  Converts between DataSet and Table 7 TableEnvironment tableEnv = new TableEnvironment();
  • 9. Counting those Clicks 8 Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count");
  • 10. Getting the User Information 9 Table userInfo = tableEnv.toTable(…, "name, id, …"); Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …");
  • 11. The Final Step 10 class Result { public String url; public int count; public String name; … } DataSet<Result> set = tableEnv.toSet(resultTable, Result.class); DataSet<Result> result = set.groupBy("url").reduceGroup(new ComplexOperation()); result.writeAsText("hdfs:///result"); env.execute();
  • 12. API in a Nutshell  Element-wise • map, flatMap, filter  Group-wise • groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct  Binary • join, coGroup, union, cross  Iterations • iterate, iterateDelta  Physical re-organization • rebalance, partitionByHash, sortPartition  Streaming • window, windowMap, coMap, ... 11
  • 13. What happens under the hood? 12
  • 14. From Program to Dataflow 13 Flink Program Dataflow Plan Optimized Plan
  • 17. Going Further  Log analysis result: Which user visited how often which web site  Which other websites might they like?  Recommendation by collaborative filtering 16
  • 18. Collaborative Filtering  Recommend items based on users with similar preferences  Latent factor models capture underlying characteristics of items and preferences of user  Predicted preference: 17 ˆru,i = xu T yi
  • 19. Matrix Factorization 18 minX,Y ru,i - xu T yi( ) 2 +l nu xu 2 + ni yi 2 i å u å æ è ç ö ø ÷ ru,i¹0 å R » XT Y
  • 20. Alternating least squares  Iterative approximation 1. Fix X and optimize Y 2. Fix Y and optimize X  Communication and computation intensive 19 R=YX x R=YX x
  • 21. Matrix Factorization Pipeline 20 val featureExtractor = HashingFT() val factorizer = ALS() val pipeline = featureExtractor.chain(factorizer) val clickstreamDS = env.readCsvFile[(String, String, Int)](clickStreamData) val parameters = ParameterMap() .add(HashingFT.NumFeatures, 1000000) .add(ALS.Iterations, 10) .add(ALS.NumFactors, 50) .add(ALS.Lambda, 1.5) val factorization = pipeline.fit(clickstreamDS, parameters) Clickstream Data Hashing Feature Extractor ALS Matrix factorization
  • 22. Does it Scale? 21 • 40 node GCE cluster, highmem-8 • 10 ALS iteration with 50 latent factors • Based on Spark MLlib’s implementation Scale of Netflix or Spotify
  • 23. What Else Can You Do?  Classification using SVMs • Conversion goal prediction  Clustering • Visitor segmentation  Multiple linear regression • Visitor prediction 22
  • 25. What Have You Seen?  Flink is a general-purpose analytics system  Highly expressive Table API  Advanced analysis with Flink’s machine learning library  Jobs are executed on powerful distributed dataflow engine 24
  • 26. Flink Roadmap for 2015  Additions to Machine Learning library  Streaming Machine Learning  Support for interactive programs  Optimization for Table API queries  SQL on top of Table API 25
  • 27. 26
  • 30. WordCount in DataSet API 29 case class Word (word: String, frequency: Int) val env = ExecutionEnvironment.getExecutionEnvironment() val lines = env.readTextFile(...) lines .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”) .print() env.execute() Java and Scala APIs offer the same functionality.
  • 31. Log Analysis Code 30 ExecutionEnvironment env = TableEnvironment tableEnv = new TableEnvironment(); TableEnvironment tableEnv = new TableEnvironment(); DataSet<String> log = env.readTextFile("hdfs:///log"); DataSet<Tuple2<String, Integer>> clicks = log.flatMap( new FlatMapFunction<String, Tuple2<String, Integer>>() { public void flatMap(String in, Collector<Tuple2<>> out) { String[] parts = in.split("*magic regex*"); if (parts[0].equals("click")) { out.collect(new Tuple2<>(parts[1], Integer.parseInt(parts[4]))); } } }); Table clicksTable = tableEnv.toTable(clicks, "url, userId"); Table urlClickCounts = clicksTable .groupBy("url, userId") .select("url, userId, url.count as count"); Table userInfo = tableEnv.toTable(…, "name, id, …"); Table resultTable = urlClickCounts.join(userInfo) .where("userId = id && count > 10") .select("url, count, name, …"); DataSet<Result> result = tableEnv.toSet(resultTable, Result.class); result.writeAsText("hdfs:///result"); env.execute();
  • 32. Log Analysis Dataflow Graph 31 Log Map AggUsers Join Result Group Log Map AggUsers Join combine partition sort merge sort Result Group partition sort
  • 33. Pipelined Execution 32 Only 1 Stage (depending on join strategy) Data transfer in-memory and disk if needed Note: Intermediate DataSets are not necessarily “created”!

Hinweis der Redaktion

  1. Engine is Batch or Streaming
  2. Works also with Scala API
  3. Visualization of program to plan to optimized plan to JobGraph What you see is not what you get.
  4. Pipelined Execution
  5. Algorithms: Decision trees and random forests PCA CCA More transformers: Scaler, Centering, Whitening Feature extractor Count vectorizer Outlier detector Support for cross validation Improved pipeline support Automatic pre- and post-processing pipeline SAMOA support Pending PR which will be merged with the upcoming milestone release Integration with Zeppelin, a IPython Notebook-like web interface for explorative data analysis
  6. Visualization of JobGraph to ExecutionGraph