SlideShare ist ein Scribd-Unternehmen logo
1 von 66
To Petascale and Beyond:
Apache Flink in the Clouds
Greg Hogan
greg@apache.org
My Project Background
• Apache Flink Committer since 2016
• Progressing from Hadoop MapReduce
enticed by tuples, fluent API, and scalability
enduring for community and growth
• Commits for performance and scalability
• Contributed Gelly graph generators and native algorithms
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
Why batch? Why Gelly?
• Batch
strong similarities between batch and streaming
benchmarking is in essence batch
• Gelly
diverse, scalable algorithms
that stress distributed architectures
What is Gelly?
Graph API:
Graph, Vertex, Edge, … BipartiteGraph, BipartiteEdge (FLINK-2254)
from*, getTriplets, groupReduceOn*, joinWith*, map*, reduceOn*
public class Graph<K, VV, EV> {

private final ExecutionEnvironment context;

private final DataSet<Vertex<K, VV>> vertices;

private final DataSet<Edge<K, EV>> edges;
public class Vertex<K, V> extends Tuple2<K, V> {
public class Edge<K, V> extends Tuple3<K, K, V>{
What is Gelly?
Iterative graph algorithm models:
vertex-centric (pregel)
scatter-gather (spargel)
gather-sum-apply
public abstract void compute(Vertex<K, VV> vertex, MessageIterator<Message> messages);
public abstract void sendMessages(Vertex<K, VV> vertex) throws Exception;
public abstract void updateVertex(Vertex<K, VV> vertex, MessageIterator<Message> inMessages);
public abstract M gather(Neighbor<VV, EV> neighbor);
public abstract M sum(M arg0, M arg1);
public abstract void apply(M newValue, VV currentValue);
What is Gelly?
• Generate scale-free graphs independent of parallelism
Complete, Cycle, Grid, Hypercube, Path, RMat, Star
• Composable building blocks for directed and undirected graphs
degree annotate, simplify, transform, …
• Algorithms (DataSet) and analytics (accumulators)
• Checksum validation
R-MAT: Recursive Matrix
• CMU: Chakrabarti, Zhan, Faloutsos
power-law degree, “community” structure, small diameter
parsimony, realism, generation speed
• vertex count: 2scale
• edge count : 2scale
* edge factor
• bit probabilities: a + b + c + d = 1
• random noise, clip-and-flip, bipartite, …
Glossary
V SVertex Edge
Open Triplet Closed Triplet / Triangle
T
U
V W
U
V W
Triads
Graph500 Metrics
Scale Vertices Edges Max Degree △ Triplets Triplets
16 46,826 955,507 9,608 33,319,057 619,718,882
20 646,203 16,084,983 64,779 1,225,406,293 35,034,305,809
24 8,869,117 263,432,942 406,382 41,744,066,881 1,782,644,764,135
28 121,233,960 4,259,425,690 2,457,580 1,357,806,114,141 84,905,324,934,756
32 1,652,616,155 68,460,935,436 14,455,055 43,031,369,386,717 3,870,783,341,130,190
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
• Measures connectedness of a vertex’s neighborhood
• Ranges from 0.0 to 1.0:
0.0: no edges connecting neighbors (star graph or tree)
1.0: each pair of neighbors is connected (clique)
• Each connection between neighbors forms a triangle
Clustering Coefficient
Clustering Coefficient
• Local Clustering Coefficient is fraction of edges between neighbors
• Global Clustering Coefficient is fraction of connected triplets
• Thomas Schank: Algorithmic Aspects of Triangle-Based Network Analysis
• Join all edges with all (open) triplets
• Naive algorithm discovers each triangle at all three vertices;
instead look for triangles at the vertex with lowest degree
Enumerating Triangles
V W
U
V W
Implicit Object Reuse
Implicit Operator Reuse
Global Clustering Coefficient
Implicit Operator Reuse
Local Clustering Coefficient
Implicit Operator Reuse
Local Clustering Coefficient
Global Clustering Coefficient
Implicit Operator Reuse
Local Clustering Coefficient
Global Clustering Coefficient
gcc = graph

.run(new GlobalClusteringCoefficient<LongValue, NullValue, NullValue>());

lcc = graph

.run(new LocalClusteringCoefficient<LongValue, NullValue, NullValue>());
Graph
Implicit Operator Reuse
• Three generations of operator reuse
1. explicit reuse via get/set on intermediate DataSets
2. Delegate using code generation
3. NoOpOperator as placeholder in Flink 1.2
• Algorithms extend GraphAlgorithmWrappingDataSet/Graph
and define and implement a mergeable configuration
GraphAnalytic
• Terminal; results retrieved via accumulators
• Defers execution to the user to allow composing multiple analytics
and algorithms into a single program
public interface GraphAnalytic<K, VV, EV, T> {



T getResult();

T execute() throws Exception;

T execute(String jobName) throws Exception;



GraphAnalytic<K, VV, EV, T> run(Graph<K, VV, EV> input);
}
Similarity Measures
• Common Neighbors
• Jaccard Index is common neighbors / distinct neighbors
• Adamic-Adar is sum over weighted-degree of common neighbors
Common Neighbors
graph.getEdges()
.groupBy(0).crossGroup(…)
.groupBy(0, 1).sum(2)
C D
B
A
CNB,C=1
CNC,D=1
CNB,D=1
GroupCross Input GroupCross Input
<A,B>, <A,C>, <A,D> <B,C>, <B,D>, <C,D>
<B,A>
<C,A>
<D,A>
GroupedCross
• FLINK-1267: pair-wise compare or combine all elements of a group;
currently can
groupReduce: consume and store the complete iterator in
memory and build all pairs
self-Join: the engine builds all pairs of the full symmetric
Cartesian product. u
v
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
u
v
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
• As linear operators:
• process fixed width slices
u
u } 64
} 64
} 64
} 64
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
• As linear operators:
• process fixed width slices
… generating slices is still quadratic!
u
u } 64
} 64
} 64
} 64
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
• As linear operators:
3. process fixed width slices
u
u } 64
} 64
} 64
} 64
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
• As linear operators:
1. emit # slices for each element
3. process fixed width slices
u
u } 64
} 64
} 64
} 64
GroupedCross
• Hashes evenly partition linear outputs, but here quadratic
• As linear operators:
1. emit # slices for each element
2. emit 1..# slice for each element
3. process fixed width slices
u
u } 64
} 64
} 64
} 64
Jaccard Index / Adamic-Adar
• Jaccard Index edges annotated with target degree: <s, t, d(t)>
• Adamic-Adar edges annotated with inverse logarithm of source
degree: <s, t, 1/log(d(s))>
• groupBy, crossGroup, groupBy, sum scores as Common Neighbors
Flink Serialization
• “Think like an element”
• Flink aggressively serializes even when data fits in memory
MapDriver:
• Complex types are antithetical to performance
don’t stream ArrayList, HashSet, …
do use Flink Value types and ValueArray types (FLINK-3695)
while (this.running && ((record = input.next(record)) != null)) {

numRecordsIn.inc();

output.collect(function.map(record));

}
Modern Memory Model
• Processor registers – the fastest possible access (usually 1 CPU cycle). A few thousand bytes in size
• Cache
• Level 0 (L0) Micro operations cache – 6 KiB in size
• Level 1 (L1) Instruction cache – 128 KiB in size
• Level 1 (L1) Data cache – 128 KiB in size. Best access speed is around 700 GiB/second
• Level 2 (L2) Instruction and data (shared) – 1 MiB in size. Best access speed is around 200 GiB/second
• Level 3 (L3) Shared cache – 6 MiB in size. Best access speed is around 100 GB/second
• Level 4 (L4) Shared cache – 128 MiB in size. Best access speed is around 40 GB/second
• Main memory (Primary storage) – Gigabytes in size. Best access speed is around 10 GB/second.
• Disk storage (Secondary storage) – Terabytes in size. As of 2013, best access speed is from a solid state drive is about 600 MB/second
• Nearline storage (Tertiary storage) – Up to exabytes in size. As of 2013, best access speed is about 160 MB/second
• Offline storage
https://en.wikipedia.org/wiki/Memory_hierarchy
Object Reuse
• Performance boost when serializing/deserializing simple types
• Inputs may be modified between function calls;
if storing inputs must make a copy
• Output may be modified pre-serialization by chained operators
• Configured globally; open discussion for using annotations
// Set up the execution environment

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

env.getConfig().enableObjectReuse();
CopyableValue
• Implemented by Flink’s value types (LongValue, StringValue, …)
• Create new or copy to list of cached objects
if (visitedCount == visited.size()) {

visited.add(edge.f1.copy());

} else {

edge.f1.copyTo(visited.get(visitedCount));

}
public interface CopyableValue<T> extends Value {

int getBinaryLength();


void copyTo(T target);

T copy();

void copy(DataInputView source, DataOutputView target) throws IOException;

}
Hints
• CombineHint.[HASH|SORT|NONE]
• JoinHint.[BROADCAST/REPARTITION_HASH_FIRST/SECOND
|REPARTITION_SORT_MERGE]
• @ForwardedFields[|First|Second]
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
Amazon Web Services
• c4.8xlarge:
• 36 vcores
• 60 GiB memory
• 2 NUMA cores (FLINK-3163)
• 5 x 200 GiB EBS volumes; kernel config: xen_blkfront.max=256
• Hourly Spot pricing ~$0.30 - $0.36 plus $0.14/hr EBS
• 800 MB/s EBS
• 10 Gbps networking + 4 Gbps dedicated EBS
Memory Allocation
• User defined functions
• TaskManager managed memory
• Network buffers
• Readhead buffers for spilled files in merge-sort
My flink-conf.yaml
jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 18000
taskmanager.numberOfTaskSlots: 18
taskmanager.memory.preallocate: true
parallelism.default: 36
jobmanager.web.port: 8081
webclient.port: 8080
taskmanager.network.numberOfBuffers: 25920
taskmanager.tmp.dirs: /volumes/xvdb/tmp:/volumes/xvdc/tmp:/volumes/xvdd/tmp:…
akka.ask.timeout: 1000 s
akka.lookup.timeout: 100 s
jobmanager.web.history: 25
taskmanager.memory.fraction: 0.9
taskmanager.memory.off-heap: true
taskmanager.runtime.hashjoin-bloom-filters: true
taskmanager.runtime.max-fan: 8192
Configuring Network Buffers
• Ignore heuristic as recommended in documentation
“#slots-per-TM^2 * #TMs * 4”
• Instead introspect with statsd reporter
nc -l -u -k 0.0.0.0 <port> | grep “AvailableMemorySegments”
AvailableMemorySegments, TotalMemorySegments in Flink 1.2
• May show up in the web UI
Python statsd server
#!/usr/bin/env python
import socket
import sys
import time
UDP_IP = ''
UDP_PORT = int(sys.argv[1])
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind((UDP_IP, UDP_PORT))
while True:
data, addr = sock.recvfrom(4096)
print('{:.6f} {}'.format(time.time(), data))
“Little” Parallelism
• Number of buffers per TM increases linearly with size of cluster so
can run out of buffers or consume too much memory
18 * (36 * 64) * 32 KiB = 1.36 GB for a single shuffle
• Typically have small tasks early and late and large tasks in middle
• Flink runs slowly when parallelism mismatched to size of data
• Must be paired with a scheduler which evenly allocates slots
among TaskManagers
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
Cloudsort
• sortbenchmark.org: lowest cost to sort 100 TB in a public cloud
an “I/O” benchmark, no compression permitted
• Input and output on persistent storage
replicated storage outside the cluster
• No discounts!
• Amazon has better compute, Google has better networking
SchnellSort
• Everything at https://github.com/greghogan/flink-cloudsort
• Custom IndyRecord type
• 1 x c4.4xlarge master, 128 x c4.4xlarge workers
• 3 x 255 GiB EBS for tmp directories
• 1 GB input files in S3
• 256 MB output files in S3
Sort Cost
Java SSL Performance
• Cannot choose SSLEngine for HDFS stack
solved with intrinsics in jdk8-b112 and Skylake’s Intel SHA Extensions
• Galois/Counter Mode (GCM) can be disabled
https://docs.oracle.com/javase/8/docs/technotes/guides/security/PolicyFiles.html
• SHA-1 or MD5 are integral to SSL and cannot be disabled
compare with `openssl speed md5 sha`
used AWS CLI
Netty Configuration
• Linux TCP/IP default buffers
• Netty configuration in flink-conf.yaml
$ cat /proc/sys/net/ipv4/tcp_[rw]mem
4096 87380 3813280
4096 20480 3813280
taskmanager.net.num-arenas: 16
taskmanager.net.server.numThreads: 16
taskmanager.net.client.numThreads: 16
taskmanager.net.sendReceiveBufferSize: 67108864
Slow uploads
• “Counting Triangles and the Curse of the Last Reducer”
kill and retry slow downloads and uploads
write files to /dev/shm and upload using “thread-pool”
• Configurable timeout
120 seconds for input
60 seconds for output
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
Java Flight Recorder
• Must edit flink-daemon.sh in order to rotate log file
• Can concatenate files from multiple TaskManagers
• Open jfr output files in JDK-provided jmc
jfr=“${FLINK_LOG_DIR}/flink-${FLINK_IDENT_STRING}-${DAEMON}-${id}-${HOSTNAME}.jfr"
JFR_ARGS=“-XX:+UnlockCommercialFeatures -XX:+FlightRecorder 
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints 
-XX:FlightRecorderOptions=defaultrecording=true,maxage=0,disk=true,dumponexit=true,dumponexitpath=${jfr}”
…
rotateLogFile $jfr
…
$JAVA_RUN $JFR_ARGS …
java oracle.jrockit.jfr.tools.ConCatRepository . -o <filename>.jfr
Macro Benchmarks
• Performance test new features and stress test Flink releases
https://github.com/greghogan/flink-benchmark
multiple time samples on increasing scale; time-weighted
• Profiling drives benchmarking: chasing the “1%”
• What metrics can be collected and analyzed?
{“algorithm”:”AdamicAdar","idType":"STRING","scale":12,"runtime_ms":634,"accumulators":{...}
{“algorithm":"TriangleListingUndirected","idType":"INT","scale":16,"runtime_ms":813,"accumulator
{“algorithm":"AdamicAdar","idType":"LONG","scale":12,"runtime_ms":410,"accumulators":{...}
{“algorithm":"HITS","idType":"STRING","scale":16,"runtime_ms":1693,"accumulators":{...}
{“algorithm”:”JaccardIndex”,”idType":"LONG","scale":12,"runtime_ms":431,"accumulators":{...}
Agenda
Graphs with Gelly
Native algorithms
Configuring at scale
Sort benchmark
Profiling and benchmarking
In development
Photo: Nico Trinkhaus - Eternal Flame, Olympic Stadium Berlin - CC-BY-NC
Custom Types
• FLINK-3042: Define a way to let types create their own TypeInformation
• @TypeInfo
-> TypeInfoFactory
-> TypeInformation
-> serializer/comparator and more
• Efficient serialization of arrays in FLINK-3695: ValueArray types
ValueArray<T>
@TypeInfo(ValueArrayTypeInfoFactory.class)

public interface ValueArray<T> …
public class ValueArrayTypeInfoFactory<T> extends
TypeInfoFactory<ValueArray<T>> {



@Override

public TypeInformation<ValueArray<T>> createTypeInfo(…) {

return new ValueArrayTypeInfo(genericParameters.get("T"));

}

}
ValueArrayTypeInfo<T>
public class ValueArrayTypeInfo<T extends ValueArray> extends TypeInformation<T> implements AtomicType<T> {



private final Class<T> typeClass;


@Override

@SuppressWarnings("unchecked")

public TypeSerializer<T> createSerializer(ExecutionConfig executionConfig) {

if (IntValue.class.isAssignableFrom(typeClass)) {

return (TypeSerializer<T>) new CopyableValueSerializer(IntValueArray.class);

} else…



@SuppressWarnings({ "unchecked", "rawtypes" })

@Override

public TypeComparator<T> createComparator(boolean sortOrderAscending, ExecutionConfig executionConfig) {

if (IntValue.class.isAssignableFrom(typeClass)) {

return (TypeComparator<T>) new CopyableValueComparator(sortOrderAscending, IntValueArray.class);

} else …
ArrayableValue
• LongValueArray, StringValueArray, …
• Initial use cases:
maximum value or top-K for pairwise similarity scores
hybrid adjacency-list model:
u v
u w
u x
y z
u v, w, x
y z
u v, w
u x
y z
Edge List Adjacency List Hybrid
Iteration Intermediate Output
• For many iterative algorithms the output is ∅
algorithm output is the output of intermediate operator
breadth-first search, K-Core decomposition, K-Truss, …
• Delta iteration solution set is blocking, doesn’t scale
Iteration
Head
Iteration
Tail
Discarding
Output
Format
Operator
Coding to Interfaces
• Vertex/Edge as interface or abstract class
subclasses define value fields
eliminate “value packing”
• Requires code generation; must track fields from input to output
• Make algorithms truly modular
• Optimizer can prune unused fields
Why batch? Why Gelly?
• A framework for building great algorithms
• Scaling to 100s of algorithms
• As an out-of-the-box tool for graph analysis
?
Images
• Squirrel: https://flink.apache.org/img/logo/png/1000/flink_squirrel_1000.png
• Clouds: http://www.publicdomainpictures.net/pictures/120000/velka/large-
cumulus-clouds-1.jpg
• Tree: https://motivatedbynature.com/wp-content/uploads/2016/04/
tree_PNG2517.png
• RMat: http://www.cs.cmu.edu/~christos/PUBLICATIONS/siam04.pdf
• Triads: http://www.educa.fmf.uni-lj.si/datana/pub/networks/pajek/triade.pdf
Agenda Images
• Fernsehturm Berlin: http://readysettrek.com/wp-content/uploads/2015/09/berlin-
attractions.jpg
• Reichstag: https://upload.wikimedia.org/wikipedia/commons/c/c7/
Reichstag_building_Berlin_view_from_west_before_sunset.jpg
• Brandenburg Gate: https://images2.alphacoders.com/479/479507.jpg
• Schloss Charlottenburg : https://upload.wikimedia.org/wikipedia/commons/e/ec/
Le_château_de_Charlottenburg_(Berlin)_(6340508573).jpg
• Memorial: https://s-media-cache-ak0.pinimg.com/736x/45/fa/
85/45fa85018367cd88d4ffc0f62416c804.jpg
• Olympic Stadium: http://sumfinity.com/wp-content/uploads/2014/03/Olympic-Stadium-
Berlin.jpg

Weitere ähnliche Inhalte

Was ist angesagt?

Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamPyData
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 

Was ist angesagt? (20)

Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...Flink Forward SF 2017: Dean Wampler -  Streaming Deep Learning Scenarios with...
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
DataFlow & Beam
DataFlow & BeamDataFlow & Beam
DataFlow & Beam
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 

Andere mochten auch

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkFlink Forward
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Flink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeFlink Forward
 

Andere mochten auch (20)

Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
 

Ähnlich wie Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 

Ähnlich wie Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds (20)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Spark etl
Spark etlSpark etl
Spark etl
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 

Mehr von Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Mehr von Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Kürzlich hochgeladen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Kürzlich hochgeladen (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds

  • 1. To Petascale and Beyond: Apache Flink in the Clouds Greg Hogan greg@apache.org
  • 2. My Project Background • Apache Flink Committer since 2016 • Progressing from Hadoop MapReduce enticed by tuples, fluent API, and scalability enduring for community and growth • Commits for performance and scalability • Contributed Gelly graph generators and native algorithms
  • 3. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development
  • 4. Why batch? Why Gelly? • Batch strong similarities between batch and streaming benchmarking is in essence batch • Gelly diverse, scalable algorithms that stress distributed architectures
  • 5. What is Gelly? Graph API: Graph, Vertex, Edge, … BipartiteGraph, BipartiteEdge (FLINK-2254) from*, getTriplets, groupReduceOn*, joinWith*, map*, reduceOn* public class Graph<K, VV, EV> {
 private final ExecutionEnvironment context;
 private final DataSet<Vertex<K, VV>> vertices;
 private final DataSet<Edge<K, EV>> edges; public class Vertex<K, V> extends Tuple2<K, V> { public class Edge<K, V> extends Tuple3<K, K, V>{
  • 6. What is Gelly? Iterative graph algorithm models: vertex-centric (pregel) scatter-gather (spargel) gather-sum-apply public abstract void compute(Vertex<K, VV> vertex, MessageIterator<Message> messages); public abstract void sendMessages(Vertex<K, VV> vertex) throws Exception; public abstract void updateVertex(Vertex<K, VV> vertex, MessageIterator<Message> inMessages); public abstract M gather(Neighbor<VV, EV> neighbor); public abstract M sum(M arg0, M arg1); public abstract void apply(M newValue, VV currentValue);
  • 7. What is Gelly? • Generate scale-free graphs independent of parallelism Complete, Cycle, Grid, Hypercube, Path, RMat, Star • Composable building blocks for directed and undirected graphs degree annotate, simplify, transform, … • Algorithms (DataSet) and analytics (accumulators) • Checksum validation
  • 8. R-MAT: Recursive Matrix • CMU: Chakrabarti, Zhan, Faloutsos power-law degree, “community” structure, small diameter parsimony, realism, generation speed • vertex count: 2scale • edge count : 2scale * edge factor • bit probabilities: a + b + c + d = 1 • random noise, clip-and-flip, bipartite, …
  • 9. Glossary V SVertex Edge Open Triplet Closed Triplet / Triangle T U V W U V W
  • 11. Graph500 Metrics Scale Vertices Edges Max Degree △ Triplets Triplets 16 46,826 955,507 9,608 33,319,057 619,718,882 20 646,203 16,084,983 64,779 1,225,406,293 35,034,305,809 24 8,869,117 263,432,942 406,382 41,744,066,881 1,782,644,764,135 28 121,233,960 4,259,425,690 2,457,580 1,357,806,114,141 84,905,324,934,756 32 1,652,616,155 68,460,935,436 14,455,055 43,031,369,386,717 3,870,783,341,130,190
  • 12. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development
  • 13. • Measures connectedness of a vertex’s neighborhood • Ranges from 0.0 to 1.0: 0.0: no edges connecting neighbors (star graph or tree) 1.0: each pair of neighbors is connected (clique) • Each connection between neighbors forms a triangle Clustering Coefficient
  • 14. Clustering Coefficient • Local Clustering Coefficient is fraction of edges between neighbors • Global Clustering Coefficient is fraction of connected triplets
  • 15. • Thomas Schank: Algorithmic Aspects of Triangle-Based Network Analysis • Join all edges with all (open) triplets • Naive algorithm discovers each triangle at all three vertices; instead look for triangles at the vertex with lowest degree Enumerating Triangles V W U V W
  • 17. Implicit Operator Reuse Global Clustering Coefficient
  • 18. Implicit Operator Reuse Local Clustering Coefficient
  • 19. Implicit Operator Reuse Local Clustering Coefficient Global Clustering Coefficient
  • 20. Implicit Operator Reuse Local Clustering Coefficient Global Clustering Coefficient gcc = graph
 .run(new GlobalClusteringCoefficient<LongValue, NullValue, NullValue>());
 lcc = graph
 .run(new LocalClusteringCoefficient<LongValue, NullValue, NullValue>()); Graph
  • 21. Implicit Operator Reuse • Three generations of operator reuse 1. explicit reuse via get/set on intermediate DataSets 2. Delegate using code generation 3. NoOpOperator as placeholder in Flink 1.2 • Algorithms extend GraphAlgorithmWrappingDataSet/Graph and define and implement a mergeable configuration
  • 22. GraphAnalytic • Terminal; results retrieved via accumulators • Defers execution to the user to allow composing multiple analytics and algorithms into a single program public interface GraphAnalytic<K, VV, EV, T> {
 
 T getResult();
 T execute() throws Exception;
 T execute(String jobName) throws Exception;
 
 GraphAnalytic<K, VV, EV, T> run(Graph<K, VV, EV> input); }
  • 23. Similarity Measures • Common Neighbors • Jaccard Index is common neighbors / distinct neighbors • Adamic-Adar is sum over weighted-degree of common neighbors
  • 24. Common Neighbors graph.getEdges() .groupBy(0).crossGroup(…) .groupBy(0, 1).sum(2) C D B A CNB,C=1 CNC,D=1 CNB,D=1 GroupCross Input GroupCross Input <A,B>, <A,C>, <A,D> <B,C>, <B,D>, <C,D> <B,A> <C,A> <D,A>
  • 25. GroupedCross • FLINK-1267: pair-wise compare or combine all elements of a group; currently can groupReduce: consume and store the complete iterator in memory and build all pairs self-Join: the engine builds all pairs of the full symmetric Cartesian product. u v
  • 26. GroupedCross • Hashes evenly partition linear outputs, but here quadratic u v
  • 27. GroupedCross • Hashes evenly partition linear outputs, but here quadratic • As linear operators: • process fixed width slices u u } 64 } 64 } 64 } 64
  • 28. GroupedCross • Hashes evenly partition linear outputs, but here quadratic • As linear operators: • process fixed width slices … generating slices is still quadratic! u u } 64 } 64 } 64 } 64
  • 29. GroupedCross • Hashes evenly partition linear outputs, but here quadratic • As linear operators: 3. process fixed width slices u u } 64 } 64 } 64 } 64
  • 30. GroupedCross • Hashes evenly partition linear outputs, but here quadratic • As linear operators: 1. emit # slices for each element 3. process fixed width slices u u } 64 } 64 } 64 } 64
  • 31. GroupedCross • Hashes evenly partition linear outputs, but here quadratic • As linear operators: 1. emit # slices for each element 2. emit 1..# slice for each element 3. process fixed width slices u u } 64 } 64 } 64 } 64
  • 32. Jaccard Index / Adamic-Adar • Jaccard Index edges annotated with target degree: <s, t, d(t)> • Adamic-Adar edges annotated with inverse logarithm of source degree: <s, t, 1/log(d(s))> • groupBy, crossGroup, groupBy, sum scores as Common Neighbors
  • 33. Flink Serialization • “Think like an element” • Flink aggressively serializes even when data fits in memory MapDriver: • Complex types are antithetical to performance don’t stream ArrayList, HashSet, … do use Flink Value types and ValueArray types (FLINK-3695) while (this.running && ((record = input.next(record)) != null)) {
 numRecordsIn.inc();
 output.collect(function.map(record));
 }
  • 34. Modern Memory Model • Processor registers – the fastest possible access (usually 1 CPU cycle). A few thousand bytes in size • Cache • Level 0 (L0) Micro operations cache – 6 KiB in size • Level 1 (L1) Instruction cache – 128 KiB in size • Level 1 (L1) Data cache – 128 KiB in size. Best access speed is around 700 GiB/second • Level 2 (L2) Instruction and data (shared) – 1 MiB in size. Best access speed is around 200 GiB/second • Level 3 (L3) Shared cache – 6 MiB in size. Best access speed is around 100 GB/second • Level 4 (L4) Shared cache – 128 MiB in size. Best access speed is around 40 GB/second • Main memory (Primary storage) – Gigabytes in size. Best access speed is around 10 GB/second. • Disk storage (Secondary storage) – Terabytes in size. As of 2013, best access speed is from a solid state drive is about 600 MB/second • Nearline storage (Tertiary storage) – Up to exabytes in size. As of 2013, best access speed is about 160 MB/second • Offline storage https://en.wikipedia.org/wiki/Memory_hierarchy
  • 35. Object Reuse • Performance boost when serializing/deserializing simple types • Inputs may be modified between function calls; if storing inputs must make a copy • Output may be modified pre-serialization by chained operators • Configured globally; open discussion for using annotations // Set up the execution environment
 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 env.getConfig().enableObjectReuse();
  • 36. CopyableValue • Implemented by Flink’s value types (LongValue, StringValue, …) • Create new or copy to list of cached objects if (visitedCount == visited.size()) {
 visited.add(edge.f1.copy());
 } else {
 edge.f1.copyTo(visited.get(visitedCount));
 } public interface CopyableValue<T> extends Value {
 int getBinaryLength(); 
 void copyTo(T target);
 T copy();
 void copy(DataInputView source, DataOutputView target) throws IOException;
 }
  • 38. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development
  • 39. Amazon Web Services • c4.8xlarge: • 36 vcores • 60 GiB memory • 2 NUMA cores (FLINK-3163) • 5 x 200 GiB EBS volumes; kernel config: xen_blkfront.max=256 • Hourly Spot pricing ~$0.30 - $0.36 plus $0.14/hr EBS • 800 MB/s EBS • 10 Gbps networking + 4 Gbps dedicated EBS
  • 40. Memory Allocation • User defined functions • TaskManager managed memory • Network buffers • Readhead buffers for spilled files in merge-sort
  • 41. My flink-conf.yaml jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 jobmanager.heap.mb: 1024 taskmanager.heap.mb: 18000 taskmanager.numberOfTaskSlots: 18 taskmanager.memory.preallocate: true parallelism.default: 36 jobmanager.web.port: 8081 webclient.port: 8080 taskmanager.network.numberOfBuffers: 25920 taskmanager.tmp.dirs: /volumes/xvdb/tmp:/volumes/xvdc/tmp:/volumes/xvdd/tmp:… akka.ask.timeout: 1000 s akka.lookup.timeout: 100 s jobmanager.web.history: 25 taskmanager.memory.fraction: 0.9 taskmanager.memory.off-heap: true taskmanager.runtime.hashjoin-bloom-filters: true taskmanager.runtime.max-fan: 8192
  • 42. Configuring Network Buffers • Ignore heuristic as recommended in documentation “#slots-per-TM^2 * #TMs * 4” • Instead introspect with statsd reporter nc -l -u -k 0.0.0.0 <port> | grep “AvailableMemorySegments” AvailableMemorySegments, TotalMemorySegments in Flink 1.2 • May show up in the web UI
  • 43. Python statsd server #!/usr/bin/env python import socket import sys import time UDP_IP = '' UDP_PORT = int(sys.argv[1]) sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) sock.bind((UDP_IP, UDP_PORT)) while True: data, addr = sock.recvfrom(4096) print('{:.6f} {}'.format(time.time(), data))
  • 44. “Little” Parallelism • Number of buffers per TM increases linearly with size of cluster so can run out of buffers or consume too much memory 18 * (36 * 64) * 32 KiB = 1.36 GB for a single shuffle • Typically have small tasks early and late and large tasks in middle • Flink runs slowly when parallelism mismatched to size of data • Must be paired with a scheduler which evenly allocates slots among TaskManagers
  • 45. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development
  • 46. Cloudsort • sortbenchmark.org: lowest cost to sort 100 TB in a public cloud an “I/O” benchmark, no compression permitted • Input and output on persistent storage replicated storage outside the cluster • No discounts! • Amazon has better compute, Google has better networking
  • 47. SchnellSort • Everything at https://github.com/greghogan/flink-cloudsort • Custom IndyRecord type • 1 x c4.4xlarge master, 128 x c4.4xlarge workers • 3 x 255 GiB EBS for tmp directories • 1 GB input files in S3 • 256 MB output files in S3
  • 49. Java SSL Performance • Cannot choose SSLEngine for HDFS stack solved with intrinsics in jdk8-b112 and Skylake’s Intel SHA Extensions • Galois/Counter Mode (GCM) can be disabled https://docs.oracle.com/javase/8/docs/technotes/guides/security/PolicyFiles.html • SHA-1 or MD5 are integral to SSL and cannot be disabled compare with `openssl speed md5 sha` used AWS CLI
  • 50. Netty Configuration • Linux TCP/IP default buffers • Netty configuration in flink-conf.yaml $ cat /proc/sys/net/ipv4/tcp_[rw]mem 4096 87380 3813280 4096 20480 3813280 taskmanager.net.num-arenas: 16 taskmanager.net.server.numThreads: 16 taskmanager.net.client.numThreads: 16 taskmanager.net.sendReceiveBufferSize: 67108864
  • 51. Slow uploads • “Counting Triangles and the Curse of the Last Reducer” kill and retry slow downloads and uploads write files to /dev/shm and upload using “thread-pool” • Configurable timeout 120 seconds for input 60 seconds for output
  • 52. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development
  • 53. Java Flight Recorder • Must edit flink-daemon.sh in order to rotate log file • Can concatenate files from multiple TaskManagers • Open jfr output files in JDK-provided jmc jfr=“${FLINK_LOG_DIR}/flink-${FLINK_IDENT_STRING}-${DAEMON}-${id}-${HOSTNAME}.jfr" JFR_ARGS=“-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:FlightRecorderOptions=defaultrecording=true,maxage=0,disk=true,dumponexit=true,dumponexitpath=${jfr}” … rotateLogFile $jfr … $JAVA_RUN $JFR_ARGS … java oracle.jrockit.jfr.tools.ConCatRepository . -o <filename>.jfr
  • 54.
  • 55. Macro Benchmarks • Performance test new features and stress test Flink releases https://github.com/greghogan/flink-benchmark multiple time samples on increasing scale; time-weighted • Profiling drives benchmarking: chasing the “1%” • What metrics can be collected and analyzed? {“algorithm”:”AdamicAdar","idType":"STRING","scale":12,"runtime_ms":634,"accumulators":{...} {“algorithm":"TriangleListingUndirected","idType":"INT","scale":16,"runtime_ms":813,"accumulator {“algorithm":"AdamicAdar","idType":"LONG","scale":12,"runtime_ms":410,"accumulators":{...} {“algorithm":"HITS","idType":"STRING","scale":16,"runtime_ms":1693,"accumulators":{...} {“algorithm”:”JaccardIndex”,”idType":"LONG","scale":12,"runtime_ms":431,"accumulators":{...}
  • 56. Agenda Graphs with Gelly Native algorithms Configuring at scale Sort benchmark Profiling and benchmarking In development Photo: Nico Trinkhaus - Eternal Flame, Olympic Stadium Berlin - CC-BY-NC
  • 57. Custom Types • FLINK-3042: Define a way to let types create their own TypeInformation • @TypeInfo -> TypeInfoFactory -> TypeInformation -> serializer/comparator and more • Efficient serialization of arrays in FLINK-3695: ValueArray types
  • 58. ValueArray<T> @TypeInfo(ValueArrayTypeInfoFactory.class)
 public interface ValueArray<T> … public class ValueArrayTypeInfoFactory<T> extends TypeInfoFactory<ValueArray<T>> {
 
 @Override
 public TypeInformation<ValueArray<T>> createTypeInfo(…) {
 return new ValueArrayTypeInfo(genericParameters.get("T"));
 }
 }
  • 59. ValueArrayTypeInfo<T> public class ValueArrayTypeInfo<T extends ValueArray> extends TypeInformation<T> implements AtomicType<T> {
 
 private final Class<T> typeClass; 
 @Override
 @SuppressWarnings("unchecked")
 public TypeSerializer<T> createSerializer(ExecutionConfig executionConfig) {
 if (IntValue.class.isAssignableFrom(typeClass)) {
 return (TypeSerializer<T>) new CopyableValueSerializer(IntValueArray.class);
 } else…
 
 @SuppressWarnings({ "unchecked", "rawtypes" })
 @Override
 public TypeComparator<T> createComparator(boolean sortOrderAscending, ExecutionConfig executionConfig) {
 if (IntValue.class.isAssignableFrom(typeClass)) {
 return (TypeComparator<T>) new CopyableValueComparator(sortOrderAscending, IntValueArray.class);
 } else …
  • 60. ArrayableValue • LongValueArray, StringValueArray, … • Initial use cases: maximum value or top-K for pairwise similarity scores hybrid adjacency-list model: u v u w u x y z u v, w, x y z u v, w u x y z Edge List Adjacency List Hybrid
  • 61. Iteration Intermediate Output • For many iterative algorithms the output is ∅ algorithm output is the output of intermediate operator breadth-first search, K-Core decomposition, K-Truss, … • Delta iteration solution set is blocking, doesn’t scale Iteration Head Iteration Tail Discarding Output Format Operator
  • 62. Coding to Interfaces • Vertex/Edge as interface or abstract class subclasses define value fields eliminate “value packing” • Requires code generation; must track fields from input to output • Make algorithms truly modular • Optimizer can prune unused fields
  • 63. Why batch? Why Gelly? • A framework for building great algorithms • Scaling to 100s of algorithms • As an out-of-the-box tool for graph analysis
  • 64. ?
  • 65. Images • Squirrel: https://flink.apache.org/img/logo/png/1000/flink_squirrel_1000.png • Clouds: http://www.publicdomainpictures.net/pictures/120000/velka/large- cumulus-clouds-1.jpg • Tree: https://motivatedbynature.com/wp-content/uploads/2016/04/ tree_PNG2517.png • RMat: http://www.cs.cmu.edu/~christos/PUBLICATIONS/siam04.pdf • Triads: http://www.educa.fmf.uni-lj.si/datana/pub/networks/pajek/triade.pdf
  • 66. Agenda Images • Fernsehturm Berlin: http://readysettrek.com/wp-content/uploads/2015/09/berlin- attractions.jpg • Reichstag: https://upload.wikimedia.org/wikipedia/commons/c/c7/ Reichstag_building_Berlin_view_from_west_before_sunset.jpg • Brandenburg Gate: https://images2.alphacoders.com/479/479507.jpg • Schloss Charlottenburg : https://upload.wikimedia.org/wikipedia/commons/e/ec/ Le_château_de_Charlottenburg_(Berlin)_(6340508573).jpg • Memorial: https://s-media-cache-ak0.pinimg.com/736x/45/fa/ 85/45fa85018367cd88d4ffc0f62416c804.jpg • Olympic Stadium: http://sumfinity.com/wp-content/uploads/2014/03/Olympic-Stadium- Berlin.jpg