SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
2 
24.11.2014 
uweseiler 
Apache Spark
2 About me 
24.11.2014 
Big Data Nerd 
Hadoop Trainer NoSQL Fan Boy 
Photography Enthusiast Travelpirate
2 About us 
24.11.2014 
specializes on... 
Big Data Nerds Agile Ninjas Continuous Delivery Gurus 
Enterprise Java Specialists Performance Geeks 
Join us!
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: In a tweet 
24.11.2014 
“Spark … is what you might 
call a Swiss Army knife of Big 
Data analytics tools” 
– Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
2 Spark: In a nutshell 
24.11.2014 
• Fast and general engine for large scale data 
processing 
• Advanced DAG execution engine with support for 
 in-memory storage 
 data locality 
 (micro) batch  streaming support 
• Improves usability via 
 Rich APIs in Scala, Java, Python 
 Interactive shell 
• Runs Standalone, on YARN, on Mesos, and on 
Amazon EC2
2 Spark is also… 
24.11.2014 
• Came out of AMPLab at UCB in 2009 
• A top-level Apache project as of 2014 
– http://spark.apache.org 
• Backed by a commercial entity: Databricks 
• A toolset for Data Scientist / Analysts 
• Implementation of Resilient Distributed Dataset 
(RDD) in Scala 
• Hadoop Compatible
2 Spark: Trends 
24.11.2014 
Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez 
Generated using http://www.google.com/trends/
2 Spark: Community 
24.11.2014 
https://github.com/apache/spark/pulse
2 Spark: Performance 
24.11.2014 
3X faster using 10X fewer machines 
http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html 
http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
2 
24.11.2014 
BlinkDB 
MapReduce 
Cluster resource mgmt. + data 
processing 
HDFS 
Spark: Ecosystem 
Redundant, reliable storage 
Spark Core 
Spark 
SQL 
SQL 
Spark 
Streaming 
Streaming 
MLlib 
Machine 
Learning 
SparkR 
R on Spark 
GraphX 
Graph 
Computation
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Core Concept 
24.11.2014 
• Resilient Distributed Dataset (RDD) 
Conceptually, RDDs can be roughly 
viewed as partitioned, locality aware 
distributed vectors 
RDD 
A11 
A12 
A13 
• Read-only collection of objects spread across a 
cluster 
• Built through parallel transformations  actions 
• Computation can be represented by lazy evaluated 
lineage DAGs composed by connected RDDs 
• Automatically rebuilt on failure 
• Controllable persistence
2 Spark: RDD Example 
24.11.2014 
Base RDD from HDFS 
lines = spark.textFile(“hdfs://...”) 
errors = 
lines.filter(_.startsWith(Error)) 
messages = errors.map(_.split('t')(2)) 
messages.cache() 
RDD in memory 
Iterative Processing 
for (str - Array(“foo”, “bar”)) 
messages.filter(_.contains(str)).count()
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map
2 Spark: Transformations 
24.11.2014 
Transformations - 
Create new datasets from existing ones 
map(func) 
filter(func) 
flatMap(func) 
mapPartitions(func) 
mapPartitionsWithIndex(func) 
union(otherDataset) 
intersection(otherDataset) 
distinct([numTasks])) 
groupByKey([numTasks]) 
sortByKey([ascending], [numTasks]) 
reduceByKey(func, [numTasks]) 
aggregateByKey(zeroValue)(seqOp, 
combOp, [numTasks]) 
join(otherDataset, [numTasks]) 
cogroup(otherDataset, [numTasks]) 
cartesian(otherDataset) 
pipe(command, [envVars]) 
coalesce(numPartitions) 
sample(withReplacement,fraction, seed) 
repartition(numPartitions)
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce
2 Spark: Actions 
24.11.2014 
Actions - 
Return a value to the client after running a 
computation on the dataset 
reduce(func) 
collect() 
count() 
first() 
countByKey() 
foreach(func) 
take(n) 
takeSample(withReplacement,num, [seed]) 
takeOrdered(n, [ordering]) 
saveAsTextFile(path) 
saveAsSequenceFile(path) 
(Only Java and Scala) 
saveAsObjectFile(path) 
(Only Java and Scala)
2 Spark: Dataflow 
24.11.2014 
All transformations in Spark are lazy and are only 
computed when an actions requires it.
2 Spark: Persistence 
24.11.2014 
One of the most important capabilities in Spark is 
caching a dataset in-memory across operations 
• cache() MEMORY_ONLY 
• persist() MEMORY_ONLY
2 Spark: Storage Levels 
24.11.2014 
• persist(Storage Level) 
Storage Level Meaning 
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does 
not fit in memory, store the partitions that don't fit on disk, and 
read them from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). 
This is generally more space-efficient than deserialized objects, 
especially when using a fast serializer, but more CPU-intensive to 
read. 
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISK_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, 
… … … 
Same as the levels above, but replicate each partition on two cluster 
nodes.
2 Spark: Parallelism 
24.11.2014 
Can be specified in a number of different ways 
• RDD partition number 
• sc.textFile(input, minSplits = 10) 
• sc.parallelize(1 to 10000, numSlices = 10) 
• Mapper side parallelism 
• Usually inherited from parent RDD(s) 
• Reducer side parallelism 
• rdd.reduceByKey(_ + _, numPartitions = 10) 
• rdd.reduceByKey(partitioner = p, _ + _) 
• “Zoom in/out” 
• rdd.repartition(numPartitions: Int) 
• rdd.coalesce(numPartitions: Int, shuffle: Boolean)
2 Spark: Example 
24.11.2014 
Text Processing Example 
Top words by frequency
2 Spark: Frequency Example 
24.11.2014 
Create RDD from external data 
Data Sources supported by 
Hadoop 
Cassandra ElasticSearch 
HDFS S3 HBase 
Mongo 
DB 
… 
I/O via Hadoop optional 
// Step 1. Create RDD from Hadoop text files 
val docs = spark.textFile(“hdfs://docs/“)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String]
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
= 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
Function map 
Hello World 
This is 
Spark 
Spark 
The end 
= 
// Step 2. Convert lines to lower case 
val lower = docs.map(line = line.ToLowerCase) 
hello world 
this is 
spark 
spark 
the end 
RDD[String] 
.map(line = line.ToLowerCase) 
RDD[String] 
.map(_.ToLowerCase)
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[Array[String]] 
hello 
spark 
_.split(s+) 
world 
this is spark 
the end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
spark 
.flatten* 
_.split(s+) 
world 
this is spark 
hello 
world 
this 
the end 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
the end 
.flatMap(line = line.split(“s+“)) 
hello 
world 
this 
end
2 Spark: Frequency Example 
24.11.2014 
map vs. flatMap 
RDD[String] 
hello world 
this is 
spark 
spark 
the end 
.map(…) 
RDD[String] 
RDD[Array[String]] 
hello 
world 
this is spark 
spark 
.flatten* 
_.split(s+) 
hello 
world 
this 
the end 
end 
.flatMap(line = line.split(“s+“)) 
// Step 3. Split lines into words 
val words = lower.flatMap(line = line.split(“s+“))
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Key-Value Pairs 
RDD[String] 
hello 
world 
spark 
end 
.map(word = Tuple2(word,1)) 
= 
.map(word = (word,1)) 
// Step 4. Convert into tuples 
val counts = words.map(word = (word,1)) 
RDD[(String, Int)] 
hello 
world 
spark 
end 
spark 
1 
1 
spark 
1 
1 
1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
end 
1 
1 
spark 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b 
.reduceByKey((a,b) = a+b)
2 Spark: Frequency Example 
24.11.2014 
Shuffling 
RDD[(String, Int)] 
hello 
world 
spark 
spark 
end 
1 
1 
1 
1 
1 
RDD[(String, Iterator(Int))] RDD[(String, Int)] 
.groupByKey 
end 1 
hello 1 
spark 1 1 
world 1 
// Step 5. Count all words 
val freq = counts.reduceByKey(_ + _) 
end 1 
hello 1 
spark 2 
world 1 
.mapValues 
_.reduce… 
(a,b) = a+b
2 Spark: Frequency Example 
24.11.2014 
Top N (Prepare data) 
RDD[(String, Int)] 
end 1 
hello 1 
spark 2 
world 1 
// Step 6. Swap tupels (Partial code) 
freq.map(_.swap) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
.map(_.swap)
2 Spark: Frequency Example 
24.11.2014 
Top N (First Attempt) 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.sortByKey
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
local top N 
.top(N) 
local top N
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 spark 
1 world 
RDD[(Int, String)] 
2 spark 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction
2 Spark: Frequency Example 
24.11.2014 
Top N 
RDD[(Int, String)] 
1 end 
1 hello 
2 
spark 
1 world 
RDD[(Int, String)] 
spark 
2 
1 end 
1 hello 
1 world 
.top(N) 
Array[(Int, String)] 
2 spark 
1 end 
local top N 
local top N 
reduction 
// Step 6. Swap tupels (Complete code) 
val top = freq.map(_.swap).top(N)
2 Spark: Frequency Example 
24.11.2014 
val spark = new SparkContext() 
// Create RDD from Hadoop text file 
val docs = spark.textFile(“hdfs://docs/“) 
// Split lines into words and process 
val lower = docs.map(line = line.ToLowerCase) 
val words = lower.flatMap(line = line.split(“s+“)) 
val counts = words.map(word = (word,1)) 
// Count all words 
val freq = counts.reduceByKey(_ + _) 
// Swap tupels and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark: Streaming 
24.11.2014 
• Real-time computation 
• Similar to Apache Storm… 
• Streaming input split into sliding windows of 
RDD‘s 
• Input distributed to memory for fault 
tolerance 
• Supports input from Kafka, Flume, ZeroMQ, 
HDFS, S3, Kinesis, Twitter, …
2 Spark: Streaming 
24.11.2014 
Discretized Stream 
Windowed Computations
2 Spark: Streaming 
24.11.2014 
TwitterUtils.createStream() 
.filter(_.getText.contains(Spark)) 
.countByWindow(Seconds(5))
2 Spark: SQL 
24.11.2014 
• Spark SQL allows relational queries 
expressed in SQL, HiveQL or Scala 
• Uses SchemaRDD’s composed of Row objects 
(= table in a traditional RDBMS) 
• SchemaRDD can be created from an 
• Existing RDD 
• Parquet File 
• JSON dataset 
• By running HiveQL against data stored in Apache Hive 
• Supports a domain specific language for 
writing queries
2 Spark: SQL 
24.11.2014 
registerFunction(LEN, (_: String).length) 
val queryRdd = sql( 
SELECT * FROM counts 
WHERE LEN(word) = 10 
ORDER BY total DESC 
LIMIT 10 
) 
queryRdd 
.map( c = sword: ${c(0)} t| total: ${c(1)}) 
.collect() 
.foreach(println)
2 Spark: GraphX 
24.11.2014 
• GraphX is the Spark API for graphs 
and graph-parallel computation 
• API’s to join and traverse graphs 
• Optimally partitions and indexes 
vertices  edges (represented as RDD’s) 
• Supports PageRank, connected 
components, triangle counting, …
2 Spark: GraphX 
24.11.2014 
val graph = Graph(userIdRDD, assocRDD) 
val ranks = graph.pageRank(0.0001).vertices 
val userRDD = sc.textFile(graphx/data/users.txt) 
val users = userRdd. map {line = 
val fields = line.split(,) 
(fields(0).toLong, fields( 1)) 
} 
val ranksByUsername = users.join(ranks).map { 
case (id, (username, rank)) = (username, rank) 
}
2 Spark: MLlib 
24.11.2014 
• Machine learning library similar to 
Apache Mahout 
• Supports statistics, regression, decision 
trees, clustering, PCA, gradient 
descent, … 
• Iterative algorithms much faster due to 
in-memory processing
2 Spark: MLlib 
24.11.2014 
val data = sc.textFile(data.txt) 
val parsedData = data.map {line = 
val parts = line.split(',') 
LabeledPoint( 
parts( 0). toDouble, 
Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) 
} 
val model = LinearRegressionWithSGD.train( 
parsedData, 100 
) 
val valuesAndPreds = parsedData.map {point = 
val prediction = model.predict(point.features) 
(point.label, prediction) 
} 
val MSE = valuesAndPreds 
.map{case(v, p) = math.pow((v - p), 2)}.mean()
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Use Case: Yahoo Native Ads 
24.11.2014 
Logistic regression 
algorithm 
• 120 LOC in Spark/Scala 
• 30 min. on model creation for 
100M samples and 13K 
features 
Initial version launched 
within 2 hours after Spark-on- 
YARN announcement 
• Compared: Several days on 
hardware acquisition, system 
setup and data movement 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Use Case: Yahoo Mobile Ads 
24.11.2014 
Learn from mobile search 
ads clicks data 
• 600M labeled examples on 
HDFS 
• 100M sparse features 
Spark programs for 
Gradient Boosting Decision 
Trees 
• 6 hours for model training 
with 100 workers 
• Model with accuracy very 
close to heavily-manually-tuned 
Logistic Regression 
models 
http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
2 Agenda 
24.11.2014 
• Why? 
• How? 
• What else? 
• Who? 
• Future?
2 Spark-on-YARN (Current) 
24.11.2014 
Hadoop 2 Spark as YARN App 
Pig … In- 
Hive Stream 
Tez 
Spark MapReduce 
Execution Engine 
Execution Engine 
YARN 
Memory 
Cluster resource management 
HDFS 
Redundant, reliable storage 
ing 
Storm 
…
2 Spark-on-YARN (Future) 
24.11.2014 
Hadoop 2 Spark as Execution Engine 
Hive … Mahout 
YARN 
HDFS 
Pig 
MapReduce 
Execution Engine 
Stream 
ing 
Storm 
… 
Tez 
Execution Engine 
Spark 
Execution Engine 
Slider
2 Spark: Future work 
24.11.2014 
• Spark Core 
• Focus on maturity, optimization  
pluggability 
• Enable long-running services (Slider) 
• Give resources back to cluster when idle 
• Integrate with Hadoop enhancements 
• Timeline server 
• ORC File Format 
• Spark Eco System 
• Focus on adding capabilities
2 One more thing… 
24.11.2014 
Let’s get started with 
Spark!
2 Hortonworks Sandbox 2.2 
24.11.2014 
http://hortonworks.com/hdp/downloads/
2 Hortonworks Sandbox 2.2 
24.11.2014 
// 1. Download 
wget http://public-repo-1.hortonworks.com/HDP-LABS/ 
Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 
2.4.0.tgz 
// 2. Untar 
tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz 
// 3. Start Spark Shell 
./bin/spark-shell
2 Thanks for listening 
24.11.2014 
Twitter: 
@uweseiler 
Mail: 
uwe.seiler@codecentric.de 
XING: 
https://www.xing.com/profile 
/Uwe_Seiler

Weitere ähnliche Inhalte

Was ist angesagt?

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Better APIs with GraphQL
Better APIs with GraphQL Better APIs with GraphQL
Better APIs with GraphQL Josh Price
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandraVinay Kumar Chella
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming LanguageHaim Michael
 
Node.js Express
Node.js  ExpressNode.js  Express
Node.js ExpressEyal Vardi
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Battle of the frameworks : Quarkus vs SpringBoot
Battle of the frameworks : Quarkus vs SpringBootBattle of the frameworks : Quarkus vs SpringBoot
Battle of the frameworks : Quarkus vs SpringBootChristos Sotiriou
 

Was ist angesagt? (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Better APIs with GraphQL
Better APIs with GraphQL Better APIs with GraphQL
Better APIs with GraphQL
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Testing in airflow
Testing in airflowTesting in airflow
Testing in airflow
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Node.js Express
Node.js  ExpressNode.js  Express
Node.js Express
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Graph database
Graph databaseGraph database
Graph database
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Battle of the frameworks : Quarkus vs SpringBoot
Battle of the frameworks : Quarkus vs SpringBootBattle of the frameworks : Quarkus vs SpringBoot
Battle of the frameworks : Quarkus vs SpringBoot
 

Andere mochten auch

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Modelnoahwong
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and BeyondQuantUniversity
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopQuantUniversity
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Futuretcloudcomputing-tw
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 

Andere mochten auch (20)

Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Model
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Big data, Analytics and Beyond
Big data, Analytics and BeyondBig data, Analytics and Beyond
Big data, Analytics and Beyond
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Energy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshopEnergy analytics with Apache Spark workshop
Energy analytics with Apache Spark workshop
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 

Ähnlich wie Apache Spark

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 

Ähnlich wie Apache Spark (20)

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

Mehr von Uwe Printz

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

Mehr von Uwe Printz (18)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Kürzlich hochgeladen

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Kürzlich hochgeladen (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Apache Spark

  • 1. 2 24.11.2014 uweseiler Apache Spark
  • 2. 2 About me 24.11.2014 Big Data Nerd Hadoop Trainer NoSQL Fan Boy Photography Enthusiast Travelpirate
  • 3. 2 About us 24.11.2014 specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 5. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 6. 2 Spark: In a tweet 24.11.2014 “Spark … is what you might call a Swiss Army knife of Big Data analytics tools” – Reynold Xin (@rxin), Berkeley AmpLab Shark Development Lead
  • 7. 2 Spark: In a nutshell 24.11.2014 • Fast and general engine for large scale data processing • Advanced DAG execution engine with support for in-memory storage data locality (micro) batch streaming support • Improves usability via Rich APIs in Scala, Java, Python Interactive shell • Runs Standalone, on YARN, on Mesos, and on Amazon EC2
  • 8. 2 Spark is also… 24.11.2014 • Came out of AMPLab at UCB in 2009 • A top-level Apache project as of 2014 – http://spark.apache.org • Backed by a commercial entity: Databricks • A toolset for Data Scientist / Analysts • Implementation of Resilient Distributed Dataset (RDD) in Scala • Hadoop Compatible
  • 9. 2 Spark: Trends 24.11.2014 Apache Drill Apache Storm Apache Spark Apache YARN Apache Tez Generated using http://www.google.com/trends/
  • 10. 2 Spark: Community 24.11.2014 https://github.com/apache/spark/pulse
  • 11. 2 Spark: Performance 24.11.2014 3X faster using 10X fewer machines http://finance.yahoo.com/news/apache-spark-beats-world-record-130000796.html http://www.wired.com/2014/10/startup-crunches-100-terabytes-data-record-23-minutes/
  • 12. 2 24.11.2014 BlinkDB MapReduce Cluster resource mgmt. + data processing HDFS Spark: Ecosystem Redundant, reliable storage Spark Core Spark SQL SQL Spark Streaming Streaming MLlib Machine Learning SparkR R on Spark GraphX Graph Computation
  • 13. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 14. 2 Spark: Core Concept 24.11.2014 • Resilient Distributed Dataset (RDD) Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors RDD A11 A12 A13 • Read-only collection of objects spread across a cluster • Built through parallel transformations actions • Computation can be represented by lazy evaluated lineage DAGs composed by connected RDDs • Automatically rebuilt on failure • Controllable persistence
  • 15. 2 Spark: RDD Example 24.11.2014 Base RDD from HDFS lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(Error)) messages = errors.map(_.split('t')(2)) messages.cache() RDD in memory Iterative Processing for (str - Array(“foo”, “bar”)) messages.filter(_.contains(str)).count()
  • 16. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map
  • 17. 2 Spark: Transformations 24.11.2014 Transformations - Create new datasets from existing ones map(func) filter(func) flatMap(func) mapPartitions(func) mapPartitionsWithIndex(func) union(otherDataset) intersection(otherDataset) distinct([numTasks])) groupByKey([numTasks]) sortByKey([ascending], [numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) sample(withReplacement,fraction, seed) repartition(numPartitions)
  • 18. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce
  • 19. 2 Spark: Actions 24.11.2014 Actions - Return a value to the client after running a computation on the dataset reduce(func) collect() count() first() countByKey() foreach(func) take(n) takeSample(withReplacement,num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) saveAsSequenceFile(path) (Only Java and Scala) saveAsObjectFile(path) (Only Java and Scala)
  • 20. 2 Spark: Dataflow 24.11.2014 All transformations in Spark are lazy and are only computed when an actions requires it.
  • 21. 2 Spark: Persistence 24.11.2014 One of the most important capabilities in Spark is caching a dataset in-memory across operations • cache() MEMORY_ONLY • persist() MEMORY_ONLY
  • 22. 2 Spark: Storage Levels 24.11.2014 • persist(Storage Level) Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, … … … Same as the levels above, but replicate each partition on two cluster nodes.
  • 23. 2 Spark: Parallelism 24.11.2014 Can be specified in a number of different ways • RDD partition number • sc.textFile(input, minSplits = 10) • sc.parallelize(1 to 10000, numSlices = 10) • Mapper side parallelism • Usually inherited from parent RDD(s) • Reducer side parallelism • rdd.reduceByKey(_ + _, numPartitions = 10) • rdd.reduceByKey(partitioner = p, _ + _) • “Zoom in/out” • rdd.repartition(numPartitions: Int) • rdd.coalesce(numPartitions: Int, shuffle: Boolean)
  • 24. 2 Spark: Example 24.11.2014 Text Processing Example Top words by frequency
  • 25. 2 Spark: Frequency Example 24.11.2014 Create RDD from external data Data Sources supported by Hadoop Cassandra ElasticSearch HDFS S3 HBase Mongo DB … I/O via Hadoop optional // Step 1. Create RDD from Hadoop text files val docs = spark.textFile(“hdfs://docs/“)
  • 26. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String]
  • 27. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] = .map(_.ToLowerCase)
  • 28. 2 Spark: Frequency Example 24.11.2014 Function map Hello World This is Spark Spark The end = // Step 2. Convert lines to lower case val lower = docs.map(line = line.ToLowerCase) hello world this is spark spark the end RDD[String] .map(line = line.ToLowerCase) RDD[String] .map(_.ToLowerCase)
  • 29. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[Array[String]] hello spark _.split(s+) world this is spark the end
  • 30. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello spark .flatten* _.split(s+) world this is spark hello world this the end end
  • 31. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) the end .flatMap(line = line.split(“s+“)) hello world this end
  • 32. 2 Spark: Frequency Example 24.11.2014 map vs. flatMap RDD[String] hello world this is spark spark the end .map(…) RDD[String] RDD[Array[String]] hello world this is spark spark .flatten* _.split(s+) hello world this the end end .flatMap(line = line.split(“s+“)) // Step 3. Split lines into words val words = lower.flatMap(line = line.split(“s+“))
  • 33. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 34. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 35. 2 Spark: Frequency Example 24.11.2014 Key-Value Pairs RDD[String] hello world spark end .map(word = Tuple2(word,1)) = .map(word = (word,1)) // Step 4. Convert into tuples val counts = words.map(word = (word,1)) RDD[(String, Int)] hello world spark end spark 1 1 spark 1 1 1
  • 36. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] .groupByKey end 1 hello 1 spark 1 1 world 1
  • 37. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 38. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark end 1 1 spark 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b .reduceByKey((a,b) = a+b)
  • 39. 2 Spark: Frequency Example 24.11.2014 Shuffling RDD[(String, Int)] hello world spark spark end 1 1 1 1 1 RDD[(String, Iterator(Int))] RDD[(String, Int)] .groupByKey end 1 hello 1 spark 1 1 world 1 // Step 5. Count all words val freq = counts.reduceByKey(_ + _) end 1 hello 1 spark 2 world 1 .mapValues _.reduce… (a,b) = a+b
  • 40. 2 Spark: Frequency Example 24.11.2014 Top N (Prepare data) RDD[(String, Int)] end 1 hello 1 spark 2 world 1 // Step 6. Swap tupels (Partial code) freq.map(_.swap) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world .map(_.swap)
  • 41. 2 Spark: Frequency Example 24.11.2014 Top N (First Attempt) RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .sortByKey
  • 42. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world local top N .top(N) local top N
  • 43. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] 2 spark 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction
  • 44. 2 Spark: Frequency Example 24.11.2014 Top N RDD[(Int, String)] 1 end 1 hello 2 spark 1 world RDD[(Int, String)] spark 2 1 end 1 hello 1 world .top(N) Array[(Int, String)] 2 spark 1 end local top N local top N reduction // Step 6. Swap tupels (Complete code) val top = freq.map(_.swap).top(N)
  • 45. 2 Spark: Frequency Example 24.11.2014 val spark = new SparkContext() // Create RDD from Hadoop text file val docs = spark.textFile(“hdfs://docs/“) // Split lines into words and process val lower = docs.map(line = line.ToLowerCase) val words = lower.flatMap(line = line.split(“s+“)) val counts = words.map(word = (word,1)) // Count all words val freq = counts.reduceByKey(_ + _) // Swap tupels and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 46. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 47. 2 Spark: Streaming 24.11.2014 • Real-time computation • Similar to Apache Storm… • Streaming input split into sliding windows of RDD‘s • Input distributed to memory for fault tolerance • Supports input from Kafka, Flume, ZeroMQ, HDFS, S3, Kinesis, Twitter, …
  • 48. 2 Spark: Streaming 24.11.2014 Discretized Stream Windowed Computations
  • 49. 2 Spark: Streaming 24.11.2014 TwitterUtils.createStream() .filter(_.getText.contains(Spark)) .countByWindow(Seconds(5))
  • 50. 2 Spark: SQL 24.11.2014 • Spark SQL allows relational queries expressed in SQL, HiveQL or Scala • Uses SchemaRDD’s composed of Row objects (= table in a traditional RDBMS) • SchemaRDD can be created from an • Existing RDD • Parquet File • JSON dataset • By running HiveQL against data stored in Apache Hive • Supports a domain specific language for writing queries
  • 51. 2 Spark: SQL 24.11.2014 registerFunction(LEN, (_: String).length) val queryRdd = sql( SELECT * FROM counts WHERE LEN(word) = 10 ORDER BY total DESC LIMIT 10 ) queryRdd .map( c = sword: ${c(0)} t| total: ${c(1)}) .collect() .foreach(println)
  • 52. 2 Spark: GraphX 24.11.2014 • GraphX is the Spark API for graphs and graph-parallel computation • API’s to join and traverse graphs • Optimally partitions and indexes vertices edges (represented as RDD’s) • Supports PageRank, connected components, triangle counting, …
  • 53. 2 Spark: GraphX 24.11.2014 val graph = Graph(userIdRDD, assocRDD) val ranks = graph.pageRank(0.0001).vertices val userRDD = sc.textFile(graphx/data/users.txt) val users = userRdd. map {line = val fields = line.split(,) (fields(0).toLong, fields( 1)) } val ranksByUsername = users.join(ranks).map { case (id, (username, rank)) = (username, rank) }
  • 54. 2 Spark: MLlib 24.11.2014 • Machine learning library similar to Apache Mahout • Supports statistics, regression, decision trees, clustering, PCA, gradient descent, … • Iterative algorithms much faster due to in-memory processing
  • 55. 2 Spark: MLlib 24.11.2014 val data = sc.textFile(data.txt) val parsedData = data.map {line = val parts = line.split(',') LabeledPoint( parts( 0). toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)) ) } val model = LinearRegressionWithSGD.train( parsedData, 100 ) val valuesAndPreds = parsedData.map {point = val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds .map{case(v, p) = math.pow((v - p), 2)}.mean()
  • 56. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 57. 2 Use Case: Yahoo Native Ads 24.11.2014 Logistic regression algorithm • 120 LOC in Spark/Scala • 30 min. on model creation for 100M samples and 13K features Initial version launched within 2 hours after Spark-on- YARN announcement • Compared: Several days on hardware acquisition, system setup and data movement http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 58. 2 Use Case: Yahoo Mobile Ads 24.11.2014 Learn from mobile search ads clicks data • 600M labeled examples on HDFS • 100M sparse features Spark programs for Gradient Boosting Decision Trees • 6 hours for model training with 100 workers • Model with accuracy very close to heavily-manually-tuned Logistic Regression models http://de.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster
  • 59. 2 Agenda 24.11.2014 • Why? • How? • What else? • Who? • Future?
  • 60. 2 Spark-on-YARN (Current) 24.11.2014 Hadoop 2 Spark as YARN App Pig … In- Hive Stream Tez Spark MapReduce Execution Engine Execution Engine YARN Memory Cluster resource management HDFS Redundant, reliable storage ing Storm …
  • 61. 2 Spark-on-YARN (Future) 24.11.2014 Hadoop 2 Spark as Execution Engine Hive … Mahout YARN HDFS Pig MapReduce Execution Engine Stream ing Storm … Tez Execution Engine Spark Execution Engine Slider
  • 62. 2 Spark: Future work 24.11.2014 • Spark Core • Focus on maturity, optimization pluggability • Enable long-running services (Slider) • Give resources back to cluster when idle • Integrate with Hadoop enhancements • Timeline server • ORC File Format • Spark Eco System • Focus on adding capabilities
  • 63. 2 One more thing… 24.11.2014 Let’s get started with Spark!
  • 64. 2 Hortonworks Sandbox 2.2 24.11.2014 http://hortonworks.com/hdp/downloads/
  • 65. 2 Hortonworks Sandbox 2.2 24.11.2014 // 1. Download wget http://public-repo-1.hortonworks.com/HDP-LABS/ Projects/spark/1.1.1/spark-1.1.0.2.1.5.0-701-bin- 2.4.0.tgz // 2. Untar tar xvfz spark-1.1.0.2.1.5.0-701-bin-2.4.0.tgz // 3. Start Spark Shell ./bin/spark-shell
  • 66. 2 Thanks for listening 24.11.2014 Twitter: @uweseiler Mail: uwe.seiler@codecentric.de XING: https://www.xing.com/profile /Uwe_Seiler