SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Andrei Avramescu
Radu Chilom
xPatterns on Spark, Tachyon and
Mesos
2
• Introduction
• Spark
• SparkSQL
• Tachyon
• Mesos
• Lessons learned
• Demo: xPatterns
• Q & A
Agenda
3
oBig data analytics / machine learning
oOffices in Seattle and Timisoara
o5+ years with Hadoop ecosystem
o1 year with Spark
4
• “fast and general engine for large-scale data
processing”
• open sourced
• API for Java/Scala/Python (80 operators)
• not bounded to map-reduce paradigm
• powers a stack of high level tools including
Spark SQL, MLlib, Spark Streaming.
Apache Spark
5
• Main entry point to Spark
• SparkConf: spark.app.name, spark.master, spark.serializer,
spark.cores.max, spark.task.cpus
SparkContext
val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or local
/ local[N]
App
name
Spark install
path on cluster
List of JARs with
app code (to ship)
6
Resilient Distributed Dataset
• Immutable collection of elements partitioned
across the cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure ( lineage )
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions (e.g. count, collect, save)
Key Concept: RDDs
7
Parallelize collection into an RDD
> sc.parallelize(List(1, 2, 3))
Load text file from local FS, HDFS, or S3
> sc.textFile(“test.txt”)
> sc.textFile(“textDir/*.txt”)
> sc.textFile(“hdfs://...”)
Use existing Hadoop InputFormat (Java/Scala only)
> sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Creating RDDs
8
> nums = sc.parallelize(List(1, 2, 3))
Pass each element through a function
> squares = nums.map(x => x * x) // {1, 4, 9}
Keep elements passing a predicate
> even = squares.filter(x => x % 2 == 0) // {4}
Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
Return first K elements
> nums.take(2) # => [1, 2]
Count number of elements
> nums.count() # => 3
Basic Transformations
Basic Actions
9
RDD Persistence
• persist() or cache()
• MEMORY_ONLY , MEMORY_AND_DISK,
• MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,
• DISK_ONLY
• OFF_HEAP
10
RDD FaultTolerance
• RDDs maintain lineage information that can be used
to reconstruct lost partitions
• Ex:cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD
11
Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Cached RDD
Parallel operation
12
SharedVariables
• Broadcast Variables
• Accumulators
13
• Data Serialization
oJava serialization ( default )
oKryo serialization
Spark tuning
14
Spark tuning
Kryo serialization : spark.kryoserializer.buffer.mb
15
Spark tuning
• Level of Parallelism
o number of partitions
o “reduce” operations <- largest parent RDD’s number of
partitions
o spark.default.parallelism
• Memory Usage of Reduce Tasks
• Broadcasting Large Variables
16
• Improved running time
Spark vs Hadoop
17
Spark vs Hadoop (word count)
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
18
val sc = new SparkContext(“spark://...”, “MyJob”, home, jars)
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
LOC : 6(11) vs 35
Spark vs Hadoop (word count)
19
• Much more operators than map reduce
• Hadoop is a bigger and older community
• Happily Coexist
Spark vs Hadoop
20
• Shark modified the Hive backend to run over
Spark, but had inconvenients:
 Limited integration with Spark programs
 Hive optimizer not designed for Spark
Spark SQL (alpha)
21
• Spark SQL reuses the best parts of Shark
 Hive data loading
 In-memory column store
Spark SQL (alpha)
22
• Adds
 Support for multiple input formats
 Rich language interfaces
 RDD-aware optimizer
Spark SQL (alpha)
23
• SchemaRDD
 Row objects
 Schema
• Row objects can be:
 Case Classes (Scala)
 Beans (Java)
Spark SQL (alpha)
24
Create SQL Context
val sqlContext = new SQLContext(sparkContext)
Create people RDD and register table
val people = sqlContext
.textFile("examples/src/main/resources/people.txt").map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt),p(3))
people.registerTempTable("people")
Query table
val teenagers = sqlContext.sql("SELECT name FROM people WHERE
age >= 13 AND age <= 23")
Spark SQL (alpha)
Radu,24,1.70
Andrei,23,1.88
25
• Running time improvment 4x – 30x
• Bucketing
• Bucket Joins
• Skew Joins
• Partial DAG Execution
SparkSQL vs Hive
26
• 0.8.0 - first POC … lots of OOM
• 0.8.1 – first production deployment, still lots of OOM
 20 billion healthcare records, 200 TB of compressed hdfs data
 Hadoop MR: 100 m1.xlarge (4c x 15GB)
 BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side
 Perf gains of 4x to 40x, required individual dataset and query fine-tuning
 Mixed Hive & Shark workloads where it made sense
 Daily processing reduced from 14 hours to 1.5 hours!
• 0.9.0 - fixed many of the problems, but still requires patches! spilling on the
reducer side fixed (less OOM)
• 1.0.2 – in production today
• 1.1 upgrade in progress
Spark 0.8.0 to 1.1
27
• cluster resource manager
• Multi-resource scheduling (memory, CPU, disk, and
ports)
• Scalability to 10,000s of nodes
• Fault-tolerant replicated master and slaves using
ZooKeeper
Mesos (0.20)
28
• memory-centric distributed file system enabling
reliable file sharing at memory-speed across cluster
frameworks
• Pluggable underlayer file system: hdfs, S3, local file
system,…
Tachyon (v0.5)
29
• Java like File API / FileSystem API
• Configurable block size
• Memory management
Tachyon (v0.5)
30
• Jaws, xPatterns http spark sql server!
http://github.com/Atigeo/http-spark-sql-server
 Backward compatible with Shark
 Backend in spray io (REST on Akka)
• Spark Job Server
 multiple Spark contexts in same JVM, job submission in Java + Scala
 https://github.com/Atigeo/spark-job-rest
• Mesos framework starvation bug
• *SchedulerBackend update due to race conditions, Spark 0.9.0
patches
Community contribution
31
• Read the papers
• Fine tuning can really boost your running time
• When using spark don’t think map-reduce
Lessons learned
32
Demo
33
Q & A
34
Apache Spark
https://spark.apache.org/
Parallel programming with Spark
http://ampcamp.berkeley.edu/wp-
content/uploads/2013/02/Parallel-Programming-With-Spark-
Matei-Zaharia-Strata-2013.pdf
Introduction to spark internals
spark.apache.org/talks/dev-meetup-dec-2012.pptx
Bibliography
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this
presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided
after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Weitere ähnliche Inhalte

Was ist angesagt?

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_sparkYiguang Hu
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 

Was ist angesagt? (19)

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 

Ähnlich wie xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 

Ähnlich wie xPatterns on Spark, Tachyon and Mesos - Bucharest meetup (20)

OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Shark
SharkShark
Shark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
Spark core
Spark coreSpark core
Spark core
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 

Kürzlich hochgeladen

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 

Kürzlich hochgeladen (20)

React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 

xPatterns on Spark, Tachyon and Mesos - Bucharest meetup

  • 1. Andrei Avramescu Radu Chilom xPatterns on Spark, Tachyon and Mesos
  • 2. 2 • Introduction • Spark • SparkSQL • Tachyon • Mesos • Lessons learned • Demo: xPatterns • Q & A Agenda
  • 3. 3 oBig data analytics / machine learning oOffices in Seattle and Timisoara o5+ years with Hadoop ecosystem o1 year with Spark
  • 4. 4 • “fast and general engine for large-scale data processing” • open sourced • API for Java/Scala/Python (80 operators) • not bounded to map-reduce paradigm • powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming. Apache Spark
  • 5. 5 • Main entry point to Spark • SparkConf: spark.app.name, spark.master, spark.serializer, spark.cores.max, spark.task.cpus SparkContext val sc = new SparkContext(“url”, “name”, “sparkHome”, Seq(“app.jar”)) Cluster URL, or local / local[N] App name Spark install path on cluster List of JARs with app code (to ship)
  • 6. 6 Resilient Distributed Dataset • Immutable collection of elements partitioned across the cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure ( lineage ) Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Key Concept: RDDs
  • 7. 7 Parallelize collection into an RDD > sc.parallelize(List(1, 2, 3)) Load text file from local FS, HDFS, or S3 > sc.textFile(“test.txt”) > sc.textFile(“textDir/*.txt”) > sc.textFile(“hdfs://...”) Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf) Creating RDDs
  • 8. 8 > nums = sc.parallelize(List(1, 2, 3)) Pass each element through a function > squares = nums.map(x => x * x) // {1, 4, 9} Keep elements passing a predicate > even = squares.filter(x => x % 2 == 0) // {4} Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] Return first K elements > nums.take(2) # => [1, 2] Count number of elements > nums.count() # => 3 Basic Transformations Basic Actions
  • 9. 9 RDD Persistence • persist() or cache() • MEMORY_ONLY , MEMORY_AND_DISK, • MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, • DISK_ONLY • OFF_HEAP
  • 10. 10 RDD FaultTolerance • RDDs maintain lineage information that can be used to reconstruct lost partitions • Ex:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  • 11. 11 Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Cached RDD Parallel operation
  • 13. 13 • Data Serialization oJava serialization ( default ) oKryo serialization Spark tuning
  • 14. 14 Spark tuning Kryo serialization : spark.kryoserializer.buffer.mb
  • 15. 15 Spark tuning • Level of Parallelism o number of partitions o “reduce” operations <- largest parent RDD’s number of partitions o spark.default.parallelism • Memory Usage of Reduce Tasks • Broadcasting Large Variables
  • 16. 16 • Improved running time Spark vs Hadoop
  • 17. 17 Spark vs Hadoop (word count) public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
  • 18. 18 val sc = new SparkContext(“spark://...”, “MyJob”, home, jars) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") LOC : 6(11) vs 35 Spark vs Hadoop (word count)
  • 19. 19 • Much more operators than map reduce • Hadoop is a bigger and older community • Happily Coexist Spark vs Hadoop
  • 20. 20 • Shark modified the Hive backend to run over Spark, but had inconvenients:  Limited integration with Spark programs  Hive optimizer not designed for Spark Spark SQL (alpha)
  • 21. 21 • Spark SQL reuses the best parts of Shark  Hive data loading  In-memory column store Spark SQL (alpha)
  • 22. 22 • Adds  Support for multiple input formats  Rich language interfaces  RDD-aware optimizer Spark SQL (alpha)
  • 23. 23 • SchemaRDD  Row objects  Schema • Row objects can be:  Case Classes (Scala)  Beans (Java) Spark SQL (alpha)
  • 24. 24 Create SQL Context val sqlContext = new SQLContext(sparkContext) Create people RDD and register table val people = sqlContext .textFile("examples/src/main/resources/people.txt").map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt),p(3)) people.registerTempTable("people") Query table val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 23") Spark SQL (alpha) Radu,24,1.70 Andrei,23,1.88
  • 25. 25 • Running time improvment 4x – 30x • Bucketing • Bucket Joins • Skew Joins • Partial DAG Execution SparkSQL vs Hive
  • 26. 26 • 0.8.0 - first POC … lots of OOM • 0.8.1 – first production deployment, still lots of OOM  20 billion healthcare records, 200 TB of compressed hdfs data  Hadoop MR: 100 m1.xlarge (4c x 15GB)  BDAS: 20 cc2.8xlarge (32c x 60.8 GB), still lots of OOM map & reducer side  Perf gains of 4x to 40x, required individual dataset and query fine-tuning  Mixed Hive & Shark workloads where it made sense  Daily processing reduced from 14 hours to 1.5 hours! • 0.9.0 - fixed many of the problems, but still requires patches! spilling on the reducer side fixed (less OOM) • 1.0.2 – in production today • 1.1 upgrade in progress Spark 0.8.0 to 1.1
  • 27. 27 • cluster resource manager • Multi-resource scheduling (memory, CPU, disk, and ports) • Scalability to 10,000s of nodes • Fault-tolerant replicated master and slaves using ZooKeeper Mesos (0.20)
  • 28. 28 • memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks • Pluggable underlayer file system: hdfs, S3, local file system,… Tachyon (v0.5)
  • 29. 29 • Java like File API / FileSystem API • Configurable block size • Memory management Tachyon (v0.5)
  • 30. 30 • Jaws, xPatterns http spark sql server! http://github.com/Atigeo/http-spark-sql-server  Backward compatible with Shark  Backend in spray io (REST on Akka) • Spark Job Server  multiple Spark contexts in same JVM, job submission in Java + Scala  https://github.com/Atigeo/spark-job-rest • Mesos framework starvation bug • *SchedulerBackend update due to race conditions, Spark 0.9.0 patches Community contribution
  • 31. 31 • Read the papers • Fine tuning can really boost your running time • When using spark don’t think map-reduce Lessons learned
  • 34. 34 Apache Spark https://spark.apache.org/ Parallel programming with Spark http://ampcamp.berkeley.edu/wp- content/uploads/2013/02/Parallel-Programming-With-Spark- Matei-Zaharia-Strata-2013.pdf Introduction to spark internals spark.apache.org/talks/dev-meetup-dec-2012.pptx Bibliography
  • 35. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hinweis der Redaktion

  1. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Lineage : Logging the transformation used to built a dataset Hdfs : one block / partition Storage : in memory serialized Tachyon / deserialized in JVM / on disk (hdfs)