SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Spark in Action
Fast Big Data Analytics using Scala


Matei Zaharia
University of California, Berkeley

www.spark-project.org
                                     UC BERKELEY
My Background
Grad student in the AMP Lab at UC Berkeley
 » 50-person lab focusing on big data

Committer on Apache Hadoop
Started Spark in 2009 to provide a richer,
Hadoop-compatible computing engine
Spark Goals
Extend the MapReduce model to support more
types of applications efficiently
 » Spark can run 40x faster than Hadoop for iterative
   and interactive applications

Make jobs easier to program
 »Language-integrated API in Scala
 »Interactive use from Scala interpreter
Why go Beyond MapReduce?
MapReduce simplified big data analysis by giving
a reliable programming model for large clusters
But as soon as it got popular, users wanted more:
 » More complex, multi-stage applications
 » More interactive ad-hoc queries
Why go Beyond MapReduce?
Complex jobs and interactive queries both need
one thing that MapReduce lacks:
         Efficient primitives for data sharing

                                             Query 1
                          Stage 3
               Stage 2
     Stage 1




                                             Query 2

                                             Query 3

    Iterative algorithm             Interactive data mining
Why go Beyond MapReduce?
Complex jobs and interactive queries both need
one thing that MapReduce lacks:
         Efficient primitives for data sharing

                                             Query 1
                          Stage 3
               Stage 2
     Stage 1




In MapReduce, the only way to shareQuery 2 across
                                        data
    jobs is stable storage (e.g. HDFS) Query 3
                                       -> slow!
    Iterative algorithm             Interactive data mining
Examples
          HDFS             HDFS             HDFS             HDFS
          read             write            read             write
                 iter. 1                           iter. 2               . . .

  Input

           HDFS                    query 1                    result 1
           read
                                   query 2                    result 2


                                   query 3                    result 3
  Input
                                    . . .

 I/O and serialization can take 90% of the time
Goal: In-Memory Data Sharing

                iter. 1         iter. 2     . . .

  Input

                                query 1
           one-time
          processing
                                query 2

                                query 3
  Input           Distributed
                   memory        . . .

     10-100× faster than network and disk
Solution: Resilient
Distributed Datasets (RDDs)
Distributed collections of objects that can be
stored in memory for fast reuse
Automatically recover lost data on failure
Support a wide range of applications
Outline
Spark programming model
User applications
Implementation
Demo
What’s next
Programming Model
Resilient distributed datasets (RDDs)
 » Immutable, partitioned collections of objects
 » Can be cached in memory for efficient reuse

Transformations (e.g. map, filter, groupBy, join)
 » Build RDDs from other RDDs

Actions (e.g. count, collect, save)
 » Return a result or write it to storage
Example: Log Mining
 Load error messages from a log into memory, then
 interactively search for various patterns
                                          BaseTransformed RDD
                                               RDD                           Cache 1
lines = spark.textFile(“hdfs://...”)                                   Worker
                                                         results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))                        tasks    Block 1
                                                  Driver
cachedMsgs = messages.cache()
                                                  Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                  Cache 2
                                                                       Worker
. . .
                                                    Cache 3
                                               Worker                  Block 2
 Result: scaled tosearch of Wikipedia
         full-text 1 TB data in 5-7 sec
 in <1 sec (vs 20 for on-disk data)
     (vs 170 sec sec for on-disk data)         Block 3
RDD Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages      = textFile(...).filter(_.contains(“error”))
                                  .map(_.split(„t‟)(2))




    HadoopRDD               FilteredRDD             MappedRDD
     path = hdfs://…       func = _.contains(...)   func = _.split(…)
Example: Logistic Regression
Goal: find best line separating two sets of points

                              random initial line




                               target
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {
  val gradient = data.map(p =>
    (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient
}

println("Final w: " + w)
Logistic Regression Performance
                   4000
                   3500
                                                             110 s / iteration
Running Time (s)




                   3000
                   2500
                   2000                                     Hadoop
                   1500                                     Spark
                   1000
                    500
                      0                                    first iteration 80 s
                                                          further iterations 6 s
                          1    5      10       20    30
                              Number of Iterations
Spark Users
User Applications
In-memory analytics on Hive data (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Algorithms for DNA sequence analysis (SNAP)
...
Conviva GeoReport
 Hive                                 20

Spark       0.5
                                          Time (hours)
        0         5   10     15      20

Group aggregations on many keys with the same
WHERE clause
40× gain over Apache Hive comes from
avoiding repeated reading, deserialization and
filtering
Quantifind Stream Analysis

               Parsed    Extracted    Relevant
Data Feeds                                         Insights
             Documents    Entities   Time Series




Load new documents every few minutes
Compute an in-memory table of time series
Let users query interactively via web app
Implementation
Runs on Apache Mesos cluster
                                Spark Hadoop     MPI
manager to coexist w/                                  …
Hadoop                                   Mesos

Supports any Hadoop storage     Node     Node     Node
system (HDFS, HBase, …)
Easy local mode and EC2 launch scripts
No changes to Scala
Task Scheduler
Runs general DAGs          A:                  B:

Pipelines functions                                                  G:
within a stage        Stage 1           groupBy

Cache-aware data      C:           D:           F:

reuse & locality                 map
                                   E:                             join
Partitioning-aware
to avoid shuffles      Stage 2              union                    Stage 3


                                        = cached data partition
Language Integration
Scala closures are Serializable Java objects
 » Serialize on master, load & run on workers
Not quite enough
 » Nested closures may reference entire outer scope,
   pulling in non-Serializable variables not used inside
 » Solution: bytecode analysis + reflection
Interpreter integration
 » Some magic tracks variables, defs, etc that each line
   depends on and automatically ships them to workers
Demo
What’s Next?
Hive on Spark (Shark)
Compatible port of the SQL-on-Hadoop engine
that can run 40x faster on existing Hive data
Scala UDFs for statistics and machine learning
Alpha coming really soon
Streaming Spark
Extend Spark to perform streaming computations
Run as a series of small (~1 s) batch jobs, keeping
state in memory as fault-tolerant RDDs
Alpha expected by June
                                     map   reduceByWindow

tweetStream                    T=1
 .flatMap(_.toLower.split)
 .map(word => (word, 1))
 .reduceByWindow(5, _ + _)
                               T=2

                                              …
Conclusion
Spark offers a simple, efficient and powerful
programming model for a wide range of apps
Shark and Spark Streaming coming soon
Download and docs: www.spark-project.org


       @matei_zaharia / matei@berkeley.edu
Related Work
DryadLINQ
 » Build queries through language-integrated SQL
   operations on lazy datasets
 » Cannot have a dataset persist across queries
Relational databases
 » Lineage/provenance, logical logging, materialized views
Piccolo
 » Parallel programs with shared distributed hash tables;
   similar to distributed shared memory
Iterative MapReduce (Twister and HaLoop)
 » Cannot define multiple distributed datasets, run different
   map/reduce pairs on them, or query data interactively
Related Work
Distributed shared memory (DSM)
 » Very general model allowing random reads/writes, but hard
   to implement efficiently (needs logging or checkpointing)
RAMCloud
 » In-memory storage system for web applications
 » Allows random reads/writes and uses logging like DSM
Nectar
 » Caching system for DryadLINQ programs that can reuse
   intermediate results across jobs
 » Does not provide caching in memory, explicit support over
   which data is cached, or control over partitioning
SMR (functional Scala API for Hadoop)
Behavior with Not Enough RAM
                     100
                              68.8
Iteration time (s)




                                      58.1
                     80




                                                 40.7
                     60




                                                            29.7
                     40




                                                                     11.5
                     20
                      0
                            Cache     25%       50%        75%      Fully
                           disabled                                cached
                                      % of working set in cache

Weitere ähnliche Inhalte

Was ist angesagt?

dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 

Was ist angesagt? (20)

dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 

Andere mochten auch

Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmerGirish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable LanguageMario Gleichmann
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsMICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive InspectionLinda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 

Andere mochten auch (20)

Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 

Ähnlich wie Zaharia spark-scala-days-2012

20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305mjfrankli
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012Amazon Web Services
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 

Ähnlich wie Zaharia spark-scala-days-2012 (20)

20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
 
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Scala+data
Scala+dataScala+data
Scala+data
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

Mehr von Skills Matter Talks

Jordan west real workscalazfinal2
Jordan west   real workscalazfinal2Jordan west   real workscalazfinal2
Jordan west real workscalazfinal2Skills Matter Talks
 
(Oleg zhurakousky)spring integration-scala-intro
(Oleg zhurakousky)spring integration-scala-intro(Oleg zhurakousky)spring integration-scala-intro
(Oleg zhurakousky)spring integration-scala-introSkills Matter Talks
 
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in Anger
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in AngerSCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in Anger
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in AngerSkills Matter Talks
 
Project kepler compile time metaprogramming for scala
Project kepler compile time metaprogramming for scalaProject kepler compile time metaprogramming for scala
Project kepler compile time metaprogramming for scalaSkills Matter Talks
 
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...Skills Matter Talks
 
Martin sustrik future_of_messaging
Martin sustrik future_of_messagingMartin sustrik future_of_messaging
Martin sustrik future_of_messagingSkills Matter Talks
 

Mehr von Skills Matter Talks (15)

Couch db skillsmatter-prognosql
Couch db skillsmatter-prognosqlCouch db skillsmatter-prognosql
Couch db skillsmatter-prognosql
 
Jordan west real workscalazfinal2
Jordan west   real workscalazfinal2Jordan west   real workscalazfinal2
Jordan west real workscalazfinal2
 
Cnc scala-presentation
Cnc scala-presentationCnc scala-presentation
Cnc scala-presentation
 
Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
(Oleg zhurakousky)spring integration-scala-intro
(Oleg zhurakousky)spring integration-scala-intro(Oleg zhurakousky)spring integration-scala-intro
(Oleg zhurakousky)spring integration-scala-intro
 
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in Anger
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in AngerSCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in Anger
SCALA DAYS 2012: Ben Parker on Interactivity - Anti-XML in Anger
 
Scala days mizushima
Scala days mizushimaScala days mizushima
Scala days mizushima
 
Real World Scalaz
Real World ScalazReal World Scalaz
Real World Scalaz
 
Project kepler compile time metaprogramming for scala
Project kepler compile time metaprogramming for scalaProject kepler compile time metaprogramming for scala
Project kepler compile time metaprogramming for scala
 
Test driven infrastructure
Test driven infrastructureTest driven infrastructure
Test driven infrastructure
 
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...
CukeUp! 2012: Michael Nacos on Just enough infrastructure for product develop...
 
Prediction suretogowrong
Prediction suretogowrongPrediction suretogowrong
Prediction suretogowrong
 
Tmt predictions 2011
Tmt predictions 2011Tmt predictions 2011
Tmt predictions 2011
 
Martin sustrik future_of_messaging
Martin sustrik future_of_messagingMartin sustrik future_of_messaging
Martin sustrik future_of_messaging
 
Marek pubsubhuddle realtime_web
Marek pubsubhuddle realtime_webMarek pubsubhuddle realtime_web
Marek pubsubhuddle realtime_web
 

Kürzlich hochgeladen

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Zaharia spark-scala-days-2012

  • 1. Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark-project.org UC BERKELEY
  • 2. My Background Grad student in the AMP Lab at UC Berkeley » 50-person lab focusing on big data Committer on Apache Hadoop Started Spark in 2009 to provide a richer, Hadoop-compatible computing engine
  • 3. Spark Goals Extend the MapReduce model to support more types of applications efficiently » Spark can run 40x faster than Hadoop for iterative and interactive applications Make jobs easier to program »Language-integrated API in Scala »Interactive use from Scala interpreter
  • 4. Why go Beyond MapReduce? MapReduce simplified big data analysis by giving a reliable programming model for large clusters But as soon as it got popular, users wanted more: » More complex, multi-stage applications » More interactive ad-hoc queries
  • 5. Why go Beyond MapReduce? Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Query 1 Stage 3 Stage 2 Stage 1 Query 2 Query 3 Iterative algorithm Interactive data mining
  • 6. Why go Beyond MapReduce? Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Query 1 Stage 3 Stage 2 Stage 1 In MapReduce, the only way to shareQuery 2 across data jobs is stable storage (e.g. HDFS) Query 3 -> slow! Iterative algorithm Interactive data mining
  • 7. Examples HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input HDFS query 1 result 1 read query 2 result 2 query 3 result 3 Input . . . I/O and serialization can take 90% of the time
  • 8. Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-time processing query 2 query 3 Input Distributed memory . . . 10-100× faster than network and disk
  • 9. Solution: Resilient Distributed Datasets (RDDs) Distributed collections of objects that can be stored in memory for fast reuse Automatically recover lost data on failure Support a wide range of applications
  • 10. Outline Spark programming model User applications Implementation Demo What’s next
  • 11. Programming Model Resilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Can be cached in memory for efficient reuse Transformations (e.g. map, filter, groupBy, join) » Build RDDs from other RDDs Actions (e.g. count, collect, save) » Return a result or write it to storage
  • 12. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns BaseTransformed RDD RDD Cache 1 lines = spark.textFile(“hdfs://...”) Worker results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker . . . Cache 3 Worker Block 2 Result: scaled tosearch of Wikipedia full-text 1 TB data in 5-7 sec in <1 sec (vs 20 for on-disk data) (vs 170 sec sec for on-disk data) Block 3
  • 13. RDD Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 14. Example: Logistic Regression Goal: find best line separating two sets of points random initial line target
  • 15. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
  • 16. Logistic Regression Performance 4000 3500 110 s / iteration Running Time (s) 3000 2500 2000 Hadoop 1500 Spark 1000 500 0 first iteration 80 s further iterations 6 s 1 5 10 20 30 Number of Iterations
  • 18. User Applications In-memory analytics on Hive data (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Algorithms for DNA sequence analysis (SNAP) ...
  • 19. Conviva GeoReport Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20 Group aggregations on many keys with the same WHERE clause 40× gain over Apache Hive comes from avoiding repeated reading, deserialization and filtering
  • 20. Quantifind Stream Analysis Parsed Extracted Relevant Data Feeds Insights Documents Entities Time Series Load new documents every few minutes Compute an in-memory table of time series Let users query interactively via web app
  • 21. Implementation Runs on Apache Mesos cluster Spark Hadoop MPI manager to coexist w/ … Hadoop Mesos Supports any Hadoop storage Node Node Node system (HDFS, HBase, …) Easy local mode and EC2 launch scripts No changes to Scala
  • 22. Task Scheduler Runs general DAGs A: B: Pipelines functions G: within a stage Stage 1 groupBy Cache-aware data C: D: F: reuse & locality map E: join Partitioning-aware to avoid shuffles Stage 2 union Stage 3 = cached data partition
  • 23. Language Integration Scala closures are Serializable Java objects » Serialize on master, load & run on workers Not quite enough » Nested closures may reference entire outer scope, pulling in non-Serializable variables not used inside » Solution: bytecode analysis + reflection Interpreter integration » Some magic tracks variables, defs, etc that each line depends on and automatically ships them to workers
  • 24. Demo
  • 26. Hive on Spark (Shark) Compatible port of the SQL-on-Hadoop engine that can run 40x faster on existing Hive data Scala UDFs for statistics and machine learning Alpha coming really soon
  • 27. Streaming Spark Extend Spark to perform streaming computations Run as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Alpha expected by June map reduceByWindow tweetStream T=1 .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=2 …
  • 28. Conclusion Spark offers a simple, efficient and powerful programming model for a wide range of apps Shark and Spark Streaming coming soon Download and docs: www.spark-project.org @matei_zaharia / matei@berkeley.edu
  • 29. Related Work DryadLINQ » Build queries through language-integrated SQL operations on lazy datasets » Cannot have a dataset persist across queries Relational databases » Lineage/provenance, logical logging, materialized views Piccolo » Parallel programs with shared distributed hash tables; similar to distributed shared memory Iterative MapReduce (Twister and HaLoop) » Cannot define multiple distributed datasets, run different map/reduce pairs on them, or query data interactively
  • 30. Related Work Distributed shared memory (DSM) » Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing) RAMCloud » In-memory storage system for web applications » Allows random reads/writes and uses logging like DSM Nectar » Caching system for DryadLINQ programs that can reuse intermediate results across jobs » Does not provide caching in memory, explicit support over which data is cached, or control over partitioning SMR (functional Scala API for Hadoop)
  • 31. Behavior with Not Enough RAM 100 68.8 Iteration time (s) 58.1 80 40.7 60 29.7 40 11.5 20 0 Cache 25% 50% 75% Fully disabled cached % of working set in cache

Hinweis der Redaktion

  1. Point out that Scala is a modern PL etcMention DryadLINQ (but we go beyond it with RDDs)Point out that interactive use and iterative use go hand in hand because both require small tasks and dataset reuse
  2. Each iteration is, for example, a MapReduce job
  3. RDDs = first-class way to manipulate and persist intermediate datasets
  4. You write a single program  similar to DryadLINQDistributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across opsVariables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimizationMention cached vars useful for some workloads that won’t be shown hereMention it’s all designed to be easy to distribute in a fault-tolerant fashion
  5. Key idea: add “variables” to the “functions” in functional programming
  6. Note that dataset is reused on each gradient computation
  7. Key idea: add “variables” to the “functions” in functional programming
  8. 100 GB of data on 50 m1.xlarge EC2 machines
  9. Mention it’s designed to be fault-tolerant
  10. NOT a modified versionof Hadoop