SlideShare ist ein Scribd-Unternehmen logo
1 von 38
UC BERKELEY
It’s All Happening On-line          User Generated
                                  (Web, Social & Mobile)
          Every:
          Click
          Ad impression
          Billing event
                                                           …..
          Fast Forward, pause,…
          Friend Request
          Transaction
          Network message
          Fault
          …


Internet of Things / M2M          Scientific Computing
Volume     Petabytes+



                               Variety    Unstructured




                               Velocity   Real-Time



Our view: More data should mean better answers


    • Must balance Cost, Time, and Answer Quality
3
4
UC BERKELEY



                    Algorithms: Machine
                       Learning and
                          Analytics




                         Massive
                        and Diverse
                           Data


         People:
                                             Machines:
     CrowdSourcing &
                                          Cloud Computing
    Human Computation

5
throughout the entire analytics lifecycle
6
Alex Bayen (Mobile Sensing)       Anthony Joseph (Sec./ Privacy)
   Ken Goldberg (Crowdsourcing)      Randy Katz (Systems)
   *Michael Franklin (Databases)     Dave Patterson (Systems)
   Armando Fox (Systems)             *Ion Stoica (Systems)
   *Mike Jordan (Machine Learning)   Scott Shenker (Networking)



Organized for Collaboration:




   7
8
> 450,000
    downloads




9
• Sequencing costs                    (150X)               Big Data                $100,000.0
                                                                                                 $K per genome

                                                                                     $10,000.0

 • UCSF cancer researchers + UCSC cancer genetic                                      $1,000.0
                                                                                       $100.0

   database + AMP Lab + Intel Cluster                                                   $10.0
                                                                                          $1.0
    @TCGA: 5 PB = 20 cancers x 1000 genomes                                               $0.1
                                                                                                   2001 - 2014


• See Dave Patterson’s Talk: Thursday 3-4, BDT205
        David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times,
   10   12/5/2011
MLBase (Declarative Machine Learning)
     Hadoop MR
        MPI                         BlinkDB (approx QP)
      Graphlab                        Shark (SQL) + Streaming
        etc.                  Spark                       Streaming
                    Shared RDDs (distributed memory)
                     Mesos (cluster resource manager)
                                HDFS

        3rd party      AMPLab (released)          AMPLab (in progress)


11
12
13
Lightning-Fast Cluster Computing
Base RDD                                              Cache 1
lines = spark.textFile(“hdfs://...”)              Transformed RDD
                                                                                            Worker
                                                                         results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))                                            tasks    Block 1
                                                                    Driver
cachedMsgs = messages.cache()

                                                    Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                       Cache 2
                                                                                           Worker
                                                                     Cache 3
                                                               Worker                      Block 2
 Result: full-text search TBWikipedia in sec sec
    Result: scaled to 1 of data in 5-7 <1
         (vs 170sec for on-disk data)
          (vs 20 sec for on-disk data)                         Block 3
messages = textFile(...).filter(_.contains(“error”))
                        .map(_.split(„t‟)(2))




HadoopRDD                FilteredRDD              MappedRDD
 path = hdfs://…        func = _.contains(...)    func = _.split(…)
random initial line




target
map readPoint     cache

                                                       Load data in memory once
                               Initial parameter vector

                  map p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
reduce _ + _
                                    Repeated MapReduce steps
                                      to do gradient descent
60

                     50
Running Time (min)



                                                            110 s / iteration

                     40
                                                             Hadoop
                     30
                                                             Spark
                     20

                     10
                                                            first iteration 80 s
                                                          further iterations 1 s
                     0
                          1     10            20     30
                              Number of Iterations
Java API        JavaRDD<String> lines = sc.textFile(...);
(out now)
                lines.filter(new Function<String, Boolean>() {
                  Boolean call(String s) {
                    return s.contains(“error”);
                  }
                }).count();




PySpark         lines = sc.textFile(...)
(coming soon)
                lines.filter(lambda x: x.contains('error')) 
                     .count()
Hive                            20

Spark       0.5
                                     Time (hours)
        0         5   10   15   20
Client
                                 CLI          JDBC

                               Driver

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                            MapReduce

                        HDFS
Client
                                 CLI          JDBC

                               Driver     Cache Mgr.

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                               Spark

                        HDFS
Row Storage       Column Storage
1   john    4.1    1      2      3

2   mike    3.5   john   mike   sally

3   sally   6.4   4.1    3.5    6.4
Shark   Shark (disk)   Hive

                                 100
                                 90
                                 80
                                 70
                                 60
                                 50
                                 40
                                 30

100 m2.4xlarge nodes             20

2.1 TB benchmark (Pavlo et al)   10




                                           1.1
                                  0
                                             Selection
Shark   Shark (disk)   Hive
                                 600


                                 500


                                 400


                                 300


                                 200


100 m2.4xlarge nodes             100




                                           32
2.1 TB benchmark (Pavlo et al)
                                  0
                                            Group By
1800
                                        Shark (copartitioned)
                                        Shark
                                 1500
                                        Shark (disk)
                                        Hive
                                 1200


                                 900


                                 600


                                 300




                                          105
100 m2.4xlarge nodes
2.1 TB benchmark (Pavlo et al)     0
                                                Join
Shark   Shark (disk)   Hive
70                             70               100
                                                90
60                             60
                                                80
50                             50               70
                                                60
40                             40
                                                50
30                             30               40
                                                30
20                             20
                                                20                100 m2.4xlarge
10                             10               10                nodes, 1.7 TB




                                                      1.0
         0.8




                                    0.7




                                                 0                Conviva dataset
 0                              0
           Query 1                    Query 2           Query 3
spark-project.org
amplab.cs.berkeley.edu

                         UC BERKELEY
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

Weitere ähnliche Inhalte

Was ist angesagt?

MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用iammutex
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 
Ado.net session08
Ado.net session08Ado.net session08
Ado.net session08Niit Care
 
Become a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsBecome a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsTier1app
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution ISSGC Summer School
 
ModuLab DLC-Medical3
ModuLab DLC-Medical3ModuLab DLC-Medical3
ModuLab DLC-Medical3Dongheon Lee
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaDave Snowdon
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithmsDuyhai Doan
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Java Future S Ritter
Java Future S RitterJava Future S Ritter
Java Future S Rittercatherinewall
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Yandex
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614Sri Ambati
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?SegFaultConf
 

Was ist angesagt? (20)

MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Ado.net session08
Ado.net session08Ado.net session08
Ado.net session08
 
Become a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsBecome a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day Devops
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution
 
ModuLab DLC-Medical3
ModuLab DLC-Medical3ModuLab DLC-Medical3
ModuLab DLC-Medical3
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in java
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Gur1009
Gur1009Gur1009
Gur1009
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Java Future S Ritter
Java Future S RitterJava Future S Ritter
Java Future S Ritter
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 

Ähnlich wie Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopsrisatish ambati
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsIvan Shcheklein
 
第三回月次セミナー(公開版)
第三回月次セミナー(公開版)第三回月次セミナー(公開版)
第三回月次セミナー(公開版)moai kids
 

Ähnlich wie Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305 (20)

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Hadoop
HadoopHadoop
Hadoop
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor Internals
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
第三回月次セミナー(公開版)
第三回月次セミナー(公開版)第三回月次セミナー(公開版)
第三回月次セミナー(公開版)
 

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

  • 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault … Internet of Things / M2M Scientific Computing
  • 3. Volume Petabytes+ Variety Unstructured Velocity Real-Time Our view: More data should mean better answers • Must balance Cost, Time, and Answer Quality 3
  • 4. 4
  • 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation 5
  • 6. throughout the entire analytics lifecycle 6
  • 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking) Organized for Collaboration: 7
  • 8. 8
  • 9. > 450,000 downloads 9
  • 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014 • See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
  • 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress) 11
  • 12. 12
  • 13. 13
  • 14.
  • 16.
  • 17.
  • 18.
  • 19. Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) Transformed RDD Worker results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
  • 20. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 22. map readPoint cache Load data in memory once Initial parameter vector map p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x reduce _ + _ Repeated MapReduce steps to do gradient descent
  • 23. 60 50 Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
  • 24. Java API JavaRDD<String> lines = sc.textFile(...); (out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); PySpark lines = sc.textFile(...) (coming soon) lines.filter(lambda x: x.contains('error')) .count()
  • 25.
  • 26. Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20
  • 27.
  • 28. Client CLI JDBC Driver Meta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
  • 29. Client CLI JDBC Driver Cache Mgr. Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
  • 30. Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
  • 31.
  • 32.
  • 33. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30 100 m2.4xlarge nodes 20 2.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
  • 34. Shark Shark (disk) Hive 600 500 400 300 200 100 m2.4xlarge nodes 100 32 2.1 TB benchmark (Pavlo et al) 0 Group By
  • 35. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 0 Join
  • 36. Shark Shark (disk) Hive 70 70 100 90 60 60 80 50 50 70 60 40 40 50 30 30 40 30 20 20 20 100 m2.4xlarge 10 10 10 nodes, 1.7 TB 1.0 0.8 0.7 0 Conviva dataset 0 0 Query 1 Query 2 Query 3
  • 38. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Hinweis der Redaktion

  1. Add “variables” to the “functions” in functional programming
  2. Note that dataset is reused on each gradient computation
  3. Key idea: add “variables” to the “functions” in functional programming
  4. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  5. Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join