SlideShare ist ein Scribd-Unternehmen logo
1 von 40
MapReduce: A Useful Parallel Tool
that Still Has Room for Improvement


                   January 5, 2012

                Kyong-Ha Lee
         bart7449@gmail.com



          Copyright © KAIST Database Lab. All Rights Reserved.
Outline
Three topics that I will discuss :
♦   Anatomy of the MapReduce framework
    –   Basic principles about the MapReduce framework
    –   Not much discussion on implementation details, but will be
        happy to discuss them if there are any questions.
♦   A brief survey on the study of improving the
    conventional MapReduce framework
♦   Research projects on going at KAIST



                      Copyright © KAIST Database Lab. All Rights Reserved.
Big Data
  ♦     A large data set hard to work with using an on-hand
        DBMS in a single node


  ♦     Data growth challenges are defined as*
        –    Increasing volume(amount of data),
        –    Velocity (speed of data in/out)
        –    Variety (range of data types, sources)




* Doug Laney, ―3D Data Management: Controlling Data Volume, Velocity and Variety‖, 2001

                                 Copyright © KAIST Database Lab. All Rights Reserved.
Importance and Impact
♦   ―Data center is the computer. If MapReduce is the first
    instruction of the data center computer, I can’t wait to
    see the rest of the instruction set, as well as the data
    center programming language, the data center operating
    system, the data center storage systems, and more.‖
     - David A. Patterson. Technical perspective: the data center is the
    computer. CACM, 51(1):105, 2008.
♦   A list of institutions that are using Hadoop, an open-source
    Java implementation of MapReduce
♦   Its scholastic impact!




         as of Dec 31, 2011 © KAIST Database Lab. All Rights Reserved.
                       Copyright
Usage Statistics Over Time at Google
                                   Aug ‘04                  Mar ‘06                   Sep ‘07       Sep ‘09
The number of jobs                        29K                     171K                    2,217K     3,467K
Average completion                        634                       874                    395         475
time (secs)
Machine years used                        217                     2,002                   11,081      25,562
Input data read(TB)                     3,288                    52,254                   403,152    544,130
Intermediate data(TB)                     758                     6,743                   3,4774      90,120
Output data                               193                     2,970                   14,018      57,520
written(TB)
  Average worker                     157                268                 394                488
* machines Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009.
  source: J. Dean,



* Hadoop won the 1st in GraySort benchmark for 100 TB sorting with over
3,800 nodes – Winning a 60 sencond Dash with a Yellow Elephant, http://sortbenchmark.org/Yahoo2009.pdf
                                   Copyright © KAIST Database Lab. All Rights Reserved.
Single Node Architecture


                              CPU


                           Memory




                               Disk




            Copyright © KAIST Database Lab. All Rights Reserved.
Commodity Clusters
♦   Web data sets can be very large
    –   Tens to hundreds of terabytes
    –   At Facebook, almost 6TB of new log data is collected every day,
        with 1.7PB of log data accumulated over time*
        *source: A comparison of join algorithms for log processing in MapReduce, SIGMOD’10

♦   We cannot store and process that size of data on a single
    machine in time
♦   Standard architecture emerging:
    –   Cluster of commodity Linux nodes
    –   Gigabit Ethernet interconnects
♦   How to organize computations on this architecture?
    –   Mask issues such as hardware failure
                               Copyright © KAIST Database Lab. All Rights Reserved.
Cluster Architecture
                               8 Gbps backbone between racks
1 Gbps between                                    Switch
any pair of nodes
in a rack
                      Switch                                                  Switch



       CPU                               CPU                     CPU                   CPU

       Mem              …                Mem                     Mem           …       Mem

        Disk                             Disk                     Disk                 Disk

   Yahoo clusters that is used for GraySort:
   •   Each rack contains 40 nodes
   •   2 quad core Xeons @ 2.5ghz per node
   •   8GB RAM, 4 SATA Copyright © KAIST Database Lab. All Rights Reserved.
                         HDD
The Need of Stable Storage
♦   Problem: if nodes can fail, how can we store data
    persistently?
    –   Cheap nodes fail frequently, if you have many
        »   MTBF for 1 node = 3 years
        »   MTBF for 1000 nodes = 1 day in average
    –   Putting fault-tolerance into system
♦   Answer: Distributed File System
    –   Provides global file namespace
    –   Google GFS; Hadoop HDFS
    –   Typical usage pattern
        »   Huge files (100s of GB to TB)
        »   Data is rarely updated in place
        »   Reads and appends are common I/O patterns
                         Copyright © KAIST Database Lab. All Rights Reserved.
GFS Design




♦   Master manages metadata
♦   Data transfers happen directly between clients/chunk servers
♦   Files broken into chunks (typically 64 MB)
♦   Data replication (typically 3 replicas, Primary copy)
♦   Immutable data blocks
                       Copyright © KAIST Database Lab. All Rights Reserved.
Google Cluster Environment
♦   Cluster is 1000s of machines, typically one or handful of
    configurations
♦   File system (GFS) + cluster scheduling system are core services
♦   Typically 100s to 1000s of active jobs (some w/1 task, some
    w/1000s)
♦   Mix of batch and low-latency, user-facing production jobs




                      Copyright © KAIST Database Lab. All Rights Reserved.
Motivation of MapReduce’s Design
♦   Large-Scale Data Processing
    –   Want to use 1,000s of CPUs
        »   But don’t want hassle of managing things


♦   MapReduce Architecture provides
    –   Automatic parallelization & distribution
    –   Fault tolerance
    –   I/O scheduling
    –   Monitoring & status updates




                       Copyright © KAIST Database Lab. All Rights Reserved.
What is MapReduce?
♦   Both a programming model and a framework for
    massive parallel processing of large datasets across
    many low-end nodes
    –   Popularized and controversially patented by Google Inc.
    –   Analogous to Group-By-Aggregation in DBMS
♦   Easy to distribute a job across nodes
    –   Implements data parallelism
♦   No hassle of managing jobs across nodes
♦   Nice retry/failure semantics
♦   Runtime scheduling with speculative execution

                      Copyright © KAIST Database Lab. All Rights Reserved.
Programming model : Map/Reduce
♦   Input: a set of key/value pairs
♦   A user implements two functions:
    –   map(key1, value1)  (key2, value2)
    –   reduce(key2, list(value2))  (key3, value3)
♦   (key2, value2) is an intermediate key/value pair
♦   Output is the set of (k3,v3) pairs

♦   Many problems can be phrased in this way
    –   but not for all.



                           Copyright © KAIST Database Lab. All Rights Reserved.
Data
♦   Input and final output are stored on DFS
    –   Scheduler tries to schedule map tasks ―close‖ to physical
        storage location of input data
♦   Intermediate results are stored on local disks of map
    and reduce workers
♦   Outputs of a MR job often become inputs of another
    MR job




                      Copyright © KAIST Database Lab. All Rights Reserved.
Parallel Execution across Nodes
1.   Partition input key/value pairs into chunks and then
     run map() tasks in parallel
2.   After all map()s are complete, consolidate all emitted
     values for each unique emitted key
3.   Now partition space of output map keys, and run
     reduce() in parallel
4.   In reduce(), values for each key are grouped together
     then aggregated, reduced output are stored on DFS



                    Copyright © KAIST Database Lab. All Rights Reserved.
Example : Word Count
map(key, value):
// key: document name; value: text of document
     for each word w in value:
         emit(w, 1)

reduce(key, values):
// key: a word; values: an iterator over counts
         result = 0
         for each count v in values:
                 result += v
         emit(key, result)

                     Copyright © KAIST Database Lab. All Rights Reserved.
Execution: The Map Step
   Input                          Intermediate
   key-value pairs                key-value pairs

                                      k               v
                map
    k       v
                                      k               v
                map
    k       v
                                      k               v

        …                               …

   k        v                          k               v




                     Copyright © KAIST Database Lab. All Rights Reserved.
Execution: The Reduce Step
                                                                                          Output
Intermediate                   Key-value groups                                           key-value pairs
key-value pairs
                                                                                 reduce
   k       v                    k              v          v          v                         k        v
                                                                                 reduce
   k       v                    k             v          v                                     k        v
                  group

   k       v
                                         …                                                          …
       …

   k        v                     k               v                                             k       v




                          Copyright © KAIST Database Lab. All Rights Reserved.
Combiner
♦   Often a map task will produce many pairs of the form
    (k,v1), (k,v2), … for the same key k
    –   E.g., popular words in Word Count

♦   It can save network time by pre-aggregating at
    mapper
    –   combine(k1, list(v1))  v2
    –   Usually same as reduce function

♦   Works only if reduce function is commutative and
    associative



                     Copyright © KAIST Database Lab. All Rights Reserved.
Example: Building an Inverted Index
♦   Input: (filename, text) records
♦   Output: list of files containing each word

♦   Map:
              foreach word in text.split():
                 emit (word, filename)

♦   Combine: uniquify filenames for each word

♦   Reduce:
                def reduce(word, filenames):
                    output(word, sort(filenames))


                       Copyright © KAIST Database Lab. All Rights Reserved.
hamlet.txt
             to, hamlet.txt
to be or     be, hamlet.txt
not to be    or, hamlet.txt                                         afraid, (12th.txt)
             not, hamlet.txt                                        be, (12th.txt, hamlet.txt)
                                                                    greatness, (12th.txt)
                                                                    not, (12th.txt, hamlet.txt)
                                                                    of, (12th.txt)
             be, 12th.txt                                           or, (hamlet.txt)
 12th.txt
             not, 12th.txt                                          to, (hamlet.txt)
  be not     afraid, 12th.txt
afraid of    of, 12th.txt
greatness
             greatness, 12th.txt


                                                 *source: PARLab Parallel Boot Camp, Matei Zaharia

                  Copyright © KAIST Database Lab. All Rights Reserved.
Distributed Execution Review
                                       Input
                      Block 1           Block 2          Block 3         ...    Block n



                 Map
               Local sort
                Mapper                Mapper                       Mapper
               Combiner


                                 Intermediate
                                      result
     Barrier
                                                                               pull

           Copy/Shuffle
                Reduce
                Merge                                             Reducer
                Reduce


                                     Output
                      Copyright © KAIST Database Lab. All Rights Reserved.
System Behavior on a Single Node




                                                                  *Source: A comparison of
                                                                  join Algorithms for log
                                                                  processing in MR,
                                                                  SIGMOD’10

           Copyright © KAIST Database Lab. All Rights Reserved.
Experimental Results




*Source: A patform for scalable one-pass analytics using MapReduce, SIGMOD’11
                            Copyright © KAIST Database Lab. All Rights Reserved.
Fault Tolerance
♦   If tasks fail, the tasks are executed again in another node
    –   Detect failure via periodic heartbeats
    –   Re-execute in-progress map tasks
    –   Re-execute in-progress reduce tasks
♦   If a node crashes:
    –   Re-launch its current tasks on other nodes
    –   Re-run any maps the node previously ran
        » Necessary because their output files were lost along with the
           crashed node
♦   If a task is going slowly (straggler):
    –   Launch second copy of task on another node (―speculative
        execution‖)
    –   Take the output of whichever copy finishes first, and kill the
        other

                         Copyright © KAIST Database Lab. All Rights Reserved.
Criticism
♦   D. DeWitt and M. Stonebraker badly criticized that
    ―MapReduce is a major step backwards‖[5].
    –   He first regarded it as a simple Extract-Transform-Load tool.
♦   A technical comparison was done by Pavlo and et
    al.[6]
    –   Compared with a commercial row-wise DBMS and Vertica
    –   After that, technical debates btw. researchers vs.
        practitioners are triggered
♦   CACM welcomed this technical debate, inviting both
    sides in The Communications of ACM, Jan 2010[7,8]

                      Copyright © KAIST Database Lab. All Rights Reserved.
*Source: A Comparison of
Approaches to Large-Scale
Data Analysis, SIGMOD’09

                            Copyright © KAIST Database Lab. All Rights Reserved.
Advantages
♦   Simple and easy to use
    –   Users code only Map() and Reduce()
    –   Users need not to consider how to distribute their job
♦   Flexible
    –   No data model, no schema
    –   Users can treat any irregular data with MapReduce
♦   Independent of the storage
♦   Fault tolerance
    –   Users need not to worry about faults during running
    –   Each run does not start from Map()
♦   High scalability
    –   Easy to scale-out

                       Copyright © KAIST Database Lab. All Rights Reserved.
Caveats
♦   A Single fixed dataflow
♦   Lack of schema, index, and high-level language
    –   Requires data parsing and full scan,
    –   no separation from apps.
♦   Sacrifice of disk I/O for fault-tolerance
    –   Materialization of intermediate results on local disks
    –   Three replicas on DFS
    –   I/O inefficient!
♦   Blocking operators
    –   Caused by merge-sort for grouping values
    –   Reduce begins after all map tasks end
♦   A simple heuristic runtime scheduling with speculative execution
♦   Very young!
    –   Few third party tools and low efficiency
                         Copyright © KAIST Database Lab. All Rights Reserved.
A Short List of Related Study
♦    Sacrifice of disk I/O for fault-                           ♦        A simple heuristic scheduling
     tolerance                                                           –       LATE, …
     –    Main difference against DBMS                          ♦        Relatively poor performance
♦    A single fixed dataflow                                             –       Adaptive and automatic performance
     –    Dryad, SCOPE, Nephele/PACT                                             tuning.
     –    Map-Reduce-Merge for binary                                    –       Work sharing/Multiple jobs
          operators                                                              »      MRShare: Multi query processing
     –    Twister and HaLoop for iterative                                       »      Hive, Pig Latin
          workload                                                               »      fair/capacity sharing, ParaTimer

     –    Map-Join-Reduce and some join                                  –       Map-Join-Reduce
          techniques                                                     –       Join algorithms in MapReduce[Blanas-
                                                                                 SIGMOD’10]
♦    No schema
     –    Protocol buffer, JSON, XML, ….                        ♦        Cowork with other tools
                                                                         –       SQL/MapReduce, HadoopDB, Teradata
♦    No indexing
                                                                                 EDW’s Hadoop integration, ….
     –    HadoopDB, Hadoop++
                                                                ♦        DBMS based on MR
♦    No high-level language
                                                                         –       Cheetah, Osprey, RICARDO(analytic
     –    Hive, Sawzall, SCOPE, Pig Latin, … ,                                   tool)
          Jaql, Dryad/LINQ
                                                                ♦        Other complements
♦    Blocking operators
                                                                         –       DREMEL, …
     –    MapReduce Online, Mortar

                                 Copyright © KAIST Database Lab. All Rights Reserved.
A Brief Bibliographic Survey

                                               • We intend to assist DB and
                                                 open source communities in
                                                 understanding various technical
                                                 aspects of the MapReduce
                                                 framework

                                               • SIGMOD Record 40(4):11—20,
                                                 Dec 2011




             Copyright © KAIST Database Lab. All Rights Reserved.
Summary
♦   MR is simple, but provides good scalability and fault-
    tolerance for massive data processing
♦   MR is unlikely to substitute DBMS
♦   MR complements DBMS with scalable and flexible
    parallel processing for various data analysis
♦   I/O efficiency of MapReduce still needs to be
    addressed for more successful implications
    –   sort-merge based grouping and frequent checkpoints
♦   Many application domains and room for improvement

                     Copyright © KAIST Database Lab. All Rights Reserved.
Other Research Challenges and
Issues
♦   Parallelizing conventional algorithms
    –   that require filtering-then-aggregation.
        »   But, not good for ad-hoc queries
♦   Performance Improvements
    –   Not so well utilize the modern HW features
        »   Multi-core, GPGPU, SSD, etc
    –   Some caveats still exist in the model
        »   iterative and incremental processing
    –   Self-tuning
        »   150+ tuning knobs in Hadoop
        »   Long-running analysis and batch processing


                        Copyright © KAIST Database Lab. All Rights Reserved.
Thank you!
Questions or comments?




      Copyright © KAIST Database Lab. All Rights Reserved.
References
1.     David A. Patterson. Technical perspective: the data center is the computer. Communications of ACM, 51(1):105,
       2008.
2.     Hadoop. users List; http://wiki.apache.org/hadoop/PoweredBy
3.     Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data, Processing on Large Clusters, In Proceedings of
       OSDI 2004 and CACM Vol. 51, No. 1 pp. 107-113, 2008
4.     S. Ghemawat and et al. The Google File System, ACM SIGOPS Operating Systems Review, Vol. 37, No. 5 pp. 29-
       43, 2003
5.     David J. DeWitt and Michael Stonebraker, MapReduce: a major step backwards, Database column blog, 2008
6.     Andrew Pavlo and et al. A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings of SIGMOD
       2009
7.     Michael Stonebraker and et al. MapReduce and Parallel DBMSs: Friends or Foes?, Communications of ACM, Vol
       53, No. 1 pp. 64-71, Jan 2010
8.     Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM,
       Vol. 53, No. 1 pp. 72-72 Jan 2010
9.     M. Stonebraker, The case for shared-nothing. Data Engineering Bulletine, 9(1):4-9, 1986
10.    D. DeWitt and J. Gray, Parallel database systems: the future of high performance database systems,
       Communications of the ACM 35(6):85-98, 1992
11.    B. Schroeder and et a. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you. In
       Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), pages 1–16, 2007.
12.    B. Schroeder and et al. DRAM errors in the wild: a large-scale field study. In Proceedings of the eleventh
       international joint conference on Measurement and modeling of computer systems, pages 193–204. ACM New York,
       NY, USA, 2009



                                      Copyright © KAIST Database Lab. All Rights Reserved.
13.   G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In
      Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485. ACM, 1967.
14.   J.L. Gustafson. Reevaluating Amdahl’s law. Communications of the ACM, 31(5):532–533, 1988.
15.   A.H. Karp and H.P. Flatt. Measuring parallel processor performance. Communications of the
      ACM, 33(5):539–543, 1990.
16.   Apache Foundation, MapReduce V0.21.0
      Tutorial, http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html, 2010
17.   Incremental MapReduce, TV’s cobweb blog, http://eagain.net/articles/incremental-mapreduce/
18.   Y. Bu and et al. HaLoop: Efficient Iterative Data Processing on Large Clusters, In Proceedings of VLDB’10
19.   J. Ekanayake and et al. Twister: A Runtime for Iterative MapReduce, In Proceedings of ACM HPDC’10 pp.
      810-818, 2010
20.   M. Isard and et al. Dryad: Distributed data-parallel programs from sequential building blocks. In
      Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, page
      72. ACM, 2007.
21.   R. Chaiken and et al. Scope: easy and efficient parallel processing of massive data sets. PVLDB:
      Proceedings of Very Large Data Base Endowment, 1(2):1265–1276, 2008.
22.   C. Olston and et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD ’08: Proceedings
      of ACM SIGMOD Conference, pages 1099–1110, 2008.
23.   A. Gates and et al. Building a high level dataflow system on top of MapReduce: The pig experience.
      PVLDB: In Proceedings of VLDB, 2(2):1414–1425, 2009.
24.   R. Pike and et al. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming, Vol. 13 No.
      4, pp. 277-298, 2005
25.   A. Thusoo and et al. Hive- A Warehousing Solution over a Map-Reduce Framework. PVLDB: Proceedings
      of Very Large Data Base Endowment, 2009
26.   A. Thusoo and et al. Hive - a petabyte scale data warehouse using hadoop. In Proceedings of ICDE 2010



                                    Copyright © KAIST Database Lab. All Rights Reserved.
27.   Y. Yu and et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a
      high-level language. In OSDI ’08: Proceedings of Symposium on Operating System Design and
      Implementation, 2008
28.   M. Isard and et al. Distributed Data-Parallel Computing Using a High-Level Programming Language, In
      Proceedings of SIGMOD 2009
29.   D. Logothetis and et al. Ad-Hoc Data Processing in the Cloud, In Proceedings of VLDB’08
30.   T. Condie and et al. MapReduce Online, In Proceedings of USENIX NSDI, 2010
31.   A. Alexandrov and et al. Massively Parallel Data Analysis with PACTs on Nephele, In Proceedings of
      VLDB Vol. 3 No.2, 2010
32.   Battr{'e}, D and et al. Nephele/PACTs: a programming model and execution framework for web-scale
      analytical processing, In Proceedings of SoCC 2010
33.   Eric Friedman and et al. SQL/MapReduce: A practical approach to self-describing, polymorphic, and
      parallelizable user defined functions. PVLDB: PVLDB: Proceedings of Very Large Data Base Endowment,
      2(2):1402–1413, 2009.
34.   A. Abouzeid and et al. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for
      analytical workloads. VLDB’09: Proceedings of Very Large Data Base Endowment, pages 1084–1095,
      2009.
35.   Y. Xu and et al. Integrating Hadoop and Parallel DBMS, In Proceedings of ACM SIGMOD, pp. 969-974,
      2010
36.   S. Das and et al. Ricardo: Integrating R and Hadoop, In Proceedings of ACM SIGMOD pp. 987-998, 2010
37.   J. Dittrich and et al. Hadoop++ Making a Yellow Elephant Run like a Cheetah (Without it Even Noticing), In
      Proceedings of VLDB’10
38.   S. Chen, Cheetah: A High Performance Custom Data Warehouse on top of MapReduce, In Proceedings
      of VLDB, Vol. 3, No. 2, 2010
39.   S. Melnik and et al. Dremel: Interactive Analysis of Web-Scale Datasets, In Proceedings of VLDB VOl 3.
      No .1, 2010


                                    Copyright © KAIST Database Lab. All Rights Reserved.
40.   C. Yang and et al. Osprey-Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing
      Distributed Databasem, In Proceedings of IEEE ICDE pp. 657-668, 2010
41.   M. Zaharia and et al. Improving MapReduce Performance in Heterogeneous Environments, In
      Proceedings of USENIX OSDI’08
42.   H. Yang, and et al., Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, In
      Proceedings of SIGMOD’07
43.   D. Jiang and et al. Map-Join-Reduce: Towards Scalable and Efficient Data Analysis on Large
      Clusters, IEEE Transactions on Knowledge and Data Engineering, preprint
44.   S. Blanas and et al. A Comparison of Join Algorithms for Log Processing in MapReduce, In Proceedings
      of SIGMOD’10
45.   F. N. Afrati and et al. Optimizing Joins in a Map-Reduce Environment, in Proceedings of EDBT 2010
46.   R. Vernica and et al. Efficient Parallel Set-Similarity Joins Using MapReduce, In Proceedings of
      SIGMOD’10
47.   T. Nykiel and et al. MRShare: Sharing Across Multiple Queries in MapReduce, In Proceedings of VLDB’10
48.   K. Morton and et al. Estimating the progress of MapReduce Pipelines, In Proceedings of IEEE ICDE pp.
      681-684, 2010
49.   K. Morton and et al. ParaTimer: A Progress Indicator for MapReduce DAGs, In Proceedings of ACM
      SIGMOD, pp. 507-518, 2010
50.   S. Papadimitriou and et al. DisCo: Distributed Co-clustering with Map-Reduce, In Proceedings of IEEE
      ICDM pp. 512-521, 2009
51.   C. Wang and et al. MapDupReducer : detecting near duplicates over massive datasets, In Proceedings of
      ACM SIGMOD pp. 1119-1122, 2010
52.   S. Babu, Towards Automatic Optimization of MapReduce Programs, In Proceedings of ACM SoCC’10
53.   D. Jiang and et al. The Performance of MapReduce: An In-depth Study, In Proceedings of VLDB’10
54.   E. Jahani and et al. Automatic Optimization for MapReduce Programs, Proceedings of VLDB Vol.4, No.
      6 , 2011


                                   Copyright © KAIST Database Lab. All Rights Reserved.
55.   B. Catanzaro and et al. A Map Reduce Framework for Programming Graphic Processors, In Proceedings
      of Workshop on Software Tools for Multicore Systems, 2008
56.   B. He and et al. Mars: A MapReduce framework on graphic processors, In Proceedings of PACT’10 pp.
      260-269, 2008

57.   W. Jiang and et al. A Map-Reduce System with an Alternate API for Multi-Core Environments, In
      Proceedings of 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010
58.   Jeff Dean, Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009.
59.   Willis Lang and et al. Energy Management for MapReduce Clusters, Proceedings of VLDB Vol. 3 No. 1,
      2010
60.   W. Xiong and et al. Energy Efficient Data Intensive Distributed Computing, Data Engineering Bulletin Vol.
      34, No. 1, pp. 24-33, March 2011
61.   E. Anderson and et al. Efficiency Matters!, ACM SIGOPS Operating Systems Review, 44(1):40-45, 2010
62.   Jimmy Lin and Chris Dyer, Data-Intensive Text Processing, Book
63.   G. Malewicz and et al. Pregel: A System for Large-Scale Graph Processing, In Proceedings of PODC’09
64.   J. Ekanayake and et al. MapReduce for Data Intensive Scientific Analyses, In Proceedings of IEEE
      eScience’08
65.   K. B. Hall and et al. MapReduce/BigTable for Distributed Optimization , NIPS LCCC Workshop 2010
66.   PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
67.   MC Schatz, CloudBurst: Highly sensitive read mapping with MapReduce, Bioinformatics, Vol 25, No. 11
68.   B. Fan and et al. DiskReduce: RAID for data-intensive scalable computing, In Proceedings of the 4th
      Annual workshop on Petascale Data Storage, pp. 6-10, 2009
69.   K. Lee and et al. Parallel data processing with MapReduce: a survey, The SIGMOD Record, Vol 40, No. 4,
      pp.11-20, 2011




                                    Copyright © KAIST Database Lab. All Rights Reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 

Was ist angesagt? (20)

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 

Ähnlich wie MapReduce: A useful parallel tool that still has room for improvement

MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

Ähnlich wie MapReduce: A useful parallel tool that still has room for improvement (20)

MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Graph processing
Graph processingGraph processing
Graph processing
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 

Mehr von Kyong-Ha Lee

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
좋은 논문 찾기
좋은 논문 찾기좋은 논문 찾기
좋은 논문 찾기Kyong-Ha Lee
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXMLKyong-Ha Lee
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 

Mehr von Kyong-Ha Lee (7)

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
좋은 논문 찾기
좋은 논문 찾기좋은 논문 찾기
좋은 논문 찾기
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXML
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

MapReduce: A useful parallel tool that still has room for improvement

  • 1. MapReduce: A Useful Parallel Tool that Still Has Room for Improvement January 5, 2012 Kyong-Ha Lee bart7449@gmail.com Copyright © KAIST Database Lab. All Rights Reserved.
  • 2. Outline Three topics that I will discuss : ♦ Anatomy of the MapReduce framework – Basic principles about the MapReduce framework – Not much discussion on implementation details, but will be happy to discuss them if there are any questions. ♦ A brief survey on the study of improving the conventional MapReduce framework ♦ Research projects on going at KAIST Copyright © KAIST Database Lab. All Rights Reserved.
  • 3. Big Data ♦ A large data set hard to work with using an on-hand DBMS in a single node ♦ Data growth challenges are defined as* – Increasing volume(amount of data), – Velocity (speed of data in/out) – Variety (range of data types, sources) * Doug Laney, ―3D Data Management: Controlling Data Volume, Velocity and Variety‖, 2001 Copyright © KAIST Database Lab. All Rights Reserved.
  • 4. Importance and Impact ♦ ―Data center is the computer. If MapReduce is the first instruction of the data center computer, I can’t wait to see the rest of the instruction set, as well as the data center programming language, the data center operating system, the data center storage systems, and more.‖ - David A. Patterson. Technical perspective: the data center is the computer. CACM, 51(1):105, 2008. ♦ A list of institutions that are using Hadoop, an open-source Java implementation of MapReduce ♦ Its scholastic impact! as of Dec 31, 2011 © KAIST Database Lab. All Rights Reserved. Copyright
  • 5. Usage Statistics Over Time at Google Aug ‘04 Mar ‘06 Sep ‘07 Sep ‘09 The number of jobs 29K 171K 2,217K 3,467K Average completion 634 874 395 475 time (secs) Machine years used 217 2,002 11,081 25,562 Input data read(TB) 3,288 52,254 403,152 544,130 Intermediate data(TB) 758 6,743 3,4774 90,120 Output data 193 2,970 14,018 57,520 written(TB) Average worker 157 268 394 488 * machines Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009. source: J. Dean, * Hadoop won the 1st in GraySort benchmark for 100 TB sorting with over 3,800 nodes – Winning a 60 sencond Dash with a Yellow Elephant, http://sortbenchmark.org/Yahoo2009.pdf Copyright © KAIST Database Lab. All Rights Reserved.
  • 6. Single Node Architecture CPU Memory Disk Copyright © KAIST Database Lab. All Rights Reserved.
  • 7. Commodity Clusters ♦ Web data sets can be very large – Tens to hundreds of terabytes – At Facebook, almost 6TB of new log data is collected every day, with 1.7PB of log data accumulated over time* *source: A comparison of join algorithms for log processing in MapReduce, SIGMOD’10 ♦ We cannot store and process that size of data on a single machine in time ♦ Standard architecture emerging: – Cluster of commodity Linux nodes – Gigabit Ethernet interconnects ♦ How to organize computations on this architecture? – Mask issues such as hardware failure Copyright © KAIST Database Lab. All Rights Reserved.
  • 8. Cluster Architecture 8 Gbps backbone between racks 1 Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU Mem … Mem Mem … Mem Disk Disk Disk Disk Yahoo clusters that is used for GraySort: • Each rack contains 40 nodes • 2 quad core Xeons @ 2.5ghz per node • 8GB RAM, 4 SATA Copyright © KAIST Database Lab. All Rights Reserved. HDD
  • 9. The Need of Stable Storage ♦ Problem: if nodes can fail, how can we store data persistently? – Cheap nodes fail frequently, if you have many » MTBF for 1 node = 3 years » MTBF for 1000 nodes = 1 day in average – Putting fault-tolerance into system ♦ Answer: Distributed File System – Provides global file namespace – Google GFS; Hadoop HDFS – Typical usage pattern » Huge files (100s of GB to TB) » Data is rarely updated in place » Reads and appends are common I/O patterns Copyright © KAIST Database Lab. All Rights Reserved.
  • 10. GFS Design ♦ Master manages metadata ♦ Data transfers happen directly between clients/chunk servers ♦ Files broken into chunks (typically 64 MB) ♦ Data replication (typically 3 replicas, Primary copy) ♦ Immutable data blocks Copyright © KAIST Database Lab. All Rights Reserved.
  • 11. Google Cluster Environment ♦ Cluster is 1000s of machines, typically one or handful of configurations ♦ File system (GFS) + cluster scheduling system are core services ♦ Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s) ♦ Mix of batch and low-latency, user-facing production jobs Copyright © KAIST Database Lab. All Rights Reserved.
  • 12. Motivation of MapReduce’s Design ♦ Large-Scale Data Processing – Want to use 1,000s of CPUs » But don’t want hassle of managing things ♦ MapReduce Architecture provides – Automatic parallelization & distribution – Fault tolerance – I/O scheduling – Monitoring & status updates Copyright © KAIST Database Lab. All Rights Reserved.
  • 13. What is MapReduce? ♦ Both a programming model and a framework for massive parallel processing of large datasets across many low-end nodes – Popularized and controversially patented by Google Inc. – Analogous to Group-By-Aggregation in DBMS ♦ Easy to distribute a job across nodes – Implements data parallelism ♦ No hassle of managing jobs across nodes ♦ Nice retry/failure semantics ♦ Runtime scheduling with speculative execution Copyright © KAIST Database Lab. All Rights Reserved.
  • 14. Programming model : Map/Reduce ♦ Input: a set of key/value pairs ♦ A user implements two functions: – map(key1, value1)  (key2, value2) – reduce(key2, list(value2))  (key3, value3) ♦ (key2, value2) is an intermediate key/value pair ♦ Output is the set of (k3,v3) pairs ♦ Many problems can be phrased in this way – but not for all. Copyright © KAIST Database Lab. All Rights Reserved.
  • 15. Data ♦ Input and final output are stored on DFS – Scheduler tries to schedule map tasks ―close‖ to physical storage location of input data ♦ Intermediate results are stored on local disks of map and reduce workers ♦ Outputs of a MR job often become inputs of another MR job Copyright © KAIST Database Lab. All Rights Reserved.
  • 16. Parallel Execution across Nodes 1. Partition input key/value pairs into chunks and then run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel 4. In reduce(), values for each key are grouped together then aggregated, reduced output are stored on DFS Copyright © KAIST Database Lab. All Rights Reserved.
  • 17. Example : Word Count map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key, result) Copyright © KAIST Database Lab. All Rights Reserved.
  • 18. Execution: The Map Step Input Intermediate key-value pairs key-value pairs k v map k v k v map k v k v … … k v k v Copyright © KAIST Database Lab. All Rights Reserved.
  • 19. Execution: The Reduce Step Output Intermediate Key-value groups key-value pairs key-value pairs reduce k v k v v v k v reduce k v k v v k v group k v … … … k v k v k v Copyright © KAIST Database Lab. All Rights Reserved.
  • 20. Combiner ♦ Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k – E.g., popular words in Word Count ♦ It can save network time by pre-aggregating at mapper – combine(k1, list(v1))  v2 – Usually same as reduce function ♦ Works only if reduce function is commutative and associative Copyright © KAIST Database Lab. All Rights Reserved.
  • 21. Example: Building an Inverted Index ♦ Input: (filename, text) records ♦ Output: list of files containing each word ♦ Map: foreach word in text.split(): emit (word, filename) ♦ Combine: uniquify filenames for each word ♦ Reduce: def reduce(word, filenames): output(word, sort(filenames)) Copyright © KAIST Database Lab. All Rights Reserved.
  • 22. hamlet.txt to, hamlet.txt to be or be, hamlet.txt not to be or, hamlet.txt afraid, (12th.txt) not, hamlet.txt be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) be, 12th.txt or, (hamlet.txt) 12th.txt not, 12th.txt to, (hamlet.txt) be not afraid, 12th.txt afraid of of, 12th.txt greatness greatness, 12th.txt *source: PARLab Parallel Boot Camp, Matei Zaharia Copyright © KAIST Database Lab. All Rights Reserved.
  • 23. Distributed Execution Review Input Block 1 Block 2 Block 3 ... Block n Map Local sort Mapper Mapper Mapper Combiner Intermediate result Barrier pull Copy/Shuffle Reduce Merge Reducer Reduce Output Copyright © KAIST Database Lab. All Rights Reserved.
  • 24. System Behavior on a Single Node *Source: A comparison of join Algorithms for log processing in MR, SIGMOD’10 Copyright © KAIST Database Lab. All Rights Reserved.
  • 25. Experimental Results *Source: A patform for scalable one-pass analytics using MapReduce, SIGMOD’11 Copyright © KAIST Database Lab. All Rights Reserved.
  • 26. Fault Tolerance ♦ If tasks fail, the tasks are executed again in another node – Detect failure via periodic heartbeats – Re-execute in-progress map tasks – Re-execute in-progress reduce tasks ♦ If a node crashes: – Re-launch its current tasks on other nodes – Re-run any maps the node previously ran » Necessary because their output files were lost along with the crashed node ♦ If a task is going slowly (straggler): – Launch second copy of task on another node (―speculative execution‖) – Take the output of whichever copy finishes first, and kill the other Copyright © KAIST Database Lab. All Rights Reserved.
  • 27. Criticism ♦ D. DeWitt and M. Stonebraker badly criticized that ―MapReduce is a major step backwards‖[5]. – He first regarded it as a simple Extract-Transform-Load tool. ♦ A technical comparison was done by Pavlo and et al.[6] – Compared with a commercial row-wise DBMS and Vertica – After that, technical debates btw. researchers vs. practitioners are triggered ♦ CACM welcomed this technical debate, inviting both sides in The Communications of ACM, Jan 2010[7,8] Copyright © KAIST Database Lab. All Rights Reserved.
  • 28. *Source: A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD’09 Copyright © KAIST Database Lab. All Rights Reserved.
  • 29. Advantages ♦ Simple and easy to use – Users code only Map() and Reduce() – Users need not to consider how to distribute their job ♦ Flexible – No data model, no schema – Users can treat any irregular data with MapReduce ♦ Independent of the storage ♦ Fault tolerance – Users need not to worry about faults during running – Each run does not start from Map() ♦ High scalability – Easy to scale-out Copyright © KAIST Database Lab. All Rights Reserved.
  • 30. Caveats ♦ A Single fixed dataflow ♦ Lack of schema, index, and high-level language – Requires data parsing and full scan, – no separation from apps. ♦ Sacrifice of disk I/O for fault-tolerance – Materialization of intermediate results on local disks – Three replicas on DFS – I/O inefficient! ♦ Blocking operators – Caused by merge-sort for grouping values – Reduce begins after all map tasks end ♦ A simple heuristic runtime scheduling with speculative execution ♦ Very young! – Few third party tools and low efficiency Copyright © KAIST Database Lab. All Rights Reserved.
  • 31. A Short List of Related Study ♦ Sacrifice of disk I/O for fault- ♦ A simple heuristic scheduling tolerance – LATE, … – Main difference against DBMS ♦ Relatively poor performance ♦ A single fixed dataflow – Adaptive and automatic performance – Dryad, SCOPE, Nephele/PACT tuning. – Map-Reduce-Merge for binary – Work sharing/Multiple jobs operators » MRShare: Multi query processing – Twister and HaLoop for iterative » Hive, Pig Latin workload » fair/capacity sharing, ParaTimer – Map-Join-Reduce and some join – Map-Join-Reduce techniques – Join algorithms in MapReduce[Blanas- SIGMOD’10] ♦ No schema – Protocol buffer, JSON, XML, …. ♦ Cowork with other tools – SQL/MapReduce, HadoopDB, Teradata ♦ No indexing EDW’s Hadoop integration, …. – HadoopDB, Hadoop++ ♦ DBMS based on MR ♦ No high-level language – Cheetah, Osprey, RICARDO(analytic – Hive, Sawzall, SCOPE, Pig Latin, … , tool) Jaql, Dryad/LINQ ♦ Other complements ♦ Blocking operators – DREMEL, … – MapReduce Online, Mortar Copyright © KAIST Database Lab. All Rights Reserved.
  • 32. A Brief Bibliographic Survey • We intend to assist DB and open source communities in understanding various technical aspects of the MapReduce framework • SIGMOD Record 40(4):11—20, Dec 2011 Copyright © KAIST Database Lab. All Rights Reserved.
  • 33. Summary ♦ MR is simple, but provides good scalability and fault- tolerance for massive data processing ♦ MR is unlikely to substitute DBMS ♦ MR complements DBMS with scalable and flexible parallel processing for various data analysis ♦ I/O efficiency of MapReduce still needs to be addressed for more successful implications – sort-merge based grouping and frequent checkpoints ♦ Many application domains and room for improvement Copyright © KAIST Database Lab. All Rights Reserved.
  • 34. Other Research Challenges and Issues ♦ Parallelizing conventional algorithms – that require filtering-then-aggregation. » But, not good for ad-hoc queries ♦ Performance Improvements – Not so well utilize the modern HW features » Multi-core, GPGPU, SSD, etc – Some caveats still exist in the model » iterative and incremental processing – Self-tuning » 150+ tuning knobs in Hadoop » Long-running analysis and batch processing Copyright © KAIST Database Lab. All Rights Reserved.
  • 35. Thank you! Questions or comments? Copyright © KAIST Database Lab. All Rights Reserved.
  • 36. References 1. David A. Patterson. Technical perspective: the data center is the computer. Communications of ACM, 51(1):105, 2008. 2. Hadoop. users List; http://wiki.apache.org/hadoop/PoweredBy 3. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data, Processing on Large Clusters, In Proceedings of OSDI 2004 and CACM Vol. 51, No. 1 pp. 107-113, 2008 4. S. Ghemawat and et al. The Google File System, ACM SIGOPS Operating Systems Review, Vol. 37, No. 5 pp. 29- 43, 2003 5. David J. DeWitt and Michael Stonebraker, MapReduce: a major step backwards, Database column blog, 2008 6. Andrew Pavlo and et al. A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings of SIGMOD 2009 7. Michael Stonebraker and et al. MapReduce and Parallel DBMSs: Friends or Foes?, Communications of ACM, Vol 53, No. 1 pp. 64-71, Jan 2010 8. Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, No. 1 pp. 72-72 Jan 2010 9. M. Stonebraker, The case for shared-nothing. Data Engineering Bulletine, 9(1):4-9, 1986 10. D. DeWitt and J. Gray, Parallel database systems: the future of high performance database systems, Communications of the ACM 35(6):85-98, 1992 11. B. Schroeder and et a. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), pages 1–16, 2007. 12. B. Schroeder and et al. DRAM errors in the wild: a large-scale field study. In Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, pages 193–204. ACM New York, NY, USA, 2009 Copyright © KAIST Database Lab. All Rights Reserved.
  • 37. 13. G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485. ACM, 1967. 14. J.L. Gustafson. Reevaluating Amdahl’s law. Communications of the ACM, 31(5):532–533, 1988. 15. A.H. Karp and H.P. Flatt. Measuring parallel processor performance. Communications of the ACM, 33(5):539–543, 1990. 16. Apache Foundation, MapReduce V0.21.0 Tutorial, http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html, 2010 17. Incremental MapReduce, TV’s cobweb blog, http://eagain.net/articles/incremental-mapreduce/ 18. Y. Bu and et al. HaLoop: Efficient Iterative Data Processing on Large Clusters, In Proceedings of VLDB’10 19. J. Ekanayake and et al. Twister: A Runtime for Iterative MapReduce, In Proceedings of ACM HPDC’10 pp. 810-818, 2010 20. M. Isard and et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, page 72. ACM, 2007. 21. R. Chaiken and et al. Scope: easy and efficient parallel processing of massive data sets. PVLDB: Proceedings of Very Large Data Base Endowment, 1(2):1265–1276, 2008. 22. C. Olston and et al. Pig Latin: a not-so-foreign language for data processing. In SIGMOD ’08: Proceedings of ACM SIGMOD Conference, pages 1099–1110, 2008. 23. A. Gates and et al. Building a high level dataflow system on top of MapReduce: The pig experience. PVLDB: In Proceedings of VLDB, 2(2):1414–1425, 2009. 24. R. Pike and et al. Interpreting the Data: Parallel Analysis with Sawzall, Scientific Programming, Vol. 13 No. 4, pp. 277-298, 2005 25. A. Thusoo and et al. Hive- A Warehousing Solution over a Map-Reduce Framework. PVLDB: Proceedings of Very Large Data Base Endowment, 2009 26. A. Thusoo and et al. Hive - a petabyte scale data warehouse using hadoop. In Proceedings of ICDE 2010 Copyright © KAIST Database Lab. All Rights Reserved.
  • 38. 27. Y. Yu and et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI ’08: Proceedings of Symposium on Operating System Design and Implementation, 2008 28. M. Isard and et al. Distributed Data-Parallel Computing Using a High-Level Programming Language, In Proceedings of SIGMOD 2009 29. D. Logothetis and et al. Ad-Hoc Data Processing in the Cloud, In Proceedings of VLDB’08 30. T. Condie and et al. MapReduce Online, In Proceedings of USENIX NSDI, 2010 31. A. Alexandrov and et al. Massively Parallel Data Analysis with PACTs on Nephele, In Proceedings of VLDB Vol. 3 No.2, 2010 32. Battr{'e}, D and et al. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing, In Proceedings of SoCC 2010 33. Eric Friedman and et al. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user defined functions. PVLDB: PVLDB: Proceedings of Very Large Data Base Endowment, 2(2):1402–1413, 2009. 34. A. Abouzeid and et al. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB’09: Proceedings of Very Large Data Base Endowment, pages 1084–1095, 2009. 35. Y. Xu and et al. Integrating Hadoop and Parallel DBMS, In Proceedings of ACM SIGMOD, pp. 969-974, 2010 36. S. Das and et al. Ricardo: Integrating R and Hadoop, In Proceedings of ACM SIGMOD pp. 987-998, 2010 37. J. Dittrich and et al. Hadoop++ Making a Yellow Elephant Run like a Cheetah (Without it Even Noticing), In Proceedings of VLDB’10 38. S. Chen, Cheetah: A High Performance Custom Data Warehouse on top of MapReduce, In Proceedings of VLDB, Vol. 3, No. 2, 2010 39. S. Melnik and et al. Dremel: Interactive Analysis of Web-Scale Datasets, In Proceedings of VLDB VOl 3. No .1, 2010 Copyright © KAIST Database Lab. All Rights Reserved.
  • 39. 40. C. Yang and et al. Osprey-Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Databasem, In Proceedings of IEEE ICDE pp. 657-668, 2010 41. M. Zaharia and et al. Improving MapReduce Performance in Heterogeneous Environments, In Proceedings of USENIX OSDI’08 42. H. Yang, and et al., Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, In Proceedings of SIGMOD’07 43. D. Jiang and et al. Map-Join-Reduce: Towards Scalable and Efficient Data Analysis on Large Clusters, IEEE Transactions on Knowledge and Data Engineering, preprint 44. S. Blanas and et al. A Comparison of Join Algorithms for Log Processing in MapReduce, In Proceedings of SIGMOD’10 45. F. N. Afrati and et al. Optimizing Joins in a Map-Reduce Environment, in Proceedings of EDBT 2010 46. R. Vernica and et al. Efficient Parallel Set-Similarity Joins Using MapReduce, In Proceedings of SIGMOD’10 47. T. Nykiel and et al. MRShare: Sharing Across Multiple Queries in MapReduce, In Proceedings of VLDB’10 48. K. Morton and et al. Estimating the progress of MapReduce Pipelines, In Proceedings of IEEE ICDE pp. 681-684, 2010 49. K. Morton and et al. ParaTimer: A Progress Indicator for MapReduce DAGs, In Proceedings of ACM SIGMOD, pp. 507-518, 2010 50. S. Papadimitriou and et al. DisCo: Distributed Co-clustering with Map-Reduce, In Proceedings of IEEE ICDM pp. 512-521, 2009 51. C. Wang and et al. MapDupReducer : detecting near duplicates over massive datasets, In Proceedings of ACM SIGMOD pp. 1119-1122, 2010 52. S. Babu, Towards Automatic Optimization of MapReduce Programs, In Proceedings of ACM SoCC’10 53. D. Jiang and et al. The Performance of MapReduce: An In-depth Study, In Proceedings of VLDB’10 54. E. Jahani and et al. Automatic Optimization for MapReduce Programs, Proceedings of VLDB Vol.4, No. 6 , 2011 Copyright © KAIST Database Lab. All Rights Reserved.
  • 40. 55. B. Catanzaro and et al. A Map Reduce Framework for Programming Graphic Processors, In Proceedings of Workshop on Software Tools for Multicore Systems, 2008 56. B. He and et al. Mars: A MapReduce framework on graphic processors, In Proceedings of PACT’10 pp. 260-269, 2008 57. W. Jiang and et al. A Map-Reduce System with an Alternate API for Multi-Core Environments, In Proceedings of 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010 58. Jeff Dean, Design, Lessons, Advices from Building Large Distributed System, Keynote , LADIS 2009. 59. Willis Lang and et al. Energy Management for MapReduce Clusters, Proceedings of VLDB Vol. 3 No. 1, 2010 60. W. Xiong and et al. Energy Efficient Data Intensive Distributed Computing, Data Engineering Bulletin Vol. 34, No. 1, pp. 24-33, March 2011 61. E. Anderson and et al. Efficiency Matters!, ACM SIGOPS Operating Systems Review, 44(1):40-45, 2010 62. Jimmy Lin and Chris Dyer, Data-Intensive Text Processing, Book 63. G. Malewicz and et al. Pregel: A System for Large-Scale Graph Processing, In Proceedings of PODC’09 64. J. Ekanayake and et al. MapReduce for Data Intensive Scientific Analyses, In Proceedings of IEEE eScience’08 65. K. B. Hall and et al. MapReduce/BigTable for Distributed Optimization , NIPS LCCC Workshop 2010 66. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce 67. MC Schatz, CloudBurst: Highly sensitive read mapping with MapReduce, Bioinformatics, Vol 25, No. 11 68. B. Fan and et al. DiskReduce: RAID for data-intensive scalable computing, In Proceedings of the 4th Annual workshop on Petascale Data Storage, pp. 6-10, 2009 69. K. Lee and et al. Parallel data processing with MapReduce: a survey, The SIGMOD Record, Vol 40, No. 4, pp.11-20, 2011 Copyright © KAIST Database Lab. All Rights Reserved.