SlideShare a Scribd company logo
1 of 37
Download to read offline
http://www.coordguru.com




Emergent Distributed Data Storages
 for Big Data, Storage, and Analysis


Woohyun Kim
The creator of open source “Coord”
(http://www.coordguru.com)


2010-01-27
Contents
                                                                                     http://www.coordguru.com




The Advent of Big Data                          MapReduce Debates
 • Noah’s Ark Problem                            • MapReduce is just A Major Step Backwards!!!
 • Key Issues with ‚Big Data‛                    • RDB experts Jump the MR Shark
 • How to deal with ‚Big Data‛                   • DBs are hammers; MR is a screwdriver
                                                 • MR is a Step Backwards, but some Steps Forward


Hadoop Revolution                               A Hybrid of MapReduce and RDBMS
                                                 • Integrate MapReduce into RDBMS
 • Best Practice in Hadoop
                                                 • In-Database MapReduce vs. File-only MapReduce
 • Hadoop is changing the Game
 • Big Data goes well with Hadoop               Non-Relational Data Storages
 • Case Study: Parallel Join                     • Throw ‘Relational’ Away, and Take ‘Schema-Free’
 • Case Study: Further Study in Parallel Join    • A Comparison of Non-Relational Data Storages
 • Case Study: Improvements in Parallel Join     • Emergent Document-oriented Storages
                                                 • Document-oriented vs. RDBMS
http://www.coordguru.com




The Advent of Big Data
http://www.coordguru.com


Noah’s Ark Problem
• Did Noah take dinosaurs on the Ark?
    • The Ark was a very large ship designed especially for its important purpose
    • It was so large and complex that it took Noah 120 years to build

• How to put such a big thing
    • Diet?
    • Split?
        • Differentiate
        • Put
        • Integrate
    • Scale Up?
    • Scale Out?
• ‚Big Data‛ problem is just like that
http://www.coordguru.com


Key Issues with ‚Big Data‛
•   Lookup
       •   Metadata server -> centralized or distributed -> partitioned replicas to avoid a single of the failure

•   Partition
       •   Data locality -> network bandwidth reduction -> putting the computation near the data

•   Replication
       •   Hardwar Failure -> Data Loss -> Availability from redundant copies of the data

•   Load-balanced Parallel Processing
       •   Corrupt Data or Remote process failure -> speculative execution or rescheduling

•   Ad-hoc Analysis
       •   Some partitioned data may need to be combined with another data
http://www.coordguru.com


How to deal with ‚Big Data‛
 Struggling to STORE and ANALYZE ‚Big Data‛
http://www.coordguru.com


Appendix: What is ETL?
  ETL(Extract, Transform, and Load)
  • A process in database usage and especially in data warehousing that involves:
        • Extracting data from outside sources(such as different data
           organization/format, non-relational database structures)
        • Transforming it to fit operational needs (which can include quality levels)
              • Selection, translation, encoding, calculation, filtering, sorting, joining,
                 aggregation, transposing or pivoting, splitting, disaggregation
        • Loading it into the end target (database or data warehouse)

  ETL Open Sources
  • Talend Open Studio
  • Pentaho Data Integration (Kettle)
  • RapidMiner
  • Jitterbit 2.0,
  • Apatar
  • Clover.ETL
  • Scriptelle
http://www.coordguru.com




Hadoop Revolution
http://www.coordguru.com


Best Practice in Hadoop
• Software Stack in Google/Hadoop        • Cookbook for ‚Big Data‛




                                • Structured Data Storage for ‚Big Data‛
                                      Row key                   Column key


                                Row
                                                Structured
                                                   Data




                                                                                                      Time
                                                       Column                Column                   stamp
                                                       Family                Family
http://www.coordguru.com


Appendix: What is MapReduce?
  Map
  • Read a set of ‚records’ from an input file, which acts as filtering or transformations
  • Output a set of (key, data) pair, which partitions them into R disjoint buckets by
    the key
  Reduce
  • Read a set of (key, a list of data) pairs from R disjoint buckets
  • Each R from map’s outputs is shuffled, and aggregated into its corresponding
    reduce with being ordering by the key
  • Output a new set of records

                             Map
                                                       Reduce
                             Map
                                                       Reduce

                             Map


                           Group-By/                   Aggregate/
                             Filter                    Aggregator
http://www.coordguru.com


Hadoop is changing the Game
• Hadoop, DW, and BI
http://www.coordguru.com


Big Data goes well with Hadoop
• Parallelize Relational Algebra Operations using MapReduce
http://www.coordguru.com


Case Study: Parallel Join
• A Parallel Join Example using MapReduce
http://www.coordguru.com


Case Study: Further Study in Parallel Join
  Problems
  • Need to sort
  • Move the partitioned data across the network
      • Due to shuffling, must send the whole data
  • Skewed by popular keys
      • All records for a particular key are sent to the same reducer
  • Overhead by tagging

  Alternatives
   • Map-side Join
       • Mapper-only job to avoid sort and to reduce data movement across the
         network
   • Semi-Join
       • Shrink data size through semi-join(by preprocessing)
http://www.coordguru.com


Case Study: Improvements in Parallel Join
  Map-Side Join
  • Replicate a relatively smaller input source to the cluster
       • Put the replicated dataset into a local hash table
  • Join – a relatively larger input source with each local hash table
       • Mapper: do Mapper-side Join

  Semi-Join
  • Extract – unique IDs referenced in a larger input source(A)
       • Mapper: extract Movie IDs from Ratings records
       • Reducer: accumulate all unique Movie IDs
  • Filter – the other larger input source(B) with the referenced unique IDs
       • Mapper: filter the referenced Movie IDs from full Movie dataset
  • Join - a larger input source(A) with the filtered datasets
       • Mapper: do Mapper-side Join
            •   Ratings records & the filtered movie IDs dataset
http://www.coordguru.com




MapReduce Debates
http://www.coordguru.com


MapReduce is just A Major Step Backwards!!!
                                                      Dewitt and StoneBraker in January 17, 2008


  • A giant step backward in the programming paradigm for
   large-scale data intensive applications
     • Schema are good
        • Type check in runtime, so no garbage
     • Separation of the schema from the application is good
        • Schema is stored in catalogs, so can be queried(in SQL)
     • High-level access languages are good
        • Present what you want rather than an algorithm for how to get it
     • No schema??!
        • At least one data field by specifying the key as input
        • For Bigtable/Hbase, different tuples within the same table can
          actually have different schemas
        • Even there is no support for logical schema changes such as
          views
http://www.coordguru.com


MapReduce is just A Major Step Backwards!!! (cont’d)
                                                                 Dewitt and StoneBraker in January 17, 2008

  • A sub-optimal implementation, in that it uses brute force instead of
    indexing
      • Indexing
          • All modern DBMSs use hash or B-tree indexes to accelerate access to data
          • In addition, there is a query optimizer to decide whether to use an index or
               perform a brute-force sequential search
          • However, MapReduce has no indexes, so processes only in brute force fashion
      • Automatic parallel execution
          • In the 1980s, DBMS research community explored it such as Gamma, Bubba,
               Grace, even commercial Teradata
      • Skew
          • The distribution of records with the same key causes is skewed in the map
               phase, so it causes some reduce to take much longer than others
      • Intermediate data pulling
          • In the reduce phase, two or more reduce attempt to read input files form the
               same map node simultaneously
http://www.coordguru.com


MapReduce is just A Major Step Backwards!!! (cont’d)
                                                              Dewitt and StoneBraker in January 17, 2008

  • Not novel at all – it represents a specific implementation of well
    known techniques developed nearly 25 years ago
      • Partitioning for join
          • Application of Hash to Data Base Machine and its Architecture, 1983
      • Joins in parallel on a shared-nothing
          • Multiprocessor Hash-based Join Algorithms, 1985
          • The Case for Shared-Nothing, 1986
      • Aggregates in parallel
          • The Gamma Database Machine Project, 1990
          • Parallel Database System: The Future of High Performance Database Systems,
             1992
          • Adaptive Parallel Aggregation Algorithms, 1995
      • Teradata has been selling a commercial DBMS utilizing all of these
        techniques for more than 20 years
      • PostgreSQL supported user-defined functions and user-defined
        aggregates in the mid 1980s
http://www.coordguru.com


MapReduce is just A Major Step Backwards!!! (cont’d)
                                                                                  Dewitt and StoneBraker in January 17, 2008

  • Missing most of the features that are routinely included in current DBMS
      • MapReduce provides only a sliver of the functionality found in modern DBMSs
           •   Bulk loader – transform input data in files into a desired format and load it into a DBMS
           •   Indexing – hash or B-Tree indexes
           •   Updates – change the data in the data base
           •   Transactions – support parallel update and recovery from failures during update
           •   integrity constraints – help keep garbage out of the data base
           •   referential integrity – again, help keep garbage out of the data base
           •   Views – so the schema can change without having to rewrite the application program

  • Incompatible with all of the tools DBMS users have come to depend on
      • MapReduce cannot use the tools available in a modern SQL DBMS, and has none of
         its own
           •   Report writers(Crystal reports)
           •   Prepare reports for human visualization
           •   business intelligence tools(Business Objects or Cognos)
           •   Enable ad-hoc querying of large data warehouses
           •   data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner)
           •   Allow a user to discover structure in large data sets
           •   replication tools(Golden Gate)
           •   Allow a user to replicate data from on DBMS to another
           •   database design tools(Embarcadero)
           •   Assist the user in constructing a data base
http://www.coordguru.com




What the !@# MapReduce?
http://www.coordguru.com


RDB experts Jump the MR Shark
                                                                     Greg Jorgensen in January 17, 2008

  • Arg1: MapReduce is a step backwards in database access
      • MapReduce is not a database, a data storage, or management system
      • MapReduce is an algorithmic technique for the distributed processing of large
        amounts of data
  • Arg2: MapReduce is a poor implementation
      • MapReduce is one way to generate indexes from a large volume of data, but it’s not
        a data storage and retrieval system
  • Arg3: MapReduce is not novel
      • Hashing, parallel processing, data partitioning, and user-defined functions are all old
        hat in the RDBMS world, but so what?
      • The big innovation MapReduce enables is distributing data processing across a
        network of cheap and possibly unreliable computers
  • Arg4: MapReduce is missing features
  • Arg5: MapReduce is incompatible with the DBMS tools
      • The ability to process a huge volume of data quickly such as web crawling and log
        analysis is more important than guaranteeing 100% data integrity and completeness
http://www.coordguru.com


DBs are hammers; MR is a screwdriver
                                                        Mark C. Chu-Carroll


  • RDBs don’t parallelize very well
     • How many RDBs do you know that can efficiently split a
       task among 1,000 cheap computers?
  • RDBs don’t handle non-tabular data well
     • RDBs are notorious for doing a poor job on recursive data
       structures
  • MapReduce isn’t intended to replace relational
    databases
     • It’s intended to provide a lightweight way of programming
       things so that they can run fast by running in parallel on a
       lot of machines
http://www.coordguru.com


MR is a Step Backwards, but some Steps Forward
                                                                                Eugene Shekita

  • Arg1: Data Models, Schemas, and Query Languages
      • Semi-structured data model and high level of parallel data flow query language is
        built on top of MapReduce
           •   Pig, Hive, Jaql, Cascading, Cloudbase
      • Hadoop will eventually have a real data model, schema, catalogs, and query
        language
      • Moreover, Pig, Jaql, and Cascading are some steps forward
           • Support semi-structured data
           • Support more high level-like parallel data flow languages than declarative query
               languages
      • Greenplum and Aster Data support MapReduce, but look more limited than Pig, Jaql,
        Cascading
           • The calls to MapReduce functions wrapped in SQL queries will make it difficult
               to work with semi-structured data and program multi-step dataflows
  • Arg3: Novelty
      • Teradata was doing parallel group-by 20 years ago
      • UDAs and UDFs appeared in PostgreSQL in the mid 80s
      • And yet, MapReduce is much more flexible, and fault-tolerant
           • Support semi-structured data types, customizable partitioning
http://www.coordguru.com
http://www.coordguru.com


Lessons Learned from the Debates
  Who Moved My Cheese?
  • Speed
      • The seek times of physical storage is not keeping pace with improvements
         in network speeds
  • Scale
      • The difficulty of scaling the RDBMS out efficiently
            • Clustering beyond a handful of servers is notoriously hard
  • Integration
      • Today’s data processing tasks increasingly have to access and combine
         data from many different non-relational sources, often over a network
  • Volume
      • Data volumes have grown from tens of gigabytes in the 1990s to
         hundreds of terabytes and often petabytes in recent years

                                        Stolen from 10 Ways To complement the Enterprise RDBMS using Hadoop
http://www.coordguru.com




A Hybrid of MapReduce and RDBMS
http://www.coordguru.com


Integrate MapReduce into RDBMS
                                 RDBMS                                MapReduce
   Data size       Gigabytes                            Petabytes
    Updates        Read and write(Mutable)              Write once, read many times(Immutable)
     Latency       Low                                  High
     Access        Interactive(point query) and batch   Batch(ad-hoc query in brute-force)
   Structure       Fixed schema                         Semi-structured schema
   Language        SQL                                  Procedural (Java, C++, etc)
    Integrity      High                                 Low
     Scaling       Nonlinear                            Linear




        HadoopDB                                Greenplum                        Aster Data
http://www.coordguru.com


 In-Database MapReduce vs. File-only MapReduce
      • In-Database MapReduce
           • Greenplum, Aster Data, HadoopDB
      • File-only MapReduce
           • Pig, Hive, Cloudbase

                                     In-Database MapReduce           File-Only MapReduce
            Target User         Analyst, DBA, Data Miner      Computer Science Engineer
      Scale & Performance       High                          High
         Hardware Costs         Low                           Low
        Analytical Insights     High                          High
       Failover & Recovery      High                          High
      Use: Ad-Hoc Queries       Easy (seamless)               Harder (custom)
       Use: UI, Client Tools    BI Tool (GUI), SQL (CLI)      Developer Tool (Java)
         Use: Ecosystem         High (JDBC, ODBC)             Lower (custom)
     Protect: Data Integrity    High (ACID, schema)           Lower (no transaction guarantees)
         Protect: Security      High (roles, privileges)      Lower (custom)
      Protect: Backup & DR      High (database backup/DR)     Lower (custom)
 Performance: Mixed Workloads High (workload/QoS mgmt)        Lower (limited concurrency)
Performance: Network Bottleneck No (optimized partitioning)   Higher (network inefficient)
        Operational Cost        Low (1 DBA)                   Higher (several engineers)
http://www.coordguru.com




Non-Relational Data Storages
http://www.coordguru.com


Throw ‘Relational’ Away, and Take ‘Schema-Free’
  The new face of data
  • Scale out, not up
  • Online load balancing, cluster growth
  • Flexible schema
       •   Some data have sparse attributes, do not need ‘relational’ property
              •   Document/Term vector, User/Item matrix, Log-structured data
  • Key-oriented queries
       •   Some data are stored and retrieved mainly by primary key, without complex joins
  • Trade-off of Consistency, Availability, and Partition Tolerance

  Two of Feasible Approaches
  • Bigtable
       •   How can we build a distributed DB on top of GFS?
       •   B+ Tree style Lookup, Synchronized consistency
              •   Memtable/Commit Log/Immutable SSTable/Indexes, Compaction

  • Dynamo
       •   How can we build a distributed hash table appropriate for the data center?
       •   DHT style Lookup, Eventually consistency
http://www.coordguru.com


 A Comparison of Non-Relational Data Storages
  Name          Language          Fault-tolerance                      Persistence                Client Protocol        Data model Docs    Community
  Hbase        Java      Replication, partitioning              Custom on-disk               Custom API, Thrift, Rest   Bigtable     A   Apache, yes
                                                                                                                                         Zvents, Baidu, y
Hypertable     C++          Replication, partitioning           Custom on-disk               Thrift, other              Bigtable     A
                                                                                                                                         es
 Neptune       Java         Replication, partitioning           Custom on-disk               Custom API, Thrift, Rest   Bigtable     A   NHN, some
                            partitioned, replicated, read-rep   Pluggable: BerkleyDB, Mys                               Structured /
 Voldemort     Java                                                                          Java API                                A   Linkedin, no
                            air                                 ql                                                      blob / text
                            partitioned, replicated, immutab    Custom on-disk (append o
   Ringo       Erlang                                                                        HTTP                       blob            B      Nokia, no
                            le                                  nly log)
  Scalaris     Erlang       partitioned, replicated, paxos      In-memory only               Erlang, Java, HTTP         blob            B      OnScale, no
    Kai        Erlang       partitioned, replicated?            On-disk Dets file            Memcached                  blob            C      no
 Dynomite      Erlang       partitioned, replicated             Pluggable: couch, dets       Custom ascii, Thrift       blob            D+     Powerset, no
MemcacheDB     C            replication                         BerkleyDB                    Memcached                  blob            B      some
                                                                Pluggable: BerkleyDB, Cust                              Document or            Third rail, unsur
  ThruDB       C++          Replication                                                      Thrift                                     C+
                                                                om, Mysql, S3                                           iented                 e
                                                                                                                        Document or
 CouchDB       Erlang       Replication, partitioning?          Custom on-disk               HTTP, json                                 A      Apache, yes
                                                                                                                        iented (json)
                                                                                                                        Bigtable me
 Cassandra     Java         Replication, partitioning           Custom on-disk               Thrift                                     F      Facebook, no
                                                                                                                        ets Dynamo
                                                                Pluggable: in-memory, Luc
  Coord        C++          Replication?, partitioning                                    Custom API, Thrift            text / blob     A      NHN, some
                                                                ene, BerkelyDB, Mysql
                                                                        Stolen from Anti-RDBMS - A list of distributed key-value stores by Richard Jones
           Bigtable                             DHT

   HBase                    Cassandra         Dynamo
             Hypertable                          Voldemort
                                     Dynomite            KAI
   SimpleDB
                      Chordless           CouchDB
 Tokyo Cabinet                                      MongoDB
           MemcacheDB
                                           ThruDB
  Scalaris
                                        Document-oriented
           Key-Value                                                 On-going classification by Woohyun Kim
http://www.coordguru.com


Emergent Document-oriented Storages
  Why Document-oriented?
  • All fields become optional
  • All relationships become Many-to-Many
  • Chatter always expands


  Key Features
  • Schema-Free
  • Straightforward Data Model
  • Full Text Indexing
  • RESTful HTTP/JSON API
http://www.coordguru.com


Document-oriented vs. RDBMS
                                CouchDB                                     MongoDB                          MySQL
Data Model            Document-Oriented (JSON)             Document-Oriented (BSON)                  Relational
                                                           string, int, double, boolean, date, bytea
Data Types            ?                                                                              Link
                                                           rray, object, array, others
Large Objects (Files) Yes (attachments)                    Yes (GridFS)                              no???
                      Master-master (with developer sup
Replication                                                Master-slave                           Master-slave
                      plied conflict resolution)
Object(row) Storage One large repository                   Collection based                     Table based
                      Map/reduce of javascript functions   Dynamic; object-based query language
Query Method                                                                                    Dynamic; SQL
                      to lazily build an index per query
Secondary Indexes Yes                                      Yes                                    Yes
Atomicity             Single document                      Single document                        Yes – advanced
Interface             REST                                 Native drivers                         Native drivers
Server-side batch dat
                      ?                                    Yes, via javascript                    Yes (SQL)
a manipulation
Written in            Erlang                               C++                                    C
Concurrency Control MVCC                                   Update in Place                        Update in Place
http://www.coordguru.com




Thank you.
http://www.coordguru.com


Appendix: What is Coord?
  Architectural Comparison
  • dust: a distributed file system based on DHT
  • coord spaces: a resource sharable store system based on SBA
  • coord mapreduce: a simplified large-scale data processing framework
  • warp: a scalable remote/parallel execution system
  • graph: a large-scale distributed graph search system
http://www.coordguru.com


Appendix: Coord Internals
 A space-based architecture built on distributed hash tables
    SBA(Space-based Architecture)
        processes communicate with others thru. only spaces
    DHT(Distributed Hash Tables)
        data identified by hash functions are placed on numerically near nodes
 A computing platform to project a single address space on
  distributed memories
    As if users worked in a single computing environment

                                                                  App

                                                           take          write
                                                                  read
                                                             2m-1 0




                                               node 1   node 2        node 3            node n

More Related Content

What's hot

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)Pavlo Baron
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 

What's hot (20)

Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 

Viewers also liked

Transforming Big Data into Big Value
Transforming Big Data into Big ValueTransforming Big Data into Big Value
Transforming Big Data into Big ValueThomas Kelly, PMP
 
Big Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiBig Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiJohn Cai
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 

Viewers also liked (14)

Transforming Big Data into Big Value
Transforming Big Data into Big ValueTransforming Big Data into Big Value
Transforming Big Data into Big Value
 
Health policy big data
Health policy big dataHealth policy big data
Health policy big data
 
Big Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John CaiBig Data Analytics for Treatment Pathways John Cai
Big Data Analytics for Treatment Pathways John Cai
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Emergent Distributed Data Storage

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 

Similar to Emergent Distributed Data Storage (20)

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Emergent Distributed Data Storage

  • 1. http://www.coordguru.com Emergent Distributed Data Storages for Big Data, Storage, and Analysis Woohyun Kim The creator of open source “Coord” (http://www.coordguru.com) 2010-01-27
  • 2. Contents http://www.coordguru.com The Advent of Big Data MapReduce Debates • Noah’s Ark Problem • MapReduce is just A Major Step Backwards!!! • Key Issues with ‚Big Data‛ • RDB experts Jump the MR Shark • How to deal with ‚Big Data‛ • DBs are hammers; MR is a screwdriver • MR is a Step Backwards, but some Steps Forward Hadoop Revolution A Hybrid of MapReduce and RDBMS • Integrate MapReduce into RDBMS • Best Practice in Hadoop • In-Database MapReduce vs. File-only MapReduce • Hadoop is changing the Game • Big Data goes well with Hadoop Non-Relational Data Storages • Case Study: Parallel Join • Throw ‘Relational’ Away, and Take ‘Schema-Free’ • Case Study: Further Study in Parallel Join • A Comparison of Non-Relational Data Storages • Case Study: Improvements in Parallel Join • Emergent Document-oriented Storages • Document-oriented vs. RDBMS
  • 4. http://www.coordguru.com Noah’s Ark Problem • Did Noah take dinosaurs on the Ark? • The Ark was a very large ship designed especially for its important purpose • It was so large and complex that it took Noah 120 years to build • How to put such a big thing • Diet? • Split? • Differentiate • Put • Integrate • Scale Up? • Scale Out? • ‚Big Data‛ problem is just like that
  • 5. http://www.coordguru.com Key Issues with ‚Big Data‛ • Lookup • Metadata server -> centralized or distributed -> partitioned replicas to avoid a single of the failure • Partition • Data locality -> network bandwidth reduction -> putting the computation near the data • Replication • Hardwar Failure -> Data Loss -> Availability from redundant copies of the data • Load-balanced Parallel Processing • Corrupt Data or Remote process failure -> speculative execution or rescheduling • Ad-hoc Analysis • Some partitioned data may need to be combined with another data
  • 6. http://www.coordguru.com How to deal with ‚Big Data‛ Struggling to STORE and ANALYZE ‚Big Data‛
  • 7. http://www.coordguru.com Appendix: What is ETL? ETL(Extract, Transform, and Load) • A process in database usage and especially in data warehousing that involves: • Extracting data from outside sources(such as different data organization/format, non-relational database structures) • Transforming it to fit operational needs (which can include quality levels) • Selection, translation, encoding, calculation, filtering, sorting, joining, aggregation, transposing or pivoting, splitting, disaggregation • Loading it into the end target (database or data warehouse) ETL Open Sources • Talend Open Studio • Pentaho Data Integration (Kettle) • RapidMiner • Jitterbit 2.0, • Apatar • Clover.ETL • Scriptelle
  • 9. http://www.coordguru.com Best Practice in Hadoop • Software Stack in Google/Hadoop • Cookbook for ‚Big Data‛ • Structured Data Storage for ‚Big Data‛ Row key Column key Row Structured Data Time Column Column stamp Family Family
  • 10. http://www.coordguru.com Appendix: What is MapReduce? Map • Read a set of ‚records’ from an input file, which acts as filtering or transformations • Output a set of (key, data) pair, which partitions them into R disjoint buckets by the key Reduce • Read a set of (key, a list of data) pairs from R disjoint buckets • Each R from map’s outputs is shuffled, and aggregated into its corresponding reduce with being ordering by the key • Output a new set of records Map Reduce Map Reduce Map Group-By/ Aggregate/ Filter Aggregator
  • 11. http://www.coordguru.com Hadoop is changing the Game • Hadoop, DW, and BI
  • 12. http://www.coordguru.com Big Data goes well with Hadoop • Parallelize Relational Algebra Operations using MapReduce
  • 13. http://www.coordguru.com Case Study: Parallel Join • A Parallel Join Example using MapReduce
  • 14. http://www.coordguru.com Case Study: Further Study in Parallel Join Problems • Need to sort • Move the partitioned data across the network • Due to shuffling, must send the whole data • Skewed by popular keys • All records for a particular key are sent to the same reducer • Overhead by tagging Alternatives • Map-side Join • Mapper-only job to avoid sort and to reduce data movement across the network • Semi-Join • Shrink data size through semi-join(by preprocessing)
  • 15. http://www.coordguru.com Case Study: Improvements in Parallel Join Map-Side Join • Replicate a relatively smaller input source to the cluster • Put the replicated dataset into a local hash table • Join – a relatively larger input source with each local hash table • Mapper: do Mapper-side Join Semi-Join • Extract – unique IDs referenced in a larger input source(A) • Mapper: extract Movie IDs from Ratings records • Reducer: accumulate all unique Movie IDs • Filter – the other larger input source(B) with the referenced unique IDs • Mapper: filter the referenced Movie IDs from full Movie dataset • Join - a larger input source(A) with the filtered datasets • Mapper: do Mapper-side Join • Ratings records & the filtered movie IDs dataset
  • 17. http://www.coordguru.com MapReduce is just A Major Step Backwards!!! Dewitt and StoneBraker in January 17, 2008 • A giant step backward in the programming paradigm for large-scale data intensive applications • Schema are good • Type check in runtime, so no garbage • Separation of the schema from the application is good • Schema is stored in catalogs, so can be queried(in SQL) • High-level access languages are good • Present what you want rather than an algorithm for how to get it • No schema??! • At least one data field by specifying the key as input • For Bigtable/Hbase, different tuples within the same table can actually have different schemas • Even there is no support for logical schema changes such as views
  • 18. http://www.coordguru.com MapReduce is just A Major Step Backwards!!! (cont’d) Dewitt and StoneBraker in January 17, 2008 • A sub-optimal implementation, in that it uses brute force instead of indexing • Indexing • All modern DBMSs use hash or B-tree indexes to accelerate access to data • In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search • However, MapReduce has no indexes, so processes only in brute force fashion • Automatic parallel execution • In the 1980s, DBMS research community explored it such as Gamma, Bubba, Grace, even commercial Teradata • Skew • The distribution of records with the same key causes is skewed in the map phase, so it causes some reduce to take much longer than others • Intermediate data pulling • In the reduce phase, two or more reduce attempt to read input files form the same map node simultaneously
  • 19. http://www.coordguru.com MapReduce is just A Major Step Backwards!!! (cont’d) Dewitt and StoneBraker in January 17, 2008 • Not novel at all – it represents a specific implementation of well known techniques developed nearly 25 years ago • Partitioning for join • Application of Hash to Data Base Machine and its Architecture, 1983 • Joins in parallel on a shared-nothing • Multiprocessor Hash-based Join Algorithms, 1985 • The Case for Shared-Nothing, 1986 • Aggregates in parallel • The Gamma Database Machine Project, 1990 • Parallel Database System: The Future of High Performance Database Systems, 1992 • Adaptive Parallel Aggregation Algorithms, 1995 • Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years • PostgreSQL supported user-defined functions and user-defined aggregates in the mid 1980s
  • 20. http://www.coordguru.com MapReduce is just A Major Step Backwards!!! (cont’d) Dewitt and StoneBraker in January 17, 2008 • Missing most of the features that are routinely included in current DBMS • MapReduce provides only a sliver of the functionality found in modern DBMSs • Bulk loader – transform input data in files into a desired format and load it into a DBMS • Indexing – hash or B-Tree indexes • Updates – change the data in the data base • Transactions – support parallel update and recovery from failures during update • integrity constraints – help keep garbage out of the data base • referential integrity – again, help keep garbage out of the data base • Views – so the schema can change without having to rewrite the application program • Incompatible with all of the tools DBMS users have come to depend on • MapReduce cannot use the tools available in a modern SQL DBMS, and has none of its own • Report writers(Crystal reports) • Prepare reports for human visualization • business intelligence tools(Business Objects or Cognos) • Enable ad-hoc querying of large data warehouses • data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner) • Allow a user to discover structure in large data sets • replication tools(Golden Gate) • Allow a user to replicate data from on DBMS to another • database design tools(Embarcadero) • Assist the user in constructing a data base
  • 22. http://www.coordguru.com RDB experts Jump the MR Shark Greg Jorgensen in January 17, 2008 • Arg1: MapReduce is a step backwards in database access • MapReduce is not a database, a data storage, or management system • MapReduce is an algorithmic technique for the distributed processing of large amounts of data • Arg2: MapReduce is a poor implementation • MapReduce is one way to generate indexes from a large volume of data, but it’s not a data storage and retrieval system • Arg3: MapReduce is not novel • Hashing, parallel processing, data partitioning, and user-defined functions are all old hat in the RDBMS world, but so what? • The big innovation MapReduce enables is distributing data processing across a network of cheap and possibly unreliable computers • Arg4: MapReduce is missing features • Arg5: MapReduce is incompatible with the DBMS tools • The ability to process a huge volume of data quickly such as web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness
  • 23. http://www.coordguru.com DBs are hammers; MR is a screwdriver Mark C. Chu-Carroll • RDBs don’t parallelize very well • How many RDBs do you know that can efficiently split a task among 1,000 cheap computers? • RDBs don’t handle non-tabular data well • RDBs are notorious for doing a poor job on recursive data structures • MapReduce isn’t intended to replace relational databases • It’s intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines
  • 24. http://www.coordguru.com MR is a Step Backwards, but some Steps Forward Eugene Shekita • Arg1: Data Models, Schemas, and Query Languages • Semi-structured data model and high level of parallel data flow query language is built on top of MapReduce • Pig, Hive, Jaql, Cascading, Cloudbase • Hadoop will eventually have a real data model, schema, catalogs, and query language • Moreover, Pig, Jaql, and Cascading are some steps forward • Support semi-structured data • Support more high level-like parallel data flow languages than declarative query languages • Greenplum and Aster Data support MapReduce, but look more limited than Pig, Jaql, Cascading • The calls to MapReduce functions wrapped in SQL queries will make it difficult to work with semi-structured data and program multi-step dataflows • Arg3: Novelty • Teradata was doing parallel group-by 20 years ago • UDAs and UDFs appeared in PostgreSQL in the mid 80s • And yet, MapReduce is much more flexible, and fault-tolerant • Support semi-structured data types, customizable partitioning
  • 26. http://www.coordguru.com Lessons Learned from the Debates Who Moved My Cheese? • Speed • The seek times of physical storage is not keeping pace with improvements in network speeds • Scale • The difficulty of scaling the RDBMS out efficiently • Clustering beyond a handful of servers is notoriously hard • Integration • Today’s data processing tasks increasingly have to access and combine data from many different non-relational sources, often over a network • Volume • Data volumes have grown from tens of gigabytes in the 1990s to hundreds of terabytes and often petabytes in recent years Stolen from 10 Ways To complement the Enterprise RDBMS using Hadoop
  • 28. http://www.coordguru.com Integrate MapReduce into RDBMS RDBMS MapReduce Data size Gigabytes Petabytes Updates Read and write(Mutable) Write once, read many times(Immutable) Latency Low High Access Interactive(point query) and batch Batch(ad-hoc query in brute-force) Structure Fixed schema Semi-structured schema Language SQL Procedural (Java, C++, etc) Integrity High Low Scaling Nonlinear Linear HadoopDB Greenplum Aster Data
  • 29. http://www.coordguru.com In-Database MapReduce vs. File-only MapReduce • In-Database MapReduce • Greenplum, Aster Data, HadoopDB • File-only MapReduce • Pig, Hive, Cloudbase In-Database MapReduce File-Only MapReduce Target User Analyst, DBA, Data Miner Computer Science Engineer Scale & Performance High High Hardware Costs Low Low Analytical Insights High High Failover & Recovery High High Use: Ad-Hoc Queries Easy (seamless) Harder (custom) Use: UI, Client Tools BI Tool (GUI), SQL (CLI) Developer Tool (Java) Use: Ecosystem High (JDBC, ODBC) Lower (custom) Protect: Data Integrity High (ACID, schema) Lower (no transaction guarantees) Protect: Security High (roles, privileges) Lower (custom) Protect: Backup & DR High (database backup/DR) Lower (custom) Performance: Mixed Workloads High (workload/QoS mgmt) Lower (limited concurrency) Performance: Network Bottleneck No (optimized partitioning) Higher (network inefficient) Operational Cost Low (1 DBA) Higher (several engineers)
  • 31. http://www.coordguru.com Throw ‘Relational’ Away, and Take ‘Schema-Free’ The new face of data • Scale out, not up • Online load balancing, cluster growth • Flexible schema • Some data have sparse attributes, do not need ‘relational’ property • Document/Term vector, User/Item matrix, Log-structured data • Key-oriented queries • Some data are stored and retrieved mainly by primary key, without complex joins • Trade-off of Consistency, Availability, and Partition Tolerance Two of Feasible Approaches • Bigtable • How can we build a distributed DB on top of GFS? • B+ Tree style Lookup, Synchronized consistency • Memtable/Commit Log/Immutable SSTable/Indexes, Compaction • Dynamo • How can we build a distributed hash table appropriate for the data center? • DHT style Lookup, Eventually consistency
  • 32. http://www.coordguru.com A Comparison of Non-Relational Data Storages Name Language Fault-tolerance Persistence Client Protocol Data model Docs Community Hbase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A Apache, yes Zvents, Baidu, y Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable A es Neptune Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A NHN, some partitioned, replicated, read-rep Pluggable: BerkleyDB, Mys Structured / Voldemort Java Java API A Linkedin, no air ql blob / text partitioned, replicated, immutab Custom on-disk (append o Ringo Erlang HTTP blob B Nokia, no le nly log) Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob B OnScale, no Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob C no Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob D+ Powerset, no MemcacheDB C replication BerkleyDB Memcached blob B some Pluggable: BerkleyDB, Cust Document or Third rail, unsur ThruDB C++ Replication Thrift C+ om, Mysql, S3 iented e Document or CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json A Apache, yes iented (json) Bigtable me Cassandra Java Replication, partitioning Custom on-disk Thrift F Facebook, no ets Dynamo Pluggable: in-memory, Luc Coord C++ Replication?, partitioning Custom API, Thrift text / blob A NHN, some ene, BerkelyDB, Mysql Stolen from Anti-RDBMS - A list of distributed key-value stores by Richard Jones Bigtable DHT HBase Cassandra Dynamo Hypertable Voldemort Dynomite KAI SimpleDB Chordless CouchDB Tokyo Cabinet MongoDB MemcacheDB ThruDB Scalaris Document-oriented Key-Value On-going classification by Woohyun Kim
  • 33. http://www.coordguru.com Emergent Document-oriented Storages Why Document-oriented? • All fields become optional • All relationships become Many-to-Many • Chatter always expands Key Features • Schema-Free • Straightforward Data Model • Full Text Indexing • RESTful HTTP/JSON API
  • 34. http://www.coordguru.com Document-oriented vs. RDBMS CouchDB MongoDB MySQL Data Model Document-Oriented (JSON) Document-Oriented (BSON) Relational string, int, double, boolean, date, bytea Data Types ? Link rray, object, array, others Large Objects (Files) Yes (attachments) Yes (GridFS) no??? Master-master (with developer sup Replication Master-slave Master-slave plied conflict resolution) Object(row) Storage One large repository Collection based Table based Map/reduce of javascript functions Dynamic; object-based query language Query Method Dynamic; SQL to lazily build an index per query Secondary Indexes Yes Yes Yes Atomicity Single document Single document Yes – advanced Interface REST Native drivers Native drivers Server-side batch dat ? Yes, via javascript Yes (SQL) a manipulation Written in Erlang C++ C Concurrency Control MVCC Update in Place Update in Place
  • 36. http://www.coordguru.com Appendix: What is Coord? Architectural Comparison • dust: a distributed file system based on DHT • coord spaces: a resource sharable store system based on SBA • coord mapreduce: a simplified large-scale data processing framework • warp: a scalable remote/parallel execution system • graph: a large-scale distributed graph search system
  • 37. http://www.coordguru.com Appendix: Coord Internals  A space-based architecture built on distributed hash tables  SBA(Space-based Architecture)  processes communicate with others thru. only spaces  DHT(Distributed Hash Tables)  data identified by hash functions are placed on numerically near nodes  A computing platform to project a single address space on distributed memories  As if users worked in a single computing environment App take write read 2m-1 0 node 1 node 2 node 3 node n