SlideShare ist ein Scribd-Unternehmen logo
1 von 25
HBase and Accumulo
 Washington DC Hadoop User Group
          Jan 25th, 2012

            Todd Lipcon
     Software Engineer, Cloudera
   todd@cloudera.com / @tlipcon


     Copyright 2011 Cloudera Inc. All rights reserved
Background – Overview

• HBase and Accumulo are both open-source, Apache
  2.0 licensed implementations of Google’s BigTable
  infrastructure, running on Apache Hadoop
• Scalable, distributed storage
   • Scalable data storage at petabyte scale, storing trillions of
     rows distributed across hundreds or thousands of machines
   • Automatic fault tolerance and data distribution as machines
     crash or rejoin the cluster
   • Linear scaling of IOPS and data capacity by adding servers
• Data model is a big sorted hierarchical map

                    Copyright 2012 Cloudera Inc. All rights reserved   2
Sorted Map Datastores
• Each row has a row key (like a Primary Key in RDBMS
  terms)
   • Users may query by exact row key or by range of row keys
   • Data is always stored and returned in sorted order
• Each row has some number of columns
   • Each column has a qualifier and some piece of data. Like a
     Map<byte[], byte[]>
   • Different rows may have different sets of columns
   • Each cell has an associated timestamp and may retain a
     history of previous values
• Columns are grouped into column families and locality
  groups
                    Copyright 2012 Cloudera Inc. All rights reserved   3
Sorted Map Datastore
   (logical view as “records”)

  Implicit PRIMARY KEY in
             RDBMS terms                              Data is all byte[] in HBase


                       Row key       Data
  Different types of
data separated into
                       cutting       info: , ‘height’: ‘9ft’, ‘state’: ‘CA’ -
           different                 roles: , ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ -
 “column families”     tlipcon       info: , ‘height’: ‘5ft7, ‘state’: ‘CA’ -
                                     roles: , ‘Hadoop’: ‘Committer’@ts=2010,
                                             ‘Hadoop’: ‘PMC’@ts=2011,
                                             ‘Hive’: ‘Contributor’ -

  Different rows may have different sets              A single cell might have different
              of columns(table is sparse)             values at different timestamps

         Useful for *-To-Many mappings
Locality Groups
• Different sets of columns may have different properties
  and access patterns
   • Perhaps a few columns are accessed all the time, whereas
     others are large and rarely needed
   • For example, a user’s metadata (1kb, accessed frequently) and
     their photo (1MB, cached by CDN and accessed rarely)
• Put metadata in one locality group and photos in
  another
• Locality groups stored separately on disk: access just the
  metadata without reading the photo
Sorted Map Datastore
          (physical view as “cells”)
                         info Column Family / Locality Group
               Row key    Column key         Timestamp          Cell value
               cutting    info:height        1273516197868      9ft
               cutting    info:state         1043871824184      CA
               tlipcon    info:height        1273878447049      5ft7
               tlipcon    info:state         1273616297446      CA

                         roles Column Family / Locality Group
               Row key    Column key         Timestamp          Cell value
               cutting    roles:ASF          1273871823022      Director
     Sorted
  on disk by   cutting    roles:Hadoop       1183746289103      Founder
Row key, Col   tlipcon    roles:Hadoop       1300062064923      PMC
        key,
 descending    tlipcon    roles:Hadoop       1293388212294      Committer
 timestamp
               tlipcon    roles:Hive         1273616297446      Contributor

                                       Milliseconds since unix epoch
(image from Accumulo manual)
Copyright 2012 Cloudera Inc. All rights reserved   7
Accumulo/HBase Terminology
Accumulo     HBase                  Definition
Tablet       Region                 A partition of a table (eg email inboxes starting
                                    with ‘a’-’c’)
TabletServer RegionServer           A server in the cluster which hosts a number of
                                    tablets/regions, providing read/write access
Log/WAL      HLog/WAL               Write-ahead log – used for durably logging edits
Minor        Flush                  Writing data from memory to disk
compaction
Major        Minor                  Merging several on-disk files into a larger one
compaction   Compaction
Major          Major                Merging all of the on-disk files into a larger one
compaction compaction
with all files

                        Copyright 2012 Cloudera Inc. All rights reserved                 8
That’s all the intro we have time for…

• Check out the excellent Accumulo manual at
  http://incubator.apache.org/accumulo
• And the HBase manual at
  http://hbase.apache.org/book.html
• Also some longer intro videos on Cloudera’s website,
  and an excellent O’Reilly book




                 Copyright 2012 Cloudera Inc. All rights reserved   9
Commonalities (the non-controversial stuff)

• Both systems scale well
   • Clusters with >1000 nodes, >1PB
   • Example HBase users: StumbleUpon, TrendMicro, Facebook,
     eBay, Flurry, ngmoco, Mozilla, Adobe, etc.
   • Example Accumulo users: ??????? (I don’t have clearance but
     I’m told they’re big and important)
• Both systems perform well
   • Depending on tuning, one might beat the other at any given
     benchmark, but overall results seem comparable
• Both open source with active development

                   Copyright 2012 Cloudera Inc. All rights reserved   10
Commonalities (the non-controversial stuff)

• Storage formats are very similar
   • Used to be the same, then diverged, then re-converged!
   • Multi-level BTrees, bloom filters, compression
   • Prefix compression currently missing in HBase, 95% complete
     for 0.94.0
• Caching code very similar
   • Accumulo uses an older version of HBase’s LRUBlockCache
   • HBase has some recent improvements (off-heap cache), but I
     imagine Accumulo will grab them soon enough.



                   Copyright 2012 Cloudera Inc. All rights reserved   11
General features

• Both have good MapReduce integration
• Both have a command-line shell
• Both have a pretty good test suite
   • Accumulo used to be ahead here, but we traded off some
     ideas and use similar testing strategies now
• Both use ZooKeeper for fault tolerant metadata storage,
  and support failover Masters




                   Copyright 2012 Cloudera Inc. All rights reserved   12
Now for the fun part… BigTable shootout 2012

• Warning: I am necessarily biased as an HBase
  committer.
• I will be comparing the very latest versions
   • HBase 0.92.0 (released only 2 days ago!)
   • Accumulo 1.4 (not yet released, due out mid Feb?)
• Please feel free to loudly disagree after the talk during
  the time allotted for questions – I am happy to be
  proven wrong! I’ll invite Aaron Cordova and John Vines
  up to help answer questions.


                   Copyright 2012 Cloudera Inc. All rights reserved   13
Differences – Active contributors and users




                                                             (plus various contractors thereof)
       (I ran out of space)


                       Copyright 2012 Cloudera Inc. All rights reserved                           14
Differences – User Mailing list activity




   500-600 messages                                      50-100 messages
   per month (peak                                       per month (peak
   1088)                                                 105)

                                                         *but it’s new at Apache+



     Winner:
                Copyright 2012 Cloudera Inc. All rights reserved                    15
Differences – Access Control

• Accumulo has per-cell visibility labels as well as table
  ACLs
   • Each cell has an ACL of what users may see it. (eg
     (TS|(SECRET&PROJECTX)))
   • Users who don’t have access can’t tell the cell even exists
   • Very useful for classified information!
• HBase has column family ACLs but no built-in per-cell
  visibility support
   • Some early work to add visibility labels, but not done yet

   Winner:
                     Copyright 2012 Cloudera Inc. All rights reserved   16
Differences – Authentication

• Accumulo has a built-in user database
   • Users are authenticated by username/password
   • Passed in plaintext over the wire
• HBase optionally uses Kerberos
   • Central administration (eg via Active Directory)
   • Key-based secure credential exchange
   • Temporary delegation tokens are created for MR jobs, so even
     if a job’s data leaks, credentials are not compromised
   • Consistent with rest of Hadoop ecosystem

   Winner:
                   Copyright 2012 Cloudera Inc. All rights reserved   17
Differences – Locality Groups

• HBase has a 1:1 correspondence of Column Families
  and Locality Groups
   • Moving columns from one locality group to another after data
     has been inserted is impossible
• Accumulo has a proper distinction and allows online
  reassignment of column-to-locality-group mappings



Winner:


                    Copyright 2012 Cloudera Inc. All rights reserved   18
Differences – extensibility frameworks
• Accumulo has iterators
   • Allows custom processing to be inserted in the read path as
     well as into the table maintenance code. Provides neat
     features like automatic summary maintenance, for example.
• HBase has coprocessors
   • Much more general framework that also subsumes triggers,
     stored procedures, and cluster management hooks. (e.g
     Access Control is an HBase coprocessor).
   • Generality has its cost: very difficult to do some things that
     are simple with iterators
   • Some iterator use cases can be done with HBase filters
• I’ll call this one a tie
                     Copyright 2012 Cloudera Inc. All rights reserved   19
Differences – Web UI and Monitoring




    Winner:
              Copyright 2012 Cloudera Inc. All rights reserved   20
Differences – Write-ahead logging

• HBase uses HDFS files as a WAL
  • Takes advantage of HDFS performance improvements as they
    are developed
  • Same trusted replication and checksumming schemes as HDFS
• Accumulo has its own Logger implementation
  • Extra daemons to run
  • Does not leverage improvements in HDFS
  • Won’t re-replicate if loggers go down


  Winner:
                  Copyright 2012 Cloudera Inc. All rights reserved   21
Differences – Other features

• Accumulo has a nice mock Accumulo implementation
   • Nice for testing user software
• Accumulo supports isolated scans on super-wide rows
   • HBase supports wide rows but isolation properties are lost
• Accumulo supports tablet merging
   • If tablets get too small, they’ll merge with neighbors
• Accumulo supports table snapshotting/cloning
• Other sundry features: logical clocks, RPC tracing, RPC
  wire compatibility, and more.

                     Copyright 2012 Cloudera Inc. All rights reserved   22
Differences – Other features
• HBase has RPM and Debian packages as part of Apache
  BigTop
   • Integrated (and integration-tested) with Hive, Pig, and others
• HBase has commercial support available from Cloudera,
  as well as several vendors and other projects building
  on top (Lily, OMID, etc)
• HBase has first-class support for REST clients and thin
  Thrift clients
• HBase has inter-cluster wide-area replication
• HBase has significantly more advanced bloom filters
  and other such optimizations (thanks Facebook!)
                    Copyright 2012 Cloudera Inc. All rights reserved   23
Summary

• Neither system is better!
• One system may very well be better for your use case,
  or for the community you want to interact with
• Over time, the feature sets are converging
   • RFile vs HFile v2, Security, Caching, Compaction policies,
     Iterators/Coprocessors
• Now that both projects are in Apache, open dialogue,
  code sharing, and friendly competition will help make
  both projects better!


                     Copyright 2012 Cloudera Inc. All rights reserved   24
Thanks!

Aaron Cordova and John Vines
(Accumulo committers) will now join
me for some discussion / questions



          Email: todd@cloudera.com
          Twitter: @tlipcon
            Copyright 2012 Cloudera Inc. All rights reserved   25

Weitere ähnliche Inhalte

Was ist angesagt?

Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajpCloudera Japan
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 

Was ist angesagt? (20)

Impala presentation
Impala presentationImpala presentation
Impala presentation
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
6.hive
6.hive6.hive
6.hive
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 

Ähnlich wie HBase and Accumulo | Washington DC Hadoop User Group

A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Dataconomy Media
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 

Ähnlich wie HBase and Accumulo | Washington DC Hadoop User Group (20)

A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

HBase and Accumulo | Washington DC Hadoop User Group

  • 1. HBase and Accumulo Washington DC Hadoop User Group Jan 25th, 2012 Todd Lipcon Software Engineer, Cloudera todd@cloudera.com / @tlipcon Copyright 2011 Cloudera Inc. All rights reserved
  • 2. Background – Overview • HBase and Accumulo are both open-source, Apache 2.0 licensed implementations of Google’s BigTable infrastructure, running on Apache Hadoop • Scalable, distributed storage • Scalable data storage at petabyte scale, storing trillions of rows distributed across hundreds or thousands of machines • Automatic fault tolerance and data distribution as machines crash or rejoin the cluster • Linear scaling of IOPS and data capacity by adding servers • Data model is a big sorted hierarchical map Copyright 2012 Cloudera Inc. All rights reserved 2
  • 3. Sorted Map Datastores • Each row has a row key (like a Primary Key in RDBMS terms) • Users may query by exact row key or by range of row keys • Data is always stored and returned in sorted order • Each row has some number of columns • Each column has a qualifier and some piece of data. Like a Map<byte[], byte[]> • Different rows may have different sets of columns • Each cell has an associated timestamp and may retain a history of previous values • Columns are grouped into column families and locality groups Copyright 2012 Cloudera Inc. All rights reserved 3
  • 4. Sorted Map Datastore (logical view as “records”) Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data Different types of data separated into cutting info: , ‘height’: ‘9ft’, ‘state’: ‘CA’ - different roles: , ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ - “column families” tlipcon info: , ‘height’: ‘5ft7, ‘state’: ‘CA’ - roles: , ‘Hadoop’: ‘Committer’@ts=2010, ‘Hadoop’: ‘PMC’@ts=2011, ‘Hive’: ‘Contributor’ - Different rows may have different sets A single cell might have different of columns(table is sparse) values at different timestamps Useful for *-To-Many mappings
  • 5. Locality Groups • Different sets of columns may have different properties and access patterns • Perhaps a few columns are accessed all the time, whereas others are large and rarely needed • For example, a user’s metadata (1kb, accessed frequently) and their photo (1MB, cached by CDN and accessed rarely) • Put metadata in one locality group and photos in another • Locality groups stored separately on disk: access just the metadata without reading the photo
  • 6. Sorted Map Datastore (physical view as “cells”) info Column Family / Locality Group Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA roles Column Family / Locality Group Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director Sorted on disk by cutting roles:Hadoop 1183746289103 Founder Row key, Col tlipcon roles:Hadoop 1300062064923 PMC key, descending tlipcon roles:Hadoop 1293388212294 Committer timestamp tlipcon roles:Hive 1273616297446 Contributor Milliseconds since unix epoch
  • 7. (image from Accumulo manual) Copyright 2012 Cloudera Inc. All rights reserved 7
  • 8. Accumulo/HBase Terminology Accumulo HBase Definition Tablet Region A partition of a table (eg email inboxes starting with ‘a’-’c’) TabletServer RegionServer A server in the cluster which hosts a number of tablets/regions, providing read/write access Log/WAL HLog/WAL Write-ahead log – used for durably logging edits Minor Flush Writing data from memory to disk compaction Major Minor Merging several on-disk files into a larger one compaction Compaction Major Major Merging all of the on-disk files into a larger one compaction compaction with all files Copyright 2012 Cloudera Inc. All rights reserved 8
  • 9. That’s all the intro we have time for… • Check out the excellent Accumulo manual at http://incubator.apache.org/accumulo • And the HBase manual at http://hbase.apache.org/book.html • Also some longer intro videos on Cloudera’s website, and an excellent O’Reilly book Copyright 2012 Cloudera Inc. All rights reserved 9
  • 10. Commonalities (the non-controversial stuff) • Both systems scale well • Clusters with >1000 nodes, >1PB • Example HBase users: StumbleUpon, TrendMicro, Facebook, eBay, Flurry, ngmoco, Mozilla, Adobe, etc. • Example Accumulo users: ??????? (I don’t have clearance but I’m told they’re big and important) • Both systems perform well • Depending on tuning, one might beat the other at any given benchmark, but overall results seem comparable • Both open source with active development Copyright 2012 Cloudera Inc. All rights reserved 10
  • 11. Commonalities (the non-controversial stuff) • Storage formats are very similar • Used to be the same, then diverged, then re-converged! • Multi-level BTrees, bloom filters, compression • Prefix compression currently missing in HBase, 95% complete for 0.94.0 • Caching code very similar • Accumulo uses an older version of HBase’s LRUBlockCache • HBase has some recent improvements (off-heap cache), but I imagine Accumulo will grab them soon enough. Copyright 2012 Cloudera Inc. All rights reserved 11
  • 12. General features • Both have good MapReduce integration • Both have a command-line shell • Both have a pretty good test suite • Accumulo used to be ahead here, but we traded off some ideas and use similar testing strategies now • Both use ZooKeeper for fault tolerant metadata storage, and support failover Masters Copyright 2012 Cloudera Inc. All rights reserved 12
  • 13. Now for the fun part… BigTable shootout 2012 • Warning: I am necessarily biased as an HBase committer. • I will be comparing the very latest versions • HBase 0.92.0 (released only 2 days ago!) • Accumulo 1.4 (not yet released, due out mid Feb?) • Please feel free to loudly disagree after the talk during the time allotted for questions – I am happy to be proven wrong! I’ll invite Aaron Cordova and John Vines up to help answer questions. Copyright 2012 Cloudera Inc. All rights reserved 13
  • 14. Differences – Active contributors and users (plus various contractors thereof) (I ran out of space) Copyright 2012 Cloudera Inc. All rights reserved 14
  • 15. Differences – User Mailing list activity 500-600 messages 50-100 messages per month (peak per month (peak 1088) 105) *but it’s new at Apache+ Winner: Copyright 2012 Cloudera Inc. All rights reserved 15
  • 16. Differences – Access Control • Accumulo has per-cell visibility labels as well as table ACLs • Each cell has an ACL of what users may see it. (eg (TS|(SECRET&PROJECTX))) • Users who don’t have access can’t tell the cell even exists • Very useful for classified information! • HBase has column family ACLs but no built-in per-cell visibility support • Some early work to add visibility labels, but not done yet Winner: Copyright 2012 Cloudera Inc. All rights reserved 16
  • 17. Differences – Authentication • Accumulo has a built-in user database • Users are authenticated by username/password • Passed in plaintext over the wire • HBase optionally uses Kerberos • Central administration (eg via Active Directory) • Key-based secure credential exchange • Temporary delegation tokens are created for MR jobs, so even if a job’s data leaks, credentials are not compromised • Consistent with rest of Hadoop ecosystem Winner: Copyright 2012 Cloudera Inc. All rights reserved 17
  • 18. Differences – Locality Groups • HBase has a 1:1 correspondence of Column Families and Locality Groups • Moving columns from one locality group to another after data has been inserted is impossible • Accumulo has a proper distinction and allows online reassignment of column-to-locality-group mappings Winner: Copyright 2012 Cloudera Inc. All rights reserved 18
  • 19. Differences – extensibility frameworks • Accumulo has iterators • Allows custom processing to be inserted in the read path as well as into the table maintenance code. Provides neat features like automatic summary maintenance, for example. • HBase has coprocessors • Much more general framework that also subsumes triggers, stored procedures, and cluster management hooks. (e.g Access Control is an HBase coprocessor). • Generality has its cost: very difficult to do some things that are simple with iterators • Some iterator use cases can be done with HBase filters • I’ll call this one a tie Copyright 2012 Cloudera Inc. All rights reserved 19
  • 20. Differences – Web UI and Monitoring Winner: Copyright 2012 Cloudera Inc. All rights reserved 20
  • 21. Differences – Write-ahead logging • HBase uses HDFS files as a WAL • Takes advantage of HDFS performance improvements as they are developed • Same trusted replication and checksumming schemes as HDFS • Accumulo has its own Logger implementation • Extra daemons to run • Does not leverage improvements in HDFS • Won’t re-replicate if loggers go down Winner: Copyright 2012 Cloudera Inc. All rights reserved 21
  • 22. Differences – Other features • Accumulo has a nice mock Accumulo implementation • Nice for testing user software • Accumulo supports isolated scans on super-wide rows • HBase supports wide rows but isolation properties are lost • Accumulo supports tablet merging • If tablets get too small, they’ll merge with neighbors • Accumulo supports table snapshotting/cloning • Other sundry features: logical clocks, RPC tracing, RPC wire compatibility, and more. Copyright 2012 Cloudera Inc. All rights reserved 22
  • 23. Differences – Other features • HBase has RPM and Debian packages as part of Apache BigTop • Integrated (and integration-tested) with Hive, Pig, and others • HBase has commercial support available from Cloudera, as well as several vendors and other projects building on top (Lily, OMID, etc) • HBase has first-class support for REST clients and thin Thrift clients • HBase has inter-cluster wide-area replication • HBase has significantly more advanced bloom filters and other such optimizations (thanks Facebook!) Copyright 2012 Cloudera Inc. All rights reserved 23
  • 24. Summary • Neither system is better! • One system may very well be better for your use case, or for the community you want to interact with • Over time, the feature sets are converging • RFile vs HFile v2, Security, Caching, Compaction policies, Iterators/Coprocessors • Now that both projects are in Apache, open dialogue, code sharing, and friendly competition will help make both projects better! Copyright 2012 Cloudera Inc. All rights reserved 24
  • 25. Thanks! Aaron Cordova and John Vines (Accumulo committers) will now join me for some discussion / questions Email: todd@cloudera.com Twitter: @tlipcon Copyright 2012 Cloudera Inc. All rights reserved 25

Hinweis der Redaktion

  1. Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the &quot;info&quot; column family. Row2 only has a single column. A column can also be empty.Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained.Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek.
  2. Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
  3. Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the &quot;info&quot; column family. Row2 only has a single column. A column can also be empty.Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained.Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek.