SlideShare ist ein Scribd-Unternehmen logo
1 von 20
HBase
In Practice




Ravi Veeramachaneni
Topics
           Why HBase?
           HBase Usecases – HBase @Navteq
                Design Considerations
                Hardware/Deployment Considerations
                Practical Tips (Tuning/Optimization)
                Wanted Features




Ravi Veeramachaneni                    HBase – In Practice   2
Hadoop Benefits
          • Stores (HDFS) and Process (MR) large amounts of data
          • Scales (100s and 1000s of nodes)
          • Inexpensive (no license cost, low cost hardware)
          • Fast (1TB sort in 62s, 1PB in 16.25h*)
          • Availability (failover built into the platform)
          • Data Recoverability (failure should not result in any data
            loss)
          • Replication (out-of-the-box 3-way replication and
            configurable)
          • Better Throughput (Time to read the whole dataset is more
                 important than latency in reading the first record)
          • Write once and read-many-times pattern
          • Works well with structured, unstructured or semi-structured
            data
              *YDN Blog: Jim Gray’s Benchmark @ http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/

Ravi Veeramachaneni                                             HBase – In Practice                                                     3
But …
           Not so good or does not support
                      • Random access
                      • Updating the data and/or file (writes are always
                        at the EOF)
                      • Apps that require low latency access to data
                      • Does not to support lots of small files
                      • Does not support multiple writers
                      • Not a solution for every Data problem



Ravi Veeramachaneni                      HBase – In Practice               4
Featuring HBase
           HBase Scales (runs on top of Hadoop)
           HBase provides fast table scans for time ranges and
            fast key based lookups
           HBase stores null values for free
                      • Saves both disk space and disk IO time
           HBase supports unstructured/semi-structured data through
            column families
           HBase has built-in version management
           Map Reduce data input
                      • Tables are sorted and have unique keys
                          • Reducer often times optional
                          • Combiner not needed
           Strong community support and wider adoption

Ravi Veeramachaneni                               HBase – In Practice   5
HBase Usecases
                To solve Big Data problems
                Sparse data (un- or semi-structured)
                Cost effectively Scalable
                Versioned data
                Some other features may interest to you
                         Linear distribution of data across the data nodes
                         Rows are stored in byte-lexographic sorted order
                         Atomic Read/Write/Update
                         Data Access – Random, Sequential reads and writes
                         Automatic replication of Data for HA
           But not for every Data problem

Ravi Veeramachaneni                            HBase – In Practice            6
Navteq’s Usecase
           Content is
                      –   Constantly growing (in higher TB)
                      –   Sparse and unstructured
                      –   Provided in multiple data formats
                      –   Ingested, processed and delivered in transactional and batch mode


           Content Breadth
                      – 100s of millions of content records
                      – 100s of content suppliers + community input


           Content Depth
                      – On average, a content record has 120 attributes
                      – Certain types of content have more than 400 attributes
                      – Content classified across 270+ categories

Ravi Veeramachaneni                                HBase – In Practice                        7
Content Processing High-level Overview
                                                    Batch and Transactional API


                               Bulk Content                                                Customer and
                                 Sources                                                   Community UGC



                                              Merchant         Community, Us
                                               Data               er and
                                                                 Merchant
                                                                  Media



                                                                         Place ID    from Place Registry
                                                                         Location ID from Location Referencing

                      Source & Blended Record Management                              Tiered Quality System




                                                                                    PUBLISHING
                                                                                 real-time, on-demand
                                                            Place ID             Bulk Content delivery; Search, and
                                                           Location ID           other mobile devices



Ravi Veeramachaneni                                        HBase – In Practice                                        8
HBase @ NAVTEQ
           Started in 2009, hbase 0.19.x (apache)
                      • 8-node VMWare Sandbox Cluster
                      • Flaky, unstable, RS Failures
                      • Switched to CDH
           Early 2010, hbase 0.20.x (CDH2)
                      • 10-node Physical Sandbox Cluster
                      • Still had lot of challenges, RS Failures, META corruption
                      • Cluster expanded significantly with multiple environments
           Current (hbase 0.90.3)
                      • Moved to CDH3u1 official release
                      • Multiple teams/projects using Hadoop/HBase implementation
                      • Working on Hive/HBase integration, Oozie, Lucene/Solr
                        integration, Cloudera Enterprise and few other


Ravi Veeramachaneni                             HBase – In Practice                 9
Measured Business Value
           Scalability & Deployment
                      •   Handling spikes are addressed by simply adding nodes
                      •   No code changes or deployment needed
                      •   From 15 to 30 to 60 nodes and more, as data grows
                      •   Deployment are well managed and controlled (from 12-16
                          hours to < 2 hours)
           Speed to Market
                      • By supporting Real-time transactions (instead of quarterly
                        update)
                      • Batch updates are handled more efficiently (from days to
                        hours)
           Faster Supplier On-boarding
                      • Flexible and externally managed Business Rules
           Cheaper than the existing solution
                       <$2m vs. $12m (based on projected growth)
Ravi Veeramachaneni                            HBase – In Practice                   10
HBase & Zookeeper
           ZK – Distributed coordination service
                      • Coordinates messages sent across the network between nodes
                        (network fails, etc.)
           HBase depends on ZK and authorizes ZK to manage the state

           HBase hosts key info on ZK
                      • Location of root catalog table
                      • Address of the current cluster master
                      • Bootstrapping a client connection to an HBase cluster

           Client connects to ZK quorum first
                      • To learn the location of -ROOT-
                      • Clients consult -ROOT- to elicit the location of the .META. Region
                      • Client then does a lookup against the found .META. Region to figure
                        the hosting user-space region and its location
                      • Clients caches all the above for future traversing

Ravi Veeramachaneni                              HBase – In Practice                          11
Design Considerations
           Database/schema design
                      • Transition to Column-oriented or flat schema
           Understand your access pattern
           Row-key design/implementation
                      • Sequential keys
                          • Suffers from distribution of load but uses the block caches
                          • Can be addressed by pre-splitting the regions
                      • Randomize keys to get better distribution
                          • Achieved through hashing on Key Attributes – SHA1 or MD5
                          • Suffers range scans
           Too many Column Families (NOT Good)
                      • Initially we had about 30 or so, now reduced to 8
           Compression
                      • LZO or Snappy (20% better than LZO) – Block (default)

Ravi Veeramachaneni                                 HBase – In Practice                   12
Design Considerations
           Serialization
                      • AVRO didn’t work well – deserialization issue
                      • Developed configurable serialization mechanism that uses JSON
                        except Date type
           Secondary Indexes
                      • Were using ITHBase and IHBase from contrib – doesn’t work well
                      • Redesigned schema without need for index
                      • We still need it though
           Performance
                      • Several tunable parameters
                          • Hadoop, HBase, OS, JVM, Networking, Hardware
           Scalability
                      • Interfacing with real-time (interactive) systems from batch oriented
                        system


Ravi Veeramachaneni                                 HBase – In Practice                        13
Hadoop/HBase Processes




Ravi Veeramachaneni      HBase – In Practice   14
Hardware/Deployment Considerations
           Hardware (Hadoop+HBase)
                      • Data Node - 24GB RAM, 8 Cores, 4x1TB (64GB, 24 Cores, 8x2TB)
                      • 6 mappers and 6 reducers per node (16 mappers, 4 reducers)
                      • Memory allocation by process
                          •   Data Node – 1GB (2GB)
                          •   Task Tracker – 1GB (2GB)
                          •   Map Tasks – 6x1GB (16x1.5GB)
                          •   Reduce Tasks – 6x1GB (4x1.5GB)
                          •   Region Server – 8GB (24GB)
                          •   Total Allocation: 24GB (64GB)
           Deployment
                      • Do not run ZK instances on DN, have a separate ZK quorum (3
                        minimum)
                      • Do not run HMaster on NN
                      • Avoid SPOF for HMaster (run additional master(s))


Ravi Veeramachaneni                                   HBase – In Practice              15
HBase Configuration/Tuning
           Configuring HBase
                      • Configuration is the key
                      • Many moving parts – typos, out of synchronization
                      • Operating System
                          • Number of open files (ulimit) to 32K or even higher (/etc/security/limits.conf)
                          • vm.swapiness to lower or 0
                      • HDFS
                          • Adjust block size based on the use case
                          • Increase xceivers to 2047 (dfs.datanode.max.xceivers)
                          • Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
                      • HBase
                          • Needs more memory
                          • No swapping – JVM hates it
                          • GC pauses could cause timeouts or RS failures (read article posted by
                            Todd Lipcon on avoiding full GC)


Ravi Veeramachaneni                                    HBase – In Practice                                    16
HBase Configuration/Tuning
           HBase
                      • Per-cluster
                          • Turn-off block cache if the hit ratio is less (hfile.block.cache.size, default
                            20%)
                      • Per-table
                          • MemStore flush Size (hbase.hregion.memstore.flush.size, default 64MB and
                            hbase.hregion.memstore.block.multiplier, default 2)
                          • Max File Size (hbase.hregion.max.filesize, default 256MB)
                      • Per-CF
                          • Compression
                          • Bloom Filter
                      • Per-RS
                          • Amount of heap in each RS to reserve for all MemStores
                            (hbase.regionserver.global.memstore.upperLimit, default 0.4)
                          • MemStore flush size
                          • Max file size
                      • Per-SF
                          • Maximum number of SFs per store to allow
                            (hbase.hstore.blockingStoreFiles, default 7)

Ravi Veeramachaneni                                    HBase – In Practice                                   17
HBase Configuration/Tuning
                      • HBase
                          • Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase
                            importing)
                               –   hbase.regionserver.global.memstore.upperLimit=0.3
                               –   hbase.regionserver.global.memstore.lowerLimit=0.15
                               –   hbase.regionserver.handler.count=256
                               –   hbase.hregion.memstore.block.multiplier=8
                               –   hbase.hstore.blockingStoreFiles=25
                          • Control number of store files (hbase.hregion.max.filesize)
           Security
                      • Still in flux, need robust RBAC
           Reliability
                      • Name Node is SPOF
                      • HBase is sensitive
                          • Region Server Failures

Ravi Veeramachaneni                                   HBase – In Practice                      18
Desired Features
           Better operational tools for using Hadoop and HBase
                      • Job management, backup, restore, user provisioning, general
                        administrative tasks, etc.
                Support for Secondary Indexes
                Full-text Indexes and Searching (Lucene/Solr integration?)
                HA support for Name Node
                Need Data Replication for HA & DR
                Security at Table, CF and Row level
                Good documentation (it’s getting better though) – now Lars
                 book out



Ravi Veeramachaneni                             HBase – In Practice                   19
Thank you




Ravi Veeramachaneni   HBase – In Practice   20

Weitere ähnliche Inhalte

Was ist angesagt?

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...MongoDB
 
Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...ORACLE USER GROUP ESTONIA
 
Google App Engine, Groovy and Gaelyk presentation at the Paris JUG
Google App Engine, Groovy and Gaelyk presentation at the Paris JUGGoogle App Engine, Groovy and Gaelyk presentation at the Paris JUG
Google App Engine, Groovy and Gaelyk presentation at the Paris JUGGuillaume Laforge
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec by PMC
 
Bank Data Frank Peterson DB2 10-Early_Experiences_pdf
Bank Data   Frank Peterson DB2 10-Early_Experiences_pdfBank Data   Frank Peterson DB2 10-Early_Experiences_pdf
Bank Data Frank Peterson DB2 10-Early_Experiences_pdfSurekha Parekh
 
Developing polyglot persistence applications (SpringOne India 2012)
Developing polyglot persistence applications (SpringOne India 2012)Developing polyglot persistence applications (SpringOne India 2012)
Developing polyglot persistence applications (SpringOne India 2012)Chris Richardson
 
Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Lai Yoong Seng
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...Verbella CMG
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Chris Richardson
 
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...MongoDB
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft Private Cloud
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill Linpro
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill LinproLife Without IPv4: Tore Anderson, IPv6 guru, Redpill Linpro
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill LinproIPv6no
 

Was ist angesagt? (19)

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...
 
Hana Offerings Engl
Hana Offerings EnglHana Offerings Engl
Hana Offerings Engl
 
Google App Engine, Groovy and Gaelyk presentation at the Paris JUG
Google App Engine, Groovy and Gaelyk presentation at the Paris JUGGoogle App Engine, Groovy and Gaelyk presentation at the Paris JUG
Google App Engine, Groovy and Gaelyk presentation at the Paris JUG
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID Whitepaper
 
Bank Data Frank Peterson DB2 10-Early_Experiences_pdf
Bank Data   Frank Peterson DB2 10-Early_Experiences_pdfBank Data   Frank Peterson DB2 10-Early_Experiences_pdf
Bank Data Frank Peterson DB2 10-Early_Experiences_pdf
 
Developing polyglot persistence applications (SpringOne India 2012)
Developing polyglot persistence applications (SpringOne India 2012)Developing polyglot persistence applications (SpringOne India 2012)
Developing polyglot persistence applications (SpringOne India 2012)
 
Storage Options in Windows Server 2012
Storage Options in Windows Server 2012Storage Options in Windows Server 2012
Storage Options in Windows Server 2012
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
User Group Bi
User Group BiUser Group Bi
User Group Bi
 
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
Document Imaging Tools and Strategies to Accelerate Your Accounts Payable Act...
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)
 
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...
MongoDB at NoSQL Now! 2012: Benefits and Challenges of Using MongoDB in the E...
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database Datasheet
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill Linpro
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill LinproLife Without IPv4: Tore Anderson, IPv6 guru, Redpill Linpro
Life Without IPv4: Tore Anderson, IPv6 guru, Redpill Linpro
 

Ähnlich wie Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage Systemqlw5
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseJesse Yates
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarPlatfora
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBaseFelipe Ferreira
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 

Ähnlich wie Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica (20)

HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage System
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hive
HiveHive
Hive
 
Musings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
hive.pptx
hive.pptxhive.pptx
hive.pptx
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 

Kürzlich hochgeladen (20)

ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 

Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica

  • 2. Topics  Why HBase?  HBase Usecases – HBase @Navteq  Design Considerations  Hardware/Deployment Considerations  Practical Tips (Tuning/Optimization)  Wanted Features Ravi Veeramachaneni HBase – In Practice 2
  • 3. Hadoop Benefits • Stores (HDFS) and Process (MR) large amounts of data • Scales (100s and 1000s of nodes) • Inexpensive (no license cost, low cost hardware) • Fast (1TB sort in 62s, 1PB in 16.25h*) • Availability (failover built into the platform) • Data Recoverability (failure should not result in any data loss) • Replication (out-of-the-box 3-way replication and configurable) • Better Throughput (Time to read the whole dataset is more important than latency in reading the first record) • Write once and read-many-times pattern • Works well with structured, unstructured or semi-structured data *YDN Blog: Jim Gray’s Benchmark @ http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/ Ravi Veeramachaneni HBase – In Practice 3
  • 4. But …  Not so good or does not support • Random access • Updating the data and/or file (writes are always at the EOF) • Apps that require low latency access to data • Does not to support lots of small files • Does not support multiple writers • Not a solution for every Data problem Ravi Veeramachaneni HBase – In Practice 4
  • 5. Featuring HBase  HBase Scales (runs on top of Hadoop)  HBase provides fast table scans for time ranges and fast key based lookups  HBase stores null values for free • Saves both disk space and disk IO time  HBase supports unstructured/semi-structured data through column families  HBase has built-in version management  Map Reduce data input • Tables are sorted and have unique keys • Reducer often times optional • Combiner not needed  Strong community support and wider adoption Ravi Veeramachaneni HBase – In Practice 5
  • 6. HBase Usecases  To solve Big Data problems  Sparse data (un- or semi-structured)  Cost effectively Scalable  Versioned data  Some other features may interest to you  Linear distribution of data across the data nodes  Rows are stored in byte-lexographic sorted order  Atomic Read/Write/Update  Data Access – Random, Sequential reads and writes  Automatic replication of Data for HA  But not for every Data problem Ravi Veeramachaneni HBase – In Practice 6
  • 7. Navteq’s Usecase  Content is – Constantly growing (in higher TB) – Sparse and unstructured – Provided in multiple data formats – Ingested, processed and delivered in transactional and batch mode  Content Breadth – 100s of millions of content records – 100s of content suppliers + community input  Content Depth – On average, a content record has 120 attributes – Certain types of content have more than 400 attributes – Content classified across 270+ categories Ravi Veeramachaneni HBase – In Practice 7
  • 8. Content Processing High-level Overview Batch and Transactional API Bulk Content Customer and Sources Community UGC Merchant Community, Us Data er and Merchant Media Place ID from Place Registry Location ID from Location Referencing Source & Blended Record Management Tiered Quality System PUBLISHING real-time, on-demand Place ID Bulk Content delivery; Search, and Location ID other mobile devices Ravi Veeramachaneni HBase – In Practice 8
  • 9. HBase @ NAVTEQ  Started in 2009, hbase 0.19.x (apache) • 8-node VMWare Sandbox Cluster • Flaky, unstable, RS Failures • Switched to CDH  Early 2010, hbase 0.20.x (CDH2) • 10-node Physical Sandbox Cluster • Still had lot of challenges, RS Failures, META corruption • Cluster expanded significantly with multiple environments  Current (hbase 0.90.3) • Moved to CDH3u1 official release • Multiple teams/projects using Hadoop/HBase implementation • Working on Hive/HBase integration, Oozie, Lucene/Solr integration, Cloudera Enterprise and few other Ravi Veeramachaneni HBase – In Practice 9
  • 10. Measured Business Value  Scalability & Deployment • Handling spikes are addressed by simply adding nodes • No code changes or deployment needed • From 15 to 30 to 60 nodes and more, as data grows • Deployment are well managed and controlled (from 12-16 hours to < 2 hours)  Speed to Market • By supporting Real-time transactions (instead of quarterly update) • Batch updates are handled more efficiently (from days to hours)  Faster Supplier On-boarding • Flexible and externally managed Business Rules  Cheaper than the existing solution  <$2m vs. $12m (based on projected growth) Ravi Veeramachaneni HBase – In Practice 10
  • 11. HBase & Zookeeper  ZK – Distributed coordination service • Coordinates messages sent across the network between nodes (network fails, etc.)  HBase depends on ZK and authorizes ZK to manage the state  HBase hosts key info on ZK • Location of root catalog table • Address of the current cluster master • Bootstrapping a client connection to an HBase cluster  Client connects to ZK quorum first • To learn the location of -ROOT- • Clients consult -ROOT- to elicit the location of the .META. Region • Client then does a lookup against the found .META. Region to figure the hosting user-space region and its location • Clients caches all the above for future traversing Ravi Veeramachaneni HBase – In Practice 11
  • 12. Design Considerations  Database/schema design • Transition to Column-oriented or flat schema  Understand your access pattern  Row-key design/implementation • Sequential keys • Suffers from distribution of load but uses the block caches • Can be addressed by pre-splitting the regions • Randomize keys to get better distribution • Achieved through hashing on Key Attributes – SHA1 or MD5 • Suffers range scans  Too many Column Families (NOT Good) • Initially we had about 30 or so, now reduced to 8  Compression • LZO or Snappy (20% better than LZO) – Block (default) Ravi Veeramachaneni HBase – In Practice 12
  • 13. Design Considerations  Serialization • AVRO didn’t work well – deserialization issue • Developed configurable serialization mechanism that uses JSON except Date type  Secondary Indexes • Were using ITHBase and IHBase from contrib – doesn’t work well • Redesigned schema without need for index • We still need it though  Performance • Several tunable parameters • Hadoop, HBase, OS, JVM, Networking, Hardware  Scalability • Interfacing with real-time (interactive) systems from batch oriented system Ravi Veeramachaneni HBase – In Practice 13
  • 14. Hadoop/HBase Processes Ravi Veeramachaneni HBase – In Practice 14
  • 15. Hardware/Deployment Considerations  Hardware (Hadoop+HBase) • Data Node - 24GB RAM, 8 Cores, 4x1TB (64GB, 24 Cores, 8x2TB) • 6 mappers and 6 reducers per node (16 mappers, 4 reducers) • Memory allocation by process • Data Node – 1GB (2GB) • Task Tracker – 1GB (2GB) • Map Tasks – 6x1GB (16x1.5GB) • Reduce Tasks – 6x1GB (4x1.5GB) • Region Server – 8GB (24GB) • Total Allocation: 24GB (64GB)  Deployment • Do not run ZK instances on DN, have a separate ZK quorum (3 minimum) • Do not run HMaster on NN • Avoid SPOF for HMaster (run additional master(s)) Ravi Veeramachaneni HBase – In Practice 15
  • 16. HBase Configuration/Tuning  Configuring HBase • Configuration is the key • Many moving parts – typos, out of synchronization • Operating System • Number of open files (ulimit) to 32K or even higher (/etc/security/limits.conf) • vm.swapiness to lower or 0 • HDFS • Adjust block size based on the use case • Increase xceivers to 2047 (dfs.datanode.max.xceivers) • Set socket timeout to 0 (dfs.datanode.socket.write.timeout) • HBase • Needs more memory • No swapping – JVM hates it • GC pauses could cause timeouts or RS failures (read article posted by Todd Lipcon on avoiding full GC) Ravi Veeramachaneni HBase – In Practice 16
  • 17. HBase Configuration/Tuning  HBase • Per-cluster • Turn-off block cache if the hit ratio is less (hfile.block.cache.size, default 20%) • Per-table • MemStore flush Size (hbase.hregion.memstore.flush.size, default 64MB and hbase.hregion.memstore.block.multiplier, default 2) • Max File Size (hbase.hregion.max.filesize, default 256MB) • Per-CF • Compression • Bloom Filter • Per-RS • Amount of heap in each RS to reserve for all MemStores (hbase.regionserver.global.memstore.upperLimit, default 0.4) • MemStore flush size • Max file size • Per-SF • Maximum number of SFs per store to allow (hbase.hstore.blockingStoreFiles, default 7) Ravi Veeramachaneni HBase – In Practice 17
  • 18. HBase Configuration/Tuning • HBase • Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase importing) – hbase.regionserver.global.memstore.upperLimit=0.3 – hbase.regionserver.global.memstore.lowerLimit=0.15 – hbase.regionserver.handler.count=256 – hbase.hregion.memstore.block.multiplier=8 – hbase.hstore.blockingStoreFiles=25 • Control number of store files (hbase.hregion.max.filesize)  Security • Still in flux, need robust RBAC  Reliability • Name Node is SPOF • HBase is sensitive • Region Server Failures Ravi Veeramachaneni HBase – In Practice 18
  • 19. Desired Features  Better operational tools for using Hadoop and HBase • Job management, backup, restore, user provisioning, general administrative tasks, etc.  Support for Secondary Indexes  Full-text Indexes and Searching (Lucene/Solr integration?)  HA support for Name Node  Need Data Replication for HA & DR  Security at Table, CF and Row level  Good documentation (it’s getting better though) – now Lars book out Ravi Veeramachaneni HBase – In Practice 19
  • 20. Thank you Ravi Veeramachaneni HBase – In Practice 20