SlideShare ist ein Scribd-Unternehmen logo
1 von 34
November 2011 – Hadoop World NYC

Advanced HBase Schema Design
Lars George, Solutions Architect
Agenda

1    Intro to HBase Architecture
2    Schema Design
3    Examples
4    Wrap up




2                    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                    Reproduction or redistribution without written permission is
                                            prohibited.
About Me

• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Author of
  HBase – The Definitive Guide
• Working with HBase since end
  of 2007
• Organizer of the Munich OpenHUG
• Speaker at Conferences (Fosdem, Hadoop
  World)
Overview

• Schema design is vital
• Needs to be done eventually
  – Same for RDBMS
• Exposes architecture and implementation
  “features”
• Might be handled in storage layer
  – eg. MegaStore, Percolator
Configuration Layers   (aka “OSI for HBase”)
HBase Architecture
HBase Architecture
• HBase uses HDFS (or similar) as its reliable
  storage layer
  – Handles checksums, replication, failover
• Native Java API, Gateway for REST, Thrift,
  Avro
• Master manages cluster
• RegionServer manage data
• ZooKeeper is used the “neural network”
  – Crucial for HBase
  – Bootstraps and coordinates cluster
Auto Sharding
Distribution
Auto Sharding and Distribution

• Unit of scalability in HBase is the Region
• Sorted, contiguous range of rows
• Spread “randomly” across RegionServer
• Moved around for load balancing and
  failover
• Split automatically or manually to scale
  with growing data
• Capacity is solely a factor of cluster nodes
  vs. regions per node
Column Families
Storage Separation

• Column Families allow for separation of data
  – Used by Columnar Databases for fast analytical
    queries, but on column level only
  – Allows different or no compression depending on
    the content type
• Segregate information based on access
  pattern
• Data is stored in one or more storage file,
  called HFiles
Merge Reads
Bloom Filter

• Bloom Filters are generated when HFile is
  persisted
  – Stored at the end of each HFile
  – Loaded into memory
• Allows check on row or row+column level
• Can filter entire store files from reads
  – Useful when data is grouped
• Also useful when many misses are
  expected during reads (non existing keys)
Bloom Filter
Key Cardinality
Key Cardinality

• The best performance is gained from using
  row keys
• Time range bound reads can skip store files
  – So can Bloom Filters
• Selecting column families reduces the
  amount of data to be scanned
• Pure value based filtering is a full table scan
  – Filters often are too, but reduce network traffic
Fold, Store, and Shift
Fold, Store, and Shift

• Logical layout does not match physical
  one
• All values are stored with the full
  coordinates, including: Row Key, Column
  Family, Column Qualifier, and Timestamp
• Folds columns into “row per column”
• NULLs are cost free as nothing is stored
• Versions are multiple “rows” in folded table
Tall-Narrow vs. Flat-Wide Tables
• Rows do not split
  – Might end up with one row per region
• Same storage footprint
• Put more details into the row key
  – Sometimes dummy column only
  – Make use of partial key scans
• Tall with Scans, Wide with Gets
  – Atomicity only on row level
• Example: Large graphs, stored as adjacency
  matrix
Example: Mail Inbox

        <userId> : <colfam> : <messageId> : <timestamp> : <email-message>

12345   :   data   :   5fc38314-e290-ae5da5fc375d      :   1307097848   :   "Hi Lars, ..."
12345   :   data   :   725aae5f-d72e-f90f3f070419      :   1307099848   :   "Welcome, and ..."
12345   :   data   :   cc6775b3-f249-c6dd2b1a7467      :   1307101848   :   "To Whom It ..."
12345   :   data   :   dcbee495-6d5e-6ed48124632c      :   1307103848   :   "Hi, how are ..."


                                              or
12345-5fc38314-e290-ae5da5fc375d        :   data   :   :   1307097848   :   "Hi Lars, ..."
12345-725aae5f-d72e-f90f3f070419        :   data   :   :   1307099848   :   "Welcome, and ..."
12345-cc6775b3-f249-c6dd2b1a7467        :   data   :   :   1307101848   :   "To Whom It ..."
12345-dcbee495-6d5e-6ed48124632c        :   data   :   :   1307103848   :   "Hi, how are ..."


                              Same Storage Requirements
Partial Key Scans
Key                                          Description
<userId>                                     Scan over all
                                             messages for a given
                                             user ID
<userId>-<date>                              Scan over all
                                             messages on a given
                                             date for the given user
                                             ID
<userId>-<date>-<messageId>                  Scan over all parts of a
                                             message for a given
                                             user ID and date
<userId>-<date>-<messageId>-<attachmentId>   Scan over all
                                             attachments of a
                                             message for a given
                                             user ID and date
Sequential Keys
    <timestamp><more key>: {CF: {CQ: {TS : Val}}}

• Hotspotting on Regions: bad!
• Instead do one of the following:
  – Salting
     • Prefix <timestamp> with distributed value
     • Binning or bucketing rows across regions
  – Key field swap/promotion
     • Move <more key> before the timestamp (see
       OpenTSDB later)
  – Randomization
     • Move <timestamp> out of key
Key Design
Key Design

• Based on access pattern, either use
  sequential or random keys
• Often a combination of both is needed
  – Overcome architectural limitations
• Neither is necessarily bad
  – Use bulk import for sequential keys and reads
  – Random keys are good for random access
    patterns
Example: Facebook Insights

• > 20B Events per Day
• 1M Counter Updates per Second
  – 100 Nodes Cluster
  – 10K OPS per Node
• ”Like” button triggers AJAX request
• Event written to log file
• 30mins current for website owner
  Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
HBase Counters
• Store counters per Domain and per URL
  – Leverage HBase increment (atomic read-modify-
    write) feature
• Each row is one specific Domain or URL
• The columns are the counters for specific
  metrics
• Column families are used to group counters
  by time range
  – Set time-to-live on CF level to auto-expire
    counters by age to save space, e.g., 2 weeks on
    “Daily Counters” family
Key Design
• Reversed Domains
  – Examples: “com.cloudera.www”, “com.cloudera.blog”
  – Helps keeping pages per site close, as HBase efficiently
    scans blocks of sorted keys
• Domain Row Key =
MD5(Reversed Domain) +
  Reversed Domain
  – Leading MD5 hash spreads keys randomly across all regions
    for load balancing reasons
  – Only hashing the domain groups per site (and per
    subdomain if needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed
  Domain + URL ID
  – Unique ID per URL already available, make use of it
Insights Schema
Example: OpenTSDB




• Metric Type, Tags are stored as IDs
• Periodically rolled up
Example: HBlobStore

• Store binary data in HBase
  – Documents, Videos, Audio, Images
• Choices:
  – Store data in one row
    • Rows are not splitable ➜ does not scale
  – Store data in chunks across rows
    • Allows to prefetch blocks
  – Store data in HDFS and HBase
    • Based on size defer to HDFS
Summary

• Design for Use-Case
    – Read, Write, or Both?
•   Avoid Hotspotting
•   Consider using IDs instead of full text
•   Leverage Column Family to HFile relation
•   Shift details to appropriate position
    – Composite Keys
    – Column Qualifiers
Summary (cont.)

• Schema design is a combination of
  – Designing the keys (row and column)
  – Segregate data into column families
  – Choose compression and block sizes
• Similar techniques are needed to scale most
  systems
  – Add indexes, partition data, consistent hashing
• Denormalization, Duplication, and Intelligent
  Keys (DDI)
Questions?




Email:     lars@cloudera.com
Twitter:   @larsgeorge

Weitere ähnliche Inhalte

Was ist angesagt?

Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 

Was ist angesagt? (20)

Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 

Ähnlich wie Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera

Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012Chris Huang
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentationadvaitdeo
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
HBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ FlipboardHBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ FlipboardHBaseCon
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More IntelligentKyle Davis
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Databasezingopen
 
Zing Database
Zing Database Zing Database
Zing Database Long Dao
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 

Ähnlich wie Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera (20)

Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
HBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ FlipboardHBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ Flipboard
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
NoSql
NoSqlNoSql
NoSql
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
REDIS327
REDIS327REDIS327
REDIS327
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Cassandra
CassandraCassandra
Cassandra
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Zing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value DatabaseZing Database – Distributed Key-Value Database
Zing Database – Distributed Key-Value Database
 
Zing Database
Zing Database Zing Database
Zing Database
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera

  • 1. November 2011 – Hadoop World NYC Advanced HBase Schema Design Lars George, Solutions Architect
  • 2. Agenda 1 Intro to HBase Architecture 2 Schema Design 3 Examples 4 Wrap up 2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 3. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Author of HBase – The Definitive Guide • Working with HBase since end of 2007 • Organizer of the Munich OpenHUG • Speaker at Conferences (Fosdem, Hadoop World)
  • 4. Overview • Schema design is vital • Needs to be done eventually – Same for RDBMS • Exposes architecture and implementation “features” • Might be handled in storage layer – eg. MegaStore, Percolator
  • 5. Configuration Layers (aka “OSI for HBase”)
  • 7. HBase Architecture • HBase uses HDFS (or similar) as its reliable storage layer – Handles checksums, replication, failover • Native Java API, Gateway for REST, Thrift, Avro • Master manages cluster • RegionServer manage data • ZooKeeper is used the “neural network” – Crucial for HBase – Bootstraps and coordinates cluster
  • 10. Auto Sharding and Distribution • Unit of scalability in HBase is the Region • Sorted, contiguous range of rows • Spread “randomly” across RegionServer • Moved around for load balancing and failover • Split automatically or manually to scale with growing data • Capacity is solely a factor of cluster nodes vs. regions per node
  • 12. Storage Separation • Column Families allow for separation of data – Used by Columnar Databases for fast analytical queries, but on column level only – Allows different or no compression depending on the content type • Segregate information based on access pattern • Data is stored in one or more storage file, called HFiles
  • 14. Bloom Filter • Bloom Filters are generated when HFile is persisted – Stored at the end of each HFile – Loaded into memory • Allows check on row or row+column level • Can filter entire store files from reads – Useful when data is grouped • Also useful when many misses are expected during reads (non existing keys)
  • 17. Key Cardinality • The best performance is gained from using row keys • Time range bound reads can skip store files – So can Bloom Filters • Selecting column families reduces the amount of data to be scanned • Pure value based filtering is a full table scan – Filters often are too, but reduce network traffic
  • 19. Fold, Store, and Shift • Logical layout does not match physical one • All values are stored with the full coordinates, including: Row Key, Column Family, Column Qualifier, and Timestamp • Folds columns into “row per column” • NULLs are cost free as nothing is stored • Versions are multiple “rows” in folded table
  • 20. Tall-Narrow vs. Flat-Wide Tables • Rows do not split – Might end up with one row per region • Same storage footprint • Put more details into the row key – Sometimes dummy column only – Make use of partial key scans • Tall with Scans, Wide with Gets – Atomicity only on row level • Example: Large graphs, stored as adjacency matrix
  • 21. Example: Mail Inbox <userId> : <colfam> : <messageId> : <timestamp> : <email-message> 12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..." 12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..." 12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..." 12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..." or 12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi Lars, ..." 12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..." 12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..." 12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..."  Same Storage Requirements
  • 22. Partial Key Scans Key Description <userId> Scan over all messages for a given user ID <userId>-<date> Scan over all messages on a given date for the given user ID <userId>-<date>-<messageId> Scan over all parts of a message for a given user ID and date <userId>-<date>-<messageId>-<attachmentId> Scan over all attachments of a message for a given user ID and date
  • 23. Sequential Keys <timestamp><more key>: {CF: {CQ: {TS : Val}}} • Hotspotting on Regions: bad! • Instead do one of the following: – Salting • Prefix <timestamp> with distributed value • Binning or bucketing rows across regions – Key field swap/promotion • Move <more key> before the timestamp (see OpenTSDB later) – Randomization • Move <timestamp> out of key
  • 25. Key Design • Based on access pattern, either use sequential or random keys • Often a combination of both is needed – Overcome architectural limitations • Neither is necessarily bad – Use bulk import for sequential keys and reads – Random keys are good for random access patterns
  • 26. Example: Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second – 100 Nodes Cluster – 10K OPS per Node • ”Like” button triggers AJAX request • Event written to log file • 30mins current for website owner Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
  • 27. HBase Counters • Store counters per Domain and per URL – Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range – Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family
  • 28. Key Design • Reversed Domains – Examples: “com.cloudera.www”, “com.cloudera.blog” – Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key =
MD5(Reversed Domain) + Reversed Domain – Leading MD5 hash spreads keys randomly across all regions for load balancing reasons – Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID – Unique ID per URL already available, make use of it
  • 30. Example: OpenTSDB • Metric Type, Tags are stored as IDs • Periodically rolled up
  • 31. Example: HBlobStore • Store binary data in HBase – Documents, Videos, Audio, Images • Choices: – Store data in one row • Rows are not splitable ➜ does not scale – Store data in chunks across rows • Allows to prefetch blocks – Store data in HDFS and HBase • Based on size defer to HDFS
  • 32. Summary • Design for Use-Case – Read, Write, or Both? • Avoid Hotspotting • Consider using IDs instead of full text • Leverage Column Family to HFile relation • Shift details to appropriate position – Composite Keys – Column Qualifiers
  • 33. Summary (cont.) • Schema design is a combination of – Designing the keys (row and column) – Segregate data into column families – Choose compression and block sizes • Similar techniques are needed to scale most systems – Add indexes, partition data, consistent hashing • Denormalization, Duplication, and Intelligent Keys (DDI)
  • 34. Questions? Email: lars@cloudera.com Twitter: @larsgeorge