SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Copyright © 2013 Segel & Associates
All Rights Reserved.
HBase 101:
Schema Design Basics
Chicago area Hadoop User Group (C.H.U.G)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Agenda
Introduction:
About the speaker
(blah blah blah)
What is Hadoop/HBase.
Why HBase?
RDBS vs HBase
What HBase isn’t.
Schema Design
HBase Schema Components
Design walk through I
Design walk through II
Secondary Indexing
Summary / Questions
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
About the Speaker
Michael has been working with Hadoop related
technologies since 2009. Currently an Independent
Consultant working for Segel & Associates.
Likes long walks off of short piers. ;-)
Founding member of Big Data Anonymous
(Its a real disease people!)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
A Quick Review of
Hadoop
Big data is becoming an increasingly important part of
every business. (1300+ members in CHUG!)
MapReduce is a distributed programming model that makes
it easier for developers to analyze massive datasets.
Hadoop is a distributed computing framework and has
many components such as MapReduce.
HDFS has historically been a “write once, read
many” (WORM) file system
Implementing create-update-delete (CRUD) operations a
challenge
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
The Hadoop Ecosystem
HDFS - Distributed File System
MapReduce - A distributed framework for executing
work in parallel
Hive - A SQL like syntax with a meta store to allow
SQL manipulation of data stored on HDFS.
Pig - A top down scripting language to manipulate
HBase - A NoSQL, non-sequential data store
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
The Hadoop Ecosystem
HDFS - Distributed File System
MapReduce - A distributed framework for executing
work in parallel
Hive - A SQL like syntax with a meta store to allow
SQL manipulation of data stored on HDFS.
Pig - A top down scripting language to manipulate
HBase - A NoSQL, non-sequential data store
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
What is HBase?
A NoSQL database
‘Column Oriented’ (At the storage layer)
Highly distributed
Highly scalable
A non-relational persistent object store...
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
What is not HBase
A relational database
Has Transactional Support
Built on a traditional, updatable file system
(updates are via cells)
A stand alone system (but could be...)
The only NoSQL game in town
Hbase is not...
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
RDBMS vs HBase
RDBMS HBase
ACID compliant No ACID compliance
Sharding/Partitions Distributed Regions
SQL Key lookup/key range scans
Triggers/Stored Procedures Coprocessors
Indexes (B+Tree, R-Tree) No indexing
Highly Normalized Denormalized
Primitive data types Byte arrays
In-place updates Cell versioning
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Audience Participation:
Questions?
How many students have a strong RDBMS
background?
Data Modeling?
Use of a non relational engine like Pick?
(Revelation, U2, ... COBOL)?
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Why HBase?
Handles unstructured or semi-structured data.
Handles enormous data volumes.
Flexible. Ad-hoc access as well as full or partial
table scans.
Cost-effective scalability.
Near linear scalability.
Part of the Hadoop ecosystem.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Lets take a look at
SCHEMAS
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Schemas Design
HBase is not an RDBMS
No SQL language or syntax.
Joins in HBase are expensive
HBase can store complex types within a single
column.
Column Families are a factor in designs.
Schema Design is one of the more difficult things
when working with HBase.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
What makes up HBase
Tables
Column Families
Columns
cells (versions)
(from a conceptual point of view)
(more on this later ... )
Monday, July 15, 13
A PRACTICAL EXAMPLE...
Sales Receipt
Acme Supply You want it, we’ve got it!
Date 12/22/2012
Receipt # abc123
Acme Supply
310 Erie St.
Chicago, IL 60654
Phone (312) 555-12112
Fax (312) 555-1214
sales@acme.com
SOLD
TO
Wylie Coyote
P.O.Box 123
Rock River Canyon, AZ
Payment Method Check No. Job
Qty Item # Description Unit Price Discount Line total
10 12345677 250 LB. Steel Anvils $150.00 10% $1350.00
Shipping $550.00
Total Discount
Subtotal $1900.00
Sales Tax
Total $1900.00
Thank you for your business!
Acme Supply
•Order Entry System
• Order Entry
• Pick Slip
• Shipping
• Invoice
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
RDBMS
Extension
Number
Area_Code
Country_Code
Phone_Type
Phone_ID
Customer_ID
Cust_Phone
Customer_ID
Customer
First_Name
Last_Name
Company
email_addr
City
State
ZIP
Ext_Zip
Street 1
Street 2
Address_ID
Customer_ID
Adress_Type
Address
Attribute
Attribute
Attribute
Attribute
Total
SubTotal
Taxes
Shipping
Customer_ID
Billing_Address
Ship_to_Address
Invoice_ID
Date
Invoice
unit_cost
unit_price
manufacturer
description
product_id
sku_code
upc_code
Product
qty
sale_unit_price
product_id
Line_number
Line_ID
Invoice_ID
Invoice_Line
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
RDBMS
Extension
Number
Area_Code
Country_Code
Phone_Type
Phone_ID
Customer_ID
Cust_Phone
Customer_ID
Customer
First_Name
Last_Name
Company
email_addr
City
State
ZIP
Ext_Zip
Street 1
Street 2
Address_ID
Customer_ID
Adress_Type
Address
Attribute
Attribute
Attribute
Attribute
Total
SubTotal
Taxes
Shipping
Customer_ID
Billing_Address
Ship_to_Address
Invoice_ID
Date
Invoice
unit_cost
unit_price
manufacturer
description
product_id
sku_code
upc_code
Product
qty
sale_unit_price
product_id
Line_number
Line_ID
Invoice_ID
Invoice_Line
These table structures contain
the Detail information in a
Master/Detail Relationship.
Requires a FOREACH to pull
data for an invoice.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
RDBMS:
What did we see?
Highly normalized design
Lots of joins to get a single record.
Requires a FOR EACH loop to fetch
the detail records. (Or a join against
the master record.)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Now Lets Look at
HBase!
The defined schema contain table name and column
families. Columns are not defined. (# Versions per
column are!)
Data is stored in key-value pairs. The key is the primary
index. (There really is no index per se.)
The value is a set of one or more columns which contain
Java Byte arrays.
Each table should be considered stand alone. The only
exception would be indices created on non-key fields.
(Secondary Indexes... More on this later...)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
How to translate the model
to HBase:
Traditional Entity Relation Design tools are of limited use.
Each table should be atomic, with only weak relationships to
other tables.
A record should be completely contained in a single row.
Define your record based on the data present.
Encapsulate data by grouping related data within the
record.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Remember our Invoice?
Company
Company Address
Customer Name
Ship To: Address
Date(s)
Phone #(s)
Line Items:
SKU
Description
Qty
Unit Price
Total Price
SubTotal
Tax(s)
Total
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Remember our Invoice?
Company
Company Address
Customer Name
Ship To: Address
Date(s)
Phone #(s)
Line Items:
SKU
Description
Qty
Unit Price
Total Price
SubTotal
Tax(s)
Total Here are a couple of groupings
that encapsulate data.
Monday, July 15, 13
HBASE SCHEMA
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
This is just an example.
There are other options!
Question(s) for the Audience:
Should this be a table?
What about a column family?
[Hint: Think of the entire system e.g. Order Entry... ]
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
LETSTAKE A CLOSER LOOK
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
LETSTAKE A CLOSER LOOK
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
The invoice_id is our row key. This is a
unique ID specific to one and only one
invoice.
Bonus question:
Is this really a good row key?
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
LETSTAKE A CLOSER LOOK
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
The invoice_id is our row key. This is a
unique ID specific to one and only one
invoice.
Bonus question:
Is this really a good row key?
These represent structures which
encapsulate primitive data types
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Lets Look at our Key
Fastest access is
when accessing a
single record based
on the full key.
Scans more
expensive, full
table scans very
expensive
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
invoice_id is unique
How do we access invoices?
What are our use cases?
Why is it not a good key?
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Lets Look at our Key
Fastest access is
when accessing a
single record based
on the full key.
Scans more
expensive, full
table scans very
expensive
sub_total
taxes
discount
total
line_items
reference_docs
customer_id
customer_name
customer_phone
customer_billing_address
customer_shipTo_address
company_address
inv_date
company_name
invoice_id
Invoice
invoice_id is unique
How do we access invoices?
What are our use cases?
Why is it not a good key?
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
So what makes a good key?
Uniqueness
Easy access for known and predominant
Use Cases
You need to Know your data.
You need to Know your access patterns.
Not in sort order- Sequential #s and
sorted input are bad for inserts and
table splits. (Hot spots)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
So how do we access
Orders/Invoices/etc...
By specific Invoice #
Pick Slips
Customer inquiries
By customer number/id Then Order/Invoice#
Customer inquiries
∴ The key should be: customer_id | invoice_id
? But what about a Timestamp?
Monday, July 15, 13
SEQUENTIAL KEYS ANDTABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
267 MoreAData
421 AndAsomeAmoreAdata.
665 Snark
1050 LewisACarrollA
1657 Alf
2616 POJO
4129 Busted
6517 SomeAmoreAdataA
10286 TimeAtoASplit
New Data
Pre-Split
Just before the table splits we can
see how the data is going to be
split in to regions.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
SEQUENTIAL KEYS ANDTABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
267 MoreAData
421 AndAsomeAmoreAdata.
665 Snark
1050 LewisACarrollA
1657 Alf
2616 POJO
4129 Busted
6517 SomeAmoreAdataA
10286 TimeAtoASplit
New Data
Pre-Split
To become
Region A
Just before the table splits we can
see how the data is going to be
split in to regions.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
SEQUENTIAL KEYS ANDTABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
267 MoreAData
421 AndAsomeAmoreAdata.
665 Snark
1050 LewisACarrollA
1657 Alf
2616 POJO
4129 Busted
6517 SomeAmoreAdataA
10286 TimeAtoASplit
New Data
Pre-Split
To become
Region A
To become
Region B
Just before the table splits we can
see how the data is going to be
split in to regions.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
TABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
Row$ID Column$Data$
267 More(Data
421 And(some(more(data.
665 Snark
1050 Lewis(Carroll(
1657 Alf
2616 POJO
4129 Busted
6517 Some(more(data(
10286 Time(to(Split
16236 Some)new)data)
25628 Whinnie
40452 Pooh
63850 Davincci
100784 Code
159081 Tigger)is)Awsome
New Data
Post-Split
Region A Region B
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
TABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
Row$ID Column$Data$
267 More(Data
421 And(some(more(data.
665 Snark
1050 Lewis(Carroll(
1657 Alf
2616 POJO
4129 Busted
6517 Some(more(data(
10286 Time(to(Split
16236 Some)new)data)
25628 Whinnie
40452 Pooh
63850 Davincci
100784 Code
159081 Tigger)is)Awsome
New Data
Post-Split
Region A Region B
As we can see, the
Region splits in to A
and B.As new rows
are added, they are
added to the right,
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
TABLE SPLITS
Row$ID Column$Data$
1 something
2 Cat
3 Dog
4 word
7 silly
11 newspaper
17 git
27 snarf
43 pink
68 brain
107 takeover
169 SomethingAelse
Row$ID Column$Data$
267 More(Data
421 And(some(more(data.
665 Snark
1050 Lewis(Carroll(
1657 Alf
2616 POJO
4129 Busted
6517 Some(more(data(
10286 Time(to(Split
16236 Some)new)data)
25628 Whinnie
40452 Pooh
63850 Davincci
100784 Code
159081 Tigger)is)Awsome
New Data
Post-Split
Region A Region B
As we can see, the
Region splits in to A
and B.As new rows
are added, they are
added to the right,
Region B will grow
until it splits and then
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Putting this in Perspective
When the region splits, it will split in half.
All of the new writes will continue to be appended to
the right side of the youngest region.
Then that region will split.
Over time you will have many regions where they
contain approximately 1/2 the region’s max file size.
In the end, all inserts will ‘hot spot’ to the last
region and it will split leaving two half filled regions.
So you have two bad side effects, ‘hot spotting’ and
lots of half filled regions.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Alternatives:
Hashing the row key...
If we hash the row key using a SHA-1 hash, or MD5 Hash the key
will be inserted in a ‘random’ order.
No regional ‘hotspotting’
In order to fetch rows efficiently, you need to know your entire key,
hash it, then use get() to fetch the specific row.
Hashing works great if you know your entire row key.
Hashing kills if you want to do partial key scans.
Some truncate the hash and prepend it to the key to get the
desired distribution and also guarantee uniqueness. (This is different
from using a salt.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Keep to a low Salt diet!
http://blog.sematext.com/2012/04/09/hbasewd-avoid-
regionserver-hotspotting-despite-writing-records-with-
sequential-keys/
Concept is to use a ‘prefix-salt’ and ‘round robin’ the
inserts.
While it solves the issue of a single region hotspotting,
it has some nasty side effects:
Complicates scan()
Complicates get()
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Key Design Summary
Think about your data.
Think about how you access the data and
what should be in the key.
Avoid sequential keys if possible, understand
the issues with hashing the key.
Keys are sorted in Byte[] order
Keep it simple. (KISS)
There are always alternatives so YMMV.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Column Families
The Good, The Bad and the Ugly
The Good: Column families allow you to partially
segregate related data with the same key.
The Bad: All actions to a table occur to all
column families at the same time. The more
column families, the longer splits, and
compactions take. Rule of thumb... no more than
3-5 column families per table.
The Ugly: The Bad gets worse when you
consider that when you split a region of a table,
all of the Column Families’ regions are split too.
This can lead to lots of small regions.
Copyright © 2013 Segel & Associates
All Rights Reserved.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Column Family
Use Case
Customer Orders
Order Entry
Pick Slips
Shipping Slips
Invoices
All column families use the
same row Key
Each column family contains
unique data, specific to step
in the order process.
Denormalize the data. Repeat
information as needed.
Column Family rows are
roughly the same size.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Schema Summary
Review
Consider column families, but use sparingly.
Focus on data access patterns.
The key to success is in the key itself.
Secondary indexing is always an option. (We
will talk about this in the next section.)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Advanced
Schema Design
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Remember our Invoice?
Company
Company Address
Customer Name
Ship To: Address
Date(s)
Phone #(s)
Line Items:
SKU
Description
Qty
Unit Price
Total Price
SubTotal
Tax(s)
Total
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Remember our Invoice?
Company
Company Address
Customer Name
Ship To: Address
Date(s)
Phone #(s)
Line Items:
SKU
Description
Qty
Unit Price
Total Price
SubTotal
Tax(s)
Total Here are a couple of groupings
that encapsulate data.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Complex Data Types
Columns are Byte[]
Its possible to now think in storing data in 3
dimensions.
Row x Column (2D)
Row x Column x Structured Blob (3D)
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Storing Structure
Options
Byte[] are essentially
blobs. Almost anything
goes.
Types of Objects
Strings (String.toByte())
Custom Java Structures
Avro
Java Serialization
Custom
toString().toByte()
Avro
We Like Avro!
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Avro Makes life Easier
Avro is a data serializer and RPC
system
Created by Doug Cutting (Also of
Hadoop Fame)
Language independent (Java, C, C++,
C#, others) APIs
Schema based, defined with JSON
Supports Dynamic Typing
Untagged Data results in smaller
serialized size.
No manually assigned field IDs. When
schemas change, old and new schema
are present.
Avro can serialize to both a binary and
JSON format
Splittable and compressible
Avro is both a service and a class library
Focus on Java APIs
http://avro.apache.org/docs/current/api/
java/index.html
Avro relies on UTF-8 for Strings
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Schemas
Avro Schema for the Address field
address-rec.avpr
{
! "namespace":"com.CHUG",
! "name": "AddressRecord",
! "type": "record",
! "fields":[
! ! {"name":"street1","type":"string","comment":"First street address."},
! ! {"name":"street2","type":[“null”,"string"],"comment":"Second street address."},
! ! {"name":"city","type":"string"},
! ! {"name":"state","type":"string"},
! ! {"name":"zip","type":"string"}
! ]
}
This is just
one example.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Secondary
Indexing
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Adding an Index
HBase naturally does not support indices
Which type of Index to implement is up to
you
Inverted Table
Lucene
Solr
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Indexing Issues
HBase writes are atomic.
Index maintenance has to be synchronized with the
base table.
Co-Processors are still not fully baked.
Writes to Index will incur additional costs. (Index
region will most likely not be on the same RS as the
base table’s region.)
All code will be custom. You are on your own.
YMMV in terms of Performance.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
So What have we learned?
Row key design is everything.
Denormalizing data is critical.
HBase is not rational, so to keep your sanity, forget
your rational modeling techniques.
Avro is a powerful feature you can use to add another
dimension to your schema.
Secondary indexing is possible!
Know your Data!
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Some Key Take Aways
Pearls of Wisdom
There are more than one way to skin a cat. So you need to take what was
presented with a grain of Kosher salt. (The grains are bigger.)
You need to experiment on your own. Reading someone’s blog or slide deck is no
replacement for hands on experience.
Just because Facebook does something, doesn’t mean its a good idea. They have a
different way of looking at problems and what works for them will not necessarily
work for you.
It takes years of experience to know when to break the rules and which rules
you can break.
Good clean code will always out perform the rest. Always stick to the KISS
methodology.
Monday, July 15, 13
Copyright © 2013 Segel & Associates
All Rights Reserved.
Questions?
Thank you for coming and we hope
that you’ve liked the show.
Congratulations
Chicago Blackhawks!
2013 Stanley Cup
Champions
Monday, July 15, 13

Weitere ähnliche Inhalte

Ähnlich wie June 2013 CHUG: HBase Schema Design Basics

Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
ganblues
 

Ähnlich wie June 2013 CHUG: HBase Schema Design Basics (20)

Why AWS's Redshift is a Game Changer
Why AWS's Redshift is a Game ChangerWhy AWS's Redshift is a Game Changer
Why AWS's Redshift is a Game Changer
 
Beyond Relational Databases
Beyond Relational DatabasesBeyond Relational Databases
Beyond Relational Databases
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Oracle to Amazon Aurora Migration, Step by Step - AWS Online Tech Talks
Oracle to Amazon Aurora Migration, Step by Step - AWS Online Tech TalksOracle to Amazon Aurora Migration, Step by Step - AWS Online Tech Talks
Oracle to Amazon Aurora Migration, Step by Step - AWS Online Tech Talks
 
DAT310_Which Database to Use When
DAT310_Which Database to Use WhenDAT310_Which Database to Use When
DAT310_Which Database to Use When
 
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
 SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ... SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Datacenter Pulse Stack v2
Datacenter Pulse Stack v2Datacenter Pulse Stack v2
Datacenter Pulse Stack v2
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data Architecture
 
Why data lake should be the foundation of enterprise data architecture by Raj...
Why data lake should be the foundation of enterprise data architecture by Raj...Why data lake should be the foundation of enterprise data architecture by Raj...
Why data lake should be the foundation of enterprise data architecture by Raj...
 
Leveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven DecisionsLeveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven Decisions
 
Elevate MongoDB with ODBC/JDBC
Elevate MongoDB with ODBC/JDBCElevate MongoDB with ODBC/JDBC
Elevate MongoDB with ODBC/JDBC
 
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS SummitApplying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Trends in Supporting Production Apache HBase Clusters
Trends in Supporting Production Apache HBase ClustersTrends in Supporting Production Apache HBase Clusters
Trends in Supporting Production Apache HBase Clusters
 
Voice Powered Analytics
Voice Powered AnalyticsVoice Powered Analytics
Voice Powered Analytics
 
Ddn 2017 10_dse_primer
Ddn 2017 10_dse_primerDdn 2017 10_dse_primer
Ddn 2017 10_dse_primer
 
TDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDWTDWI Roundtable: The HANA EDW
TDWI Roundtable: The HANA EDW
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

June 2013 CHUG: HBase Schema Design Basics

  • 1. Copyright © 2013 Segel & Associates All Rights Reserved. HBase 101: Schema Design Basics Chicago area Hadoop User Group (C.H.U.G) Monday, July 15, 13
  • 2. Copyright © 2013 Segel & Associates All Rights Reserved. Agenda Introduction: About the speaker (blah blah blah) What is Hadoop/HBase. Why HBase? RDBS vs HBase What HBase isn’t. Schema Design HBase Schema Components Design walk through I Design walk through II Secondary Indexing Summary / Questions Monday, July 15, 13
  • 3. Copyright © 2013 Segel & Associates All Rights Reserved. About the Speaker Michael has been working with Hadoop related technologies since 2009. Currently an Independent Consultant working for Segel & Associates. Likes long walks off of short piers. ;-) Founding member of Big Data Anonymous (Its a real disease people!) Monday, July 15, 13
  • 4. Copyright © 2013 Segel & Associates All Rights Reserved. A Quick Review of Hadoop Big data is becoming an increasingly important part of every business. (1300+ members in CHUG!) MapReduce is a distributed programming model that makes it easier for developers to analyze massive datasets. Hadoop is a distributed computing framework and has many components such as MapReduce. HDFS has historically been a “write once, read many” (WORM) file system Implementing create-update-delete (CRUD) operations a challenge Monday, July 15, 13
  • 5. Copyright © 2013 Segel & Associates All Rights Reserved. The Hadoop Ecosystem HDFS - Distributed File System MapReduce - A distributed framework for executing work in parallel Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate HBase - A NoSQL, non-sequential data store Monday, July 15, 13
  • 6. Copyright © 2013 Segel & Associates All Rights Reserved. The Hadoop Ecosystem HDFS - Distributed File System MapReduce - A distributed framework for executing work in parallel Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS. Pig - A top down scripting language to manipulate HBase - A NoSQL, non-sequential data store Monday, July 15, 13
  • 7. Copyright © 2013 Segel & Associates All Rights Reserved. What is HBase? A NoSQL database ‘Column Oriented’ (At the storage layer) Highly distributed Highly scalable A non-relational persistent object store... Monday, July 15, 13
  • 8. Copyright © 2013 Segel & Associates All Rights Reserved. What is not HBase A relational database Has Transactional Support Built on a traditional, updatable file system (updates are via cells) A stand alone system (but could be...) The only NoSQL game in town Hbase is not... Monday, July 15, 13
  • 9. Copyright © 2013 Segel & Associates All Rights Reserved. RDBMS vs HBase RDBMS HBase ACID compliant No ACID compliance Sharding/Partitions Distributed Regions SQL Key lookup/key range scans Triggers/Stored Procedures Coprocessors Indexes (B+Tree, R-Tree) No indexing Highly Normalized Denormalized Primitive data types Byte arrays In-place updates Cell versioning Monday, July 15, 13
  • 10. Copyright © 2013 Segel & Associates All Rights Reserved. Audience Participation: Questions? How many students have a strong RDBMS background? Data Modeling? Use of a non relational engine like Pick? (Revelation, U2, ... COBOL)? Monday, July 15, 13
  • 11. Copyright © 2013 Segel & Associates All Rights Reserved. Why HBase? Handles unstructured or semi-structured data. Handles enormous data volumes. Flexible. Ad-hoc access as well as full or partial table scans. Cost-effective scalability. Near linear scalability. Part of the Hadoop ecosystem. Monday, July 15, 13
  • 12. Copyright © 2013 Segel & Associates All Rights Reserved. Lets take a look at SCHEMAS Monday, July 15, 13
  • 13. Copyright © 2013 Segel & Associates All Rights Reserved. Schemas Design HBase is not an RDBMS No SQL language or syntax. Joins in HBase are expensive HBase can store complex types within a single column. Column Families are a factor in designs. Schema Design is one of the more difficult things when working with HBase. Monday, July 15, 13
  • 14. Copyright © 2013 Segel & Associates All Rights Reserved. What makes up HBase Tables Column Families Columns cells (versions) (from a conceptual point of view) (more on this later ... ) Monday, July 15, 13
  • 15. A PRACTICAL EXAMPLE... Sales Receipt Acme Supply You want it, we’ve got it! Date 12/22/2012 Receipt # abc123 Acme Supply 310 Erie St. Chicago, IL 60654 Phone (312) 555-12112 Fax (312) 555-1214 sales@acme.com SOLD TO Wylie Coyote P.O.Box 123 Rock River Canyon, AZ Payment Method Check No. Job Qty Item # Description Unit Price Discount Line total 10 12345677 250 LB. Steel Anvils $150.00 10% $1350.00 Shipping $550.00 Total Discount Subtotal $1900.00 Sales Tax Total $1900.00 Thank you for your business! Acme Supply •Order Entry System • Order Entry • Pick Slip • Shipping • Invoice Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 18. Copyright © 2013 Segel & Associates All Rights Reserved. RDBMS: What did we see? Highly normalized design Lots of joins to get a single record. Requires a FOR EACH loop to fetch the detail records. (Or a join against the master record.) Monday, July 15, 13
  • 19. Copyright © 2013 Segel & Associates All Rights Reserved. Now Lets Look at HBase! The defined schema contain table name and column families. Columns are not defined. (# Versions per column are!) Data is stored in key-value pairs. The key is the primary index. (There really is no index per se.) The value is a set of one or more columns which contain Java Byte arrays. Each table should be considered stand alone. The only exception would be indices created on non-key fields. (Secondary Indexes... More on this later...) Monday, July 15, 13
  • 20. Copyright © 2013 Segel & Associates All Rights Reserved. How to translate the model to HBase: Traditional Entity Relation Design tools are of limited use. Each table should be atomic, with only weak relationships to other tables. A record should be completely contained in a single row. Define your record based on the data present. Encapsulate data by grouping related data within the record. Monday, July 15, 13
  • 21. Copyright © 2013 Segel & Associates All Rights Reserved. Remember our Invoice? Company Company Address Customer Name Ship To: Address Date(s) Phone #(s) Line Items: SKU Description Qty Unit Price Total Price SubTotal Tax(s) Total Monday, July 15, 13
  • 22. Copyright © 2013 Segel & Associates All Rights Reserved. Remember our Invoice? Company Company Address Customer Name Ship To: Address Date(s) Phone #(s) Line Items: SKU Description Qty Unit Price Total Price SubTotal Tax(s) Total Here are a couple of groupings that encapsulate data. Monday, July 15, 13
  • 23. HBASE SCHEMA sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice This is just an example. There are other options! Question(s) for the Audience: Should this be a table? What about a column family? [Hint: Think of the entire system e.g. Order Entry... ] Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 24. LETSTAKE A CLOSER LOOK sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 25. LETSTAKE A CLOSER LOOK sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice The invoice_id is our row key. This is a unique ID specific to one and only one invoice. Bonus question: Is this really a good row key? Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 26. LETSTAKE A CLOSER LOOK sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice The invoice_id is our row key. This is a unique ID specific to one and only one invoice. Bonus question: Is this really a good row key? These represent structures which encapsulate primitive data types Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 27. Copyright © 2013 Segel & Associates All Rights Reserved. Lets Look at our Key Fastest access is when accessing a single record based on the full key. Scans more expensive, full table scans very expensive sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice invoice_id is unique How do we access invoices? What are our use cases? Why is it not a good key? Monday, July 15, 13
  • 28. Copyright © 2013 Segel & Associates All Rights Reserved. Lets Look at our Key Fastest access is when accessing a single record based on the full key. Scans more expensive, full table scans very expensive sub_total taxes discount total line_items reference_docs customer_id customer_name customer_phone customer_billing_address customer_shipTo_address company_address inv_date company_name invoice_id Invoice invoice_id is unique How do we access invoices? What are our use cases? Why is it not a good key? Monday, July 15, 13
  • 29. Copyright © 2013 Segel & Associates All Rights Reserved. So what makes a good key? Uniqueness Easy access for known and predominant Use Cases You need to Know your data. You need to Know your access patterns. Not in sort order- Sequential #s and sorted input are bad for inserts and table splits. (Hot spots) Monday, July 15, 13
  • 30. Copyright © 2013 Segel & Associates All Rights Reserved. So how do we access Orders/Invoices/etc... By specific Invoice # Pick Slips Customer inquiries By customer number/id Then Order/Invoice# Customer inquiries ∴ The key should be: customer_id | invoice_id ? But what about a Timestamp? Monday, July 15, 13
  • 31. SEQUENTIAL KEYS ANDTABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse 267 MoreAData 421 AndAsomeAmoreAdata. 665 Snark 1050 LewisACarrollA 1657 Alf 2616 POJO 4129 Busted 6517 SomeAmoreAdataA 10286 TimeAtoASplit New Data Pre-Split Just before the table splits we can see how the data is going to be split in to regions. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 32. SEQUENTIAL KEYS ANDTABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse 267 MoreAData 421 AndAsomeAmoreAdata. 665 Snark 1050 LewisACarrollA 1657 Alf 2616 POJO 4129 Busted 6517 SomeAmoreAdataA 10286 TimeAtoASplit New Data Pre-Split To become Region A Just before the table splits we can see how the data is going to be split in to regions. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 33. SEQUENTIAL KEYS ANDTABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse 267 MoreAData 421 AndAsomeAmoreAdata. 665 Snark 1050 LewisACarrollA 1657 Alf 2616 POJO 4129 Busted 6517 SomeAmoreAdataA 10286 TimeAtoASplit New Data Pre-Split To become Region A To become Region B Just before the table splits we can see how the data is going to be split in to regions. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 34. TABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse Row$ID Column$Data$ 267 More(Data 421 And(some(more(data. 665 Snark 1050 Lewis(Carroll( 1657 Alf 2616 POJO 4129 Busted 6517 Some(more(data( 10286 Time(to(Split 16236 Some)new)data) 25628 Whinnie 40452 Pooh 63850 Davincci 100784 Code 159081 Tigger)is)Awsome New Data Post-Split Region A Region B Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 35. TABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse Row$ID Column$Data$ 267 More(Data 421 And(some(more(data. 665 Snark 1050 Lewis(Carroll( 1657 Alf 2616 POJO 4129 Busted 6517 Some(more(data( 10286 Time(to(Split 16236 Some)new)data) 25628 Whinnie 40452 Pooh 63850 Davincci 100784 Code 159081 Tigger)is)Awsome New Data Post-Split Region A Region B As we can see, the Region splits in to A and B.As new rows are added, they are added to the right, Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 36. TABLE SPLITS Row$ID Column$Data$ 1 something 2 Cat 3 Dog 4 word 7 silly 11 newspaper 17 git 27 snarf 43 pink 68 brain 107 takeover 169 SomethingAelse Row$ID Column$Data$ 267 More(Data 421 And(some(more(data. 665 Snark 1050 Lewis(Carroll( 1657 Alf 2616 POJO 4129 Busted 6517 Some(more(data( 10286 Time(to(Split 16236 Some)new)data) 25628 Whinnie 40452 Pooh 63850 Davincci 100784 Code 159081 Tigger)is)Awsome New Data Post-Split Region A Region B As we can see, the Region splits in to A and B.As new rows are added, they are added to the right, Region B will grow until it splits and then Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 37. Putting this in Perspective When the region splits, it will split in half. All of the new writes will continue to be appended to the right side of the youngest region. Then that region will split. Over time you will have many regions where they contain approximately 1/2 the region’s max file size. In the end, all inserts will ‘hot spot’ to the last region and it will split leaving two half filled regions. So you have two bad side effects, ‘hot spotting’ and lots of half filled regions. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 38. Alternatives: Hashing the row key... If we hash the row key using a SHA-1 hash, or MD5 Hash the key will be inserted in a ‘random’ order. No regional ‘hotspotting’ In order to fetch rows efficiently, you need to know your entire key, hash it, then use get() to fetch the specific row. Hashing works great if you know your entire row key. Hashing kills if you want to do partial key scans. Some truncate the hash and prepend it to the key to get the desired distribution and also guarantee uniqueness. (This is different from using a salt. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 39. Keep to a low Salt diet! http://blog.sematext.com/2012/04/09/hbasewd-avoid- regionserver-hotspotting-despite-writing-records-with- sequential-keys/ Concept is to use a ‘prefix-salt’ and ‘round robin’ the inserts. While it solves the issue of a single region hotspotting, it has some nasty side effects: Complicates scan() Complicates get() Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 40. Key Design Summary Think about your data. Think about how you access the data and what should be in the key. Avoid sequential keys if possible, understand the issues with hashing the key. Keys are sorted in Byte[] order Keep it simple. (KISS) There are always alternatives so YMMV. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 41. Column Families The Good, The Bad and the Ugly The Good: Column families allow you to partially segregate related data with the same key. The Bad: All actions to a table occur to all column families at the same time. The more column families, the longer splits, and compactions take. Rule of thumb... no more than 3-5 column families per table. The Ugly: The Bad gets worse when you consider that when you split a region of a table, all of the Column Families’ regions are split too. This can lead to lots of small regions. Copyright © 2013 Segel & Associates All Rights Reserved. Monday, July 15, 13
  • 42. Copyright © 2013 Segel & Associates All Rights Reserved. Column Family Use Case Customer Orders Order Entry Pick Slips Shipping Slips Invoices All column families use the same row Key Each column family contains unique data, specific to step in the order process. Denormalize the data. Repeat information as needed. Column Family rows are roughly the same size. Monday, July 15, 13
  • 43. Copyright © 2013 Segel & Associates All Rights Reserved. Schema Summary Review Consider column families, but use sparingly. Focus on data access patterns. The key to success is in the key itself. Secondary indexing is always an option. (We will talk about this in the next section.) Monday, July 15, 13
  • 44. Copyright © 2013 Segel & Associates All Rights Reserved. Advanced Schema Design Monday, July 15, 13
  • 45. Copyright © 2013 Segel & Associates All Rights Reserved. Remember our Invoice? Company Company Address Customer Name Ship To: Address Date(s) Phone #(s) Line Items: SKU Description Qty Unit Price Total Price SubTotal Tax(s) Total Monday, July 15, 13
  • 46. Copyright © 2013 Segel & Associates All Rights Reserved. Remember our Invoice? Company Company Address Customer Name Ship To: Address Date(s) Phone #(s) Line Items: SKU Description Qty Unit Price Total Price SubTotal Tax(s) Total Here are a couple of groupings that encapsulate data. Monday, July 15, 13
  • 47. Copyright © 2013 Segel & Associates All Rights Reserved. Complex Data Types Columns are Byte[] Its possible to now think in storing data in 3 dimensions. Row x Column (2D) Row x Column x Structured Blob (3D) Monday, July 15, 13
  • 48. Copyright © 2013 Segel & Associates All Rights Reserved. Storing Structure Options Byte[] are essentially blobs. Almost anything goes. Types of Objects Strings (String.toByte()) Custom Java Structures Avro Java Serialization Custom toString().toByte() Avro We Like Avro! Monday, July 15, 13
  • 49. Copyright © 2013 Segel & Associates All Rights Reserved. Avro Makes life Easier Avro is a data serializer and RPC system Created by Doug Cutting (Also of Hadoop Fame) Language independent (Java, C, C++, C#, others) APIs Schema based, defined with JSON Supports Dynamic Typing Untagged Data results in smaller serialized size. No manually assigned field IDs. When schemas change, old and new schema are present. Avro can serialize to both a binary and JSON format Splittable and compressible Avro is both a service and a class library Focus on Java APIs http://avro.apache.org/docs/current/api/ java/index.html Avro relies on UTF-8 for Strings Monday, July 15, 13
  • 50. Copyright © 2013 Segel & Associates All Rights Reserved. Schemas Avro Schema for the Address field address-rec.avpr { ! "namespace":"com.CHUG", ! "name": "AddressRecord", ! "type": "record", ! "fields":[ ! ! {"name":"street1","type":"string","comment":"First street address."}, ! ! {"name":"street2","type":[“null”,"string"],"comment":"Second street address."}, ! ! {"name":"city","type":"string"}, ! ! {"name":"state","type":"string"}, ! ! {"name":"zip","type":"string"} ! ] } This is just one example. Monday, July 15, 13
  • 51. Copyright © 2013 Segel & Associates All Rights Reserved. Secondary Indexing Monday, July 15, 13
  • 52. Copyright © 2013 Segel & Associates All Rights Reserved. Adding an Index HBase naturally does not support indices Which type of Index to implement is up to you Inverted Table Lucene Solr Monday, July 15, 13
  • 53. Copyright © 2013 Segel & Associates All Rights Reserved. Indexing Issues HBase writes are atomic. Index maintenance has to be synchronized with the base table. Co-Processors are still not fully baked. Writes to Index will incur additional costs. (Index region will most likely not be on the same RS as the base table’s region.) All code will be custom. You are on your own. YMMV in terms of Performance. Monday, July 15, 13
  • 54. Copyright © 2013 Segel & Associates All Rights Reserved. So What have we learned? Row key design is everything. Denormalizing data is critical. HBase is not rational, so to keep your sanity, forget your rational modeling techniques. Avro is a powerful feature you can use to add another dimension to your schema. Secondary indexing is possible! Know your Data! Monday, July 15, 13
  • 55. Copyright © 2013 Segel & Associates All Rights Reserved. Some Key Take Aways Pearls of Wisdom There are more than one way to skin a cat. So you need to take what was presented with a grain of Kosher salt. (The grains are bigger.) You need to experiment on your own. Reading someone’s blog or slide deck is no replacement for hands on experience. Just because Facebook does something, doesn’t mean its a good idea. They have a different way of looking at problems and what works for them will not necessarily work for you. It takes years of experience to know when to break the rules and which rules you can break. Good clean code will always out perform the rest. Always stick to the KISS methodology. Monday, July 15, 13
  • 56. Copyright © 2013 Segel & Associates All Rights Reserved. Questions? Thank you for coming and we hope that you’ve liked the show. Congratulations Chicago Blackhawks! 2013 Stanley Cup Champions Monday, July 15, 13