Jon Hsieh, Software Engineer @ Cloudera and HBase Committer
Apache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I'll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I'll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Apache HBase: Where We've Been and What's Upcoming
1. Headline
Goes
Here
Speaker
Name
or
Subhead
Goes
Here
DO
NOT
USE
PUBLICLY
PRIOR
TO
10/23/12
Apache
HBase
–
Where
we’ve
been
and
what’s
upcoming
Jonathan
Hsieh
|
@jmhsieh
Tech
lead
/
SoMware
Engineer
at
Cloudera
|
HBase
PMC
Member
Hadoop
Users
Group
UK
April
10,
2014
4/10/14 Hadoop Users Group UK
2. Who
Am
I?
• Cloudera:
• Tech
Lead
HBase
Team
• So<ware
Engineer
• Apache
HBase
commiVer
/
PMC
• Apache
Flume
founder
/
PMC
• U
of
Washington:
• Research
in
Distributed
Systems
4/10/14 Hadoop Users Group UK
3. What
is
Apache
HBase?
Apache
HBase
is
a
reliable,
column-‐
oriented
data
store
that
provides
consistent,
low-‐
latency,
random
read/
write
access.
ZK
HDFS
App
MR
4/10/14 Hadoop Users Group UK
4. HBase
provides
Low-‐latency
Random
Access
• Writes:
• 1-‐3ms,
1k-‐20k
writes/sec
per
node
• Reads:
• 0-‐3ms
cached,
10-‐30ms
disk
• 10k-‐40k
reads
/
second
/
node
from
cache
• Cell
size:
• 0B-‐3MB
• Read,
write,
and
insert
data
anywhere
in
the
table
4/10/14 Hadoop Users Group UK
0000000000
1111111111
2222222222
3333333333
4444444444
5555555555
6666666666
7777777777
1
2
3
4
5
5. Core
Properces
• ACID
guarantees
on
a
row
• Writes
are
durable
• Strong
consistency
first,
then
availability
• AMer
failure,
recover
and
return
current
value
instead
of
returning
stale
value
• CAS
and
atomic
increments
can
be
efficient.
• Sorted
By
Primary
Key
• Short
scans
are
efficient
• Parcconed
by
Primary
Key
• Log
Structured
Merged
Tree
• Writes
are
extremely
efficient
• Reads
are
efficient
• Periodic
layout
opcmizacons
for
read
opcmizacon
(“compaccons”)
required.
4/10/14 Hadoop Users Group UK
7. Jan
‘12:
0.92.0
Apache
HBase
Timeline
4/10/14 Hadoop Users Group UK
2014
2006
2007
2008
2009
2010
2011
2013
2012
Nov
’06:
Google
BigTable
OSDI
‘06
Apr
‘07:
First
Apache
HBase
commit
as
Hadoop
contrib
project
Apr
‘10:
Apache
HBase
becomes
top
level
project
Oct
‘13:
0.96.0
Apr’11:
CDH3
GA
with
HBase
0.90.1
May
‘12:
HBaseCon
2012
Jun
‘13:
HBaseCon
2013
Jan‘08:
Promoted
to
Hadoop
subproject
Feb
‘13:
0.98.0
May
‘12:
0.94.0
Summer‘11:
Messages
on
HBase
Summer
‘09
StumbleUpon
goes
produccon
on
HBase
~0.20
Nov
‘11:
Cassini
on
HBase
Jan
‘13
Phoenix
on
HBase
Summer‘11:
Web
Crawl
Cache
8. Developer
Community
• Accve
community!
• Diverse
commiVers
from
many
organizacons
4/10/14 Hadoop Users Group UK
13. What’s
here
and
new
today?
Today:
Apache
0.96.2
/
0.98.1
4/10/14 Hadoop Users Group UK
14. Criccal
Features
Disaster
Recovery
• Cluster
Replicacon
• Table
Snapshots
• Copy
Table
• Import
/
Export
Tables
• Metadata
Corrupcon
repair
tool
(hbck)
AdministraMve
and
ConMnuity
• Kerberos
based
Authenccacon
• ACL
based
Authorizacon
• Config
change
via
rolling
restart.
• Within
version
rolling
upgrade.
• Protobuf
based
wire
protocol
for
RPC
future
proofing
4/10/14 Hadoop Users Group UK
15. Hardened
for
0.96
Table
AdministraMon
• Online
Schema
change
• Online
Region
Merging
• Concnuous
fault
injeccon
tescng
with
“Chaos
Monkey”
Performance
Tuning
• Alternate
key
encodings
for
efficient
memory
usage
• Exploring
Compactor
policy
minimizes
compaccon
storms
• Smart
and
Adapcve
Stochascc
region
load
balancer
• Fast
split
policy
for
new
tables
4/10/14 Hadoop Users Group UK
16. MR
over
Table
Snapshots
(0.98,
CDH5.0)
• Previously
MapReduce
jobs
over
HBase
required
online
full
table
scan
• Idea:
Take
a
snapshot
and
run
MR
job
over
snapshot
files
• Doesn’t
use
HBase
client
• Avoid
affeccng
HBase
caches
• 3-‐5x
perf
boost.
4/10/14 Hadoop Users Group UK
map
map
map
map
map
map
map
map
reduce
reduce
reduce
map
map
map
map
map
map
map
map
reduce
reduce
reduce
snapshot
17. Mean
Time
to
Recovery
(MTTR)
• Machine
failures
happen
in
distributed
systems
• Average
unavailability
when
automaccally
recovering
from
a
failure.
• Recovery
cme
for
a
unclean
data
center
power
cycle
4/10/14 Hadoop Users Group UK
recovered
nocfy
repair
detect
Region
unavailable
Region
available
client
aware
Region
available
client
unaware
18. Fast
nocficacon
and
deteccon
(0.96)
• Proaccve
nocficacon
of
HMaster
failure
(0.96)
• Proaccve
nocficacon
of
RS
failure
(0.96)
• Nocfy
client
on
recovery
(0.96)
• Fast
server
failover
(Hardware)
4/10/14 Hadoop Users Group UK
recovered
replay
assign
split
Region
unavailable
Region
available
for
RW
hdfs
hdfs
detect
hdfs
19. • Previously
had
two
IO
intensive
passes:
• Log
splitng
to
intermediate
files
• Assign
and
log
replay
• Now
just
one
IO
heavy
pass:
Assign
first,
then
split+replay.
• Improves
read
and
write
recovery
cmes.
• Off
by
default
currently*.
Distributed
log
replay
(experimental
0.96)
4/10/14 Hadoop Users Group UK
recovered
split
+
replay
assign
Region
unavailable
Region
available
for
RW
Region
available
for
replay
writes
hdfs
detect
*Caveat:
If
you
override
cme
stamps
you
could
have
READ
REPEATED
isolacon
violacons
(use
tags
to
fix
this)
20. Cell
Tags
(0.98
experimental)
• Mechanism
for
aVaching
arbitrary
metadata
to
Cells.
• Mocvacon:
Finer-‐grained
isolacon
• Use
for
Accumulo-‐style
cell-‐level
visibility
• Main
feature
for
0.98
• Other
uses:
• Add
sequence
numbers
to
enable
correct
fast
read/write
recovery
• Potencal
for
schema
tags
4/10/14 Hadoop Users Group UK
21. Htrace
(0.96
experimental)
• Problem:
• Where
is
cme
being
spent
inside
HBase?
• Solucon:
HTrace
Framework
• Inspired
by
Google
Dapper
• Threaded
through
HBase
and
HDFS
• Tracks
cme
spent
in
calls
in
a
distributed
system
by
tracking
spans*
on
different
machines.
*Some
assembly
scll
required.
4/10/14 Hadoop Users Group UK
22. HTrace:
Distributed
Tracing
in
HBase
and
HDFS
• Framework
Inspired
by
Google
Dapper
• Tracks
cme
spent
in
calls
in
RPCs
across
different
machines.
• Threaded
through
HBase
(0.96)
and
future
HDFS.
4/10/14 Hadoop Users Group UK
HBase
RS
1
HDFS
DN
ZK
HBase
Client
HBase
meta
HDFS
NN
A
span
RPC
calls
23. Zipkin
–
Visualizing
Spans
• UI
+
Visualizacon
System
• WriVen
by
TwiVer
• Zipkin
HBase
Storage
• Zipkin
HTrace
integracon
• View
where
cme
from
a
specific
call
is
spent
in
HBase,
HDFS,
and
ZK.
4/10/14 Hadoop Users Group UK
24. A
Future
HBase
What’s
Upcoming
4/10/14 Hadoop Users Group UK
25. Outline
• Improved
Mean
cme
to
recovery
(MTTR)
• Improved
Predictability
• Improved
Usability
• Improved
Mulctenancy
4/10/14 Hadoop Users Group UK
27. • Previously
had
two
IO
intensive
passes:
• Log
splitng
to
intermediate
files
• Assign
and
log
replay
• Now
just
one
IO
heavy
pass:
Assign
first,
then
split+replay.
• Improves
read
and
write
recovery
cmes.
• Off
by
default
currently*.
Distributed
log
replay
(experimental
0.96)
4/10/14 Hadoop Users Group UK
recovered
split
+
replay
assign
Region
unavailable
Region
available
for
RW
Region
available
for
replay
writes
hdfs
detect
*Caveat:
If
you
override
cme
stamps
you
could
have
READ
REPEATED
isolacon
violacons
(use
tags
to
fix
this)
28. recovered
split
+
replay
Distributed
log
replay
with
fast
write
recovery
4/10/14 Hadoop Users Group UK
assign
Region
unavailable
Region
available
for
RW
Region
available
for
all
writes
hdfs
detect
• Writes
in
HBase
do
not
incur
reads.
• With
distributed
log
replay,
we’ve
already
have
regions
open
for
write.
• Allow
fresh
writes
while
replaying
old
logs*.
*Caveat:
If
you
override
cme
stamps
you
could
have
READ
REPEATED
isolacon
violacons
(use
tags
to
fix
this)
29. Fast
Read
Recovery
(proposed)
• Idea:
Priscne
Region
fast
read
recovery
• If
region
not
edited
it
is
consistent
and
can
recover
RW
immediately
• Idea:
Shadow
Regions
for
fast
read
recovery
• Shadow
region
tails
the
WAL
of
the
primary
region
• Shadow
memstore
is
one
HDFS
block
behind,
catch
up
recover
RW
• Currently
some
progress
for
trunk
4/10/14 Hadoop Users Group UK
recovered
assign
Region
unavailable
Can
guarantee
no
new
edits?
Region
available
for
all
RW
detect
Can
guarantee
we
have
all
edits?
Region
available
for
all
RW
31. Common
causes
of
performance
variability
• Locality
Loss
• Favored
Nodes,
HDFS
block
affinity
• Compaccon
• Exploring
compactor
• GC*
• Off-‐heap
Cache
• Hardware
hiccups
• MulM
WAL,
HDFS
speculaMve
read
4/10/14 Hadoop Users Group UK
32. Performance
degraded
aMer
recovery
• AMer
recovery,
reads
suffer
a
performance
hit.
• Regions
have
lost
locality
• To
maintain
performance
aMer
failover,
we
need
to
regain
locality.
• Compact
Region
to
regain
locality
• We
can
do
beVer
by
using
HDFS
features
4/10/14 Hadoop Users Group UK
performance
recovered
recovered
Service
recovered;
degraded
performance
L
recovery
Performance
recovered
because
compaccon
restores
locality
J
33. • Control
and
track
where
block
replicas
are
• All
files
for
a
region
created
such
that
blocks
go
to
the
same
set
of
favored
nodes
• When
failing
over,
assign
the
region
to
one
of
those
favored
nodes.
• Currently
a
preview
feature
in
0.96
• Disabled
by
default
because
it
doesn’t
work
well
with
the
latest
balancer
or
splits.
• Will
likely
use
upcoming
HDFS
block
affinity
for
beVer
operability
• Originally
on
Facebook’s
0.89,
ported
to
0.96
performance
recovered
Read
Throughput:
Favored
Nodes
(experimental
0.96)
4/10/14 Hadoop Users Group UK
Service
recovered;
performance
sustained
because
region
assigned
to
favored
node.
J
recovery
34. Read
latency:
HDFS
hedged
read
(CDH5.0)
• HBase’s
Region
servers
use
HDFS
client
to
reads
1
of
3
HDFS
block
replicas
• If
you
chose
the
slow
node,
your
reads
are
slow.
• If
a
read
is
taking
too
long,
speculacvely
go
to
another
that
may
be
faster.
4/10/14 Hadoop Users Group UKRS
1
2
3
Slow
read!
Hdfs
replicas
RS
1
2
3
Hdfs
replicas
Too
slow,
read
other
replica
35. Read
latency:
Read
Replicas
(in
progress)
• HBase
client
reads
from
primary
region
servers.
• If
you
chose
the
slow
node,
your
reads
are
slow.
• Idea:
Read
replica
assigned
to
other
region
servers.
Replicas
periodically
catch
up
(via
snapshots
or
shadow
region
memstores)
• Client
specifies
if
stale
read
OK.
If
a
read
is
taking
too
long,
speculacvely
go
to
another
that
may
be
faster.
4/10/14 Hadoop Users Group UK
Hbase
Client
1
Slow
read!
1
2
3
Region
replicas
Too
slow,
read
stale
replica
Hbase
Client
36. Write
latency:
Mulcple
WALs
(in
progress)
• HBase’s
HDFS
client
writes
3
replicas
• Min
write
latency
is
bounded
by
the
slowest
of
the
3
replicas
• Idea:
If
a
write
is
taking
too
long
let’s
duplicate
it
on
another
set
that
may
be
faster.
4/10/14 Hadoop Users Group UKRS
1
2
3
Slow
Write
Hdfs
replicas
RS
1
2
3
Hdfs
replicas
1
2
3
Hdfs
replicas
Too
slow,
write
to
other
replica
38. Making
HBase
easier
to
use
and
tune.
• Difficult
to
see
what
is
happening
in
HBase
• Easy
to
make
poor
design
decisions
early
without
realizing
• New
Developments
• Memory
auto
tuning
• HTrace
+
Zipkin
• Frameworks
for
Schema
design
4/10/14 Hadoop Users Group UK
39. Memory
Use
Auto-‐tuning
(trunk)
• Memory
is
divided
between
• the
memstore
(used
for
serving
recent
writes)
• the
block
cache
(used
for
read
hot
spots)
• Need
to
choose
balance
for
work
load
4/10/14 Hadoop Users Group UK
memstore
Block
cache
memstore
Block
cache
memstore
Block
cache
Read
Heavy
Balanced
Write
heavy
40. HBase
Schemas
• HBase
Applicacon
developers
must
iterate
to
find
a
suitable
HBase
schema
• Schema
criMcal
for
Performance
at
Scale
• How
can
we
make
this
easier?
• How
can
we
reduce
the
expercse
required
to
do
this?
• Today:
• Lots
of
tuning
knobs
• Developers
need
to
understand
Column
Families,
Rowkey
design,
Data
encoding,
…
• Some
are
expensive
to
change
aMer
the
fact
4/10/14 Hadoop Users Group UK
41. How
should
I
arrange
my
data?
• Isomorphic
data
representacons!
4/10/14 Hadoop Users Group UK
rowkey
d:
bob-‐col1
aaaa
bob-‐col2
bbbb
bob-‐col3
cccc
bob-‐col4
dddd
jon-‐col1
eeee
jon-‐col2
ffff
jon-‐col3
gggg
jon-‐col4
hhhh
Rowkey
d:col1
d:col2
d:col3
d:col4
bob
aaaa
bbbb
cccc
dddd
jon
eeee
ffff
gggg
hhhhh
Rowkey
col1:
col2:
col3:
col4:
bob
aaaa
bbbb
cccc
dddd
jon
eeee
ffff
gggg
hhhhh
Short
Fat
Table
using
column
qualifiers
Short
Fat
Table
using
column
families
Tall
skinny
with
compound
rowkey
42. How
should
I
arrange
my
data?
• Isomorphic
data
representacons!
4/10/14 Hadoop Users Group UK
rowkey
d:
bob-‐col1
aaaa
bob-‐col2
bbbb
bob-‐col3
cccc
bob-‐col4
dddd
jon-‐col1
eeee
jon-‐col2
ffff
jon-‐col3
gggg
jon-‐col4
hhhh
Rowkey
d:col1
d:col2
d:col3
d:col4
bob
aaaa
bbbb
cccc
dddd
jon
eeee
ffff
gggg
hhhhh
Rowkey
col1:
col2:
col3:
col4:
bob
aaaa
bbbb
cccc
dddd
jon
eeee
ffff
gggg
hhhhh
Short
Fat
Table
using
column
qualifiers
Short
Fat
Table
using
column
families
Tall
skinny
with
compound
rowkey
With
great
power
comes
great
responsibility!
How
can
we
make
this
easier
for
users?
43. Impala
• Scalable
Low-‐latency
SQL
querying
for
HDFS
(and
HBase!)
• ODBC/JDBC
driver
interface
• Highlights
• Use’s
Hive
metastore
and
its
hbase-‐hbase
connector
configuracon
convencons.
• Nacve
code
implementacon,
uses
JIT
for
query
execucon
opcmizacon.
• Authorizacon
via
Kerberos
support
• Open
sourced
by
Cloudera
• hVps://github.com/cloudera/impala
4/10/14 Hadoop Users Group UK
44. Phoenix
• A
SQL
skin
over
HBase
targecng
low-‐latency
queries.
• JDBC
SQL
interface
• Highlights
• Adds
Types
• Handles
Compound
Row
key
encoding
• Secondary
indices
in
development
• Provides
some
pushdown
aggregacons
(coprocessor).
• Open
sourced
by
Salesforce.com
• Work
from
James
Taylor,
Jesse
Yates,
et
al
• hVps://github.com/forcedotcom/phoenix
4/10/14 Hadoop Users Group UK
45. Kite
(nee
Cloudera
Development
Kit/CDK)
• APIs
that
provides
a
Dataset
abstracMon
• Provides
get/put/delete
API
in
avro
objects
• HBase
Support
in
progress
• Highlights
• Supports
mulcple
components
of
the
hadoop
distros
(flume,
morphlines,
hive,
crunch,
hcat)
• Provides
types
using
Avro
and
parquet
formats
for
encoding
encces
• Manages
schema
evolucon
• Open
source
by
Cloudera
• hVps://github.com/kite-‐sdk/kite
4/10/14 Hadoop Users Group UK
46. Many
apps
and
users
in
a
single
cluster
Mulc-‐tenancy
4/10/14 Hadoop Users Group UK
47. Growing
HBase
• Pre
0.96.0:
scaling
up
HBase
for
single
HBase
applicacons
• Essencally
a
single
user
for
single
app.
• Ex:
Facebook
messages,
one
applicacon,
many
hbase
clusters
• Shard
users
to
different
pods
• Focused
on
concnuity
and
disaster
recovery
features
• Cross-‐cluster
Replicacon
• Table
Snapshots
• Rolling
Upgrades
4/10/14 Hadoop Users Group UK
#
of
isolated
applicacons
#
of
clusters
Scalability
One
giant
applicacon,
Mulcple
clusters
48. Growing
HBase
• In
0.96
we
introduce
primicves
for
supporcng
MulMtenancy
• Many
users,
many
applicacons,
one
HBase
cluster
• Need
to
have
some
control
of
the
interaccons
different
users
cause.
• Ex:
Manage
for
MR
analyccs
and
low-‐latency
serving
in
one
cluster.
4/10/14 Hadoop Users Group UK
#
of
isolated
applicacons
#
of
clusters
mulctenancy
Scalability
One
giant
applicacon,
Mulcple
clusters
Many
applicacons
In
one
shared
cluster
49. Namespaces
(0.96)
• Namespaces
provide
an
abstraccon
for
mulcple
tenants
to
create
and
manage
their
own
tables
within
a
large
HBase
instance.
4/10/14 Hadoop Users Group UK
Namespace
blue
Namespace
green
Namespace
orange
50. Mulctenancy
goals
• Security
(0.96)
• A
separate
admin
ACLs
for
different
sets
of
tables
• Quotas
(in
progress)
• Max
tables,
max
regions.
• Performance
Isolacon
(in
progress)
• Limit
performance
impact
load
on
one
table
has
on
others.
• Priority
(future)
• Prioricze
some
workloads/tables/user
before
others
4/10/14 Hadoop Users Group UK
51. Isolacon
with
Region
Server
Groups
(in
progress)
4/10/14 Hadoop Users Group UK
Region
assignment
distribucon
(no
region
server
groups)
Namespace
blue
Namespace
green
Namespace
orange
52. Isolacon
with
Region
Server
Groups
(in
progress)
4/10/14 Hadoop Users Group UK
RSG
blue
RSG
green
orange
Namespace
blue
Namespace
green
Namespace
orange
Region
assignment
distribucon
with
Region
Server
Groups
(RSG)
54. Summary
by
Version
0.90
(CDH3)
0.92
/0.94
(CDH4)
0.96
(CDH5)
Next
(0.98
/
1.0.0)
New
Features
Stability
Reliability
Concnuity
Mulctenancy
MTTR
Recovery
in
Hours
Recovery
in
Minutes
Recovery
of
writes
in
seconds,
reads
in
10’s
of
seconds
Recovery
in
Seconds
(reads+writes)
Perf
Baseline
BeVer
Throughput
Opcmizing
Performance
Predictable
Performance
Usability
HBase
Developer
Expercse
HBase
Operaconal
Experience
Distributed
Systems
Admin
Experience
Applicacon
Developers
Experience
4/10/14 Hadoop Users Group UK