This document discusses strategies for reducing the mean time to recovery (MTTR) in HBase to below 1 minute. It outlines how HBase recovery works and key components involved. Some techniques discussed to reduce MTTR include faster failure detection by lowering Zookeeper timeouts, improving parallelism in region reassignment, and rewriting the data recovery process in HBase 0.96. However, the document notes that high MTTR is often due to downtime from HDFS data replication when a datanode fails along with a regionserver.
Genislab builds better products and faster go-to-market with Lean project man...
HBase: How to get MTTR below 1 minute
1. How
to
get
the
MTTR
below
1
minute
and
more
Devaraj
Das
(ddas@hortonworks.com)
Nicolas
Liochon
(nkeywal@gmail.com)
2. Outline
• What
is
this?
Why
are
we
talking
about
this
topic?
Why
it
ma>ers?
….
• HBase
Recovery
–
an
overview
• HDFS
issues
• Beyond
MTTR
(Performance
post
recovery)
• Conclusion
/
Future
/
Q
&
A
3. What
is
MTTR?
Why
its
important?
…
• Mean
Time
To
Recovery
-‐>
Average
Pme
required
to
repair
a
failed
component
(Courtesy:
Wikipedia)
• Enterprises
want
an
MTTR
of
ZERO
– Data
should
always
be
available
with
no
degradaPon
of
perceived
SLAs
– PracPcally
hard
to
obtain
but
yeah
it’s
a
goal
• Close
to
Zero-‐MTTR
is
especially
important
for
HBase
– Given
it
is
used
in
near
realPme
systems
• MTTR
in
other
NoSQL
systems
&
Databases
4. HBase
Basics
• Strongly
consistent
– Write
ordered
with
reads
– Once
wri>en,
the
data
will
stay
• Built
on
top
of
HDFS
• When
a
machine
fails
the
cluster
remains
available,
and
its
data
as
well
• We’re
just
speaking
about
the
piece
of
data
that
was
handled
by
this
machine
5. Write
path
WAL
–
Write
Ahead
Log
A
write
is
finished
once
wri>en
on
all
HDFS
nodes
The
client
communicated
with
the
region
servers
6. We’re
in
a
distributed
system
• You
can’t
disPnguish
a
slow
server
from
a
dead
server
• Everything,
or,
nearly
everything,
is
based
on
Pmeout
• Smaller
Pmeouts
means
more
false
posiPve
• HBase
works
well
with
false
posiPve,
but
they
always
have
a
cost.
• The
less
the
Pmeouts
the
be>er
9. Recovery
process
• Failure
detecPon:
ZooKeeper
heartbeats
the
servers.
Expire
the
session
when
it
does
not
reply
• Region
assignment:
the
master
reallocates
the
regions
to
the
other
servers
• Failure
recovery:
read
the
WAL
and
rewrite
the
data
again
• The
clients
stops
the
connecPon
to
the
dead
server
and
goes
to
the
new
one.
ZK
Heartbeat
Client
Region
Servers,
DataNode
Data
recovery
Master,
RS,
ZK
Region
Assignment
10. So….
• Detect
the
failure
as
fast
as
possible
• Reassign
as
fast
as
possible
• Read
/
rewrite
the
WAL
as
fast
as
possible
• That’s
obvious
11. The
obvious
–
failure
detecPon
• Failure
detecPon
– Set
a
ZooKeeper
Pmeout
to
30s
instead
of
the
old
180s
default.
– Beware
of
the
GC,
but
lower
values
are
possible.
– ZooKeeper
detects
the
errors
sooner
than
the
configured
Pmeout
• 0.96
– HBase
scripts
clean
the
ZK
node
when
the
server
is
kill
-‐9ed
• =>
DetecPon
Pme
becomes
0
– Can
be
used
by
any
monitoring
tool
12. The
obvious
–
faster
data
recovery
• Not
so
obvious
actually
• Already
distributed
since
0.92
– The
large
the
cluster
the
be>er.
• Completely
rewri>en
in
0.96
– Recovery
itself
rewri>en
in
0.96
– Will
be
covered
in
the
second
part
13. The
obvious
–
Faster
assignment
• Faster
assignment
– Just
improving
performances
• Parallelism
• Speed
– Globally
‘much’
faster
– Backported
to
0.94
• SPll
possible
to
do
be>er
for
huge
number
of
regions.
• A
few
seconds
for
most
cases
14. With
this
• DetecPon:
from
180s
to
30s
• Data
recovery:
around
10s
• Reassignment
:
from
10s
of
seconds
to
seconds
15. Do
you
think
we’re
be>er
with
this
• Answer
is
NO
• Actually,
yes
but
if
and
only
if
HDFS
is
fine
– But
when
you
lose
a
regionserver,
you’ve
just
lost
a
datanode
16. DataNode
crash
is
expensive!
• One
replica
of
WAL
edits
is
on
the
crashed
DN
– 33%
of
the
reads
during
the
regionserver
recovery
will
go
to
it
• Many
writes
will
go
to
it
as
well
(the
smaller
the
cluster,
the
higher
that
probability)
• NameNode
re-‐replicates
the
data
(maybe
TBs)
that
was
on
this
node
to
restore
replica
count
– NameNode
does
this
work
only
amer
a
good
Pmeout
(10
minutes
by
default)
17. HDFS
–
Stale
mode
Live
Stale
Dead
As
today:
used
for
reads
&
writes,
using
locality
Not
used
for
writes,
used
as
last
resort
for
reads
As
today:
not
used.
And
actually,
it’s
be>er
to
do
the
HBase
recovery
before
HDFS
replicates
the
TBs
of
data
of
this
node
30
seconds,
can
be
less.
10
minutes,
don’t
change
this
18. Results
• Do
more
read/writes
to
HDFS
during
the
recovery
• MulPple
failures
are
sPll
possible
– Stale
mode
will
sPll
play
its
role
– And
set
dfs.Pmeout
to
30s
– This
limits
the
effect
of
two
failure
in
a
row.
The
cost
of
the
second
failure
is
30s
if
you
were
unlucky
19. Are
we
done?
• We’re
not
bad
• But
there
is
sPll
something
22. The
client
• You
want
the
client
to
be
paPent
• Retries
when
the
system
is
already
loaded
is
not
good.
• You
want
the
client
to
learn
about
region
servers
dying,
and
to
be
able
to
react
immediately.
• You
want
this
to
scale.
23. SoluPon
• The
master
noPfies
the
client
– A
cheap
mulPcast
message
with
the
“dead
servers”
list.
Sent
5
Pmes
for
safety.
– Off
by
default.
– On
recepPon,
the
client
stops
immediately
waiPng
on
the
TCP
connecPon.
You
can
now
enjoy
large
hbase.rpc.Pmeout
24. Full
workflow
t0
t1
t2
t3
Client
reads
and
writes
RegionServer
serving
reads
and
writes
RegionServer
crashes
Affected
regions
reassigned
Client
writes
Data
recovered
Client
reads
and
writes
t4
25. Are
we
done
• In
a
way,
yes
– There
is
a
lot
of
things
around
asynchronous
writes,
reads
during
recovery
– Will
be
for
another
Pme,
but
there
will
be
some
nice
things
in
0.96
• And
a
couple
of
them
is
presented
in
the
second
part
of
this
talk!
26. Faster
recovery
• Previous
algo
– Read
the
WAL
files
– Write
new
Hfiles
– Tell
the
region
server
it
got
new
Hfiles
• Put
pressure
on
namenode
– Remember:
don’t
put
pressure
on
the
namenode
• New
algo:
– Read
the
WAL
– Write
to
the
regionserver
– We’re
done
(have
seen
great
improvements
in
our
tests)
– TBD:
Assign
the
WAL
to
a
RegionServer
local
to
a
replica
29. Write
during
recovery
• Hey,
you
can
write
during
the
WAL
replay
• Events
stream:
your
new
recovery
Pme
is
the
failure
detecPon
Pme:
max
30s,
likely
less!
30. MemStore
flush
• Real
life:
some
tables
are
updated
at
a
given
moment
then
lem
alone
– With
a
non
empty
memstore
– More
data
to
recover
• It’s
now
possible
to
guarantee
that
we
don’t
have
MemStore
with
old
data
• Improves
real
life
MTTR
• Helps
snapshots
31. .META.
• .META.
– There
is
no
–ROOT-‐
in
0.95/0.96
– But
.META.
failures
are
criPcal
• A
lot
of
small
improvements
– Server
now
says
to
the
client
when
a
region
has
moved
(client
can
avoid
going
to
meta)
• And
a
big
one
– .META.
WAL
is
managed
separately
to
allow
an
immediate
recovery
of
META
– With
the
new
MemStore
flush,
ensure
a
quick
recovery
32. Data
locality
post
recovery
• HBase
performance
depends
on
data-‐locality
• Amer
a
recovery,
you’ve
lost
it
– Bad
for
performance
• Here
comes
region
groups
• Assign
3
favored
RegionServers
for
every
region
• On
failures
assign
the
region
to
one
of
the
secondaries
• The
data-‐locality
issue
is
minimized
on
failures
33. Block1
Block2
Block3
Block1
Block2
Rack1
Block3
Block3
Rack2
Rack3
Block1
Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Block1
Block2
Rack1
Block3
Block3
Rack2
Rack3
Block1
Block2
RegionServer4
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Reads
Blk1
and
Blk2
remotely
Reads
Blk3
remotely
RegionServer1
serves
three
regions,
and
their
StoreFile
blks
are
sca>ered
across
the
cluster
with
one
replica
local
to
RegionServer1.
34. Block1
Block2
Block3
Block1
Block2
Rack1
Block3
Block3
Rack2
Rack3
Block1
Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
RegionServer1
serves
three
regions,
and
their
StoreFile
blks
are
placed
on
specific
machines
on
the
other
racks
Block1
Block2
Rack1
Block3
Block3
Rack2
Rack3
Block1
Block2
RegionServer4
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
No
remote
reads
Datanode
35. Conclusion
• The
target
was
“from
omen
10
minutes
to
always
less
than
1
minute”
– We’re
almost
there
• Most
of
it
is
available
in
0.96,
some
parts
were
backported
• Real
life
tesPng
of
the
improvements
in
progress
• Room
for
more
improvements