HBase: How to get MTTR below 1 minute

How
to
get
the
MTTR
below
1

minute
and
more

Devaraj
Das

(ddas@hortonworks.com)

Nicolas
Liochon

(nkeywal@gmail.com)

Outline

•  What
is
this?
Why
are
we
talking
about
this

topic?
Why
it
ma>ers?
….

•  HBase
Recovery
–
an
overview

•  HDFS
issues

•  Beyond
MTTR
(Performance
post
recovery)

•  Conclusion
/
Future
/
Q
&
A

What
is
MTTR?
Why
its
important?
…

•  Mean
Time
To
Recovery
-‐>
Average
Pme
required

to
repair
a
failed
component
(Courtesy:
Wikipedia)

•  Enterprises
want
an
MTTR
of
ZERO

–  Data
should
always
be
available
with
no
degradaPon

of
perceived
SLAs

–  PracPcally
hard
to
obtain
but
yeah
it’s
a
goal

•  Close
to
Zero-‐MTTR
is
especially
important
for

HBase

–  Given
it
is
used
in
near
realPme
systems

•  MTTR
in
other
NoSQL
systems
&
Databases

HBase
Basics

•  Strongly
consistent

–  Write
ordered
with
reads

–  Once
wri>en,
the
data
will
stay

•  Built
on
top
of
HDFS

•  When
a
machine
fails
the
cluster
remains

available,
and
its
data
as
well

•  We’re
just
speaking
about
the
piece
of
data
that

was
handled
by
this
machine

Write
path

WAL
–
Write

Ahead
Log

A
write
is

ﬁnished
once

wri>en
on
all

HDFS
nodes

The
client

communicated

with
the
region

servers

We’re
in
a
distributed
system

•  You
can’t
disPnguish
a

slow
server
from
a

dead
server

•  Everything,
or,
nearly

everything,
is
based

on
Pmeout

•  Smaller
Pmeouts
means
more
false
posiPve

•  HBase
works
well
with
false
posiPve,
but
they

always
have
a
cost.

•  The
less
the
Pmeouts
the
be>er

HBase
components
for
recovery

Recovery
process

•  Failure
detecPon:
ZooKeeper

heartbeats
the
servers.
Expire

the
session
when
it
does
not

reply

•  Region
assignment:
the
master

reallocates
the
regions
to
the

other
servers

•  Failure
recovery:
read
the
WAL

and
rewrite
the
data
again

•  The
clients
stops
the

connecPon
to
the
dead
server

and
goes
to
the
new
one.

ZK

Heartbeat

Client

Region
Servers,

DataNode

Data
recovery

Master,
RS,
ZK

Region
Assignment

So….

•  Detect
the
failure
as
fast
as
possible

•  Reassign
as
fast
as
possible

•  Read
/
rewrite
the
WAL
as
fast
as
possible

•  That’s
obvious

The
obvious
–
failure
detecPon

•  Failure
detecPon

–  Set
a
ZooKeeper
Pmeout
to
30s
instead
of
the
old
180s

default.

–  Beware
of
the
GC,
but
lower
values
are
possible.

–  ZooKeeper
detects
the
errors
sooner
than
the
conﬁgured

Pmeout

•  0.96

–  HBase
scripts
clean
the
ZK
node
when
the
server
is
kill

-‐9ed

•  =>
DetecPon
Pme
becomes
0

–  Can
be
used
by
any
monitoring
tool

The
obvious
–
faster
data
recovery

•  Not
so
obvious
actually

•  Already
distributed
since
0.92

–  The
large
the
cluster
the
be>er.

•  Completely
rewri>en
in
0.96

–  Recovery
itself
rewri>en
in
0.96

–  Will
be
covered
in
the
second
part

The
obvious
–
Faster
assignment

•  Faster
assignment

–  Just
improving
performances

•  Parallelism

•  Speed

–  Globally
‘much’
faster

–  Backported
to
0.94

•  SPll
possible
to
do
be>er
for
huge
number
of

regions.

•  A
few
seconds
for
most
cases

With
this

•  DetecPon:
from
180s
to
30s

•  Data
recovery:
around
10s

•  Reassignment
:
from
10s
of
seconds
to

seconds

Do
you
think
we’re
be>er
with
this

•  Answer
is
NO

•  Actually,
yes
but
if
and
only
if
HDFS
is
ﬁne

– But
when
you
lose
a
regionserver,
you’ve
just
lost

a
datanode

DataNode
crash
is
expensive!

•  One
replica
of
WAL
edits
is
on
the
crashed
DN

– 33%
of
the
reads
during
the
regionserver
recovery

will
go
to
it

•  Many
writes
will
go
to
it
as
well
(the
smaller

the
cluster,
the
higher
that
probability)

•  NameNode
re-‐replicates
the
data
(maybe
TBs)

that
was
on
this
node
to
restore
replica
count

– NameNode
does
this
work
only
amer
a
good

Pmeout
(10
minutes
by
default)

HDFS
–
Stale
mode

Live

Stale

Dead

As
today:
used
for
reads
&

writes,
using
locality

Not
used
for
writes,
used
as

last
resort
for
reads

As
today:
not
used.

And
actually,
it’s
be>er
to
do
the
HBase

recovery
before
HDFS
replicates
the
TBs

of
data
of
this
node

30
seconds,
can
be
less.

10
minutes,
don’t
change
this

Results

•  Do
more
read/writes
to
HDFS
during
the

recovery

•  MulPple
failures
are
sPll
possible

– Stale
mode
will
sPll
play
its
role

– And
set
dfs.Pmeout
to
30s

– This
limits
the
eﬀect
of
two
failure
in
a
row.
The

cost
of
the
second
failure
is
30s
if
you
were

unlucky

Are
we
done?

•  We’re
not
bad

•  But
there
is
sPll
something

The
client

You
lem
it
waiPng
on
the
dead
server

The
client

•  You
want
the
client
to
be
paPent

•  Retries
when
the
system
is
already
loaded
is

not
good.

•  You
want
the
client
to
learn
about
region

servers
dying,
and
to
be
able
to
react

immediately.

•  You
want
this
to
scale.

SoluPon

•  The
master
noPﬁes
the
client

–  A
cheap
mulPcast
message
with
the
“dead
servers”

list.
Sent
5
Pmes
for
safety.

–  Oﬀ
by
default.

–  On
recepPon,
the
client
stops
immediately
waiPng
on

the
TCP
connecPon.
You
can
now
enjoy
large

hbase.rpc.Pmeout

Full
workﬂow

t0

t1

t2

t3

Client
reads

and
writes

RegionServer
serving

reads
and
writes

RegionServer
crashes

Aﬀected
regions

reassigned

Client
writes

Data
recovered

Client
reads

and
writes
t4

Are
we
done

•  In
a
way,
yes

– There
is
a
lot
of
things
around
asynchronous

writes,
reads
during
recovery

– Will
be
for
another
Pme,
but
there
will
be
some

nice
things
in
0.96

•  And
a
couple
of
them
is
presented
in
the

second
part
of
this
talk!

Faster
recovery

•  Previous
algo

–  Read
the
WAL
files

–  Write
new
Hfiles

–  Tell
the
region
server
it
got
new
Hfiles

•  Put
pressure
on
namenode

–  Remember:
don’t
put
pressure
on
the
namenode

•  New
algo:

–  Read
the
WAL

–  Write
to
the
regionserver

–  We’re
done
(have
seen
great
improvements
in
our
tests)

–  TBD:
Assign
the
WAL
to
a
RegionServer
local
to
a
replica

RegionServer0
RegionServer_x

RegionServer_y

WAL-‐file3

<region2:edit1><region1:edit2>

……

<region3:edit1>

……..

WAL-‐file2


……

<region3:edit1>

……..

WAL-‐file1


……

<region3:edit1>

……..

HDFS

Splitlog-‐file-‐for-‐region3


……

<region3:edit1>

……..



……

<region2:edit1>

……..



……

<region1:edit1>

……..

HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writes

reads

reads

Distributed
log

Split

RegionServer0
RegionServer_x

RegionServer_y

WAL-‐file3


……

<region3:edit1>

……..

WAL-‐file2


……

<region3:edit1>

……..

WAL-‐file1


……

<region3:edit1>

……..

HDFS

Recovered-‐file-‐for-‐region3


……

<region3:edit1>

……..



……

<region2:edit1>

……..



……

<region1:edit1>

……..

HDFS

RegionServer3

RegionServer2

RegionServer1

writes

writes

reads

reads

Distributed
log

Replay

replays

Write
during
recovery

•  Hey,
you
can
write
during
the
WAL
replay

•  Events
stream:
your
new
recovery
Pme
is
the

failure
detecPon
Pme:
max
30s,
likely
less!

MemStore
ﬂush

•  Real
life:
some
tables
are
updated
at
a
given

moment
then
lem
alone

– With
a
non
empty
memstore

– More
data
to
recover

•  It’s
now
possible
to
guarantee
that
we
don’t

have
MemStore
with
old
data

•  Improves
real
life
MTTR

•  Helps
snapshots

.META.

•  .META.

–  There
is
no
–ROOT-‐
in
0.95/0.96

–  But
.META.
failures
are
criPcal

•  A
lot
of
small
improvements

–  Server
now
says
to
the
client
when
a
region
has

moved
(client
can
avoid
going
to
meta)

•  And
a
big
one

–  .META.
WAL
is
managed
separately
to
allow
an

immediate
recovery
of
META

–  With
the
new
MemStore
ﬂush,
ensure
a
quick

recovery

Data
locality
post
recovery

•  HBase
performance
depends
on
data-‐locality

•  Amer
a
recovery,
you’ve
lost
it

–  Bad
for
performance

•  Here
comes
region
groups

•  Assign
3
favored
RegionServers
for
every
region

•  On
failures
assign
the
region
to
one
of
the

secondaries

•  The
data-‐locality
issue
is
minimized
on
failures

Block1
Block2
Block3

Block1
Block2

Rack1

Block3

Block3

Rack2
Rack3

Block1
Block2

Datanode

RegionServer1

Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

Block1
Block2

Rack1

Block3

Block3

Rack2
Rack3

Block1
Block2

RegionServer4
Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

Reads
Blk1
and

Blk2
remotely

Reads
Blk3

remotely

RegionServer1
serves
three
regions,
and
their
StoreFile
blks
are
sca>ered

across
the
cluster
with
one
replica
local
to
RegionServer1.

Block1
Block2
Block3

Block1
Block2

Rack1

Block3

Block3

Rack2
Rack3

Block1
Block2

Datanode

RegionServer1

Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

RegionServer1
serves
three
regions,
and
their
StoreFile
blks
are
placed
on

speciﬁc
machines
on
the
other
racks

Block1
Block2

Rack1

Block3

Block3

Rack2
Rack3

Block1
Block2

RegionServer4
Datanode1

RegionServer1

Datanode

RegionServer2

Datanode1

RegionServer1

Datanode

RegionServer3

No
remote
reads

Datanode

Conclusion

•  The
target
was
“from
omen
10
minutes
to

always
less
than
1
minute”

– We’re
almost
there

•  Most
of
it
is
available
in
0.96,
some
parts
were

backported

•  Real
life
tesPng
of
the
improvements
in

progress

•  Room
for
more
improvements

HBase: How to get MTTR below 1 minute

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie HBase: How to get MTTR below 1 minute

Ähnlich wie HBase: How to get MTTR below 1 minute (20)

Mehr von Hortonworks

Mehr von Hortonworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HBase: How to get MTTR below 1 minute