Merchant Lookup Service Intuit

Merchant
Mastering
&
De-‐duping
with

Hadoop
and
Lucene

Hadoop in Action @ Hadoop Summit, June 13th, 2012
Michael J. Radwin, Intuit

Merchant
contact
informa9on

Fuzzy
matching
&
de-‐duplica9ng
merchants

Company ABC Company PQR

name: The Windsor Press, Inc. name: The Windsor Press
address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St.
city: Hamburg city: Hamburg
state: PA state: PA
zip: 19526 zip: 19526-0465
phone: (610) 562-2267 phone: (610) 562-2267

Both of the above vendor records map to external reference data:

DUNSnum: 002114902
Name: The Windsor-Press Inc
Street: 6 N 3rd St
City: Hamburg
State: PA
Dun & Zip: 19526-1502
Bradstreet Phone: (610)-562-2267

Automa9c
transac9on
categoriza9on

09/20/2010 ORCHARD SUPPLY #690 MOUNTAIN VI026460773
415-691-2000 320102640145034981 $20.09

De-‐duping
system
architecture

1
Input
Import
2 Data

Address
Merchant
Standardizer reference
data
3
name phone address

Matchers

7

4
Matcher 5 Applications
scores
Score
Combiner
Auto-complete

6 Merchant Transaction categorization
Splicer

5

HBase
schema
example:
Merchant
table

Row key Info (column family) Mapping (column
family)
25204939 name:Crepevine sourcename:10000048,
street:367 University Avenue 10000075
city:Palo Alto
state:CA
zip:94031
county:Santa Clara County
country: United States of America
website:www.crepevine.com
phoneNumber:16503233900
latitude:37.430211
longitude:-122.098221
source:internet
mint_category:Food & Dining
qbo_category:Restaurants
NAICS:722110
SIC:5182

6

MapReduce
algorithm
for
matching

Mapper Reducer

Input Merchant
Merchant A1 Compare attribute
A values via
custom matching
Merchant
A2
Output score
Generate between 0 to 1
potential Merchant
matches A3
subset
A: A1 0.6
A: A2 0.9
Lookup Merchant A: A3 0.4
A4 A: A4 0.667

Matched from lucene

7

Fuzzy-‐matching
implementa9on
details

• Normaliza)on
&
string
pre-‐processing

– Case,
punctua)on
&
special
characters

– Phone
numbers:
le;er-‐to-‐digit
conversion,
remove
extensions

– Biz
names:
special
handling
for
common
suffixes
like
Inc,
Corp,
LLC

– USA
addresses:
123
North
Main
Ave
becomes
123
N.
Main

• Jaccard
and
Jaro
Winkler
string
similarity
approaches

• Final
Score
=
(0.4
*
phone
confidence)
+
(0.25
*
name

confidence)
+
(0.35
*
address
confidence)

– Two
businesses
with
same
phone
are
likely
to
be
the
same
business

– Same
with
email
address

– Similar
business
name
less
important

– And
some)mes
two
businesses
share
the
same
address

8

10x
speedup
via
op9miza9ons!

• De-‐duping
1
million
sample
merchants
takes
about
1
hour

(previously
took
10
hours)

• Wri)ng
back
a
sample
set
of
31
million
records
into
the
HBase

cluster
takes
about
30
mins
(previously
took
4
hours
37
mins)

• These
metrics
calculated
on
a
20-‐node
Hadoop
cluster
(HBase

installed
on
5
nodes)

9

Op9miza9ons
–
overall
system
design

Idea:
par))on
address
match
by
US
state
to
allow
parallelism

1.  Select
subset
of
input
table
from
a
par)cular
state
(e.g.
NY)

2.  Apply
matching
to
a
Lucene
index
that
contains
only
reference

data
from
that
state

– Each
single-‐state
Lucene
index
is
small,
ﬁts
en)rely
in
memory

– Standardize
the
addresses,
normalize
the
strings

– Compare
using
string
distance
metrics

3.  Run
all
50
states
(+
Washington
DC,
Puerto
Rico,
etc)

– Let
Oozie
run
these
in
parallel

10

Op9miza9ons
–
hbase
conﬁg

Set
caching
parameters
to
make
our
full
table
scans
faster

scan.setCaching(500);

– transfers
500
rows
at
a
)me
to
the
client
to
be
processed

– Scanner
)meout
Excep)ons
possible
if
you
set
it
too
high

scan.setCacheBlocks(false);

– avoid
the
block
cache
churning

hbase.regionserver.lease.period
=
10
minutes

– Clients
must
report
in
within
this
period
else
they
are
considered
dead

11

Op9miza9ons
–
code
level

Cache
frequently
used
column
family
and
column
names
as

immutable
byte
arrays
in
a
public
interface

public
static
final
byte[]
COLUMN_NAME
=

Bytes.toBytes("name");

public
static
final
byte[]
COLUMN_FAMILY_INFO
=

Bytes.toBytes("info");

•  Improves
readability

•  Minor
run)me
performance
improvement

12

Best
prac9ces
–
hadoop
interfacing

• For
Hadoop
jobs
interfacing
with
HBase,
used

TableMapReduceUtil

– On
the
input
side
(source)
as
well
as
the
output
side
(sink)

– Instead
of
doing
a
regular
input
split

• When
wri)ng
to
HBase
table,
emi;ed
a
‘put’
from
Mapper
or

Reducer
instead
of
a
regular
HTable
put

– Use
context.write(rowKey,put)

– Much
faster
than
doing
an
HTable.put(),
even
for
a
bulk
put

13

Best
prac9ces
–
readability,
maintainability

Client
gets
values
out
of
Result
via
convenience
methods:

String
val
=
HBaseUtils.getColumnValue(result,

COLUMN_FAMILY_INFO,
COLUMN_NAME));

Double
lat
=
HBaseUtils.getDoubleColumnValue(result,

COLUMN_FAMILY_INFO,
COLUMN_LATITUDE);

Long
sicCode
=
HBaseUtils.getLongColumnValue(result,

COLUMN_FAMILY_INFO,
COLUMN_SIC)

14

Best
prac9ces
–
HBaseU)ls
implementa)on

public
class
HBaseUtils
{

public
static
String
getColumnValue(Result
result,
byte[]
type,

byte[]
columnName)
{

return
Bytes.toString(result.getValue(type,
columnName));

}

public
static
Double
getDoubleColumnValue(Result
result,
byte[]

type,
byte[]
columnName)
{

try
{

return
Double.parseDouble(getColumnValue(result,
type,

columnName));

}
catch
(Exception
e)
{

return
null;

}

}

}

15

Thank
You!

Michael
J.
Radwin

Twi;er:
@michael_radwin

16

MR
Workﬂow
(oozie)

Start

Name
Matcher

OK OK OK OK
Phone Score
Data Splicer
matcher combiner
Import
Address
Matcher
(Fork-join)

Address
Standardizer

(Fork-join)
Failed
End

17

Backups
via
HBase
Export

• Backups
done
before
new
dataset
is
added
or
updates
of
exis)ng

data
set
are
to
be
added

• Master
dataset
on
HBase

– Backed
up
before
merge

– Uses
Live
Cluster
Backup
done
using
HBase
Export

– Data
can
be
reimported
using
HBase
Import

18

Merchant Lookup Service Intuit

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Merchant Lookup Service Intuit

Ähnlich wie Merchant Lookup Service Intuit (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Merchant Lookup Service Intuit