This document summarizes an approach for de-duplicating merchant records using Hadoop and Lucene. Key elements include:
1. Fuzzy string matching of merchant names, addresses, and phone numbers is used to identify potential duplicate records. Match scores are calculated based on similarities.
2. Records are partitioned by state and matched against a Lucene index of reference data for that state only, to allow parallel processing.
3. The system was optimized to achieve a 10x speedup, processing 1 million records in 1 hour instead of 10 hours previously. Configuration changes and code optimizations improved performance.
4. Best practices like caching frequently used values, using HBase utilities, and abstracting HBase
3. Fuzzy
matching
&
de-‐duplica9ng
merchants
Company ABC Company PQR
name: The Windsor Press, Inc. name: The Windsor Press
address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St.
city: Hamburg city: Hamburg
state: PA state: PA
zip: 19526 zip: 19526-0465
phone: (610) 562-2267 phone: (610) 562-2267
Both of the above vendor records map to external reference data:
DUNSnum: 002114902
Name: The Windsor-Press Inc
Street: 6 N 3rd St
City: Hamburg
State: PA
Dun & Zip: 19526-1502
Bradstreet Phone: (610)-562-2267
5. De-‐duping
system
architecture
1
Input
Import
2 Data
Address
Merchant
Standardizer reference
data
3
name phone address
Matchers
7
4
Matcher 5 Applications
scores
Score
Combiner
Auto-complete
6 Merchant Transaction categorization
Splicer
5
6. HBase
schema
example:
Merchant
table
Row key Info (column family) Mapping (column
family)
25204939 name:Crepevine sourcename:10000048,
street:367 University Avenue 10000075
city:Palo Alto
state:CA
zip:94031
county:Santa Clara County
country: United States of America
website:www.crepevine.com
phoneNumber:16503233900
latitude:37.430211
longitude:-122.098221
source:internet
mint_category:Food & Dining
qbo_category:Restaurants
NAICS:722110
SIC:5182
6
7. MapReduce
algorithm
for
matching
Mapper Reducer
Input Merchant
Merchant A1 Compare attribute
A values via
custom matching
Merchant
A2
Output score
Generate between 0 to 1
potential Merchant
matches A3
subset
A: A1 0.6
A: A2 0.9
Lookup Merchant A: A3 0.4
A4 A: A4 0.667
Matched from lucene
7
8. Fuzzy-‐matching
implementa9on
details
• Normaliza)on
&
string
pre-‐processing
– Case,
punctua)on
&
special
characters
– Phone
numbers:
le;er-‐to-‐digit
conversion,
remove
extensions
– Biz
names:
special
handling
for
common
suffixes
like
Inc,
Corp,
LLC
– USA
addresses:
123
North
Main
Ave
becomes
123
N.
Main
• Jaccard
and
Jaro
Winkler
string
similarity
approaches
• Final
Score
=
(0.4
*
phone
confidence)
+
(0.25
*
name
confidence)
+
(0.35
*
address
confidence)
– Two
businesses
with
same
phone
are
likely
to
be
the
same
business
– Same
with
email
address
– Similar
business
name
less
important
– And
some)mes
two
businesses
share
the
same
address
8
9. 10x
speedup
via
op9miza9ons!
• De-‐duping
1
million
sample
merchants
takes
about
1
hour
(previously
took
10
hours)
• Wri)ng
back
a
sample
set
of
31
million
records
into
the
HBase
cluster
takes
about
30
mins
(previously
took
4
hours
37
mins)
• These
metrics
calculated
on
a
20-‐node
Hadoop
cluster
(HBase
installed
on
5
nodes)
9
10. Op9miza9ons
–
overall
system
design
Idea:
par))on
address
match
by
US
state
to
allow
parallelism
1. Select
subset
of
input
table
from
a
par)cular
state
(e.g.
NY)
2. Apply
matching
to
a
Lucene
index
that
contains
only
reference
data
from
that
state
– Each
single-‐state
Lucene
index
is
small,
fits
en)rely
in
memory
– Standardize
the
addresses,
normalize
the
strings
– Compare
using
string
distance
metrics
3. Run
all
50
states
(+
Washington
DC,
Puerto
Rico,
etc)
– Let
Oozie
run
these
in
parallel
10
11. Op9miza9ons
–
hbase
config
Set
caching
parameters
to
make
our
full
table
scans
faster
scan.setCaching(500);
– transfers
500
rows
at
a
)me
to
the
client
to
be
processed
– Scanner
)meout
Excep)ons
possible
if
you
set
it
too
high
scan.setCacheBlocks(false);
– avoid
the
block
cache
churning
hbase.regionserver.lease.period
=
10
minutes
– Clients
must
report
in
within
this
period
else
they
are
considered
dead
11
12. Op9miza9ons
–
code
level
Cache
frequently
used
column
family
and
column
names
as
immutable
byte
arrays
in
a
public
interface
public
static
final
byte[]
COLUMN_NAME
=
Bytes.toBytes("name");
public
static
final
byte[]
COLUMN_FAMILY_INFO
=
Bytes.toBytes("info");
• Improves
readability
• Minor
run)me
performance
improvement
12
13. Best
prac9ces
–
hadoop
interfacing
• For
Hadoop
jobs
interfacing
with
HBase,
used
TableMapReduceUtil
– On
the
input
side
(source)
as
well
as
the
output
side
(sink)
– Instead
of
doing
a
regular
input
split
• When
wri)ng
to
HBase
table,
emi;ed
a
‘put’
from
Mapper
or
Reducer
instead
of
a
regular
HTable
put
– Use
context.write(rowKey,put)
– Much
faster
than
doing
an
HTable.put(),
even
for
a
bulk
put
13
14. Best
prac9ces
–
readability,
maintainability
Client
gets
values
out
of
Result
via
convenience
methods:
String
val
=
HBaseUtils.getColumnValue(result,
COLUMN_FAMILY_INFO,
COLUMN_NAME));
Double
lat
=
HBaseUtils.getDoubleColumnValue(result,
COLUMN_FAMILY_INFO,
COLUMN_LATITUDE);
Long
sicCode
=
HBaseUtils.getLongColumnValue(result,
COLUMN_FAMILY_INFO,
COLUMN_SIC)
14
16. Thank
You!
Michael
J.
Radwin
Twi;er:
@michael_radwin
16
17. MR
Workflow
(oozie)
Start
Name
Matcher
OK OK OK OK
Phone Score
Data Splicer
matcher combiner
Import
Address
Matcher
(Fork-join)
Address
Standardizer
(Fork-join)
Failed
End
17
18. Backups
via
HBase
Export
• Backups
done
before
new
dataset
is
added
or
updates
of
exis)ng
data
set
are
to
be
added
• Master
dataset
on
HBase
– Backed
up
before
merge
– Uses
Live
Cluster
Backup
done
using
HBase
Export
– Data
can
be
reimported
using
HBase
Import
18