SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Merchant	
  Mastering	
  &	
  De-­‐duping	
  with	
  
Hadoop	
  and	
  Lucene	
  
Hadoop in Action @ Hadoop Summit, June 13th, 2012
Michael J. Radwin, Intuit
Merchant	
  contact	
  informa9on	
  
Fuzzy	
  matching	
  &	
  de-­‐duplica9ng	
  merchants	
  

                   Company ABC                                  Company PQR

name: The Windsor Press, Inc.              name: The Windsor Press
address: PO Box 465 6 North Third Street   address: P.O. Box 465 6 North 3rd St.
city: Hamburg                              city: Hamburg
state: PA                                  state: PA
zip: 19526                                 zip: 19526-0465
phone: (610) 562-2267                      phone: (610) 562-2267



     Both of the above vendor records map to external reference data:


                DUNSnum: 002114902
                Name: The Windsor-Press Inc
                Street: 6 N 3rd St
                City: Hamburg
                State: PA
   Dun &        Zip: 19526-1502
   Bradstreet   Phone: (610)-562-2267
Automa9c	
  transac9on	
  categoriza9on	
  

                  09/20/2010 ORCHARD SUPPLY #690 MOUNTAIN VI026460773
                  415-691-2000 320102640145034981 $20.09
De-­‐duping	
  system	
  architecture	
  
                                                      1
                                              Input
                                                          Import
          2                                   Data

                Address
                                                                          Merchant
              Standardizer                                                reference
                                                                          data
    3
    name          phone           address

                  Matchers

                                                                     7


    4
        Matcher           5                                        Applications
         scores
                               Score
                              Combiner
                                                                            Auto-complete


                              6    Merchant                          Transaction categorization
                                    Splicer



5
HBase	
  schema	
  example:	
  Merchant	
  table	
  
    Row key     Info (column family)                Mapping (column
                                                    family)
    25204939    name:Crepevine                      sourcename:10000048,
                street:367 University Avenue        10000075
                city:Palo Alto
                state:CA
                zip:94031
                county:Santa Clara County
                country: United States of America
                website:www.crepevine.com
                phoneNumber:16503233900
                latitude:37.430211
                longitude:-122.098221
                source:internet
                mint_category:Food & Dining
                qbo_category:Restaurants
                NAICS:722110
                SIC:5182




6
MapReduce	
  algorithm	
  for	
  matching	
  
                    Mapper                            Reducer


      Input                     Merchant
      Merchant                  A1                 Compare attribute
      A                                               values via
                                                   custom matching
                                Merchant
                                A2
                                                              Output score
        Generate                                              between 0 to 1
        potential              Merchant
        matches                A3
        subset
                                                    A:   A1   0.6
                                                    A:   A2   0.9
     Lookup                    Merchant             A:   A3   0.4
                               A4                   A:   A4   0.667

                             Matched from lucene


7
Fuzzy-­‐matching	
  implementa9on	
  details	
  
    • Normaliza)on	
  &	
  string	
  pre-­‐processing	
  
       – Case,	
  punctua)on	
  &	
  special	
  characters	
  
       – Phone	
  numbers:	
  le;er-­‐to-­‐digit	
  conversion,	
  remove	
  extensions	
  
       – Biz	
  names:	
  special	
  handling	
  for	
  common	
  suffixes	
  like	
  Inc,	
  Corp,	
  LLC	
  
       – USA	
  addresses:	
  123	
  North	
  Main	
  Ave	
  becomes	
  123	
  N.	
  Main	
  
    • Jaccard	
  and	
  Jaro	
  Winkler	
  string	
  similarity	
  approaches	
  
    • Final	
  Score	
  =	
  (0.4	
  *	
  phone	
  confidence)	
  +	
  (0.25	
  *	
  name	
  
      confidence)	
  +	
  (0.35	
  *	
  address	
  confidence)	
  
       – Two	
  businesses	
  with	
  same	
  phone	
  are	
  likely	
  to	
  be	
  the	
  same	
  business	
  
       – Same	
  with	
  email	
  address	
  
       – Similar	
  business	
  name	
  less	
  important	
  
       – And	
  some)mes	
  two	
  businesses	
  share	
  the	
  same	
  address	
  

8
10x	
  speedup	
  via	
  op9miza9ons!	
  
    • De-­‐duping	
  1	
  million	
  sample	
  merchants	
  takes	
  about	
  1	
  hour	
  
      (previously	
  took	
  10	
  hours)	
  

    • Wri)ng	
  back	
  a	
  sample	
  set	
  of	
  31	
  million	
  records	
  into	
  the	
  HBase	
  
      cluster	
  takes	
  about	
  30	
  mins	
  (previously	
  took	
  4	
  hours	
  37	
  mins)	
  

    • These	
  metrics	
  calculated	
  on	
  a	
  20-­‐node	
  Hadoop	
  cluster	
  (HBase	
  
      installed	
  on	
  5	
  nodes)	
  




9
Op9miza9ons	
  –	
  overall	
  system	
  design	
  
     Idea:	
  par))on	
  address	
  match	
  by	
  US	
  state	
  to	
  allow	
  parallelism	
  
     1.  Select	
  subset	
  of	
  input	
  table	
  from	
  a	
  par)cular	
  state	
  (e.g.	
  NY)	
  
     2.  Apply	
  matching	
  to	
  a	
  Lucene	
  index	
  that	
  contains	
  only	
  reference	
  
         data	
  from	
  that	
  state	
  
        – Each	
  single-­‐state	
  Lucene	
  index	
  is	
  small,	
  fits	
  en)rely	
  in	
  memory	
  
        – Standardize	
  the	
  addresses,	
  normalize	
  the	
  strings	
  
        – Compare	
  using	
  string	
  distance	
  metrics	
  
     3.  Run	
  all	
  50	
  states	
  (+	
  Washington	
  DC,	
  Puerto	
  Rico,	
  etc)	
  
        – Let	
  Oozie	
  run	
  these	
  in	
  parallel	
  




10
Op9miza9ons	
  –	
  hbase	
  config	
  
     Set	
  caching	
  parameters	
  to	
  make	
  our	
  full	
  table	
  scans	
  faster	
  
     scan.setCaching(500);	
  
        – transfers	
  500	
  rows	
  at	
  a	
  )me	
  to	
  the	
  client	
  to	
  be	
  processed	
  
        – Scanner	
  )meout	
  Excep)ons	
  possible	
  if	
  you	
  set	
  it	
  too	
  high	
  


     scan.setCacheBlocks(false);	
  
        – avoid	
  the	
  block	
  cache	
  churning	
  


     hbase.regionserver.lease.period	
  =	
  10	
  minutes	
  	
  
        – Clients	
  must	
  report	
  in	
  within	
  this	
  period	
  else	
  they	
  are	
  considered	
  dead	
  




11
Op9miza9ons	
  –	
  code	
  level	
  
     Cache	
  frequently	
  used	
  column	
  family	
  and	
  column	
  names	
  as	
  
     immutable	
  byte	
  arrays	
  in	
  a	
  public	
  interface	
  
     	
  
     public	
  static	
  final	
  byte[]	
  COLUMN_NAME	
  =	
  
     Bytes.toBytes("name");	
  
     public	
  static	
  final	
  byte[]	
  COLUMN_FAMILY_INFO	
  =	
  
     Bytes.toBytes("info");	
  
     	
  
     •  Improves	
  readability	
  
     •  Minor	
  run)me	
  performance	
  improvement	
  



12
Best	
  prac9ces	
  –	
  hadoop	
  interfacing	
  
     • For	
  Hadoop	
  jobs	
  interfacing	
  with	
  HBase,	
  used	
  
       TableMapReduceUtil	
  
        – On	
  the	
  input	
  side	
  (source)	
  as	
  well	
  as	
  the	
  output	
  side	
  (sink)	
  
        – Instead	
  of	
  doing	
  a	
  regular	
  input	
  split	
  


     • When	
  wri)ng	
  to	
  HBase	
  table,	
  emi;ed	
  a	
  ‘put’	
  from	
  Mapper	
  or	
  
       Reducer	
  instead	
  of	
  a	
  regular	
  HTable	
  put	
  
        – Use	
  context.write(rowKey,put)	
  
        – Much	
  faster	
  than	
  doing	
  an	
  HTable.put(),	
  even	
  for	
  a	
  bulk	
  put	
  




13
Best	
  prac9ces	
  –	
  readability,	
  maintainability	
  
     Client	
  gets	
  values	
  out	
  of	
  Result	
  via	
  convenience	
  methods:	
  
     	
  
     String	
  val	
  =	
  HBaseUtils.getColumnValue(result,	
  
     COLUMN_FAMILY_INFO,	
  COLUMN_NAME));	
  
     	
  

     Double	
  lat	
  =	
  HBaseUtils.getDoubleColumnValue(result,	
  
     COLUMN_FAMILY_INFO,	
  COLUMN_LATITUDE);	
  
     	
  

     Long	
  sicCode	
  =	
  HBaseUtils.getLongColumnValue(result,	
  
     COLUMN_FAMILY_INFO,	
  COLUMN_SIC)	
  




14
Best	
  prac9ces	
  –	
  HBaseU)ls	
  implementa)on	
  
     public	
  class	
  HBaseUtils	
  {	
  
     	
  	
  public	
  static	
  String	
  getColumnValue(Result	
  result,	
  byte[]	
  type,	
  
     byte[]	
  columnName)	
  {	
  
     	
  	
  	
  	
  return	
  Bytes.toString(result.getValue(type,	
  columnName));	
  
     	
  	
  }	
  
     	
  	
  public	
  static	
  Double	
  getDoubleColumnValue(Result	
  result,	
  byte[]   	
  
     type,	
  byte[]	
  columnName)	
  {	
  
     	
  	
  	
  	
  try	
  {	
  
     	
  	
  	
  	
  	
  	
  return	
  Double.parseDouble(getColumnValue(result,	
  type,	
  
     columnName));	
  
     	
  	
  	
  	
  }	
  catch	
  (Exception	
  e)	
  {	
  
     	
  	
  	
  	
  	
  	
  return	
  null;	
  
     	
  	
  	
  	
  }	
  
     	
  	
  }	
  
     }	
  
     	
  
15
Thank	
  You!	
  
     Michael	
  J.	
  Radwin	
  
     Twi;er:	
  @michael_radwin	
  




16
MR	
  Workflow	
  (oozie) 	
                             	
  	
  
        Start



                         Name
                        Matcher




                 OK                  OK                 OK               OK
                         Phone                                  Score
         Data                                                                 Splicer
                        matcher                               combiner
        Import
                                          Address
                                          Matcher
                                          (Fork-join)


                        Address
                      Standardizer

                       (Fork-join)
        Failed
                                                                                        End




17
Backups	
  via	
  HBase	
  Export	
  
     • Backups	
  done	
  before	
  new	
  dataset	
  is	
  added	
  or	
  updates	
  of	
  exis)ng	
  
       data	
  set	
  are	
  to	
  be	
  added	
  
     • Master	
  dataset	
  on	
  HBase	
  	
  
        – Backed	
  up	
  before	
  merge	
  
        – Uses	
  Live	
  Cluster	
  Backup	
  done	
  using	
  HBase	
  Export	
  
        – Data	
  can	
  be	
  reimported	
  using	
  HBase	
  Import	
  




18

Weitere ähnliche Inhalte

Ähnlich wie Merchant Lookup Service Intuit

Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...Matt Stubbs
 
Mining Affiliate Data for Untapped Opportunity
Mining Affiliate Data for Untapped OpportunityMining Affiliate Data for Untapped Opportunity
Mining Affiliate Data for Untapped OpportunityAffiliate Summit
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...Data Science Milan
 
QCon 2014 - How Shutl delivers even faster with Neo4j
QCon 2014 - How Shutl delivers even faster with Neo4jQCon 2014 - How Shutl delivers even faster with Neo4j
QCon 2014 - How Shutl delivers even faster with Neo4jVolker Pacher
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20Joshua Bae
 
Query in Couchbase. N1QL: SQL for JSON
Query in Couchbase.  N1QL: SQL for JSONQuery in Couchbase.  N1QL: SQL for JSON
Query in Couchbase. N1QL: SQL for JSONKeshav Murthy
 
Ariba Supplier Collaboration Procurement_02
Ariba Supplier Collaboration Procurement_02Ariba Supplier Collaboration Procurement_02
Ariba Supplier Collaboration Procurement_02Karthikeyan BN
 
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017MLconf
 
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...Amazon Web Services
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
 
Microsoft Analysis Services Physical Design
Microsoft Analysis Services Physical DesignMicrosoft Analysis Services Physical Design
Microsoft Analysis Services Physical Designjamessnape
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 

Ähnlich wie Merchant Lookup Service Intuit (20)

Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...
Big Data LDN 2017: Matching and De-duping Big Data in the Cloud – in Minutes ...
 
Mining Affiliate Data for Untapped Opportunity
Mining Affiliate Data for Untapped OpportunityMining Affiliate Data for Untapped Opportunity
Mining Affiliate Data for Untapped Opportunity
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
 
QCon 2014 - How Shutl delivers even faster with Neo4j
QCon 2014 - How Shutl delivers even faster with Neo4jQCon 2014 - How Shutl delivers even faster with Neo4j
QCon 2014 - How Shutl delivers even faster with Neo4j
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20
 
Query in Couchbase. N1QL: SQL for JSON
Query in Couchbase.  N1QL: SQL for JSONQuery in Couchbase.  N1QL: SQL for JSON
Query in Couchbase. N1QL: SQL for JSON
 
Ariba Supplier Collaboration Procurement_02
Ariba Supplier Collaboration Procurement_02Ariba Supplier Collaboration Procurement_02
Ariba Supplier Collaboration Procurement_02
 
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
 
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...
Implementing advanced design patterns for Amazon DynamoDB - ADB401 - Chicago ...
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Microsoft Analysis Services Physical Design
Microsoft Analysis Services Physical DesignMicrosoft Analysis Services Physical Design
Microsoft Analysis Services Physical Design
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Mis04
Mis04Mis04
Mis04
 
Salesforce and sap integration
Salesforce and sap integrationSalesforce and sap integration
Salesforce and sap integration
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Kürzlich hochgeladen (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Merchant Lookup Service Intuit

  • 1. Merchant  Mastering  &  De-­‐duping  with   Hadoop  and  Lucene   Hadoop in Action @ Hadoop Summit, June 13th, 2012 Michael J. Radwin, Intuit
  • 3. Fuzzy  matching  &  de-­‐duplica9ng  merchants   Company ABC Company PQR name: The Windsor Press, Inc. name: The Windsor Press address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St. city: Hamburg city: Hamburg state: PA state: PA zip: 19526 zip: 19526-0465 phone: (610) 562-2267 phone: (610) 562-2267 Both of the above vendor records map to external reference data: DUNSnum: 002114902 Name: The Windsor-Press Inc Street: 6 N 3rd St City: Hamburg State: PA Dun & Zip: 19526-1502 Bradstreet Phone: (610)-562-2267
  • 4. Automa9c  transac9on  categoriza9on   09/20/2010 ORCHARD SUPPLY #690 MOUNTAIN VI026460773 415-691-2000 320102640145034981 $20.09
  • 5. De-­‐duping  system  architecture   1 Input Import 2 Data Address Merchant Standardizer reference data 3 name phone address Matchers 7 4 Matcher 5 Applications scores Score Combiner Auto-complete 6 Merchant Transaction categorization Splicer 5
  • 6. HBase  schema  example:  Merchant  table   Row key Info (column family) Mapping (column family) 25204939 name:Crepevine sourcename:10000048, street:367 University Avenue 10000075 city:Palo Alto state:CA zip:94031 county:Santa Clara County country: United States of America website:www.crepevine.com phoneNumber:16503233900 latitude:37.430211 longitude:-122.098221 source:internet mint_category:Food & Dining qbo_category:Restaurants NAICS:722110 SIC:5182 6
  • 7. MapReduce  algorithm  for  matching   Mapper Reducer Input Merchant Merchant A1 Compare attribute A values via custom matching Merchant A2 Output score Generate between 0 to 1 potential Merchant matches A3 subset A: A1 0.6 A: A2 0.9 Lookup Merchant A: A3 0.4 A4 A: A4 0.667 Matched from lucene 7
  • 8. Fuzzy-­‐matching  implementa9on  details   • Normaliza)on  &  string  pre-­‐processing   – Case,  punctua)on  &  special  characters   – Phone  numbers:  le;er-­‐to-­‐digit  conversion,  remove  extensions   – Biz  names:  special  handling  for  common  suffixes  like  Inc,  Corp,  LLC   – USA  addresses:  123  North  Main  Ave  becomes  123  N.  Main   • Jaccard  and  Jaro  Winkler  string  similarity  approaches   • Final  Score  =  (0.4  *  phone  confidence)  +  (0.25  *  name   confidence)  +  (0.35  *  address  confidence)   – Two  businesses  with  same  phone  are  likely  to  be  the  same  business   – Same  with  email  address   – Similar  business  name  less  important   – And  some)mes  two  businesses  share  the  same  address   8
  • 9. 10x  speedup  via  op9miza9ons!   • De-­‐duping  1  million  sample  merchants  takes  about  1  hour   (previously  took  10  hours)   • Wri)ng  back  a  sample  set  of  31  million  records  into  the  HBase   cluster  takes  about  30  mins  (previously  took  4  hours  37  mins)   • These  metrics  calculated  on  a  20-­‐node  Hadoop  cluster  (HBase   installed  on  5  nodes)   9
  • 10. Op9miza9ons  –  overall  system  design   Idea:  par))on  address  match  by  US  state  to  allow  parallelism   1.  Select  subset  of  input  table  from  a  par)cular  state  (e.g.  NY)   2.  Apply  matching  to  a  Lucene  index  that  contains  only  reference   data  from  that  state   – Each  single-­‐state  Lucene  index  is  small,  fits  en)rely  in  memory   – Standardize  the  addresses,  normalize  the  strings   – Compare  using  string  distance  metrics   3.  Run  all  50  states  (+  Washington  DC,  Puerto  Rico,  etc)   – Let  Oozie  run  these  in  parallel   10
  • 11. Op9miza9ons  –  hbase  config   Set  caching  parameters  to  make  our  full  table  scans  faster   scan.setCaching(500);   – transfers  500  rows  at  a  )me  to  the  client  to  be  processed   – Scanner  )meout  Excep)ons  possible  if  you  set  it  too  high   scan.setCacheBlocks(false);   – avoid  the  block  cache  churning   hbase.regionserver.lease.period  =  10  minutes     – Clients  must  report  in  within  this  period  else  they  are  considered  dead   11
  • 12. Op9miza9ons  –  code  level   Cache  frequently  used  column  family  and  column  names  as   immutable  byte  arrays  in  a  public  interface     public  static  final  byte[]  COLUMN_NAME  =   Bytes.toBytes("name");   public  static  final  byte[]  COLUMN_FAMILY_INFO  =   Bytes.toBytes("info");     •  Improves  readability   •  Minor  run)me  performance  improvement   12
  • 13. Best  prac9ces  –  hadoop  interfacing   • For  Hadoop  jobs  interfacing  with  HBase,  used   TableMapReduceUtil   – On  the  input  side  (source)  as  well  as  the  output  side  (sink)   – Instead  of  doing  a  regular  input  split   • When  wri)ng  to  HBase  table,  emi;ed  a  ‘put’  from  Mapper  or   Reducer  instead  of  a  regular  HTable  put   – Use  context.write(rowKey,put)   – Much  faster  than  doing  an  HTable.put(),  even  for  a  bulk  put   13
  • 14. Best  prac9ces  –  readability,  maintainability   Client  gets  values  out  of  Result  via  convenience  methods:     String  val  =  HBaseUtils.getColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_NAME));     Double  lat  =  HBaseUtils.getDoubleColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_LATITUDE);     Long  sicCode  =  HBaseUtils.getLongColumnValue(result,   COLUMN_FAMILY_INFO,  COLUMN_SIC)   14
  • 15. Best  prac9ces  –  HBaseU)ls  implementa)on   public  class  HBaseUtils  {      public  static  String  getColumnValue(Result  result,  byte[]  type,   byte[]  columnName)  {          return  Bytes.toString(result.getValue(type,  columnName));      }      public  static  Double  getDoubleColumnValue(Result  result,  byte[]   type,  byte[]  columnName)  {          try  {              return  Double.parseDouble(getColumnValue(result,  type,   columnName));          }  catch  (Exception  e)  {              return  null;          }      }   }     15
  • 16. Thank  You!   Michael  J.  Radwin   Twi;er:  @michael_radwin   16
  • 17. MR  Workflow  (oozie)       Start Name Matcher OK OK OK OK Phone Score Data Splicer matcher combiner Import Address Matcher (Fork-join) Address Standardizer (Fork-join) Failed End 17
  • 18. Backups  via  HBase  Export   • Backups  done  before  new  dataset  is  added  or  updates  of  exis)ng   data  set  are  to  be  added   • Master  dataset  on  HBase     – Backed  up  before  merge   – Uses  Live  Cluster  Backup  done  using  HBase  Export   – Data  can  be  reimported  using  HBase  Import   18