Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Tamr // Data Driven NYC // September 2014

1.138 Aufrufe

Veröffentlicht am

Tamr Co-Founder Ihab Ilyas presented at September 2014's edition of Data Driven NYC. Tamr is a data connection platform.

Veröffentlicht in: Technologie
  • Unlock Her Legs is your passage way to a life full of loving and sex... read more ... ●●● http://ishbv.com/unlockher/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • How to start a wildly profitable 7 figure marketing business and get your first commission check tonight, click here ♥♥♥ http://ishbv.com/j1r2c/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Earn $90/day Working Online. You won't get rich, but it is going to make you some money! ♥♥♥ http://scamcb.com/ezpayjobs/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Your opinions matter! get paid for them! click here for more info...➤➤ https://tinyurl.com/make2793amonth
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • You can now be your own boss and get yourself a very generous daily income. START FREE... https://tinyurl.com/realmoneystreams2019
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Tamr // Data Driven NYC // September 2014

  1. 1. Moving Data Curation from Theory to Practice Ihab Ilyas University of Waterloo
  2. 2. 2
  3. 3. Data Curation: Many Definitions and One Goal Extract Value from Data 3
  4. 4. Many Technical Challenges Automatic Schema Mapping <Part Number> Example: a global equipment manufacturer with thousands of products across hundreds of databases from multiple suppliers talking about the same part numbers 4
  5. 5. Many Technical Challenges Record Linkage and Deduplication Record 1 Record 2 Record 3 Record 4 Example: Thomson Reuters spent 6 months on a single deduplication project of a subset of their data sources 5 Unified Record
  6. 6. Many Technical Challenges Missing Values Example: Most real data collected from sensors, surveys, agents, have a high percentage of N/A or nulls, special values (99999) etc. 6
  7. 7. Common Data Quality Issues ID Name ZIP City State Income 1 Green 60610 Chicago IL 30k 2 Green 60611 Chicago IL 32k 3 Peter New Yrk NY 40k 4 John 11507 New York NY 40k 5 Gree 90057 Los Angeles CA 55k 6 Chuck 90057 San Francisco CA 30k 1 Green 60610 Chicago IL 31k 11507 New York 7 Duplicates Syntactic Error Missing Value Los Angeles Integrity Constraint Violation
  8. 8. Are We Missing the Real Challenges? "Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group 8
  9. 9. Realities of Data Curation Efforts Data is owned by people and is not an orphan Result: Fully automated cleaning will probably never be adopted in an enterprise setting 9 Data Stewards IT Data Experts Constant Interaction
  10. 10. Realities of Data Curation Efforts Scale renders most solutions un-deployable Result: Need to rethink all cleaning algorithms including record linkage to work at scale and avoid quadratic complexity. De-duping one million records naïvely can take weeks 10 (even on a big machine)
  11. 11. Realities of Data Curation Efforts Data Variety is even worse Result: Curation requires its own stack including transformations and adaptors 11
  12. 12. Realities of Data Curation Efforts Iterative by nature -- not by design Result: Need to be incremental, agile and low startup overhead. A curation solution should be a part of the data production line 12 Data stream
  13. 13. Pragmatic ≠ Unprincipled
  14. 14. Leverage All Data • Data ownership • Scale • Variety • Incremental cleaning [CIDR 2013]
  15. 15. Tamr: Machine Learning with Human Insight Identify sources, understand relationships and curate the massive variety of siloed data Structured and Semi-structured Data Sources Collaborative Curation Data Experts (Source owners) Data Stewards and Curators Data Inventory APIs Systems Tools Data Scientists Advanced Algorithms & Machine Learning Expert Input Integrated Data & Metadata Expert Directory 15 Data Ownership Non-programmatic Interfaces
  16. 16. Tamr: Machine Learning with Human Insight Identify sources, understand relationships and curate the massive variety of siloed data Structured and Semi-structured Data Sources Collaborative Curation Data Experts (Source owners) Data Stewards and Curators Data Inventory APIs Systems Tools Data Scientists Advanced Algorithms & Machine Learning Expert Input Integrated Data & Metadata Expert Directory 16 Scale & Variety
  17. 17. Tamr: Machine Learning with Human Insight Identify sources, understand relationships and curate the massive variety of siloed data Structured and Semi-structured Data Sources Collaborative Curation Data Experts (Source owners) Data Stewards and Curators Data Inventory APIs Systems Tools Data Scientists Advanced Algorithms & Machine Learning Expert Input Integrated Data & Metadata Expert Directory 17 Incremental
  18. 18. Example Tamr Functionality: Entity Resolution Merge Clusters C1 C3 18 ID name ZIP Income P1 Green 51519 30k P2 Green 51518 32k P3 Peter 30528 40k P4 Peter 30528 40k P5 Gree 51519 55k P6 Chuck 51519 30k ID name ZIP Income C1 Green 51519 39k C2 Peter 30528 40k C3 Chuck 51519 30k Compute Pair-wise Similarity P1 0.9 P2 0.3 0.5 P6 P5 P3 1.0 P4 Cluster Similar Records P1 P2 P3 P4 P5 P6 C2 Relation with duplicates Clean Relation
  19. 19. Tamr Solution • Object linkage model – Treats schema mapping and record linkage as one process – Feature extraction and evidence accumulation • Novel fuzzy blocking – A hierarchy of classifiers from binning candidates in the same comparison clusters to aggressive linkage of duplicates • Open-channel with humans in different capacities – Expert to increase the training sets – Stewards for verification and application of updates – IT for modeling user enterprise rules 19
  20. 20. Tamr Architecture 20
  21. 21. Suggestions to Build a Unified Schema 21
  22. 22. ML to Match Records Across Sources 22
  23. 23. www.tamr.com Thank you

×