SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Matching Dirty Data

   Yet another wheel



             Jeff Sherwood, Programmer.
             Anjanette Young, Systems Librarian.
             University of Washington, Libraries.
DSpace Repository
 Goal   Ingest Metadata and PDF's for ETD's received from
        UMI into a DSpace repository.
Electronic Theses & Dissertations


       Sources       Output
MARC Fields


UMI Records       III Records


=001 (Filename)   =001   (OCLC number)
=520 (Abstract)   =100   (Author)
                  =245   (Title)
                  =260   (Date published)
                  =502   (type and date)
                  =695   (Department)
                  =941   (Local identifier)
dublin_core.xml

<dublin_core>
  <dcvalue element="identifier" qualifier="other">
    iii[941]</dcvalue>
  <dcvalue element="title" qualifier="none">
    iii[245][a][b]</dcvalue>
  <dcvalue element="contributor" qualifier="author">
    iii[100][a][b][c]</dcvalue>
  <dcvalue element="description" qualifier="abstract">
    umi[520][a]</dcvalue>
  <dcvalue element="subject" qualifier="other">
    iii[655][a][x]</dcvalue>
</dublin_core>
MARC Loader . . . No.


|||0|0| | |0|n|G|0|@ov_action="o"
|||0|0| | |0|n|G|0
|@ov_protect="b=V0123456789d(690,695:d)
                      hn(590:d)y(099,249,852,856:d)y(910,925,
                      980,981)F26"
035|001 |+|0|0|b|o|0|y|N|0|%001(start="1-9",char="!-~")
245||+|0|0|b|t|0|y|N|0|%bracket="h"
500-599||+|0|0|b|n|0|y|N|0|
600-651||-w|0|0|b|d|0|y|N|0|
653-657||+|0|0|b|d|0|y|N|0|
690-699||-w|0|0|b|d|0|y|N|0|
700-715||-w|0|0|b|b|0|y|N|0|
730-740||-w|0|0|b|f|0|y|N|0|
Matching overview


Ham-fisted Method
1. Exact Title + Exact Author
2. Exact Title + Shortened Author

Cool Math Method
Calculate Similarity of Title
Calculate Similarity of Author
1. Exact Title + Fuzzy Author
2. Fuzzy Title + Fuzzy Author
3. Fuzzy Title or Fuzzy Author
Pymarc - the MARC Hammer


umi_dict = {
 Alaskan Bootlegger: {author: Leon Kania, umi_count = 1},
 title2_value: {author: author2_value, umi_count = index2},
 ...
  }

iii_dict = {
   Alaskan Bootlegger: {author: Leon W. Kania, iii_count = 9},
  title2_value: {author: author2_value, iii_count = index2},
  ...
   }
Exact title + exact author

# Exact Title
# Create sets out of the dictionary keys
umi_set = set(umi_dict.iterkeys())
iii_set = set(iii_dict.iterkeys())

# Find the Intersection of sets.
title_match = umi_set & iii_set


# Verify Intersection with Exact Author
for x in title_match:
   if umi_dict[x][author] == iii_dict[x][author]:
       . . . do stuff.
Exact title + Truncated author


  def shortenAuthorName(name):
    #Leon W. Kania        ->     [Leon, W., Kania]
    namelist = str(name).split()
    if len(namelist) > 2:
        shortname = "%s %s" % (namelist[0], namelist[-1])
    else:
        shortname = name
    return shortname
"If you break three spokes, it is time for a rebuild"
                      Charles Hadrann,
                      "Hadrann Wheelcraft Method – Part 1 Lacing"
Rogues
Gallery
USE OF CROWN LENGTH TO DEFINE
STEM FORM: SEGMENTED TAPER
EQUATION (DOUGLAS FIR)


Use of crown length to define stem form
:: segmented taper equation
Towards an understanding of seismic
performance of three-dimensional
structures: Stability and reliability

Towards an understanding of seismic
performance of 3D structures :: stability
& reliability
Hoekstra, Hopi Danielle Elisabeth

Hoekstra, Danielle E
Arnason, Halldor

Halldór Árnason
Levenshtein Edit
   Distance
Edit distance is the number of
operations required to transform one
string of characters into the another.
How many steps to turn
kitten
into sitting?
3
kitten ➔ sitten    (k changes to s)

sitten ➔ sittin    (e changes to i)

sittin ➔ sitting   (insert g)
LD is Always...


   ≥ difference in string lengths
   ≤ length of the longer string
   = 0 if the strings are identical
Similarity Score
Optimizations
Reduce the Search Space


   "A stochastic model
   of cyclical interaction
   processes"




                             All titles
Reduce the Search Space

    Identify Stopwords
    the: 24587         Throw out common
    for: 7643          words in titles
    with: 3323
    effects: 1958
    evaluation: 1073
    ...
    hypoxic: 1         Keep the rarer ones
    reduplication: 1
    picaresque: 1
    emperador 1
    heteroduplex 1
Reduce the Search Space

      Extract Significant Words

 "Stochastic             stochastic
 models for DNA          dna
 sequence data"          sequence
Reduce the Search Space


  rec = {'title': 'Stochastic models...',}

  index['stochastic'].append(rec)
  index['dna'].append(rec)
  index['sequence'].append(rec)
Reduce the Search Space


                             index['stochastic']


{'title': "Stochastic models for DNA sequence data", ...}
{'title': "A stochastic model of clan systems", ...}
{'title': "A stochastic model of cyclical interaction processes", ...}
{'title': "Stochastic reliability models for maintained systems", ...}
{'title': "Uniform approximation and almost periodicity of doubly stochastic operators", ...}
Normalize Names



  Hoekstra, Hopi Danielle Elisabeth

  Hoekstra, Danielle E
Normalize Names



          Hoekstra, H

          Hoekstra, D
Normalize Names



        Arnason, Halldor

        Halldór Árnason
Normalize Names



          Arnason, H

          Árnason, H
Improvements
Jaro-Winkler Algorithm
What's a "match"?

  Two characters match if they are a
  reasonable distance from one another
  as defined by:
Example



   s1 = Martha
   s2 = Marhta
Example



   s1 = Martha
   s2 = Marhta
Jaro-Winkler works
best for short strings
Resources
Levenshtein & Jaro-Winkler

  Background
  http://en.wikipedia.org/wiki/Levenshtein_distance
  http://en.wikipedia.org/wiki/Jaro-Winkler_distance


  Code
  http://pypi.python.org/pypi/editdist/0.1
  http://pypi.python.org/pypi/python-Levenshtein/0.10.1
Miscellaneous

  String Comparison Tutorial
  http://bit.ly/ZGSmF

  SecondString - Java text analysis library
  http://secondstring.sourceforge.net/

  MarcXimiL - MARC de-duping package
  http://marcximil.sourceforge.net/
http://snurl.com/uggtn

Weitere ähnliche Inhalte

Was ist angesagt?

Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)
MongoSF
 

Was ist angesagt? (19)

Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Java lecture 7 1
Java lecture 7 1Java lecture 7 1
Java lecture 7 1
 
Functional es6
Functional es6Functional es6
Functional es6
 
R programming language
R programming languageR programming language
R programming language
 
User biglm
User biglmUser biglm
User biglm
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
R intro 20140716-advance
R intro 20140716-advanceR intro 20140716-advance
R intro 20140716-advance
 
Cloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsCloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forests
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 

Ähnlich wie Matching Dirty Data

Slides
SlidesSlides
Slides
butest
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression TreesExploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
rasmuskl
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 

Ähnlich wie Matching Dirty Data (20)

Recursion schemes in Scala
Recursion schemes in ScalaRecursion schemes in Scala
Recursion schemes in Scala
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matching
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
R language introduction
R language introductionR language introduction
R language introduction
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Slides
SlidesSlides
Slides
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
 
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression TreesExploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Everything is composable
Everything is composableEverything is composable
Everything is composable
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 

Matching Dirty Data

  • 1. Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.
  • 2. DSpace Repository Goal Ingest Metadata and PDF's for ETD's received from UMI into a DSpace repository.
  • 3. Electronic Theses & Dissertations Sources Output
  • 4. MARC Fields UMI Records III Records =001 (Filename) =001 (OCLC number) =520 (Abstract) =100 (Author) =245 (Title) =260 (Date published) =502 (type and date) =695 (Department) =941 (Local identifier)
  • 5. dublin_core.xml <dublin_core> <dcvalue element="identifier" qualifier="other"> iii[941]</dcvalue> <dcvalue element="title" qualifier="none"> iii[245][a][b]</dcvalue> <dcvalue element="contributor" qualifier="author"> iii[100][a][b][c]</dcvalue> <dcvalue element="description" qualifier="abstract"> umi[520][a]</dcvalue> <dcvalue element="subject" qualifier="other"> iii[655][a][x]</dcvalue> </dublin_core>
  • 6. MARC Loader . . . No. |||0|0| | |0|n|G|0|@ov_action="o" |||0|0| | |0|n|G|0 |@ov_protect="b=V0123456789d(690,695:d) hn(590:d)y(099,249,852,856:d)y(910,925, 980,981)F26" 035|001 |+|0|0|b|o|0|y|N|0|%001(start="1-9",char="!-~") 245||+|0|0|b|t|0|y|N|0|%bracket="h" 500-599||+|0|0|b|n|0|y|N|0| 600-651||-w|0|0|b|d|0|y|N|0| 653-657||+|0|0|b|d|0|y|N|0| 690-699||-w|0|0|b|d|0|y|N|0| 700-715||-w|0|0|b|b|0|y|N|0| 730-740||-w|0|0|b|f|0|y|N|0|
  • 7. Matching overview Ham-fisted Method 1. Exact Title + Exact Author 2. Exact Title + Shortened Author Cool Math Method Calculate Similarity of Title Calculate Similarity of Author 1. Exact Title + Fuzzy Author 2. Fuzzy Title + Fuzzy Author 3. Fuzzy Title or Fuzzy Author
  • 8. Pymarc - the MARC Hammer umi_dict = { Alaskan Bootlegger: {author: Leon Kania, umi_count = 1}, title2_value: {author: author2_value, umi_count = index2}, ... } iii_dict = { Alaskan Bootlegger: {author: Leon W. Kania, iii_count = 9}, title2_value: {author: author2_value, iii_count = index2}, ... }
  • 9. Exact title + exact author # Exact Title # Create sets out of the dictionary keys umi_set = set(umi_dict.iterkeys()) iii_set = set(iii_dict.iterkeys()) # Find the Intersection of sets. title_match = umi_set & iii_set # Verify Intersection with Exact Author for x in title_match: if umi_dict[x][author] == iii_dict[x][author]: . . . do stuff.
  • 10. Exact title + Truncated author def shortenAuthorName(name): #Leon W. Kania -> [Leon, W., Kania] namelist = str(name).split() if len(namelist) > 2: shortname = "%s %s" % (namelist[0], namelist[-1]) else: shortname = name return shortname
  • 11. "If you break three spokes, it is time for a rebuild" Charles Hadrann, "Hadrann Wheelcraft Method – Part 1 Lacing"
  • 13. USE OF CROWN LENGTH TO DEFINE STEM FORM: SEGMENTED TAPER EQUATION (DOUGLAS FIR) Use of crown length to define stem form :: segmented taper equation
  • 14. Towards an understanding of seismic performance of three-dimensional structures: Stability and reliability Towards an understanding of seismic performance of 3D structures :: stability & reliability
  • 15. Hoekstra, Hopi Danielle Elisabeth Hoekstra, Danielle E
  • 17.
  • 18. Levenshtein Edit Distance
  • 19. Edit distance is the number of operations required to transform one string of characters into the another.
  • 20. How many steps to turn kitten into sitting?
  • 21. 3
  • 22. kitten ➔ sitten (k changes to s) sitten ➔ sittin (e changes to i) sittin ➔ sitting (insert g)
  • 23. LD is Always... ≥ difference in string lengths ≤ length of the longer string = 0 if the strings are identical
  • 26. Reduce the Search Space "A stochastic model of cyclical interaction processes" All titles
  • 27. Reduce the Search Space Identify Stopwords the: 24587 Throw out common for: 7643 words in titles with: 3323 effects: 1958 evaluation: 1073 ... hypoxic: 1 Keep the rarer ones reduplication: 1 picaresque: 1 emperador 1 heteroduplex 1
  • 28. Reduce the Search Space Extract Significant Words "Stochastic stochastic models for DNA dna sequence data" sequence
  • 29. Reduce the Search Space rec = {'title': 'Stochastic models...',} index['stochastic'].append(rec) index['dna'].append(rec) index['sequence'].append(rec)
  • 30. Reduce the Search Space index['stochastic'] {'title': "Stochastic models for DNA sequence data", ...} {'title': "A stochastic model of clan systems", ...} {'title': "A stochastic model of cyclical interaction processes", ...} {'title': "Stochastic reliability models for maintained systems", ...} {'title': "Uniform approximation and almost periodicity of doubly stochastic operators", ...}
  • 31. Normalize Names Hoekstra, Hopi Danielle Elisabeth Hoekstra, Danielle E
  • 32. Normalize Names Hoekstra, H Hoekstra, D
  • 33. Normalize Names Arnason, Halldor Halldór Árnason
  • 34. Normalize Names Arnason, H Árnason, H
  • 37. What's a "match"? Two characters match if they are a reasonable distance from one another as defined by:
  • 38. Example s1 = Martha s2 = Marhta
  • 39. Example s1 = Martha s2 = Marhta
  • 42. Levenshtein & Jaro-Winkler Background http://en.wikipedia.org/wiki/Levenshtein_distance http://en.wikipedia.org/wiki/Jaro-Winkler_distance Code http://pypi.python.org/pypi/editdist/0.1 http://pypi.python.org/pypi/python-Levenshtein/0.10.1
  • 43. Miscellaneous String Comparison Tutorial http://bit.ly/ZGSmF SecondString - Java text analysis library http://secondstring.sourceforge.net/ MarcXimiL - MARC de-duping package http://marcximil.sourceforge.net/