SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Matching Dirty Data

   Yet another wheel



             Jeff Sherwood, Programmer.
             Anjanette Young, Systems Librarian.
             University of Washington, Libraries.
DSpace Repository
 Goal   Ingest Metadata and PDF's for ETD's received from
        UMI into a DSpace repository.
Electronic Theses & Dissertations


       Sources       Output
MARC Fields


UMI Records       III Records


=001 (Filename)   =001   (OCLC number)
=520 (Abstract)   =100   (Author)
                  =245   (Title)
                  =260   (Date published)
                  =502   (type and date)
                  =695   (Department)
                  =941   (Local identifier)
dublin_core.xml

<dublin_core>
  <dcvalue element="identifier" qualifier="other">
    iii[941]</dcvalue>
  <dcvalue element="title" qualifier="none">
    iii[245][a][b]</dcvalue>
  <dcvalue element="contributor" qualifier="author">
    iii[100][a][b][c]</dcvalue>
  <dcvalue element="description" qualifier="abstract">
    umi[520][a]</dcvalue>
  <dcvalue element="subject" qualifier="other">
    iii[655][a][x]</dcvalue>
</dublin_core>
MARC Loader . . . No.


|||0|0| | |0|n|G|0|@ov_action="o"
|||0|0| | |0|n|G|0
|@ov_protect="b=V0123456789d(690,695:d)
                      hn(590:d)y(099,249,852,856:d)y(910,925,
                      980,981)F26"
035|001 |+|0|0|b|o|0|y|N|0|%001(start="1-9",char="!-~")
245||+|0|0|b|t|0|y|N|0|%bracket="h"
500-599||+|0|0|b|n|0|y|N|0|
600-651||-w|0|0|b|d|0|y|N|0|
653-657||+|0|0|b|d|0|y|N|0|
690-699||-w|0|0|b|d|0|y|N|0|
700-715||-w|0|0|b|b|0|y|N|0|
730-740||-w|0|0|b|f|0|y|N|0|
Matching overview


Ham-fisted Method
1. Exact Title + Exact Author
2. Exact Title + Shortened Author

Cool Math Method
Calculate Similarity of Title
Calculate Similarity of Author
1. Exact Title + Fuzzy Author
2. Fuzzy Title + Fuzzy Author
3. Fuzzy Title or Fuzzy Author
Pymarc - the MARC Hammer


umi_dict = {
 Alaskan Bootlegger: {author: Leon Kania, umi_count = 1},
 title2_value: {author: author2_value, umi_count = index2},
 ...
  }

iii_dict = {
   Alaskan Bootlegger: {author: Leon W. Kania, iii_count = 9},
  title2_value: {author: author2_value, iii_count = index2},
  ...
   }
Exact title + exact author

# Exact Title
# Create sets out of the dictionary keys
umi_set = set(umi_dict.iterkeys())
iii_set = set(iii_dict.iterkeys())

# Find the Intersection of sets.
title_match = umi_set & iii_set


# Verify Intersection with Exact Author
for x in title_match:
   if umi_dict[x][author] == iii_dict[x][author]:
       . . . do stuff.
Exact title + Truncated author


  def shortenAuthorName(name):
    #Leon W. Kania        ->     [Leon, W., Kania]
    namelist = str(name).split()
    if len(namelist) > 2:
        shortname = "%s %s" % (namelist[0], namelist[-1])
    else:
        shortname = name
    return shortname
"If you break three spokes, it is time for a rebuild"
                      Charles Hadrann,
                      "Hadrann Wheelcraft Method – Part 1 Lacing"
Rogues
Gallery
USE OF CROWN LENGTH TO DEFINE
STEM FORM: SEGMENTED TAPER
EQUATION (DOUGLAS FIR)


Use of crown length to define stem form
:: segmented taper equation
Towards an understanding of seismic
performance of three-dimensional
structures: Stability and reliability

Towards an understanding of seismic
performance of 3D structures :: stability
& reliability
Hoekstra, Hopi Danielle Elisabeth

Hoekstra, Danielle E
Arnason, Halldor

Halldór Árnason
Levenshtein Edit
   Distance
Edit distance is the number of
operations required to transform one
string of characters into the another.
How many steps to turn
kitten
into sitting?
3
kitten ➔ sitten    (k changes to s)

sitten ➔ sittin    (e changes to i)

sittin ➔ sitting   (insert g)
LD is Always...


   ≥ difference in string lengths
   ≤ length of the longer string
   = 0 if the strings are identical
Similarity Score
Optimizations
Reduce the Search Space


   "A stochastic model
   of cyclical interaction
   processes"




                             All titles
Reduce the Search Space

    Identify Stopwords
    the: 24587         Throw out common
    for: 7643          words in titles
    with: 3323
    effects: 1958
    evaluation: 1073
    ...
    hypoxic: 1         Keep the rarer ones
    reduplication: 1
    picaresque: 1
    emperador 1
    heteroduplex 1
Reduce the Search Space

      Extract Significant Words

 "Stochastic             stochastic
 models for DNA          dna
 sequence data"          sequence
Reduce the Search Space


  rec = {'title': 'Stochastic models...',}

  index['stochastic'].append(rec)
  index['dna'].append(rec)
  index['sequence'].append(rec)
Reduce the Search Space


                             index['stochastic']


{'title': "Stochastic models for DNA sequence data", ...}
{'title': "A stochastic model of clan systems", ...}
{'title': "A stochastic model of cyclical interaction processes", ...}
{'title': "Stochastic reliability models for maintained systems", ...}
{'title': "Uniform approximation and almost periodicity of doubly stochastic operators", ...}
Normalize Names



  Hoekstra, Hopi Danielle Elisabeth

  Hoekstra, Danielle E
Normalize Names



          Hoekstra, H

          Hoekstra, D
Normalize Names



        Arnason, Halldor

        Halldór Árnason
Normalize Names



          Arnason, H

          Árnason, H
Improvements
Jaro-Winkler Algorithm
What's a "match"?

  Two characters match if they are a
  reasonable distance from one another
  as defined by:
Example



   s1 = Martha
   s2 = Marhta
Example



   s1 = Martha
   s2 = Marhta
Jaro-Winkler works
best for short strings
Resources
Levenshtein & Jaro-Winkler

  Background
  http://en.wikipedia.org/wiki/Levenshtein_distance
  http://en.wikipedia.org/wiki/Jaro-Winkler_distance


  Code
  http://pypi.python.org/pypi/editdist/0.1
  http://pypi.python.org/pypi/python-Levenshtein/0.10.1
Miscellaneous

  String Comparison Tutorial
  http://bit.ly/ZGSmF

  SecondString - Java text analysis library
  http://secondstring.sourceforge.net/

  MarcXimiL - MARC de-duping package
  http://marcximil.sourceforge.net/
http://snurl.com/uggtn

Weitere ähnliche Inhalte

Was ist angesagt?

Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101DataStax Academy
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개r-kor
 
Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)MongoSF
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Cloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsCloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsDataconomy Media
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 

Was ist angesagt? (19)

Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Java lecture 7 1
Java lecture 7 1Java lecture 7 1
Java lecture 7 1
 
Functional es6
Functional es6Functional es6
Functional es6
 
R programming language
R programming languageR programming language
R programming language
 
User biglm
User biglmUser biglm
User biglm
 
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
RUCK 2017 MxNet과 R을 연동한 딥러닝 소개
 
Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)Indexing and Query Optimizer (Aaron Staple)
Indexing and Query Optimizer (Aaron Staple)
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
R intro 20140716-advance
R intro 20140716-advanceR intro 20140716-advance
R intro 20140716-advance
 
Cloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forestsCloudera - A Taste of random decision forests
Cloudera - A Taste of random decision forests
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 

Ähnlich wie Matching Dirty Data

Recursion schemes in Scala
Recursion schemes in ScalaRecursion schemes in Scala
Recursion schemes in ScalaArthur Kushka
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAAiman Hud
 
CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26Bilal Ahmed
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingMateusz Fedoryszak
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Languageleague
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Slides
SlidesSlides
Slidesbutest
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: DeanonymizingDavid Evans
 
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression TreesExploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Treesrasmuskl
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGGeorge Simov
 
Everything is composable
Everything is composableEverything is composable
Everything is composableVictor Igor
 

Ähnlich wie Matching Dirty Data (20)

Recursion schemes in Scala
Recursion schemes in ScalaRecursion schemes in Scala
Recursion schemes in Scala
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matching
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
R language introduction
R language introductionR language introduction
R language introduction
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Slides
SlidesSlides
Slides
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
 
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression TreesExploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
Exploring C# DSLs: LINQ, Fluent Interfaces and Expression Trees
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERINGECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
 
Everything is composable
Everything is composableEverything is composable
Everything is composable
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Matching Dirty Data

  • 1. Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.
  • 2. DSpace Repository Goal Ingest Metadata and PDF's for ETD's received from UMI into a DSpace repository.
  • 3. Electronic Theses & Dissertations Sources Output
  • 4. MARC Fields UMI Records III Records =001 (Filename) =001 (OCLC number) =520 (Abstract) =100 (Author) =245 (Title) =260 (Date published) =502 (type and date) =695 (Department) =941 (Local identifier)
  • 5. dublin_core.xml <dublin_core> <dcvalue element="identifier" qualifier="other"> iii[941]</dcvalue> <dcvalue element="title" qualifier="none"> iii[245][a][b]</dcvalue> <dcvalue element="contributor" qualifier="author"> iii[100][a][b][c]</dcvalue> <dcvalue element="description" qualifier="abstract"> umi[520][a]</dcvalue> <dcvalue element="subject" qualifier="other"> iii[655][a][x]</dcvalue> </dublin_core>
  • 6. MARC Loader . . . No. |||0|0| | |0|n|G|0|@ov_action="o" |||0|0| | |0|n|G|0 |@ov_protect="b=V0123456789d(690,695:d) hn(590:d)y(099,249,852,856:d)y(910,925, 980,981)F26" 035|001 |+|0|0|b|o|0|y|N|0|%001(start="1-9",char="!-~") 245||+|0|0|b|t|0|y|N|0|%bracket="h" 500-599||+|0|0|b|n|0|y|N|0| 600-651||-w|0|0|b|d|0|y|N|0| 653-657||+|0|0|b|d|0|y|N|0| 690-699||-w|0|0|b|d|0|y|N|0| 700-715||-w|0|0|b|b|0|y|N|0| 730-740||-w|0|0|b|f|0|y|N|0|
  • 7. Matching overview Ham-fisted Method 1. Exact Title + Exact Author 2. Exact Title + Shortened Author Cool Math Method Calculate Similarity of Title Calculate Similarity of Author 1. Exact Title + Fuzzy Author 2. Fuzzy Title + Fuzzy Author 3. Fuzzy Title or Fuzzy Author
  • 8. Pymarc - the MARC Hammer umi_dict = { Alaskan Bootlegger: {author: Leon Kania, umi_count = 1}, title2_value: {author: author2_value, umi_count = index2}, ... } iii_dict = { Alaskan Bootlegger: {author: Leon W. Kania, iii_count = 9}, title2_value: {author: author2_value, iii_count = index2}, ... }
  • 9. Exact title + exact author # Exact Title # Create sets out of the dictionary keys umi_set = set(umi_dict.iterkeys()) iii_set = set(iii_dict.iterkeys()) # Find the Intersection of sets. title_match = umi_set & iii_set # Verify Intersection with Exact Author for x in title_match: if umi_dict[x][author] == iii_dict[x][author]: . . . do stuff.
  • 10. Exact title + Truncated author def shortenAuthorName(name): #Leon W. Kania -> [Leon, W., Kania] namelist = str(name).split() if len(namelist) > 2: shortname = "%s %s" % (namelist[0], namelist[-1]) else: shortname = name return shortname
  • 11. "If you break three spokes, it is time for a rebuild" Charles Hadrann, "Hadrann Wheelcraft Method – Part 1 Lacing"
  • 13. USE OF CROWN LENGTH TO DEFINE STEM FORM: SEGMENTED TAPER EQUATION (DOUGLAS FIR) Use of crown length to define stem form :: segmented taper equation
  • 14. Towards an understanding of seismic performance of three-dimensional structures: Stability and reliability Towards an understanding of seismic performance of 3D structures :: stability & reliability
  • 15. Hoekstra, Hopi Danielle Elisabeth Hoekstra, Danielle E
  • 17.
  • 18. Levenshtein Edit Distance
  • 19. Edit distance is the number of operations required to transform one string of characters into the another.
  • 20. How many steps to turn kitten into sitting?
  • 21. 3
  • 22. kitten ➔ sitten (k changes to s) sitten ➔ sittin (e changes to i) sittin ➔ sitting (insert g)
  • 23. LD is Always... ≥ difference in string lengths ≤ length of the longer string = 0 if the strings are identical
  • 26. Reduce the Search Space "A stochastic model of cyclical interaction processes" All titles
  • 27. Reduce the Search Space Identify Stopwords the: 24587 Throw out common for: 7643 words in titles with: 3323 effects: 1958 evaluation: 1073 ... hypoxic: 1 Keep the rarer ones reduplication: 1 picaresque: 1 emperador 1 heteroduplex 1
  • 28. Reduce the Search Space Extract Significant Words "Stochastic stochastic models for DNA dna sequence data" sequence
  • 29. Reduce the Search Space rec = {'title': 'Stochastic models...',} index['stochastic'].append(rec) index['dna'].append(rec) index['sequence'].append(rec)
  • 30. Reduce the Search Space index['stochastic'] {'title': "Stochastic models for DNA sequence data", ...} {'title': "A stochastic model of clan systems", ...} {'title': "A stochastic model of cyclical interaction processes", ...} {'title': "Stochastic reliability models for maintained systems", ...} {'title': "Uniform approximation and almost periodicity of doubly stochastic operators", ...}
  • 31. Normalize Names Hoekstra, Hopi Danielle Elisabeth Hoekstra, Danielle E
  • 32. Normalize Names Hoekstra, H Hoekstra, D
  • 33. Normalize Names Arnason, Halldor Halldór Árnason
  • 34. Normalize Names Arnason, H Árnason, H
  • 37. What's a "match"? Two characters match if they are a reasonable distance from one another as defined by:
  • 38. Example s1 = Martha s2 = Marhta
  • 39. Example s1 = Martha s2 = Marhta
  • 42. Levenshtein & Jaro-Winkler Background http://en.wikipedia.org/wiki/Levenshtein_distance http://en.wikipedia.org/wiki/Jaro-Winkler_distance Code http://pypi.python.org/pypi/editdist/0.1 http://pypi.python.org/pypi/python-Levenshtein/0.10.1
  • 43. Miscellaneous String Comparison Tutorial http://bit.ly/ZGSmF SecondString - Java text analysis library http://secondstring.sourceforge.net/ MarcXimiL - MARC de-duping package http://marcximil.sourceforge.net/