SlideShare ist ein Scribd-Unternehmen logo
1 von 24
LEVER
AG
IN
G
SEM
AN
TIC
FIN
G
ER
PR
IN
TS
FO
R
BUILD
IN
G
AUTH
O
R
N
ETW
O
R
KS
Bob Kasenchak
Production Coordinator
DHUG 2014
NAMED ENTITY DISAMBIGUATION
Most publishers (and many other organizations) have
need of disambiguating lists of:
Persons
Authors
Editors
Members
Employees
Institutions
Colleges, Companies, Laboratories, Organizations
Copyright 2014 Access Innovations, Inc.
BUT WHY DISAMBIGUATE?
Facilitate content discovery
Browse by Author or Institution name
Resolve member, author, marketing lists
Link out to other organizations (e.g., ORCID)
Demonstrate value to stakeholders
e.g., College libraries less apt to cancel
subscriptions if they are shown how many of their
professors are published in your content
Market research and analysis
Copyright 2014 Access Innovations, Inc.
TWO DISAMBIGUATION
PROCESSES
Matching algorithms
String matching
Fuzzy matching
Leveraging other data associated with each entity to
increase matching probability and reduce false
matches, such as:
Country
Date
Co-authors
Copyright 2014 Access Innovations, Inc.
TWO-PHASE WORKFLOW
Initial set of raw data is used to create an authority file
Questionable names are subject to human review
Authority file is subject to constant review and
cleanup
Entities are extracted from new content and compared
to the authority file
Anomalies are reviewed and matched to existing
records or added as new entities
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATION
Having a clean Institution authority file allows for
better processing of persons
The work is easier and more clear-cut
Develop standards and practices, but be prepared to
change or add to them as new data comes to light
Forcing data into a bad paradigm isn’t helpful
The data should inform your standards and
practices
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATION
FLOW
Copyright 2014 Access Innovations, Inc.
QUALITY OF RAW DATA MATTERS
Well-formed source data?
Structured or unstructured?
Legacy content?
Often not as well structured
Or auto-tagged, so can be unreliable
Parsed using punctuation etc. as delimiters
Common abbreviations and stopwords
Also, leverage country information if
available
Copyright 2014 Access Innovations, Inc.
INSTITUTIONS: RAW DATA
Ohio Aerosp. Inst., Cleveland, OH 44142
Ohio Aerospace Institute (OAI)
Ohio Dominican University
Ohio Institute of Technology
Ohio Northern University
Ohio State
Ohio State Univ., Columbus, OH
Ohio State Univ., Columbus, OH 43210
Ohio State Univ., Columbus, OH 43210 1298‐
Ohio State Univ., Dept. of Linguist.
Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu
Ohio University
Copyright 2014 Access Innovations, Inc.
INSTITUTION DISAMBIGUATION
FLOW
Copyright 2014 Access Innovations, Inc.
HUMAN EDITORIAL REVIEW
Two kinds of human intervention are used:
QC of automated matches for accuracy
Culls out errors
Gather data to iteratively adjust matching algorithms
Reviewing non-matched entities
Match by hand to existing authority file
Create new listings for new entities
Copyright 2014 Access Innovations, Inc.
EDITORIAL REVIEW INTERFACE
Institutions to be reviewed
Authority
File lookup
Search results
Copyright 2014 Access Innovations, Inc.
AUTHORS (AND OTHER PERSONS)
Persons are trickier than institutions!
Variants
Nicknames
Middle name, initial, or nothing
Initials
Suffixes and Prefixes
Similar last names
Name changes
Transliterations
Copyright 2014 Access Innovations, Inc.
NAMES: RAW DATA
Carlson, N.
Carlson, Neil N.
Carlson, P.
Carlson, R. L.
Carlson, R. M. K.
Carlson, R. W.
Carlson, Roy
Carlson, Roy F.
Carlson, T. A.
Carlson, Thomas
Carlson, Thomas A.
Carlson, Thomas J.
Carlson, W. G.
Carlson, William
Carlson, William V.
Which, if any, are
the same person?
Copyright 2014 Access Innovations, Inc.
PERSON NAME DISAMBIGUATION
FLOW
Copyright 2014 Access Innovations, Inc.
RESOLVER; SEMANTIC
FINGERPRINTS
Copyright 2014 Access Innovations, Inc.
RESOLVER; SEMANTIC
FINGERPRINTS
Copyright 2014 Access Innovations, Inc.
AUTHOR NAME AUTHORITY FILE
Each author record is linked to other associated data:
Every DOI (or other document #)
Every co-author
Every institution
Dates of publication
Subject terms from thesaurus used to index
content associated with each person
Each of these is used in the disambiguation algorithm
to weight the potential matches of similar names
Copyright 2014 Access Innovations, Inc.
LEVERAGING THESAURUS TERMS
The indexing from every paper by each known author
comprises a weighted subject “fingerprint”
Potential matching names from incoming content are
associated with the indexing from each paper
Subject terms are compared to potential matches to
increase certainty weighting
Copyright 2014 Access Innovations, Inc.
PERSON NAME DISAMBIGUATION
FLOW
Copyright 2014 Access Innovations, Inc.
EDITORIAL REVIEW INTERFACE
Authors to be reviewed Authority
File lookup
Search results
Copyright 2014 Access Innovations, Inc.
ITERATIVE PROCESSES
Every batch of new content adds more data for the
matching algorithms to use
The authority files should be reviewed by editors for
QC to keep the files clean
Editors can suggest tweaks to the algorithm based on
the results that are being sent to them for review and
QC of the authority files
 Too many obvious matches being kicked out; or
 Bad automatic matches being added to authority files
Copyright 2014 Access Innovations, Inc.
CONTENT-AWARE PROCESSES
Every dataset is different, so the named entity
disambiguation processes and algorithms should be
modified to suit
More “adjustable” than “one-size-fits-all”
Basic processes can be customized to suit different
datasets and client needs
Leveraging thesaurus/subject terms from indexing is
a huge addition to the disambiguation algorithms
Copyright 2014 Access Innovations, Inc.
NAM
ED
ENTITY DISAM
BIGUATION
PROCESSES
AND
PROCEDURES
Bob Kasenchak
Project Coordinator
November 20, 2013
Thank You – Any Questions?

Weitere ähnliche Inhalte

Ähnlich wie Leveraging Semantic Fingerprinting for Building Author Networks

Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexingbalaabirami
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivelbalaabirami
 
Introduction to Crossref - Crossref LIVE Kuala Lumpur
Introduction to Crossref - Crossref LIVE Kuala LumpurIntroduction to Crossref - Crossref LIVE Kuala Lumpur
Introduction to Crossref - Crossref LIVE Kuala LumpurCrossref
 
Lesson Six Researching And The Internet
Lesson Six   Researching And The InternetLesson Six   Researching And The Internet
Lesson Six Researching And The Internetbsimoneaux
 
Rp week 7 presentation compressed
Rp week 7 presentation compressedRp week 7 presentation compressed
Rp week 7 presentation compresseddazza50
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 
CrossRef Branding Update
CrossRef Branding UpdateCrossRef Branding Update
CrossRef Branding UpdateCrossref
 
Purdue Research And The Internet
Purdue Research And The InternetPurdue Research And The Internet
Purdue Research And The Internetchrissienehrenberg
 
Purdue Research Power Point
Purdue Research Power PointPurdue Research Power Point
Purdue Research Power Pointnicole.rivers
 
Exploring and accessing knowledge in Research
Exploring and accessing knowledge in ResearchExploring and accessing knowledge in Research
Exploring and accessing knowledge in ResearchNabeel Salih Ali
 
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...Seth Earley Presentation on Knowledge Management Through Search-Based Applica...
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...Earley Information Science
 
ORCID Use Cases from the CDL
ORCID Use Cases from the CDLORCID Use Cases from the CDL
ORCID Use Cases from the CDLLisa Schiff
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxclarebernice
 
Information Literacy Session 5
Information Literacy Session 5Information Literacy Session 5
Information Literacy Session 5Sarah Moore
 
The Internet
The InternetThe Internet
The Internetmscuttle
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxmonicafrancis71118
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxcargillfilberto
 

Ähnlich wie Leveraging Semantic Fingerprinting for Building Author Networks (20)

Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
MSE - Year 1 - Jan 2012
MSE - Year 1 - Jan 2012MSE - Year 1 - Jan 2012
MSE - Year 1 - Jan 2012
 
Introduction to Crossref - Crossref LIVE Kuala Lumpur
Introduction to Crossref - Crossref LIVE Kuala LumpurIntroduction to Crossref - Crossref LIVE Kuala Lumpur
Introduction to Crossref - Crossref LIVE Kuala Lumpur
 
Lesson Six Researching And The Internet
Lesson Six   Researching And The InternetLesson Six   Researching And The Internet
Lesson Six Researching And The Internet
 
Rp week 7 presentation compressed
Rp week 7 presentation compressedRp week 7 presentation compressed
Rp week 7 presentation compressed
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
CrossRef Branding Update
CrossRef Branding UpdateCrossRef Branding Update
CrossRef Branding Update
 
Purdue Research And The Internet
Purdue Research And The InternetPurdue Research And The Internet
Purdue Research And The Internet
 
Purdue Research Power Point
Purdue Research Power PointPurdue Research Power Point
Purdue Research Power Point
 
Exploring and accessing knowledge in Research
Exploring and accessing knowledge in ResearchExploring and accessing knowledge in Research
Exploring and accessing knowledge in Research
 
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...Seth Earley Presentation on Knowledge Management Through Search-Based Applica...
Seth Earley Presentation on Knowledge Management Through Search-Based Applica...
 
ORCID Use Cases from the CDL
ORCID Use Cases from the CDLORCID Use Cases from the CDL
ORCID Use Cases from the CDL
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docx
 
Information Literacy Session 5
Information Literacy Session 5Information Literacy Session 5
Information Literacy Session 5
 
Internet Search
Internet SearchInternet Search
Internet Search
 
The Internet
The InternetThe Internet
The Internet
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docx
 
Communicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docxCommunicating professionally and ethically is one of the ess.docx
Communicating professionally and ethically is one of the ess.docx
 

Mehr von Access Innovations, Inc.

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Access Innovations, Inc.
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Access Innovations, Inc.
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Access Innovations, Inc.
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut ItAccess Innovations, Inc.
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedAccess Innovations, Inc.
 

Mehr von Access Innovations, Inc. (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Smart submit
Smart submitSmart submit
Smart submit
 
Plos taxonomy beyond search dhug 2021
Plos taxonomy beyond search   dhug 2021Plos taxonomy beyond search   dhug 2021
Plos taxonomy beyond search dhug 2021
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Atypon dhug2021
Atypon dhug2021Atypon dhug2021
Atypon dhug2021
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021
 
Asce more than just topic taxonomies
Asce more than just topic taxonomiesAsce more than just topic taxonomies
Asce more than just topic taxonomies
 
Acs discoverability-dhug2021
Acs discoverability-dhug2021Acs discoverability-dhug2021
Acs discoverability-dhug2021
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut It
 
Health Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut ItHealth Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut It
 
Why Keywords Don't Cut It
Why Keywords Don't Cut ItWhy Keywords Don't Cut It
Why Keywords Don't Cut It
 
Data Harmony update 2020 final
Data Harmony update 2020 finalData Harmony update 2020 final
Data Harmony update 2020 final
 
Data Harmony Update 2020 final
Data Harmony Update 2020 finalData Harmony Update 2020 final
Data Harmony Update 2020 final
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
DHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCRDHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCR
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
 

Leveraging Semantic Fingerprinting for Building Author Networks

  • 2. NAMED ENTITY DISAMBIGUATION Most publishers (and many other organizations) have need of disambiguating lists of: Persons Authors Editors Members Employees Institutions Colleges, Companies, Laboratories, Organizations Copyright 2014 Access Innovations, Inc.
  • 3. BUT WHY DISAMBIGUATE? Facilitate content discovery Browse by Author or Institution name Resolve member, author, marketing lists Link out to other organizations (e.g., ORCID) Demonstrate value to stakeholders e.g., College libraries less apt to cancel subscriptions if they are shown how many of their professors are published in your content Market research and analysis Copyright 2014 Access Innovations, Inc.
  • 4. TWO DISAMBIGUATION PROCESSES Matching algorithms String matching Fuzzy matching Leveraging other data associated with each entity to increase matching probability and reduce false matches, such as: Country Date Co-authors Copyright 2014 Access Innovations, Inc.
  • 5. TWO-PHASE WORKFLOW Initial set of raw data is used to create an authority file Questionable names are subject to human review Authority file is subject to constant review and cleanup Entities are extracted from new content and compared to the authority file Anomalies are reviewed and matched to existing records or added as new entities Copyright 2014 Access Innovations, Inc.
  • 6. INSTITUTION DISAMBIGUATION Having a clean Institution authority file allows for better processing of persons The work is easier and more clear-cut Develop standards and practices, but be prepared to change or add to them as new data comes to light Forcing data into a bad paradigm isn’t helpful The data should inform your standards and practices Copyright 2014 Access Innovations, Inc.
  • 8. QUALITY OF RAW DATA MATTERS Well-formed source data? Structured or unstructured? Legacy content? Often not as well structured Or auto-tagged, so can be unreliable Parsed using punctuation etc. as delimiters Common abbreviations and stopwords Also, leverage country information if available Copyright 2014 Access Innovations, Inc.
  • 9. INSTITUTIONS: RAW DATA Ohio Aerosp. Inst., Cleveland, OH 44142 Ohio Aerospace Institute (OAI) Ohio Dominican University Ohio Institute of Technology Ohio Northern University Ohio State Ohio State Univ., Columbus, OH Ohio State Univ., Columbus, OH 43210 Ohio State Univ., Columbus, OH 43210 1298‐ Ohio State Univ., Dept. of Linguist. Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu Ohio University Copyright 2014 Access Innovations, Inc.
  • 11. HUMAN EDITORIAL REVIEW Two kinds of human intervention are used: QC of automated matches for accuracy Culls out errors Gather data to iteratively adjust matching algorithms Reviewing non-matched entities Match by hand to existing authority file Create new listings for new entities Copyright 2014 Access Innovations, Inc.
  • 12. EDITORIAL REVIEW INTERFACE Institutions to be reviewed Authority File lookup Search results Copyright 2014 Access Innovations, Inc.
  • 13. AUTHORS (AND OTHER PERSONS) Persons are trickier than institutions! Variants Nicknames Middle name, initial, or nothing Initials Suffixes and Prefixes Similar last names Name changes Transliterations Copyright 2014 Access Innovations, Inc.
  • 14. NAMES: RAW DATA Carlson, N. Carlson, Neil N. Carlson, P. Carlson, R. L. Carlson, R. M. K. Carlson, R. W. Carlson, Roy Carlson, Roy F. Carlson, T. A. Carlson, Thomas Carlson, Thomas A. Carlson, Thomas J. Carlson, W. G. Carlson, William Carlson, William V. Which, if any, are the same person? Copyright 2014 Access Innovations, Inc.
  • 15. PERSON NAME DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  • 18. AUTHOR NAME AUTHORITY FILE Each author record is linked to other associated data: Every DOI (or other document #) Every co-author Every institution Dates of publication Subject terms from thesaurus used to index content associated with each person Each of these is used in the disambiguation algorithm to weight the potential matches of similar names Copyright 2014 Access Innovations, Inc.
  • 19. LEVERAGING THESAURUS TERMS The indexing from every paper by each known author comprises a weighted subject “fingerprint” Potential matching names from incoming content are associated with the indexing from each paper Subject terms are compared to potential matches to increase certainty weighting Copyright 2014 Access Innovations, Inc.
  • 20. PERSON NAME DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  • 21. EDITORIAL REVIEW INTERFACE Authors to be reviewed Authority File lookup Search results Copyright 2014 Access Innovations, Inc.
  • 22. ITERATIVE PROCESSES Every batch of new content adds more data for the matching algorithms to use The authority files should be reviewed by editors for QC to keep the files clean Editors can suggest tweaks to the algorithm based on the results that are being sent to them for review and QC of the authority files  Too many obvious matches being kicked out; or  Bad automatic matches being added to authority files Copyright 2014 Access Innovations, Inc.
  • 23. CONTENT-AWARE PROCESSES Every dataset is different, so the named entity disambiguation processes and algorithms should be modified to suit More “adjustable” than “one-size-fits-all” Basic processes can be customized to suit different datasets and client needs Leveraging thesaurus/subject terms from indexing is a huge addition to the disambiguation algorithms Copyright 2014 Access Innovations, Inc.
  • 24. NAM ED ENTITY DISAM BIGUATION PROCESSES AND PROCEDURES Bob Kasenchak Project Coordinator November 20, 2013 Thank You – Any Questions?