Presented at the 10th annual Data Harmony Users Group meeting on Wednesday, February 12, 2014 by Bob Kasenchak of Access Innovations, Inc. With the rise of ORCID and other universal databases of researchers and institutions, it is increasingly crucial for publishers to sort out their own data containing named entities. This talk details Access Innovations' approach to author disambiguation, which includes a taxonomy-based solution in addition to algorithmic processes. The presentation includes a case study.
2. NAMED ENTITY DISAMBIGUATION
Most publishers (and many other organizations) have
need of disambiguating lists of:
Persons
Authors
Editors
Members
Employees
Institutions
Colleges, Companies, Laboratories, Organizations
Copyright 2014 Access Innovations, Inc.
3. BUT WHY DISAMBIGUATE?
Facilitate content discovery
Browse by Author or Institution name
Resolve member, author, marketing lists
Link out to other organizations (e.g., ORCID)
Demonstrate value to stakeholders
e.g., College libraries less apt to cancel
subscriptions if they are shown how many of their
professors are published in your content
Market research and analysis
Copyright 2014 Access Innovations, Inc.
4. TWO DISAMBIGUATION
PROCESSES
Matching algorithms
String matching
Fuzzy matching
Leveraging other data associated with each entity to
increase matching probability and reduce false
matches, such as:
Country
Date
Co-authors
Copyright 2014 Access Innovations, Inc.
5. TWO-PHASE WORKFLOW
Initial set of raw data is used to create an authority file
Questionable names are subject to human review
Authority file is subject to constant review and
cleanup
Entities are extracted from new content and compared
to the authority file
Anomalies are reviewed and matched to existing
records or added as new entities
Copyright 2014 Access Innovations, Inc.
6. INSTITUTION DISAMBIGUATION
Having a clean Institution authority file allows for
better processing of persons
The work is easier and more clear-cut
Develop standards and practices, but be prepared to
change or add to them as new data comes to light
Forcing data into a bad paradigm isn’t helpful
The data should inform your standards and
practices
Copyright 2014 Access Innovations, Inc.
8. QUALITY OF RAW DATA MATTERS
Well-formed source data?
Structured or unstructured?
Legacy content?
Often not as well structured
Or auto-tagged, so can be unreliable
Parsed using punctuation etc. as delimiters
Common abbreviations and stopwords
Also, leverage country information if
available
Copyright 2014 Access Innovations, Inc.
9. INSTITUTIONS: RAW DATA
Ohio Aerosp. Inst., Cleveland, OH 44142
Ohio Aerospace Institute (OAI)
Ohio Dominican University
Ohio Institute of Technology
Ohio Northern University
Ohio State
Ohio State Univ., Columbus, OH
Ohio State Univ., Columbus, OH 43210
Ohio State Univ., Columbus, OH 43210 1298‐
Ohio State Univ., Dept. of Linguist.
Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu
Ohio University
Copyright 2014 Access Innovations, Inc.
11. HUMAN EDITORIAL REVIEW
Two kinds of human intervention are used:
QC of automated matches for accuracy
Culls out errors
Gather data to iteratively adjust matching algorithms
Reviewing non-matched entities
Match by hand to existing authority file
Create new listings for new entities
Copyright 2014 Access Innovations, Inc.
13. AUTHORS (AND OTHER PERSONS)
Persons are trickier than institutions!
Variants
Nicknames
Middle name, initial, or nothing
Initials
Suffixes and Prefixes
Similar last names
Name changes
Transliterations
Copyright 2014 Access Innovations, Inc.
14. NAMES: RAW DATA
Carlson, N.
Carlson, Neil N.
Carlson, P.
Carlson, R. L.
Carlson, R. M. K.
Carlson, R. W.
Carlson, Roy
Carlson, Roy F.
Carlson, T. A.
Carlson, Thomas
Carlson, Thomas A.
Carlson, Thomas J.
Carlson, W. G.
Carlson, William
Carlson, William V.
Which, if any, are
the same person?
Copyright 2014 Access Innovations, Inc.
18. AUTHOR NAME AUTHORITY FILE
Each author record is linked to other associated data:
Every DOI (or other document #)
Every co-author
Every institution
Dates of publication
Subject terms from thesaurus used to index
content associated with each person
Each of these is used in the disambiguation algorithm
to weight the potential matches of similar names
Copyright 2014 Access Innovations, Inc.
19. LEVERAGING THESAURUS TERMS
The indexing from every paper by each known author
comprises a weighted subject “fingerprint”
Potential matching names from incoming content are
associated with the indexing from each paper
Subject terms are compared to potential matches to
increase certainty weighting
Copyright 2014 Access Innovations, Inc.
22. ITERATIVE PROCESSES
Every batch of new content adds more data for the
matching algorithms to use
The authority files should be reviewed by editors for
QC to keep the files clean
Editors can suggest tweaks to the algorithm based on
the results that are being sent to them for review and
QC of the authority files
Too many obvious matches being kicked out; or
Bad automatic matches being added to authority files
Copyright 2014 Access Innovations, Inc.
23. CONTENT-AWARE PROCESSES
Every dataset is different, so the named entity
disambiguation processes and algorithms should be
modified to suit
More “adjustable” than “one-size-fits-all”
Basic processes can be customized to suit different
datasets and client needs
Leveraging thesaurus/subject terms from indexing is
a huge addition to the disambiguation algorithms
Copyright 2014 Access Innovations, Inc.