Leveraging Semantic Fingerprinting for Building Author Networks

LEVER
AG
IN
G
SEM
AN
TIC
FIN
G
ER
PR
IN
TS
FO
R
BUILD
IN
G
AUTH
O
R
N
ETW
O
R
KS
Bob Kasenchak
Production Coordinator
DHUG 2014

NAMED ENTITY DISAMBIGUATION
Most publishers (and many other organizations) have
need of disambiguating lists of:
Persons
Authors
Editors
Members
Employees
Institutions
Colleges, Companies, Laboratories, Organizations
Copyright 2014 Access Innovations, Inc.

BUT WHY DISAMBIGUATE?
Facilitate content discovery
Browse by Author or Institution name
Resolve member, author, marketing lists
Link out to other organizations (e.g., ORCID)
Demonstrate value to stakeholders
e.g., College libraries less apt to cancel
subscriptions if they are shown how many of their
professors are published in your content
Market research and analysis

TWO DISAMBIGUATION
PROCESSES
Matching algorithms
String matching
Fuzzy matching
Leveraging other data associated with each entity to
increase matching probability and reduce false
matches, such as:
Country
Date
Co-authors

TWO-PHASE WORKFLOW
Initial set of raw data is used to create an authority file
Questionable names are subject to human review
Authority file is subject to constant review and
cleanup
Entities are extracted from new content and compared
to the authority file
Anomalies are reviewed and matched to existing
records or added as new entities

INSTITUTION DISAMBIGUATION
Having a clean Institution authority file allows for
better processing of persons
The work is easier and more clear-cut
Develop standards and practices, but be prepared to
change or add to them as new data comes to light
Forcing data into a bad paradigm isn’t helpful
The data should inform your standards and
practices

INSTITUTION DISAMBIGUATION
FLOW

QUALITY OF RAW DATA MATTERS
Well-formed source data?
Structured or unstructured?
Legacy content?
Often not as well structured
Or auto-tagged, so can be unreliable
Parsed using punctuation etc. as delimiters
Common abbreviations and stopwords
Also, leverage country information if
available

INSTITUTIONS: RAW DATA
Ohio Aerosp. Inst., Cleveland, OH 44142
Ohio Aerospace Institute (OAI)
Ohio Dominican University
Ohio Institute of Technology
Ohio Northern University
Ohio State
Ohio State Univ., Columbus, OH
Ohio State Univ., Columbus, OH 43210
Ohio State Univ., Columbus, OH 43210 1298‐
Ohio State Univ., Dept. of Linguist.
Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu
Ohio University

HUMAN EDITORIAL REVIEW
Two kinds of human intervention are used:
QC of automated matches for accuracy
Culls out errors
Gather data to iteratively adjust matching algorithms
Reviewing non-matched entities
Match by hand to existing authority file
Create new listings for new entities

EDITORIAL REVIEW INTERFACE
Institutions to be reviewed
Authority
File lookup
Search results

AUTHORS (AND OTHER PERSONS)
Persons are trickier than institutions!
Variants
Nicknames
Middle name, initial, or nothing
Initials
Suffixes and Prefixes
Similar last names
Name changes
Transliterations

NAMES: RAW DATA
Carlson, N.
Carlson, Neil N.
Carlson, P.
Carlson, R. L.
Carlson, R. M. K.
Carlson, R. W.
Carlson, Roy
Carlson, Roy F.
Carlson, T. A.
Carlson, Thomas
Carlson, Thomas A.
Carlson, Thomas J.
Carlson, W. G.
Carlson, William
Carlson, William V.
Which, if any, are
the same person?

PERSON NAME DISAMBIGUATION
FLOW

RESOLVER; SEMANTIC
FINGERPRINTS

AUTHOR NAME AUTHORITY FILE
Each author record is linked to other associated data:
Every DOI (or other document #)
Every co-author
Every institution
Dates of publication
Subject terms from thesaurus used to index
content associated with each person
Each of these is used in the disambiguation algorithm
to weight the potential matches of similar names

LEVERAGING THESAURUS TERMS
The indexing from every paper by each known author
comprises a weighted subject “fingerprint”
Potential matching names from incoming content are
associated with the indexing from each paper
Subject terms are compared to potential matches to
increase certainty weighting

EDITORIAL REVIEW INTERFACE
Authors to be reviewed Authority
File lookup
Search results

ITERATIVE PROCESSES
Every batch of new content adds more data for the
matching algorithms to use
The authority files should be reviewed by editors for
QC to keep the files clean
Editors can suggest tweaks to the algorithm based on
the results that are being sent to them for review and
QC of the authority files
 Too many obvious matches being kicked out; or
 Bad automatic matches being added to authority files

CONTENT-AWARE PROCESSES
Every dataset is different, so the named entity
disambiguation processes and algorithms should be
modified to suit
More “adjustable” than “one-size-fits-all”
Basic processes can be customized to suit different
datasets and client needs
Leveraging thesaurus/subject terms from indexing is
a huge addition to the disambiguation algorithms

NAM
ED
ENTITY DISAM
BIGUATION
PROCESSES
AND
PROCEDURES
Bob Kasenchak
Project Coordinator
November 20, 2013
Thank You – Any Questions?

Leveraging Semantic Fingerprinting for Building Author Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Leveraging Semantic Fingerprinting for Building Author Networks

Ähnlich wie Leveraging Semantic Fingerprinting for Building Author Networks (20)

Mehr von Access Innovations, Inc.

Mehr von Access Innovations, Inc. (20)

Leveraging Semantic Fingerprinting for Building Author Networks