SlideShare a Scribd company logo
1 of 16
Recent advances in the project
EXCITE – Extraction of Citations
from PDF Documents
Philipp Mayr
GESIS – Leibniz Institute for the Social Sciences
2018-09-03, Bologna
http://excite.west.uni-koblenz.de/
#opencitations
#WOOC2018
EXCITE team
• PI: Steffen Staab (WeST), Philipp Mayr (GESIS)
• Researchers: Behnam Ghavimi, Zeyd Boukhers
• Developer: Azam Hosseini
• Collaborators: Heinrich Hartmann, Martin Körner
2
EXCITE: Background
3
• We run productive search systems and research in information
retrieval, recommendation systems and knowledge discovery
− SSOAR https://www.gesis.org/ssoar/ (48K full texts)
− GESIS Search https://search.gesis.org/ (242K data sets +
further materials)
• National literatures are not well represented in major citation
indices (like WoS, Scopus)
• Shortage of citation data for the international and German
social sciences (Social Science Citation Index is not enough)
• Open availability of citation data is improving but still very
limited
EXCITE: Main objectives
• Develop web services to allow third-parties to
extract citation data from arbitrary publications
• Develop a toolchain of reference extraction and
matching software
• Integrate and publish the extracted citation data in
reusable formats
• Narrow the supply gap of citation data in the social
sciences
4
EXCITE: toolchain
(1) Extraction of text from source documents (PDFs),
(2) Identification of reference sections and other forms of embedded reference
information within the text,
(3) Segmentation of individual references into its constituent fields such as author, title,
etc.,
(4) Matching of reference strings against bibliographic databases,
(5) Export and publication of matched references to reusable formats (convert to OCC)
5
Training data
EXCITE: recent advances
• All components are available as reusable components, see
https://github.com/exciteproject
• EXparser – tool to extracting and segment references (see talk
by Zeyd Boukhers: “A Generic Approach for Reference
Extraction from PDF Documents” tomorrow)
• Annotators and Gold standards – tools to annotate references
and different gold standards to train and test the tools
• EXmatcher – tool to match references to bibliographic
databases which base on solr, elasticsearch
• EXpublisher – tool to convert EXCITE data to JSON-LD
• Public demo http://excite.west.uni-koblenz.de/excite
• Extracted and matched data in productive systems, e.g.
https://search.gesis.org/publication/gesis-ssoar-10004
6
EXAnnotators: Reference Identification
7
http://excite.west.uni-koblenz.de/refanno
EXAnnotators: Reference Segmentation
8
http://excite.west.uni-koblenz.de/seganno
EXCITE: Demo
9
DEMO
EXCITE: Demo
10
Uploading File Display References
Result
http://excite.west.uni-koblenz.de/excite
EXmatcher
• Input are segmented reference strings with probabilities
for each segment
• Output are matched document ids
11
EXmatcher
hybrid
approach -
combination of
blocking
techniques
and a
classifier
algorithm
Input: strings,
segments,
probabilities 12
EXPublisher
• Converting extracted and matched data to the OCC ontology (incl.
EXCITE Identifier in the OCI)
• Enrichment of the reference information by external metadata
13https://github.com/exciteproject/EXpublisher
EXMatcher and ExPublisher will be included in the demo soon!
Next steps in EXCITE
• EXCITE references to be published in OpenCitationCorpus
• Public EXCITE API for testing (to be public soon)
• Reference Matching to Crossref to be added in the
demo/API
• Gold Standards (German/English/Reference
Section/Footnotes) to be completed
• Extractions models for German and English texts
• More Social Science data to be processed and released
• We try to process ArXiv for OCC
14
Thank you
Contact:
Dr Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
Email: philipp.mayr@gesis.org
Twitter: @philipp_mayr
• Project website
http://excite.west.uni-koblenz.de/
• EXCITE mailing list: Subscribe to our Newsletter.
• Demo http://excite.west.uni-koblenz.de/excite
• GIT https://github.com/exciteproject/
15
EXCITE: Toolchain
16

More Related Content

What's hot

ODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersdatacite
 
skos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systemsskos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization SystemsJoachim Neubert
 
ODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresdatacite
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data JournalismIrina Radchenko
 
Federating Research Profiling Data
Federating Research Profiling DataFederating Research Profiling Data
Federating Research Profiling Dataericmeeks
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Datadatacite
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestabEmily Singley
 
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Digital Methods Initiative
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeSimon Price
 
Introduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzIntroduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzCrossref
 
ORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsCrossref
 
Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpediaKrzysztof Wecel
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...Robert H. McDonald
 

What's hot (20)

ODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiers
 
skos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systemsskos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systems
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
 
Meadows apr28-1
Meadows apr28-1Meadows apr28-1
Meadows apr28-1
 
ODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentres
 
Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Federating Research Profiling Data
Federating Research Profiling DataFederating Research Profiling Data
Federating Research Profiling Data
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
 
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
 
Introduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzIntroduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed Pentz
 
Dash: data sharing made easy
Dash: data sharing made easyDash: data sharing made easy
Dash: data sharing made easy
 
ORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice Meadows
 
Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpedia
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseAnita de Waard
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformAndrea Bollini
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentationJohn Turner
 
2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl2015 04-21-eexcess emtacl
2015 04-21-eexcess emtaclTamara Pianos
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howDavid T Palmer
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley SoftwareDave Marcial
 
EOSC and libraries
EOSC and librariesEOSC and libraries
EOSC and librariesSarah Jones
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshopbellalli
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرنمركز البحوث الأقسام العلمية
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data PublishingBrian Hole
 
British Library
British LibraryBritish Library
British Libraryclarivate
 
Open Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesOpen Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesAnup Kumar Das
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017Sarah Amrani
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsAnita de Waard
 
Library support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchLibrary support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchTim Wales
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Susanna-Assunta Sansone
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents (20)

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentation
 
2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl
 
6_ULiege_presentation.pdf
6_ULiege_presentation.pdf6_ULiege_presentation.pdf
6_ULiege_presentation.pdf
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & how
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
 
EOSC and libraries
EOSC and librariesEOSC and libraries
EOSC and libraries
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshop
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
British Library
British LibraryBritish Library
British Library
 
Open Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesOpen Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research Libraries
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Library support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchLibrary support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted Research
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
 

More from GESIS

10th BIR Workshop @ECIR 2020: introduction
10th  BIR Workshop @ECIR 2020: introduction10th  BIR Workshop @ECIR 2020: introduction
10th BIR Workshop @ECIR 2020: introductionGESIS
 
From closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsFrom closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsGESIS
 
Highly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeHighly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeGESIS
 
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...GESIS
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsGESIS
 
Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”GESIS
 
Searching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesSearching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesGESIS
 
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenBedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenGESIS
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabGESIS
 
41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)GESIS
 
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...GESIS
 
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...GESIS
 
Challenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesChallenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesGESIS
 
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...GESIS
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...GESIS
 
Recent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalRecent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalGESIS
 
Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...GESIS
 
Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016GESIS
 
Recent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsRecent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsGESIS
 
Using co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationUsing co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationGESIS
 

More from GESIS (20)

10th BIR Workshop @ECIR 2020: introduction
10th  BIR Workshop @ECIR 2020: introduction10th  BIR Workshop @ECIR 2020: introduction
10th BIR Workshop @ECIR 2020: introduction
 
From closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsFrom closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journals
 
Highly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeHighly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over time
 
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
 
Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”
 
Searching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesSearching beyond datasets in the Social Sciences
Searching beyond datasets in the Social Sciences
 
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenBedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living Lab
 
41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)
 
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
 
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
 
Challenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesChallenges in Extracting and Managing References
Challenges in Extracting and Managing References
 
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
 
Recent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalRecent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information Retrieval
 
Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...
 
Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016
 
Recent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsRecent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization Systems
 
Using co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationUsing co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguation
 

Recently uploaded

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

  • 1. Recent advances in the project EXCITE – Extraction of Citations from PDF Documents Philipp Mayr GESIS – Leibniz Institute for the Social Sciences 2018-09-03, Bologna http://excite.west.uni-koblenz.de/ #opencitations #WOOC2018
  • 2. EXCITE team • PI: Steffen Staab (WeST), Philipp Mayr (GESIS) • Researchers: Behnam Ghavimi, Zeyd Boukhers • Developer: Azam Hosseini • Collaborators: Heinrich Hartmann, Martin Körner 2
  • 3. EXCITE: Background 3 • We run productive search systems and research in information retrieval, recommendation systems and knowledge discovery − SSOAR https://www.gesis.org/ssoar/ (48K full texts) − GESIS Search https://search.gesis.org/ (242K data sets + further materials) • National literatures are not well represented in major citation indices (like WoS, Scopus) • Shortage of citation data for the international and German social sciences (Social Science Citation Index is not enough) • Open availability of citation data is improving but still very limited
  • 4. EXCITE: Main objectives • Develop web services to allow third-parties to extract citation data from arbitrary publications • Develop a toolchain of reference extraction and matching software • Integrate and publish the extracted citation data in reusable formats • Narrow the supply gap of citation data in the social sciences 4
  • 5. EXCITE: toolchain (1) Extraction of text from source documents (PDFs), (2) Identification of reference sections and other forms of embedded reference information within the text, (3) Segmentation of individual references into its constituent fields such as author, title, etc., (4) Matching of reference strings against bibliographic databases, (5) Export and publication of matched references to reusable formats (convert to OCC) 5 Training data
  • 6. EXCITE: recent advances • All components are available as reusable components, see https://github.com/exciteproject • EXparser – tool to extracting and segment references (see talk by Zeyd Boukhers: “A Generic Approach for Reference Extraction from PDF Documents” tomorrow) • Annotators and Gold standards – tools to annotate references and different gold standards to train and test the tools • EXmatcher – tool to match references to bibliographic databases which base on solr, elasticsearch • EXpublisher – tool to convert EXCITE data to JSON-LD • Public demo http://excite.west.uni-koblenz.de/excite • Extracted and matched data in productive systems, e.g. https://search.gesis.org/publication/gesis-ssoar-10004 6
  • 10. EXCITE: Demo 10 Uploading File Display References Result http://excite.west.uni-koblenz.de/excite
  • 11. EXmatcher • Input are segmented reference strings with probabilities for each segment • Output are matched document ids 11
  • 12. EXmatcher hybrid approach - combination of blocking techniques and a classifier algorithm Input: strings, segments, probabilities 12
  • 13. EXPublisher • Converting extracted and matched data to the OCC ontology (incl. EXCITE Identifier in the OCI) • Enrichment of the reference information by external metadata 13https://github.com/exciteproject/EXpublisher EXMatcher and ExPublisher will be included in the demo soon!
  • 14. Next steps in EXCITE • EXCITE references to be published in OpenCitationCorpus • Public EXCITE API for testing (to be public soon) • Reference Matching to Crossref to be added in the demo/API • Gold Standards (German/English/Reference Section/Footnotes) to be completed • Extractions models for German and English texts • More Social Science data to be processed and released • We try to process ArXiv for OCC 14
  • 15. Thank you Contact: Dr Philipp Mayr GESIS - Leibniz Institute for the Social Sciences, Germany Email: philipp.mayr@gesis.org Twitter: @philipp_mayr • Project website http://excite.west.uni-koblenz.de/ • EXCITE mailing list: Subscribe to our Newsletter. • Demo http://excite.west.uni-koblenz.de/excite • GIT https://github.com/exciteproject/ 15