SlideShare a Scribd company logo
1 of 24
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case
Fabrizio Celli – Food and Agriculture
Organization of the UN - 27th March 2014
Before Starting…
• AGROVOC is the FAO 30 years old multilingual vocabulary
containing more than 32 000 concepts in 22 languages
(http://aims.fao.org/standards/agrovoc/about )
• AGRIS (http://agris.fao.org/ ) is a database of more than 7
million bibliographic references in Agriculture
– A collaborative network of more than 150 institutions from 65
countries
– AGRIS bibliographic metadata are enhanced by AGROVOC
descriptors, which is very important in the context of adopting LOD
technologies (http://agris.fao.org/content/about )
• Both are exposed as RDF
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Outline
• Disambiguation
• How does it work?
• Use Case 1: indexing AGRIS resources
• Use Case 2: crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation
• At a high level of abstraction, AgroTagger is a
keyword extractor that uses the AGROVOC
thesaurus to enhance bibliographic resources
• The name AgroTagger may refer to different tools:
– MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in
collaboration with Indian Institute of Technology of Kanpur
(IITK) in 2010, built on top of the popular Keyword
Extraction Engine (KEA, http://www.nzdl.org/Kea/ )
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation (2)
– A Web Application developed by MIMOS in
collaboration with IITK and FAO
(http://kt.mimos.my/AgroTagger/)
• built on top of the IITK tagging service
• It generates keywords as RDF triples
• It builds a tag cloud showing the most commonly
extracted keywords
• More information on AIMS:
http://aims.fao.org/agrotagger
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Disambiguation (3)
• «AgroTagger» refers also to a command line
application, based on MAUI
(https://code.google.com/p/maui-indexer/)
• There isn’t a graphic interface neither a Web Service
on top of the application
• It is a JAVA API
• This is the AgroTagger exposed in this presentation!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
MAUI
• Maui is named after the Polynesian mythological hero
and demi-god, which would transform himself into
different kinds of birds to perform many of his exploits
• Similarly, the Maui algorithm assimilates two software
tools named after New Zealand native birds Kea
(keyphrase extraction algorithm) and Weka (the
machine learning toolkit for creating the topic indexing
model from documents with topics assigned by people
and applying it to new documents)
• Maui automatically identifies main topics in text
documents
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
How does it work?
• The purpose of the application is to index some Web
resources (i.e. URLs) with the AGROVOC thesaurus
• The application can accept two different inputs:
– A text file with a list of URLs
– The output file of an Apache Nuts Web Crawler (which
contains a list of discovered URLs, but in a specific format)
• The output is a set of connections between input URLs
and some extracted AGROVOC URIs
– It can be a simple text file or a set of triples (NTRIPLES
serialization)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
A text file with a list of
URLs of Web resources input
AgroTagger
output
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
How does it work?
• For each URL in the input file
– Download the resource
– Run the MAUI indexer trained with AGROVOC (the
application was trained with 780 bibliographic
resources manually indexed by FAO cataloguers)
– Update the output file with discovered
connections (source URL -> set of AGROVOC URIs)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Use Case 1:
indexing AGRIS
resources
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
AGRIS
• A collection of more than 7 million
bibliographic references in agriculture
• AGRIS records come with AGROVOC
descriptors
• An RDF-aware system
– the AGRIS database is exposed as RDF
– AGROVOC is the backbone to interlink to external
sources of information (statistics, distribution
maps, country profiles, germplasm data…)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The problem
• Sometimes AGRIS records have not been
indexed with Agrovoc keywords
• When Agrovoc keywords are not available, an
AGRIS record cannot be interlinked to external
sources of information
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The solution
Not yet implemented!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
An example
• In 2012 AGRIS received from the WorldBank
28.582 bibliographic records
• All records came with a fulltext link, but no
keywords associated
• Running the AgroTagger we were able to
assign from 4 to 10 AGROVOC keywords to
each WorldBank resource
• We did a manual, random evaluation of the
quality of the output, with good results!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
AgroTagger
output
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Use Case 2:
crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The setting
• Objective: discovering Web resources in
agriculture and interlinking them to AGRIS
records
• Tools:
– Apache Nuts Crawler
– AgroTagger Java API
• Final Goal: when the system displays an AGRIS
record, a list of related Web resources should
be available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
The algorithm
• The Apache Nuts Web Crawler, after a
tuning, crawls the Web starting from a list of
preselected URLs
– The output of the Crawler (a list of discovered URLs) is
given to the AgroTagger
• The AgroTagger assigns some AGROVOC URIs to
each URL discovered by the Crawler
• AGRIS records are interlinked to these URLs if
they have at least 5 common AGROVOC URIs (the
number has to be tuned)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
First test: some numbers
• A first test started from the URL:
http://ageconsearch.umn.edu/
• 101,000 distinct Web resources have been
discovered by the WebCrawler and associated to
AGROVOC URIs by the AgroTagger
• An algorithm tried to match AGRIS data to these
resources
– E.g. the resource
«http://www.waeaonline.org/WEForum/WEF-Vol.9-
No.2-Fall2010.pdf» was associated to the AGRIS
record «http://agris.fao.org/aos/records/US7938594»
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
First test: some numbers (2)
Number of AGRIS records Common AGROVOC URIs
between AGRIS and the
output of the Crawler
Number of associations
900 K 3 17 MLN
530 K 4 1,9 MLN
2,3 MLN 5 1,27 MLN
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Future
• Other qualitative/quantitative tests
• Optimization of the algorithm to run faster
• Tuning of the physical infrastructure
• Complete automation of procedures (e.g. the
output goes directy to a triplestore)
• Reach the final goal: when the system displays
an AGRIS record, a list of related Web
resources are available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Thank you !
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014

More Related Content

Similar to Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Agris (agricultural information system)
Agris (agricultural information system)Agris (agricultural information system)
Agris (agricultural information system)yashir16
 
Web services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsWeb services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsTrish Whetzel
 
Developing a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetDeveloping a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetVassilis Protonotarios
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote KeizerJohannes Keizer
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web DataIRJET Journal
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...LIBER Europe
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Dag Endresen
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferenceJohannes Keizer
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsEDINA, University of Edinburgh
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big datacthanopoulos
 
App db egi.tf.2013.v2
App db egi.tf.2013.v2App db egi.tf.2013.v2
App db egi.tf.2013.v2Nuno Ferreira
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Global RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataGlobal RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataVassilis Protonotarios
 

Similar to Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase (20)

Agris (agricultural information system)
Agris (agricultural information system)Agris (agricultural information system)
Agris (agricultural information system)
 
Web services and the Development of Semantic Applications
Web services and the Development of Semantic ApplicationsWeb services and the Development of Semantic Applications
Web services and the Development of Semantic Applications
 
Developing a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.EdunetDeveloping a network of content providers: The case of Organic.Edunet
Developing a network of content providers: The case of Organic.Edunet
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer2007 08 26 Dc Keynote Keizer
2007 08 26 Dc Keynote Keizer
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
AnalogIST/ezPAARSE: Analysing Locally Gathered Logfiles to Determine Users’ A...
 
AGRIS: an RDF-aware system in the agricultural domain
AGRIS: an RDF-aware system in the agricultural domainAGRIS: an RDF-aware system in the agricultural domain
AGRIS: an RDF-aware system in the agricultural domain
 
Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)Global Information Systems for Plant Genetic Resources (2009)
Global Information Systems for Plant Genetic Resources (2009)
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conference
 
GBIF Work Programme 2016 Update
GBIF Work Programme 2016 UpdateGBIF Work Programme 2016 Update
GBIF Work Programme 2016 Update
 
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
Jisc Publications Router
Jisc Publications RouterJisc Publications Router
Jisc Publications Router
 
Jisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to InstitutionsJisc Publications Router: Delivering Open Access Content to Institutions
Jisc Publications Router: Delivering Open Access Content to Institutions
 
Introduction to Big data
Introduction to Big dataIntroduction to Big data
Introduction to Big data
 
App db egi.tf.2013.v2
App db egi.tf.2013.v2App db egi.tf.2013.v2
App db egi.tf.2013.v2
 
An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Global RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm DataGlobal RDF Descriptors for Germplasm Data
Global RDF Descriptors for Germplasm Data
 
AKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS NetworkAKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS Network
 

More from AIMS (Agricultural Information Management Standards)

More from AIMS (Agricultural Information Management Standards) (20)

Linked Data Competency Index : Mapping the field for teachers and learners
 Linked Data Competency Index : Mapping the field for teachers and learners Linked Data Competency Index : Mapping the field for teachers and learners
Linked Data Competency Index : Mapping the field for teachers and learners
 
Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...Metadata as Standard: improving Interoperability through the Research Data Al...
Metadata as Standard: improving Interoperability through the Research Data Al...
 
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic ResourcesAssigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
Assigning Digital Object Identifiers (DOIs) to Plant Genetic Resources
 
VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release VocBench 3: some insights on the forthcoming release
VocBench 3: some insights on the forthcoming release
 
The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...The case for Digital Objects Identifiers (DOIs) in support of research activi...
The case for Digital Objects Identifiers (DOIs) in support of research activi...
 
Webinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management PlanningWebinar@AIMS_FAIR Principles and Data Management Planning
Webinar@AIMS_FAIR Principles and Data Management Planning
 
Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library Webinar@ASIRA: How to foster openness from an academic library
Webinar@ASIRA: How to foster openness from an academic library
 
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
Webinar@ASIRA: A Practitioners Approach to Open Data for Agricultural Research
 
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
Webinar@ASIRA: AuthorAID: Supporting Developing Country Researchers in Publis...
 
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
Webinar@ASIRA: Introduction to Using TEEAL to Access Agricultural Journals
 
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA) Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
Webinar@ASIRA: Access to Global Online Research in Agriculture (AGORA)
 
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
Webinar@ASIRA: AGRIS: Providing Access to Agricultural Research and Technolog...
 
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
Webinar@ASIRA: New Roles for Changing Times UNAM Subject Librarians in Context
 
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research PublishingWebinar@ASIRA: Emerging Themes in Agricultural Research Publishing
Webinar@ASIRA: Emerging Themes in Agricultural Research Publishing
 
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
Webinar@AIMS: OKAD & F1000Research: a very different approach to publishing a...
 
Research4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portesResearch4Life: La bibliothèque qui ouvre ses portes
Research4Life: La bibliothèque qui ouvre ses portes
 
Publishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmosPublishing skos concept schemes with skosmos
Publishing skos concept schemes with skosmos
 
Research4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertasResearch4Life: La biblioteca que abre puertas
Research4Life: La biblioteca que abre puertas
 
Research4Life: The library that opens doors
Research4Life: The library that opens doorsResearch4Life: The library that opens doors
Research4Life: The library that opens doors
 
Webinar@AIMS: Perspective on Big Data in the CGIAR
Webinar@AIMS: Perspective on Big Data in the CGIARWebinar@AIMS: Perspective on Big Data in the CGIAR
Webinar@AIMS: Perspective on Big Data in the CGIAR
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

  • 1. Automatic Indexing of Bibliographic Metadata: The AgroTagger use case Fabrizio Celli – Food and Agriculture Organization of the UN - 27th March 2014
  • 2. Before Starting… • AGROVOC is the FAO 30 years old multilingual vocabulary containing more than 32 000 concepts in 22 languages (http://aims.fao.org/standards/agrovoc/about ) • AGRIS (http://agris.fao.org/ ) is a database of more than 7 million bibliographic references in Agriculture – A collaborative network of more than 150 institutions from 65 countries – AGRIS bibliographic metadata are enhanced by AGROVOC descriptors, which is very important in the context of adopting LOD technologies (http://agris.fao.org/content/about ) • Both are exposed as RDF Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 3. Outline • Disambiguation • How does it work? • Use Case 1: indexing AGRIS resources • Use Case 2: crawling the Web Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 4. Disambiguation • At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to enhance bibliographic resources • The name AgroTagger may refer to different tools: – MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in collaboration with Indian Institute of Technology of Kanpur (IITK) in 2010, built on top of the popular Keyword Extraction Engine (KEA, http://www.nzdl.org/Kea/ ) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 5. Disambiguation (2) – A Web Application developed by MIMOS in collaboration with IITK and FAO (http://kt.mimos.my/AgroTagger/) • built on top of the IITK tagging service • It generates keywords as RDF triples • It builds a tag cloud showing the most commonly extracted keywords • More information on AIMS: http://aims.fao.org/agrotagger Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 6. Disambiguation (3) • «AgroTagger» refers also to a command line application, based on MAUI (https://code.google.com/p/maui-indexer/) • There isn’t a graphic interface neither a Web Service on top of the application • It is a JAVA API • This is the AgroTagger exposed in this presentation! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 7. MAUI • Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits • Similarly, the Maui algorithm assimilates two software tools named after New Zealand native birds Kea (keyphrase extraction algorithm) and Weka (the machine learning toolkit for creating the topic indexing model from documents with topics assigned by people and applying it to new documents) • Maui automatically identifies main topics in text documents Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 8. How does it work? • The purpose of the application is to index some Web resources (i.e. URLs) with the AGROVOC thesaurus • The application can accept two different inputs: – A text file with a list of URLs – The output file of an Apache Nuts Web Crawler (which contains a list of discovered URLs, but in a specific format) • The output is a set of connections between input URLs and some extracted AGROVOC URIs – It can be a simple text file or a set of triples (NTRIPLES serialization) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 9. A text file with a list of URLs of Web resources input AgroTagger output Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 10. How does it work? • For each URL in the input file – Download the resource – Run the MAUI indexer trained with AGROVOC (the application was trained with 780 bibliographic resources manually indexed by FAO cataloguers) – Update the output file with discovered connections (source URL -> set of AGROVOC URIs) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 11. Use Case 1: indexing AGRIS resources Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 12. AGRIS • A collection of more than 7 million bibliographic references in agriculture • AGRIS records come with AGROVOC descriptors • An RDF-aware system – the AGRIS database is exposed as RDF – AGROVOC is the backbone to interlink to external sources of information (statistics, distribution maps, country profiles, germplasm data…) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 13. Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 14. The problem • Sometimes AGRIS records have not been indexed with Agrovoc keywords • When Agrovoc keywords are not available, an AGRIS record cannot be interlinked to external sources of information Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 15. The solution Not yet implemented! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 16. An example • In 2012 AGRIS received from the WorldBank 28.582 bibliographic records • All records came with a fulltext link, but no keywords associated • Running the AgroTagger we were able to assign from 4 to 10 AGROVOC keywords to each WorldBank resource • We did a manual, random evaluation of the quality of the output, with good results! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 17. AgroTagger output Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 18. Use Case 2: crawling the Web Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 19. The setting • Objective: discovering Web resources in agriculture and interlinking them to AGRIS records • Tools: – Apache Nuts Crawler – AgroTagger Java API • Final Goal: when the system displays an AGRIS record, a list of related Web resources should be available to the user Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 20. The algorithm • The Apache Nuts Web Crawler, after a tuning, crawls the Web starting from a list of preselected URLs – The output of the Crawler (a list of discovered URLs) is given to the AgroTagger • The AgroTagger assigns some AGROVOC URIs to each URL discovered by the Crawler • AGRIS records are interlinked to these URLs if they have at least 5 common AGROVOC URIs (the number has to be tuned) Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 21. First test: some numbers • A first test started from the URL: http://ageconsearch.umn.edu/ • 101,000 distinct Web resources have been discovered by the WebCrawler and associated to AGROVOC URIs by the AgroTagger • An algorithm tried to match AGRIS data to these resources – E.g. the resource «http://www.waeaonline.org/WEForum/WEF-Vol.9- No.2-Fall2010.pdf» was associated to the AGRIS record «http://agris.fao.org/aos/records/US7938594» Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 22. First test: some numbers (2) Number of AGRIS records Common AGROVOC URIs between AGRIS and the output of the Crawler Number of associations 900 K 3 17 MLN 530 K 4 1,9 MLN 2,3 MLN 5 1,27 MLN Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 23. Future • Other qualitative/quantitative tests • Optimization of the algorithm to run faster • Tuning of the physical infrastructure • Complete automation of procedures (e.g. the output goes directy to a triplestore) • Reach the final goal: when the system displays an AGRIS record, a list of related Web resources are available to the user Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014
  • 24. Thank you ! Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Editor's Notes

  1. Tuning parameters, both for the crawler and for the matching algorithmParallelizationCloud