The webinar will present the keyword extractor AgroTagger. AgroTagger is a tool based on MAUI that uses the AGROVOC thesaurus as its set of allowable keywords. It can read the fulltext of publications through the extraction of related AGROVOC keywords.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase
1. Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case
Fabrizio Celli – Food and Agriculture
Organization of the UN - 27th March 2014
2. Before Starting…
• AGROVOC is the FAO 30 years old multilingual vocabulary
containing more than 32 000 concepts in 22 languages
(http://aims.fao.org/standards/agrovoc/about )
• AGRIS (http://agris.fao.org/ ) is a database of more than 7
million bibliographic references in Agriculture
– A collaborative network of more than 150 institutions from 65
countries
– AGRIS bibliographic metadata are enhanced by AGROVOC
descriptors, which is very important in the context of adopting LOD
technologies (http://agris.fao.org/content/about )
• Both are exposed as RDF
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
3. Outline
• Disambiguation
• How does it work?
• Use Case 1: indexing AGRIS resources
• Use Case 2: crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
4. Disambiguation
• At a high level of abstraction, AgroTagger is a
keyword extractor that uses the AGROVOC
thesaurus to enhance bibliographic resources
• The name AgroTagger may refer to different tools:
– MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in
collaboration with Indian Institute of Technology of Kanpur
(IITK) in 2010, built on top of the popular Keyword
Extraction Engine (KEA, http://www.nzdl.org/Kea/ )
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
5. Disambiguation (2)
– A Web Application developed by MIMOS in
collaboration with IITK and FAO
(http://kt.mimos.my/AgroTagger/)
• built on top of the IITK tagging service
• It generates keywords as RDF triples
• It builds a tag cloud showing the most commonly
extracted keywords
• More information on AIMS:
http://aims.fao.org/agrotagger
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
6. Disambiguation (3)
• «AgroTagger» refers also to a command line
application, based on MAUI
(https://code.google.com/p/maui-indexer/)
• There isn’t a graphic interface neither a Web Service
on top of the application
• It is a JAVA API
• This is the AgroTagger exposed in this presentation!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
7. MAUI
• Maui is named after the Polynesian mythological hero
and demi-god, which would transform himself into
different kinds of birds to perform many of his exploits
• Similarly, the Maui algorithm assimilates two software
tools named after New Zealand native birds Kea
(keyphrase extraction algorithm) and Weka (the
machine learning toolkit for creating the topic indexing
model from documents with topics assigned by people
and applying it to new documents)
• Maui automatically identifies main topics in text
documents
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
8. How does it work?
• The purpose of the application is to index some Web
resources (i.e. URLs) with the AGROVOC thesaurus
• The application can accept two different inputs:
– A text file with a list of URLs
– The output file of an Apache Nuts Web Crawler (which
contains a list of discovered URLs, but in a specific format)
• The output is a set of connections between input URLs
and some extracted AGROVOC URIs
– It can be a simple text file or a set of triples (NTRIPLES
serialization)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
9. A text file with a list of
URLs of Web resources input
AgroTagger
output
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
10. How does it work?
• For each URL in the input file
– Download the resource
– Run the MAUI indexer trained with AGROVOC (the
application was trained with 780 bibliographic
resources manually indexed by FAO cataloguers)
– Update the output file with discovered
connections (source URL -> set of AGROVOC URIs)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
11. Use Case 1:
indexing AGRIS
resources
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
12. AGRIS
• A collection of more than 7 million
bibliographic references in agriculture
• AGRIS records come with AGROVOC
descriptors
• An RDF-aware system
– the AGRIS database is exposed as RDF
– AGROVOC is the backbone to interlink to external
sources of information (statistics, distribution
maps, country profiles, germplasm data…)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
13. Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
14. The problem
• Sometimes AGRIS records have not been
indexed with Agrovoc keywords
• When Agrovoc keywords are not available, an
AGRIS record cannot be interlinked to external
sources of information
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
15. The solution
Not yet implemented!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
16. An example
• In 2012 AGRIS received from the WorldBank
28.582 bibliographic records
• All records came with a fulltext link, but no
keywords associated
• Running the AgroTagger we were able to
assign from 4 to 10 AGROVOC keywords to
each WorldBank resource
• We did a manual, random evaluation of the
quality of the output, with good results!
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
18. Use Case 2:
crawling the Web
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
19. The setting
• Objective: discovering Web resources in
agriculture and interlinking them to AGRIS
records
• Tools:
– Apache Nuts Crawler
– AgroTagger Java API
• Final Goal: when the system displays an AGRIS
record, a list of related Web resources should
be available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
20. The algorithm
• The Apache Nuts Web Crawler, after a
tuning, crawls the Web starting from a list of
preselected URLs
– The output of the Crawler (a list of discovered URLs) is
given to the AgroTagger
• The AgroTagger assigns some AGROVOC URIs to
each URL discovered by the Crawler
• AGRIS records are interlinked to these URLs if
they have at least 5 common AGROVOC URIs (the
number has to be tuned)
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
21. First test: some numbers
• A first test started from the URL:
http://ageconsearch.umn.edu/
• 101,000 distinct Web resources have been
discovered by the WebCrawler and associated to
AGROVOC URIs by the AgroTagger
• An algorithm tried to match AGRIS data to these
resources
– E.g. the resource
«http://www.waeaonline.org/WEForum/WEF-Vol.9-
No.2-Fall2010.pdf» was associated to the AGRIS
record «http://agris.fao.org/aos/records/US7938594»
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
22. First test: some numbers (2)
Number of AGRIS records Common AGROVOC URIs
between AGRIS and the
output of the Crawler
Number of associations
900 K 3 17 MLN
530 K 4 1,9 MLN
2,3 MLN 5 1,27 MLN
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
23. Future
• Other qualitative/quantitative tests
• Optimization of the algorithm to run faster
• Tuning of the physical infrastructure
• Complete automation of procedures (e.g. the
output goes directy to a triplestore)
• Reach the final goal: when the system displays
an AGRIS record, a list of related Web
resources are available to the user
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
24. Thank you !
Automatic Indexing of Bibliographic
Metadata: The AgroTagger use case -
Fabrizio Celli - 27/03/2014
Editor's Notes
Tuning parameters, both for the crawler and for the matching algorithmParallelizationCloud