SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Personalised
access to
cultural heritage
spaces

Recommendations for
the automatic enrichment of DL content
using open source software
Authors:

Eneko Agirre and Arantxa Otegi

www.paths-project.eu
PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
Recommendations for the automatic enrichment
of DL content using open source software

1 Introduction
2 Producing intra-collections links
2.1 Similarity links
2.2 Typed-similarity links
3 Producing background links
4 Ontology extension
References

1 Introduction
PATHS uses text processing software to enable the following functionality in the prototypes
(see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources):
●
●
●

intra-collection links
background links
ontology extension

The aim of PATHS is to investigate the use of those functionalities to better serve exploration
by users. Most of the software is in-house and relatively complex. The release of the software
as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For
instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers
for automatic content enrichment of digital library items are to be produced.
Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the
release of open source tools which are similar to the ones used for text processing in PATHS.
The release of all PATHS software as open source out-of-the-box packages would be a
major undertaking, on a par to those European projects.
The goal of this document is to present an overall set of recommendations for the automatic
enrichment of Digital Libraries content using open source software. We think this would be
useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and
programming effort involved.
This document is structured according to each of the enrichment tasks described in PATHS
deliverables D2.1 and D2.2.

PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
2

Producing intra-collections links

PATHS produced both generic similarity links and more specific typed-similarity links.

2.1 Similarity links
The target items would need to be parsed by the following Natural Language Processing
(NLP) tools: Pos tagging and lemmatization. Several open source products exist, including
the two used in PATHS:
CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
and
Freeling (http://nlp.lsi.upc.edu/freeling/).
Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both,
and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a
suite of open source NLP tools including, in addition to English and Spanish, four other
European languages.
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the similarity links requires additional software, which in this case
has been developed in-house. No out-of-the box alternatives exist. The interested parties
would need to replicate the software described in (Aletras et al. 2012).
As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al.
2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/)
to provide similar items. Note that this alternative does not require additional NLP tools, as it
uses their own stop words and stemming algorithm.

2.2 Typed-similarity links
The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatization (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the typed-similarity links requires additional software, including open
source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the
publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/),
and use the machine learning models on the target items. No out-of-the-box alternatives
exist. The interested parties would need to replicate the software as described in (Agirre et al.
2013a).
3

Producing background links

In order to produce background links we used Wikipedia Miner, an open source software
available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.

4

Ontology extension

The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatisation (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the vocabulary requires additional software. We first extract the
background links (see previous section), and then find the most relevant Wikipedia articles
per item. This is done globally, analysing the statistics of the whole collection. Those articles
are used to categorize the items according to a Wikipedia-based category system, which is
trimmed-down to only cover the categories which are relevant to the collection at hand. No
out-of-the-box alternatives exist. The interested parties would need to replicate the software
as described in (Fernando et al. 2012).

References
Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and
Computational Semantics (*SEM 2013)
Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich:
a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and
Advanced Technology for Digital Libraries, International Conference on Theory and Practice
of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in
Computer Science, vol. 8092, pp 462-465.
Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital
Library of Cultural Heritage. In ACM JOCCH.
Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing
taxonomies for organising collections of documents. In Proceedings of the 24th International
Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India,
December 2012.

Weitere ähnliche Inhalte

Andere mochten auch

PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecture
pathsproject
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
mediamera
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
Akinori Tateyama
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
pathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
pathsproject
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
Eyal Doron
 

Andere mochten auch (18)

Boletim informativo - janeiro
Boletim informativo - janeiroBoletim informativo - janeiro
Boletim informativo - janeiro
 
Fichas s
Fichas sFichas s
Fichas s
 
PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecture
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
 
Zp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsaZp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsa
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Vienna
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 

Ähnlich wie Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 

Ähnlich wie Recommendations for the automatic enrichment of digital library content using open source software, PATHS report (20)

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho
 
07419051111154767
0741905111115476707419051111154767
07419051111154767
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-development
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
New Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web PortalNew Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web Portal
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And Repository
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Note
 
Dspace
DspaceDspace
Dspace
 
Dspace
DspaceDspace
Dspace
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over Android
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-development
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 

Mehr von pathsproject

Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
pathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
pathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
pathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
pathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
pathsproject
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
pathsproject
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
pathsproject
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2
pathsproject
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototype
pathsproject
 

Mehr von pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototype
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

  • 1. Personalised access to cultural heritage spaces Recommendations for the automatic enrichment of DL content using open source software Authors: Eneko Agirre and Arantxa Otegi www.paths-project.eu PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 2. Recommendations for the automatic enrichment of DL content using open source software 1 Introduction 2 Producing intra-collections links 2.1 Similarity links 2.2 Typed-similarity links 3 Producing background links 4 Ontology extension References 1 Introduction PATHS uses text processing software to enable the following functionality in the prototypes (see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources): ● ● ● intra-collection links background links ontology extension The aim of PATHS is to investigate the use of those functionalities to better serve exploration by users. Most of the software is in-house and relatively complex. The release of the software as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers for automatic content enrichment of digital library items are to be produced. Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the release of open source tools which are similar to the ones used for text processing in PATHS. The release of all PATHS software as open source out-of-the-box packages would be a major undertaking, on a par to those European projects. The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Libraries content using open source software. We think this would be useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and programming effort involved. This document is structured according to each of the enrichment tasks described in PATHS deliverables D2.1 and D2.2. PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 3. 2 Producing intra-collections links PATHS produced both generic similarity links and more specific typed-similarity links. 2.1 Similarity links The target items would need to be parsed by the following Natural Language Processing (NLP) tools: Pos tagging and lemmatization. Several open source products exist, including the two used in PATHS: CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml) and Freeling (http://nlp.lsi.upc.edu/freeling/). Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both, and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a suite of open source NLP tools including, in addition to English and Spanish, four other European languages. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the similarity links requires additional software, which in this case has been developed in-house. No out-of-the box alternatives exist. The interested parties would need to replicate the software described in (Aletras et al. 2012). As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al. 2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/) to provide similar items. Note that this alternative does not require additional NLP tools, as it uses their own stop words and stemming algorithm. 2.2 Typed-similarity links The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatization (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the typed-similarity links requires additional software, including open source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/), and use the machine learning models on the target items. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Agirre et al. 2013a).
  • 4. 3 Producing background links In order to produce background links we used Wikipedia Miner, an open source software available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. 4 Ontology extension The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatisation (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the vocabulary requires additional software. We first extract the background links (see previous section), and then find the most relevant Wikipedia articles per item. This is done globally, analysing the statistics of the whole collection. Those articles are used to categorize the items according to a Wikipedia-based category system, which is trimmed-down to only cover the categories which are relevant to the collection at hand. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Fernando et al. 2012). References Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and Computational Semantics (*SEM 2013) Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich: a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and Advanced Technology for Digital Libraries, International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in Computer Science, vol. 8092, pp 462-465. Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital Library of Cultural Heritage. In ACM JOCCH. Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing taxonomies for organising collections of documents. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India, December 2012.