SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
Personalised
access to
cultural heritage
spaces

Recommendations for
the automatic enrichment of DL content
using open source software
Authors:

Eneko Agirre and Arantxa Otegi

www.paths-project.eu
PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
Recommendations for the automatic enrichment
of DL content using open source software

1 Introduction
2 Producing intra-collections links
2.1 Similarity links
2.2 Typed-similarity links
3 Producing background links
4 Ontology extension
References

1 Introduction
PATHS uses text processing software to enable the following functionality in the prototypes
(see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources):
●
●
●

intra-collection links
background links
ontology extension

The aim of PATHS is to investigate the use of those functionalities to better serve exploration
by users. Most of the software is in-house and relatively complex. The release of the software
as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For
instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers
for automatic content enrichment of digital library items are to be produced.
Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the
release of open source tools which are similar to the ones used for text processing in PATHS.
The release of all PATHS software as open source out-of-the-box packages would be a
major undertaking, on a par to those European projects.
The goal of this document is to present an overall set of recommendations for the automatic
enrichment of Digital Libraries content using open source software. We think this would be
useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and
programming effort involved.
This document is structured according to each of the enrichment tasks described in PATHS
deliverables D2.1 and D2.2.

PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
2

Producing intra-collections links

PATHS produced both generic similarity links and more specific typed-similarity links.

2.1 Similarity links
The target items would need to be parsed by the following Natural Language Processing
(NLP) tools: Pos tagging and lemmatization. Several open source products exist, including
the two used in PATHS:
CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
and
Freeling (http://nlp.lsi.upc.edu/freeling/).
Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both,
and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a
suite of open source NLP tools including, in addition to English and Spanish, four other
European languages.
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the similarity links requires additional software, which in this case
has been developed in-house. No out-of-the box alternatives exist. The interested parties
would need to replicate the software described in (Aletras et al. 2012).
As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al.
2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/)
to provide similar items. Note that this alternative does not require additional NLP tools, as it
uses their own stop words and stemming algorithm.

2.2 Typed-similarity links
The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatization (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the typed-similarity links requires additional software, including open
source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the
publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/),
and use the machine learning models on the target items. No out-of-the-box alternatives
exist. The interested parties would need to replicate the software as described in (Agirre et al.
2013a).
3

Producing background links

In order to produce background links we used Wikipedia Miner, an open source software
available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.

4

Ontology extension

The target items would need to be parsed by the following NLP tools: Pos tagging and
lemmatisation (see previous section).
In addition, we used in-house scripts to process the internal representation of the items,
extract the textual pieces, and produce the enriched representations.
The actual production of the vocabulary requires additional software. We first extract the
background links (see previous section), and then find the most relevant Wikipedia articles
per item. This is done globally, analysing the statistics of the whole collection. Those articles
are used to categorize the items according to a Wikipedia-based category system, which is
trimmed-down to only cover the categories which are relevant to the collection at hand. No
out-of-the-box alternatives exist. The interested parties would need to replicate the software
as described in (Fernando et al. 2012).

References
Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and
Computational Semantics (*SEM 2013)
Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich:
a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and
Advanced Technology for Digital Libraries, International Conference on Theory and Practice
of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in
Computer Science, vol. 8092, pp 462-465.
Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital
Library of Cultural Heritage. In ACM JOCCH.
Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing
taxonomies for organising collections of documents. In Proceedings of the 24th International
Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India,
December 2012.

Weitere ähnliche Inhalte

Andere mochten auch

PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecturepathsproject
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...mediamera
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17Eyal Doron
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...pathsproject
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...Eyal Doron
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Viennapathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...Eyal Doron
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17Eyal Doron
 

Andere mochten auch (18)

Boletim informativo - janeiro
Boletim informativo - janeiroBoletim informativo - janeiro
Boletim informativo - janeiro
 
Fichas s
Fichas sFichas s
Fichas s
 
PATHS system architecture
PATHS system architecturePATHS system architecture
PATHS system architecture
 
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
Величко М.В. (2012) — Территориальные инновации, их внедрение и кадровое обе...
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17My E-mail appears as spam | The 7 major reasons | Part 5#17
My E-mail appears as spam | The 7 major reasons | Part 5#17
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
 
Zp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsaZp primary school, ganegaon khalsa
Zp primary school, ganegaon khalsa
 
Cross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interfaceCross-lingual event-mining using wordnet as a shared knowledge interface
Cross-lingual event-mining using wordnet as a shared knowledge interface
 
What are the possible damages of phishing and spoofing mail attacks part 2#...
What are the possible damages of phishing and spoofing mail attacks   part 2#...What are the possible damages of phishing and spoofing mail attacks   part 2#...
What are the possible damages of phishing and spoofing mail attacks part 2#...
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
PATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, ViennaPATHS at EuropeanaTech 2011, Vienna
PATHS at EuropeanaTech 2011, Vienna
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
The importance of Exchange 2013 CAS in Exchange 2013 coexistence | Part 2/2 |...
 
My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 

Ähnlich wie Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipelineConference Papers
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion miningAnkush Mehta
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartachoeMadrid network
 
07419051111154767
0741905111115476707419051111154767
07419051111154767gayatri 24
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentDimitris Panagiotou
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriessangeetadhamdhere
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryJill Brown
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Noteador
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidUniversity of Piraeus
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentDimitris Panagiotou
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using DockerIRJET Journal
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...FIAT/IFTA
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012François Belleau
 

Ähnlich wie Recommendations for the automatic enrichment of digital library content using open source software, PATHS report (20)

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Sentiment analyzer and opinion mining
Sentiment analyzer and opinion miningSentiment analyzer and opinion mining
Sentiment analyzer and opinion mining
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho
 
07419051111154767
0741905111115476707419051111154767
07419051111154767
 
Knowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-developmentKnowledge based-interaction-in-software-development
Knowledge based-interaction-in-software-development
 
ABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositoriesABCD Open Source Software for managing ETD repositories
ABCD Open Source Software for managing ETD repositories
 
New Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web PortalNew Goals of PARES: Spanish Archives Web Portal
New Goals of PARES: Spanish Archives Web Portal
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And Repository
 
Research Tool - End Note
Research Tool - End NoteResearch Tool - End Note
Research Tool - End Note
 
Dspace
DspaceDspace
Dspace
 
Dspace
DspaceDspace
Dspace
 
A Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over AndroidA Distributed Audio Personalization Framework over Android
A Distributed Audio Personalization Framework over Android
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
A knowledge-workbench-for-software-development
A knowledge-workbench-for-software-developmentA knowledge-workbench-for-software-development
A knowledge-workbench-for-software-development
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
QEBU: AN ADVANCED GRAPHICAL EDITOR FOR THE EBUCORE METADATA SET | Paolo PASIN...
 
Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012Bio2RDF presentation at Combine 2012
Bio2RDF presentation at Combine 2012
 

Mehr von pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationpathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-specpathsproject
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4pathsproject
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototypepathsproject
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2pathsproject
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototypepathsproject
 

Mehr von pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
 
PATHS first paths prototype
PATHS first paths prototypePATHS first paths prototype
PATHS first paths prototype
 
PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2PATHS Content processing 2nd prototype-revised.v2
PATHS Content processing 2nd prototype-revised.v2
 
PATHS Content processing 1st prototype
PATHS  Content processing 1st prototypePATHS  Content processing 1st prototype
PATHS Content processing 1st prototype
 

Kürzlich hochgeladen

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Kürzlich hochgeladen (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Recommendations for the automatic enrichment of digital library content using open source software, PATHS report

  • 1. Personalised access to cultural heritage spaces Recommendations for the automatic enrichment of DL content using open source software Authors: Eneko Agirre and Arantxa Otegi www.paths-project.eu PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 2. Recommendations for the automatic enrichment of DL content using open source software 1 Introduction 2 Producing intra-collections links 2.1 Similarity links 2.2 Typed-similarity links 3 Producing background links 4 Ontology extension References 1 Introduction PATHS uses text processing software to enable the following functionality in the prototypes (see deliverables D2.1 and D2.2 http://paths-project.eu/eng/Resources): ● ● ● intra-collection links background links ontology extension The aim of PATHS is to investigate the use of those functionalities to better serve exploration by users. Most of the software is in-house and relatively complex. The release of the software as open source was not in the DOW, and the exploitations plan goes in the software-as-aservice direction, where the processing is done in servers prepared by the partners. For instance, in a closely related Best Practice Network (LoCloud http://www.locloud.eu/), servers for automatic content enrichment of digital library items are to be produced. Alternatively, projects like OpeNER (http://www.opener-project.org/) are devoted to the release of open source tools which are similar to the ones used for text processing in PATHS. The release of all PATHS software as open source out-of-the-box packages would be a major undertaking, on a par to those European projects. The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Libraries content using open source software. We think this would be useful for third-parties who would like to offer similar services. Note that this is not a step-bystep guide for reimplementation, but an overall view of the required software and programming effort involved. This document is structured according to each of the enrichment tasks described in PATHS deliverables D2.1 and D2.2. PATHS is funded by the European Commission FP7 programme under Digital Libraries and Digital Preservation
  • 3. 2 Producing intra-collections links PATHS produced both generic similarity links and more specific typed-similarity links. 2.1 Similarity links The target items would need to be parsed by the following Natural Language Processing (NLP) tools: Pos tagging and lemmatization. Several open source products exist, including the two used in PATHS: CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml) and Freeling (http://nlp.lsi.upc.edu/freeling/). Regarding multilinguality PATHS worked on Spanish and English texts. Freeling covers both, and Stanford works out-of-the-box for English. Recent projects like OpenNER also offer a suite of open source NLP tools including, in addition to English and Spanish, four other European languages. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the similarity links requires additional software, which in this case has been developed in-house. No out-of-the box alternatives exist. The interested parties would need to replicate the software described in (Aletras et al. 2012). As an alternative, the PATHS server demonstrating similarity links described in (Agirre et al. 2013b) uses the functionality provided by a search engine (Solr http://lucene.apache.org/solr/) to provide similar items. Note that this alternative does not require additional NLP tools, as it uses their own stop words and stemming algorithm. 2.2 Typed-similarity links The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatization (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the typed-similarity links requires additional software, including open source machine learning software (Weka http://www.cs.waikato.ac.nz/ml/weka/) and the inhouse scripts to extract features from items, train the machine learning models on the publicly available typed-similarity datasets produced by PATHS (http://ixa2.si.ehu.es/sts/), and use the machine learning models on the target items. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Agirre et al. 2013a).
  • 4. 3 Producing background links In order to produce background links we used Wikipedia Miner, an open source software available at http://wikipedia-miner.cms.waikato.ac.nz. We used wikipedia miner out-of-thebox. In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. 4 Ontology extension The target items would need to be parsed by the following NLP tools: Pos tagging and lemmatisation (see previous section). In addition, we used in-house scripts to process the internal representation of the items, extract the textual pieces, and produce the enriched representations. The actual production of the vocabulary requires additional software. We first extract the background links (see previous section), and then find the most relevant Wikipedia articles per item. This is done globally, analysing the statistics of the whole collection. Those articles are used to categorize the items according to a Wikipedia-based category system, which is trimmed-down to only cover the categories which are relevant to the collection at hand. No out-of-the-box alternatives exist. The interested parties would need to replicate the software as described in (Fernando et al. 2012). References Agirre E., Aletras N., Gonzalez-Agirre A., Rigau G., Stevenson M. (2013a). UBC UOSTYPED: Regression for typed-similarity. The Second Joint Conference on Lexical and Computational Semantics (*SEM 2013) Agirre E., Barrena A., Fernandez K., Miranda E., Otegi A., Soroa A. (2013b). PATHSenrich: a Web Service Prototype for Automatic Cultural Heritage Item Enrichment in Research and Advanced Technology for Digital Libraries, International Conference on Theory and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26, 2013. Lecture Notes in Computer Science, vol. 8092, pp 462-465. Aletras N., Stevenson M., Clough P. (2012). Computing Similarity between Items in a Digital Library of Cultural Heritage. In ACM JOCCH. Fernando S., Hall M., Agirre E., Soroa A., Clough P., Stevenson M. (2012). Comparing taxonomies for organising collections of documents. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 879-894, Mumbai, India, December 2012.