Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Structured Vs Unstructured:
Extracting Information From Classics
Scholarly Texts

Matteo Romanello1
1 Centre for Computing in the Humanities

PhD Seminar
London 28/01/2010

Extracting Information From Classics Scholarly Texts CCH


Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results



The Project at a glance

Project started in October 2009;
Field of application: Digital Humanities, Classics
(particularly Greek literature);
co-supervision between the CCH and the CS department
at King’s -> application of Computational Linguistics
method



Goal

Devising an automatic system to improve information retrieval
over a discipline-speciﬁc corpus of unstructured texts
focus on secondary sources
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt ﬁles) as opposed
to the structured/encoded XML



The Million Book Library

archives.org, Google Books -> growth of
volume of information available in
electronic format
longer “shelf-life” of books in
Classics/Humanities
results of traditional search engines ->
high recall but low precision
need for effective tools to access
information for research purposes



Information extraction in Classics

lack of tools comparable to Citeseer, CiteseerX, GoPubMed for
other disciplines
are JSTOR’s features/functionalities enough for scholarly
purposes?
still issues with encoding of ancient greek (e.g., The +$%j& of
Danaids)



Access points to information

going beyond TOCs or string
matching-based IR
access points meaningful for Classics
scholars

Contribution to research
problems peculiar of Classics can help to
improve the performances of existing
tools/algorithms
Analysis of papers published in a Classics
journal (or archive) as corpus



Mining and information extraction from classics texts

no ad-hoc gold standards/training set
lack of tools speciﬁcally tailored to Classics resources
electronically available text does not mean electronic text

Possible corpus analysis
citation patterns
citation and co-citation networks
trends in the Classics citation practice



Finding Mentions of Realia

mentions of realia are information that matter -> importance of print
indexes in Classics
Using realia as access points to information
Identifying mentions of Realia
Disambiguation, different spellings or translations of names

Kinds of realia we are interested in extracting

1. Place Names (ancient and modern);
2. Relevant person Names(mythological names, ancient authors, modern
scholars)
3. Reference to primary and secondary sources (canonical texts and
modern publications about them)



Reuse of Structured Information

Scholars have been producing over the last years several
structured datasources:
use of structured information to train machine-learning
based tools to mine unstructured texts
Related projects: EROCS by IBM
current practice: Wikipedia/DBpedia as datasource of
structured information
what improvements by using a discipline speciﬁc
Knowledge B ase?



Corpus building

Getting materials
Crawling online archives

Characteristics of considered corpora
Open Access -> publically accessible
Possibly multilingual

Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with
Ancient Greek encoding
re-OCR documents even the native digital ones



Corpus Building II

Corpora
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis
300 articles in 2 corpora

OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)



Structured datasources

Information about the same entities (i.e. realia) can be
spread over several datasources
partial overlappings
Datasources can use different formats (text, DB, HTML,
XML etc.)
no interoperability



Structured datasources II

To create a semantic knowledge base (KB)
import each datasource
map it to high level ontologies (e.g., CIDOC-CRM)
ﬁnd overlappings between datasources -> alignign the
records
The obtained knowledge base will be used as support for all the
text processing tasks



Corpus Processing

1. sentence identiﬁcation
2. entities extraction (named entities recognition +
disambiguation)
KB implied to build up an entity context
3. canonical references extraction
KB provides training data
4. modern bibliographic references extraction
KB provides list of journals/name places/authors to improve
the perfomances of the tool



Canonical References Extraction

1. citations used speciﬁcally for secondary sources (i.e. works of
ancient authors)
2. essential entry point to information: refer to the research object,
i.e. Ancient Texts
3. logical instead of physical citation scheme (e.g., chapter/paragr
vs. page)
4. variation -> time, style, language (regexp insufﬁcient!)

Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6



Results
Provide automatically multiple meaningful entry points to
information
Enrich the corpus with links to resources (particularly
primary sources)
Improve the user access to the corpus
Demonstrate the scalability of the approach
Tools/Resources
Knowledge Base for Classics
Articles with improved text quality
Corpora released
single tools fr information extraction (e.g. Canonical
References Extractor)


Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Ähnlich wie Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts (20)

Mehr von Matteo Romanello

Mehr von Matteo Romanello (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts