Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts
1. Introduction Motivations Methodology WorkPhases ExpectedResults
Structured Vs Unstructured:
Extracting Information From Classics
Scholarly Texts
Matteo Romanello1
1 Centre for Computing in the Humanities
PhD Seminar
London 28/01/2010
Extracting Information From Classics Scholarly Texts CCH
2. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
3. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
4. Introduction Motivations Methodology WorkPhases ExpectedResults
The Project at a glance
Project started in October 2009;
Field of application: Digital Humanities, Classics
(particularly Greek literature);
co-supervision between the CCH and the CS department
at King’s -> application of Computational Linguistics
method
Extracting Information From Classics Scholarly Texts CCH
5. Introduction Motivations Methodology WorkPhases ExpectedResults
Goal
Devising an automatic system to improve information retrieval
over a discipline-specific corpus of unstructured texts
focus on secondary sources
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt files) as opposed
to the structured/encoded XML
Extracting Information From Classics Scholarly Texts CCH
6. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
7. Introduction Motivations Methodology WorkPhases ExpectedResults
The Million Book Library
archives.org, Google Books -> growth of
volume of information available in
electronic format
longer “shelf-life” of books in
Classics/Humanities
results of traditional search engines ->
high recall but low precision
need for effective tools to access
information for research purposes
Extracting Information From Classics Scholarly Texts CCH
8. Introduction Motivations Methodology WorkPhases ExpectedResults
Information extraction in Classics
lack of tools comparable to Citeseer, CiteseerX, GoPubMed for
other disciplines
are JSTOR’s features/functionalities enough for scholarly
purposes?
still issues with encoding of ancient greek (e.g., The +$%j& of
Danaids)
Extracting Information From Classics Scholarly Texts CCH
9. Introduction Motivations Methodology WorkPhases ExpectedResults
Access points to information
going beyond TOCs or string
matching-based IR
access points meaningful for Classics
scholars
Contribution to research
problems peculiar of Classics can help to
improve the performances of existing
tools/algorithms
Analysis of papers published in a Classics
journal (or archive) as corpus
Extracting Information From Classics Scholarly Texts CCH
10. Introduction Motivations Methodology WorkPhases ExpectedResults
Mining and information extraction from classics texts
no ad-hoc gold standards/training set
lack of tools specifically tailored to Classics resources
electronically available text does not mean electronic text
Possible corpus analysis
citation patterns
citation and co-citation networks
trends in the Classics citation practice
Extracting Information From Classics Scholarly Texts CCH
11. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
12. Introduction Motivations Methodology WorkPhases ExpectedResults
Finding Mentions of Realia
mentions of realia are information that matter -> importance of print
indexes in Classics
Using realia as access points to information
Identifying mentions of Realia
Disambiguation, different spellings or translations of names
Kinds of realia we are interested in extracting
1. Place Names (ancient and modern);
2. Relevant person Names(mythological names, ancient authors, modern
scholars)
3. Reference to primary and secondary sources (canonical texts and
modern publications about them)
Extracting Information From Classics Scholarly Texts CCH
13. Introduction Motivations Methodology WorkPhases ExpectedResults
Reuse of Structured Information
Scholars have been producing over the last years several
structured datasources:
use of structured information to train machine-learning
based tools to mine unstructured texts
Related projects: EROCS by IBM
current practice: Wikipedia/DBpedia as datasource of
structured information
what improvements by using a discipline specific
Knowledge B ase?
Extracting Information From Classics Scholarly Texts CCH
14. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
15. Introduction Motivations Methodology WorkPhases ExpectedResults
Extracting Information From Classics Scholarly Texts CCH
16. Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus building
Getting materials
Crawling online archives
Characteristics of considered corpora
Open Access -> publically accessible
Possibly multilingual
Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with
Ancient Greek encoding
re-OCR documents even the native digital ones
Extracting Information From Classics Scholarly Texts CCH
17. Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus Building II
Corpora
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis
300 articles in 2 corpora
OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)
Extracting Information From Classics Scholarly Texts CCH
18. Introduction Motivations Methodology WorkPhases ExpectedResults
Structured datasources
Information about the same entities (i.e. realia) can be
spread over several datasources
partial overlappings
Datasources can use different formats (text, DB, HTML,
XML etc.)
no interoperability
Extracting Information From Classics Scholarly Texts CCH
19. Introduction Motivations Methodology WorkPhases ExpectedResults
Structured datasources II
To create a semantic knowledge base (KB)
import each datasource
map it to high level ontologies (e.g., CIDOC-CRM)
find overlappings between datasources -> alignign the
records
The obtained knowledge base will be used as support for all the
text processing tasks
Extracting Information From Classics Scholarly Texts CCH
20. Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus Processing
1. sentence identification
2. entities extraction (named entities recognition +
disambiguation)
KB implied to build up an entity context
3. canonical references extraction
KB provides training data
4. modern bibliographic references extraction
KB provides list of journals/name places/authors to improve
the perfomances of the tool
Extracting Information From Classics Scholarly Texts CCH
21. Introduction Motivations Methodology WorkPhases ExpectedResults
Canonical References Extraction
1. citations used specifically for secondary sources (i.e. works of
ancient authors)
2. essential entry point to information: refer to the research object,
i.e. Ancient Texts
3. logical instead of physical citation scheme (e.g., chapter/paragr
vs. page)
4. variation -> time, style, language (regexp insufficient!)
Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6
Extracting Information From Classics Scholarly Texts CCH
22. Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
23. Introduction Motivations Methodology WorkPhases ExpectedResults
Results
Provide automatically multiple meaningful entry points to
information
Enrich the corpus with links to resources (particularly
primary sources)
Improve the user access to the corpus
Demonstrate the scalability of the approach
Tools/Resources
Knowledge Base for Classics
Articles with improved text quality
Corpora released
single tools fr information extraction (e.g. Canonical
References Extractor)
Extracting Information From Classics Scholarly Texts CCH