Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
1. Structured and Unstructured:
Extracting Information From Classics
Scholarly Texts
Matteo Romanello1
1 Centre for Computing in the Humanities
King’s College London
Graduate Colloquium - DHSI 2010
University of Victoria BC - 8th June 2010
Romanello CCH
Extracting Information From Scholarly Texts
2. The Project at a glance
Project started in October 2009;
Disciplines: Digital Humanities, Classics, Computer
Science;
co-supervised by:
Willard McCarty (KCL, Department of Digital Humanities)
Jonathan Ginzburg (KCL, Department of Computer
Science)
project supported by an AHRC (Arts and Humanities
Research Council) award
Romanello CCH
Extracting Information From Scholarly Texts
3. Goal
Devising an automatic system to improve semantic
information retrieval over a discipline-specific corpus of
unstructured texts
focus on secondary sources (e.g. journal papers) as
opposed to primary sources (i.e. Ancient Texts)
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt files) as opposed
to the structured/encoded XML
Example
“Hom. Il. XII 1”: sequence of 14 characters meaning “first line
of the twelfth book of Homer’s Iliad”
Romanello CCH
Extracting Information From Scholarly Texts
4. Semantic Information Retrieval
Semantic vs String Matching based IR
Romanello CCH
Extracting Information From Scholarly Texts
5. Named Entities as Entry Point to Information
Entities to be extracted:
1 Place Names (ancient and modern);
2 Relevant Person Names (mythological names, ancient authors,
modern scholars)
3 References to primary and secondary sources (canonical
texts and modern publications about them)
Romanello CCH
Extracting Information From Scholarly Texts
7. Corpus building
Getting materials
Crawling online archives
Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with
Ancient Greek encoding
re-OCR documents even the native digital ones
Romanello CCH
Extracting Information From Scholarly Texts
8. Corpus Building II
Corpora
open access, multilingual
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis online
470 articles in 2 corpora
OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)
Alignment of multiple OCR outputs
Romanello CCH
Extracting Information From Scholarly Texts
9. Building the Knowledge Base (KB)
Goal: integrate different data sources into a single KB
Why?
Information about the same entities spread over several
data sources
Data sources might use different output formats (raw text,
DBs, HTML, XML etc.)
partial overlappings but no interoperability
How?
Use of high level ontologies to map records related to the
same entity
Result: KB containing semantic data
Romanello CCH
Extracting Information From Scholarly Texts
10. Corpus Processing
Tasks
1 sentence identification
2 entities extraction (named entities recognition +
disambiguation)
KB implied to build up an entity context
3 canonical references extraction
KB provides training data
4 modern bibliographic references extraction
KB provides list of journals/name places/authors to improve
the perfomances of the tool
Romanello CCH
Extracting Information From Scholarly Texts
12. Canonical References Extraction
1 citations used specifically for primary sources (i.e. works of
ancient authors)
2 essential entry point to information: refer to the research
object, i.e. ancient texts
3 logical instead of physical citation scheme (e.g., chapter/paragr
vs. page)
4 variation -> time, style, language (regexp insufficient!)
Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6
Romanello CCH
Extracting Information From Scholarly Texts
13. So What?
New Possible Research Questions:
how citing primary sources in Classics changed?
what are the characteristics of citation and co-citation
networks?
the traditional IR tools in Classics are actually exhaustive?
Romanello CCH
Extracting Information From Scholarly Texts
14. Why a Digital Humanities project?
Better understanding of
the discipline specifities
users’ needs
Writing code to develop a project means
formalizing the way a given result is obtained
creating a repeatable and thus confutable process
introducing a reasoning based on the analysis of
quantitative data into Classics
Being able to
apply the product of a DH research to traditional scholarship
Romanello CCH
Extracting Information From Scholarly Texts
15. Thanks for your attention!
matteo.romanello@kcl.ac.uk
http://kcl.academia.edu/MatteoRomanello
Romanello CCH
Extracting Information From Scholarly Texts