M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
[poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts
1. EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS
Matteo Romanello, matteo.romanello@kcl.ac.uk
THE PROJECT AT A GLANCE EXTRACTING CANONICAL
A KNOWLEDGE-BASED APPROACH
● PhD in Digital Humanities (DH) REFERENCES
Idea:
● Project started in October 2009 Classicists usually refer to primary sources by
● Project co-supervised by:
●exploiting accurate information contained in means of abbreviated references, called canonical
structured data sources; references.
● Willard McCarty (CCH KCL)
● Jonathan Ginzburg (DCS KCL)
● using information in the KB to train machine- Explicit linking of primary and secondary sources
learning based components; contained in the Digital Library implies being able to
●Supported by the Arts and Humanities Research
Council (AHRC) Obstacles: automatically interpret and extract such references,
which is still an open issue.
●different (heterogeneous) data formats, XML
dialects, DB schemas etc.; CRefEx is a tool I'm developing for the automatic
extraction of canonical references.
INFORMATION RETRIEVAL IN CLASSICS lack of semantic interoperability.
The digital image of the Venetus A (Marc. Graec. Z.454) used for the graphics has been produced by the CHS (Harvard U)
●
● APh: state of the art in IR (Information Retrieval) in Possible solution: using high level ontologies (e.g. EXTRACTING BIBLIOGRAPHIC
the Classics CIDOC CRM) to make data interoperable REFERENCES
● Model: centralized, analytic (selective), manual Once integrated into a Knowledge Base that The task of extracting modern bibliographic for the
labour, highly (absolutely?) accurate, subscription- information can be used to train machine-learning purpose of Automatic Citation Indexing is well
based access based components. established in disciplines such as Computer
● Future(s): How can we complement it? With which Science.
tools? What can be improved in the tool used by the WHICH CORPUS TO BE MINED? However, some assumptions behind a tool such as
majority of Classicists to retrieve bibliography? Parscit (used to build CiteseerX) are not always
3 scenarios: applicable to Classics papers.
1) LEXIS: electronic version of a Classics journal. This is an example of how the complexity of
Online papers are PDF produced scanning the Humanities materials can lead to an improvement of
...OTHER POSSIBLE MODEL(S) actual printed copies (open access,no OCR, low tools and algorithms drawn from other fields.
● Open source and open access quality images);
Decentralized: harvesting rather than centralising
EMERGING RESEARCH QUESTIONS
●
2)PSWPC (Princeton Stanford working papers in
●Automatic: how to reduce the labour of manual Classics): open archive; text stripped from PDFs ●
How did the practice of citing ancient texts change
annotation? often is scarcely accurate (e.g. Greek sequences)
after that tools such as the TLG were introduced?
● Errorful/noisy input and acceptable error rate 3) JSTOR's Data for Research API: no controls ● What the citation networks in Classics look like?
● Measurable completeness/exhaustivity over text processing; access to data for ~180k
● How the two approaches compare to each other? Classics papers (to word frequency, extracted ●What emerges from the comparison with citation
references, bigrams/trigrams, key terms). networks in other disciplines?
●What are the access point to information
meaningful for Classicists?
Centre of Computing in the Humanities (CCH), King's College London