SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Introduction               Motivations                 Methodology         WorkPhases   ExpectedResults




                     Structured Vs Unstructured:
                 Extracting Information From Classics
                            Scholarly Texts

                                               Matteo Romanello1
                                     1 Centre    for Computing in the Humanities

                                                         PhD Seminar
                                                       London 28/01/2010




Extracting Information From Classics Scholarly Texts                                              CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




The Project at a glance




               Project started in October 2009;
               Field of application: Digital Humanities, Classics
               (particularly Greek literature);
               co-supervision between the CCH and the CS department
               at King’s -> application of Computational Linguistics
               method




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Goal



       Devising an automatic system to improve information retrieval
       over a discipline-specific corpus of unstructured texts
               focus on secondary sources
               automatic -> scalable with huge amount of data
               information retrieval -> the task of retrieving information
               unstructured texts -> raw texts (e.g. .txt files) as opposed
               to the structured/encoded XML




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




The Million Book Library



           archives.org, Google Books -> growth of
           volume of information available in
           electronic format
           longer “shelf-life” of books in
           Classics/Humanities
           results of traditional search engines ->
           high recall but low precision
           need for effective tools to access
           information for research purposes




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Information extraction in Classics


               lack of tools comparable to Citeseer, CiteseerX, GoPubMed for
               other disciplines
               are JSTOR’s features/functionalities enough for scholarly
               purposes?
               still issues with encoding of ancient greek (e.g., The +$%j& of
               Danaids)




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Access points to information

           going beyond TOCs or string
           matching-based IR
           access points meaningful for Classics
           scholars

   Contribution to research
           problems peculiar of Classics can help to
           improve the performances of existing
           tools/algorithms
           Analysis of papers published in a Classics
           journal (or archive) as corpus


Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Mining and information extraction from classics texts


               no ad-hoc gold standards/training set
               lack of tools specifically tailored to Classics resources
               electronically available text does not mean electronic text

       Possible corpus analysis
               citation patterns
               citation and co-citation networks
               trends in the Classics citation practice




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Finding Mentions of Realia

               mentions of realia are information that matter -> importance of print
               indexes in Classics
               Using realia as access points to information
               Identifying mentions of Realia
               Disambiguation, different spellings or translations of names

       Kinds of realia we are interested in extracting

           1. Place Names (ancient and modern);
           2. Relevant person Names(mythological names, ancient authors, modern
              scholars)
           3. Reference to primary and secondary sources (canonical texts and
              modern publications about them)



Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Reuse of Structured Information


       Scholars have been producing over the last years several
       structured datasources:
               use of structured information to train machine-learning
               based tools to mine unstructured texts
               Related projects: EROCS by IBM
               current practice: Wikipedia/DBpedia as datasource of
               structured information
               what improvements by using a discipline specific
               Knowledge B ase?




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus building

       Getting materials
       Crawling online archives

       Characteristics of considered corpora
               Open Access -> publically accessible
               Possibly multilingual

       Extracting the text from collected documents
               Tools for text extraction from PDF -> open issues with
               Ancient Greek encoding
               re-OCR documents even the native digital ones

Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus Building II


       Corpora
               Princeton/Stanford Working Papers in Classics (PSWPC)
               Lexis
               300 articles in 2 corpora

       OCR
               Finereader
               Ocropus (layout analysis)
               text extracted from PDFs (tools like pdftotext etc.)



Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Structured datasources




               Information about the same entities (i.e. realia) can be
               spread over several datasources
               partial overlappings
               Datasources can use different formats (text, DB, HTML,
               XML etc.)
               no interoperability




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Structured datasources II



       To create a semantic knowledge base (KB)
               import each datasource
               map it to high level ontologies (e.g., CIDOC-CRM)
               find overlappings between datasources -> alignign the
               records
       The obtained knowledge base will be used as support for all the
       text processing tasks




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Corpus Processing



          1. sentence identification
          2. entities extraction (named entities recognition +
             disambiguation)
                       KB implied to build up an entity context
          3. canonical references extraction
                 KB provides training data
          4. modern bibliographic references extraction
                KB provides list of journals/name places/authors to improve
                the perfomances of the tool




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Canonical References Extraction

          1. citations used specifically for secondary sources (i.e. works of
             ancient authors)
          2. essential entry point to information: refer to the research object,
             i.e. Ancient Texts
          3. logical instead of physical citation scheme (e.g., chapter/paragr
             vs. page)
          4. variation -> time, style, language (regexp insufficient!)

       Example
       Hom. Il. XII 1
       Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
       Hes. fr. 321 M.-W.
       Callimaco, ’ep.’ 28 Pf., 5-6


Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Overview


       Introduction

       Motivations and Background

       Methodology

       Work Phases

       Expected Results




Extracting Information From Classics Scholarly Texts                                        CCH
Introduction               Motivations                 Methodology   WorkPhases   ExpectedResults




Results
               Provide automatically multiple meaningful entry points to
               information
               Enrich the corpus with links to resources (particularly
               primary sources)
               Improve the user access to the corpus
               Demonstrate the scalability of the approach
       Tools/Resources
               Knowledge Base for Classics
               Articles with improved text quality
               Corpora released
               single tools fr information extraction (e.g. Canonical
               References Extractor)

Extracting Information From Classics Scholarly Texts                                        CCH

Weitere ähnliche Inhalte

Ähnlich wie Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
ataloadane
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information Architecture
Scott Abel
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 

Ähnlich wie Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts (20)

Romanello tokyo
Romanello tokyoRomanello tokyo
Romanello tokyo
 
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly TextsStructured and Unstructured:Extracting Information From Classics Scholarly Texts
Structured and Unstructured:Extracting Information From Classics Scholarly Texts
 
[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts[poster] Extracting Information From Classics Scholarly Texts
[poster] Extracting Information From Classics Scholarly Texts
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
Advanced Knowledge Technologies (AKT) -highlights 2006
Advanced Knowledge Technologies (AKT) -highlights 2006Advanced Knowledge Technologies (AKT) -highlights 2006
Advanced Knowledge Technologies (AKT) -highlights 2006
 
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading GroupOntology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading Group
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
Web Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research DomainWeb Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research Domain
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
 
Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
Recent_Trends_in_Deep_Learning_Based_Open-Domain_Textual_Question_Answering_S...
 
Coding Your Results
Coding Your ResultsCoding Your Results
Coding Your Results
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information Architecture
 
Poster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsPoster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History Collections
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Scidisc Slides 27 Sept 2010
Scidisc Slides 27 Sept 2010Scidisc Slides 27 Sept 2010
Scidisc Slides 27 Sept 2010
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 

Mehr von Matteo Romanello

Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
Matteo Romanello
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and Space
Matteo Romanello
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
Matteo Romanello
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
Matteo Romanello
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
Matteo Romanello
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal Presentation
Matteo Romanello
 

Mehr von Matteo Romanello (15)

Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...
 
Scaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in ClassicsScaling up the Extraction of Canonical Citations in Classics
Scaling up the Extraction of Canonical Citations in Classics
 
Transforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation NetworksTransforming Indexes Locorum into Citation Networks
Transforming Indexes Locorum into Citation Networks
 
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...
 
Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014Introduction to the Text Reuse panel at DH 2014
Introduction to the Text Reuse panel at DH 2014
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
DARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and SpaceDARIAH Geo-browser: Exploring Data through Time and Space
DARIAH Geo-browser: Exploring Data through Time and Space
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
DIGITAL HUMANITIES   E FILOLOGIA   Un'introduzioneDIGITAL HUMANITIES   E FILOLOGIA   Un'introduzione
DIGITAL HUMANITIES E FILOLOGIA Un'introduzione
 
Ht159 Poster
Ht159 PosterHt159 Poster
Ht159 Poster
 
Rethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by OntologiesRethinking Critical Editions of Fragments by Ontologies
Rethinking Critical Editions of Fragments by Ontologies
 
Presentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, TorontoPresentatio @ ELPUB 2008, Toronto
Presentatio @ ELPUB 2008, Toronto
 
Linking Primary and Secondary by Microformats
Linking Primary and Secondary by MicroformatsLinking Primary and Secondary by Microformats
Linking Primary and Secondary by Microformats
 
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...
 
M.Romanello Ecal Presentation
M.Romanello Ecal PresentationM.Romanello Ecal Presentation
M.Romanello Ecal Presentation
 

Kürzlich hochgeladen

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

  • 1. Introduction Motivations Methodology WorkPhases ExpectedResults Structured Vs Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello1 1 Centre for Computing in the Humanities PhD Seminar London 28/01/2010 Extracting Information From Classics Scholarly Texts CCH
  • 2. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 3. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 4. Introduction Motivations Methodology WorkPhases ExpectedResults The Project at a glance Project started in October 2009; Field of application: Digital Humanities, Classics (particularly Greek literature); co-supervision between the CCH and the CS department at King’s -> application of Computational Linguistics method Extracting Information From Classics Scholarly Texts CCH
  • 5. Introduction Motivations Methodology WorkPhases ExpectedResults Goal Devising an automatic system to improve information retrieval over a discipline-specific corpus of unstructured texts focus on secondary sources automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt files) as opposed to the structured/encoded XML Extracting Information From Classics Scholarly Texts CCH
  • 6. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 7. Introduction Motivations Methodology WorkPhases ExpectedResults The Million Book Library archives.org, Google Books -> growth of volume of information available in electronic format longer “shelf-life” of books in Classics/Humanities results of traditional search engines -> high recall but low precision need for effective tools to access information for research purposes Extracting Information From Classics Scholarly Texts CCH
  • 8. Introduction Motivations Methodology WorkPhases ExpectedResults Information extraction in Classics lack of tools comparable to Citeseer, CiteseerX, GoPubMed for other disciplines are JSTOR’s features/functionalities enough for scholarly purposes? still issues with encoding of ancient greek (e.g., The +$%j& of Danaids) Extracting Information From Classics Scholarly Texts CCH
  • 9. Introduction Motivations Methodology WorkPhases ExpectedResults Access points to information going beyond TOCs or string matching-based IR access points meaningful for Classics scholars Contribution to research problems peculiar of Classics can help to improve the performances of existing tools/algorithms Analysis of papers published in a Classics journal (or archive) as corpus Extracting Information From Classics Scholarly Texts CCH
  • 10. Introduction Motivations Methodology WorkPhases ExpectedResults Mining and information extraction from classics texts no ad-hoc gold standards/training set lack of tools specifically tailored to Classics resources electronically available text does not mean electronic text Possible corpus analysis citation patterns citation and co-citation networks trends in the Classics citation practice Extracting Information From Classics Scholarly Texts CCH
  • 11. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 12. Introduction Motivations Methodology WorkPhases ExpectedResults Finding Mentions of Realia mentions of realia are information that matter -> importance of print indexes in Classics Using realia as access points to information Identifying mentions of Realia Disambiguation, different spellings or translations of names Kinds of realia we are interested in extracting 1. Place Names (ancient and modern); 2. Relevant person Names(mythological names, ancient authors, modern scholars) 3. Reference to primary and secondary sources (canonical texts and modern publications about them) Extracting Information From Classics Scholarly Texts CCH
  • 13. Introduction Motivations Methodology WorkPhases ExpectedResults Reuse of Structured Information Scholars have been producing over the last years several structured datasources: use of structured information to train machine-learning based tools to mine unstructured texts Related projects: EROCS by IBM current practice: Wikipedia/DBpedia as datasource of structured information what improvements by using a discipline specific Knowledge B ase? Extracting Information From Classics Scholarly Texts CCH
  • 14. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 15. Introduction Motivations Methodology WorkPhases ExpectedResults Extracting Information From Classics Scholarly Texts CCH
  • 16. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus building Getting materials Crawling online archives Characteristics of considered corpora Open Access -> publically accessible Possibly multilingual Extracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital ones Extracting Information From Classics Scholarly Texts CCH
  • 17. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus Building II Corpora Princeton/Stanford Working Papers in Classics (PSWPC) Lexis 300 articles in 2 corpora OCR Finereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Extracting Information From Classics Scholarly Texts CCH
  • 18. Introduction Motivations Methodology WorkPhases ExpectedResults Structured datasources Information about the same entities (i.e. realia) can be spread over several datasources partial overlappings Datasources can use different formats (text, DB, HTML, XML etc.) no interoperability Extracting Information From Classics Scholarly Texts CCH
  • 19. Introduction Motivations Methodology WorkPhases ExpectedResults Structured datasources II To create a semantic knowledge base (KB) import each datasource map it to high level ontologies (e.g., CIDOC-CRM) find overlappings between datasources -> alignign the records The obtained knowledge base will be used as support for all the text processing tasks Extracting Information From Classics Scholarly Texts CCH
  • 20. Introduction Motivations Methodology WorkPhases ExpectedResults Corpus Processing 1. sentence identification 2. entities extraction (named entities recognition + disambiguation) KB implied to build up an entity context 3. canonical references extraction KB provides training data 4. modern bibliographic references extraction KB provides list of journals/name places/authors to improve the perfomances of the tool Extracting Information From Classics Scholarly Texts CCH
  • 21. Introduction Motivations Methodology WorkPhases ExpectedResults Canonical References Extraction 1. citations used specifically for secondary sources (i.e. works of ancient authors) 2. essential entry point to information: refer to the research object, i.e. Ancient Texts 3. logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4. variation -> time, style, language (regexp insufficient!) Example Hom. Il. XII 1 Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803 Hes. fr. 321 M.-W. Callimaco, ’ep.’ 28 Pf., 5-6 Extracting Information From Classics Scholarly Texts CCH
  • 22. Introduction Motivations Methodology WorkPhases ExpectedResults Overview Introduction Motivations and Background Methodology Work Phases Expected Results Extracting Information From Classics Scholarly Texts CCH
  • 23. Introduction Motivations Methodology WorkPhases ExpectedResults Results Provide automatically multiple meaningful entry points to information Enrich the corpus with links to resources (particularly primary sources) Improve the user access to the corpus Demonstrate the scalability of the approach Tools/Resources Knowledge Base for Classics Articles with improved text quality Corpora released single tools fr information extraction (e.g. Canonical References Extractor) Extracting Information From Classics Scholarly Texts CCH