SlideShare ist ein Scribd-Unternehmen logo
1 von 11
IBM Haifa Research Lab

PreMapper:
Improving Entity Extraction Accuracy in the
Digital Humanities
Cormac Hampson (cormac.hampson@scss.tdc.ie)
Ella Rabinovich (ellak@il.ibm.com), Sara Porat (porat@il.ibm.com)
Maya Koleva (maya.koleva@commetric.com), Ivan Uzunov (ivan.uzunov@commetric.com)
Owen Conlan (owen.conlan@scss.tcd.ie)

© 2014 IBM Corporation
IBM Haifa Research Lab

What is CULTURA
•

Digital humanities portal supporting the exploration of cultural heritage
collections by a range of different users
•
•

•

Professional researchers and historians
Students with little or no experience of a particular archive

There are three digitised collections in the portal
•

1641 Depositions (http://cultura-project.eu/1641/)

•

Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916)

•

IPSA Collection (http://cultura-project.eu/ipsa)

2

© 2014 IBM Corporation
IBM Haifa Research Lab

Smart Content Analysis with Entity-Relationship Extraction
•

A powerful technique for injecting semantics into unstructured text
•
•

•

Employing Natural Language Processing (NLP)
Involving training a dataset and/or using prior knowledge (e.g., dictionaries)
so that specific entities can be identified within the text

Each collection introduces its unique entity-relationship model
•

Entities, e.g., Person, Location, Event

•

Entity attributes, e.g., Person.occupation, Deposition.mentioned_date

•

Relation between entities, e.g., Person at Location

3

© 2014 IBM Corporation
IBM Haifa Research Lab

Entity Extraction – Example
<title> first-name last-name
sir Robert Andrew

4

© 2014 IBM Corporation
IBM Haifa Research Lab

Manual Updates of Extracted Entities - Motivation
•

The automatic task of entity extraction cannot provide full accuracy

•

The 1641 Depositions collection introduces additional difficulty due to
the noisy text, inconsistent grammar and spelling

•

Extraction errors can damage a curator’s trust in the automatic
processing, as well as an end user’s overall confidence in the system

•

Approaches to improve the accuracy of entity extraction are of major
benefit of the CULTURA environment

5

© 2014 IBM Corporation
IBM Haifa Research Lab

Entities Visualisation and Modification with PreMapper
•

PreMapper is a web-based visualization and analysis tool that is
integrated into the CULTURA environment

•

Provides visualisation and editing of entities, maps, flows and
relationships between individuals and groups

•

Entities (people, organizations) are represented by nodes, links present
relationships between these nodes

6

© 2014 IBM Corporation
IBM Haifa Research Lab

Manual Changes of Extracted Entities
PreMapper enables curators of the collection to make manual changes to
the extracted entities using a GUI
•

Add/delete/update entity

•

Merge two entities into a single entity (entities disambiguation)

•

Add/delete relationship between entities

The entity “sir phelim” can be merged with the
entity “phelim neil” if an expert deems that these
entities refer to the same person

7

© 2014 IBM Corporation
IBM Haifa Research Lab

General Flow

8

© 2014 IBM Corporation
IBM Haifa Research Lab

Entity Disambiguation via PreMapper
•

The task of determining the identity of entities mentioned in the text
•

•

e.g., based on entity’s key attributes

Entity disambiguation in historical content is one of the main challenges
of CULTURA professional users
•
•

Are “Rob. Meredith” and “Robert Meredith” the same person?

•

•

Are “sir Phelim” and “Phelim o neil” the same person?

Entities scope matter (disambiguation of entities found in the same deposition
vs. entities found in different depositions)

Non-functional challenges
•

Authorization – who is allowed to make changes?

•

Personalization – what is the scope of a specific change (specific researcher,
group of researchers, the entire professional community)?

•

Verification – who verifies the changes?

9

© 2014 IBM Corporation
IBM Haifa Research Lab

Summary and Future Work
•

Entity-relationship extraction is a powerful technique for extracting
structured information from unstructured documents

•

PreMapper is a visualization tool that allows domain experts to improve
the accuracy of the entity-relationship data

•

Domain experts feedback is important in refining the user interface
with the CULTURA environment
•

•

It becomes vital when entity extraction is error-prone, as with the 1641
Depositions collection that contains a lot of noise and misspellings

Future work includes further exploration and design of the fully
integrated end-to-end solution
http://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId=
7&layout=circle#svg-graph-editor-switch

10

© 2014 IBM Corporation
IBM Haifa Research Lab

11

© 2014 IBM Corporation

Weitere ähnliche Inhalte

Ähnlich wie PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

IIIF Introduction and Opportunities at Cornell
IIIF Introduction and Opportunities at CornellIIIF Introduction and Opportunities at Cornell
IIIF Introduction and Opportunities at CornellSimeon Warner
 
Enhancing Digital Cultural Heritage Collections with Social Network Capabilities
Enhancing Digital Cultural Heritage Collections with Social Network CapabilitiesEnhancing Digital Cultural Heritage Collections with Social Network Capabilities
Enhancing Digital Cultural Heritage Collections with Social Network CapabilitiesElla Rabinovich
 
APIS. Digitale biographische Blütenlese
APIS. Digitale biographische BlütenleseAPIS. Digitale biographische Blütenlese
APIS. Digitale biographische Blütenleseeveline wandl-vogt
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingTom-Cramer
 
Annotations and Europeana @Project Assembly 2014 - Tech Workshops
Annotations and Europeana @Project Assembly 2014 - Tech WorkshopsAnnotations and Europeana @Project Assembly 2014 - Tech Workshops
Annotations and Europeana @Project Assembly 2014 - Tech WorkshopsDavid Haskiya
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of SouthamptonRepository Fringe
 
Ontologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebOntologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebGuus Schreiber
 
OER for repository managers
OER for repository managersOER for repository managers
OER for repository managersNick Sheppard
 
New ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipNew ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipLiaquat Rahoo
 
2013 1st koha training
2013 1st koha training2013 1st koha training
2013 1st koha trainingRYAN T.
 
Alexandria winer20100623
Alexandria winer20100623Alexandria winer20100623
Alexandria winer20100623Dov Winer
 
Repositioning realignment and the researcher
Repositioning realignment and the researcherRepositioning realignment and the researcher
Repositioning realignment and the researcherLIBER Europe
 
Discovery Systems Used in Academic Libraries Projects & Case Study
Discovery Systems Used in Academic Libraries Projects & Case StudyDiscovery Systems Used in Academic Libraries Projects & Case Study
Discovery Systems Used in Academic Libraries Projects & Case StudyHong (Jenny) Jing
 
Institutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic LibrariesInstitutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic LibrariesHong (Jenny) Jing
 
IIIF for Index of Christian Art
IIIF for Index of Christian ArtIIIF for Index of Christian Art
IIIF for Index of Christian ArtJon Stroop
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Debra Kolah
 

Ähnlich wie PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities (20)

IIIF Introduction and Opportunities at Cornell
IIIF Introduction and Opportunities at CornellIIIF Introduction and Opportunities at Cornell
IIIF Introduction and Opportunities at Cornell
 
Enhancing Digital Cultural Heritage Collections with Social Network Capabilities
Enhancing Digital Cultural Heritage Collections with Social Network CapabilitiesEnhancing Digital Cultural Heritage Collections with Social Network Capabilities
Enhancing Digital Cultural Heritage Collections with Social Network Capabilities
 
APIS. Digitale biographische Blütenlese
APIS. Digitale biographische BlütenleseAPIS. Digitale biographische Blütenlese
APIS. Digitale biographische Blütenlese
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership Meeting
 
Annotations and Europeana @Project Assembly 2014 - Tech Workshops
Annotations and Europeana @Project Assembly 2014 - Tech WorkshopsAnnotations and Europeana @Project Assembly 2014 - Tech Workshops
Annotations and Europeana @Project Assembly 2014 - Tech Workshops
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
Ontologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture WebOntologies for multimedia: the Semantic Culture Web
Ontologies for multimedia: the Semantic Culture Web
 
OER for repository managers
OER for repository managersOER for repository managers
OER for repository managers
 
New ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipNew ICT Trends and Issues of Librarianship
New ICT Trends and Issues of Librarianship
 
2013 1st koha training
2013 1st koha training2013 1st koha training
2013 1st koha training
 
Alexandria winer20100623
Alexandria winer20100623Alexandria winer20100623
Alexandria winer20100623
 
Ukgs2013 dave pattern
Ukgs2013 dave patternUkgs2013 dave pattern
Ukgs2013 dave pattern
 
Repositioning realignment and the researcher
Repositioning realignment and the researcherRepositioning realignment and the researcher
Repositioning realignment and the researcher
 
Discovery Systems Used in Academic Libraries Projects & Case Study
Discovery Systems Used in Academic Libraries Projects & Case StudyDiscovery Systems Used in Academic Libraries Projects & Case Study
Discovery Systems Used in Academic Libraries Projects & Case Study
 
Institutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic LibrariesInstitutional Repository (IR) and Open Access in Academic Libraries
Institutional Repository (IR) and Open Access in Academic Libraries
 
IIIF for Index of Christian Art
IIIF for Index of Christian ArtIIIF for Index of Christian Art
IIIF for Index of Christian Art
 
Unit 1
Unit 1Unit 1
Unit 1
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 

Kürzlich hochgeladen

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Kürzlich hochgeladen (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities

  • 1. IBM Haifa Research Lab PreMapper: Improving Entity Extraction Accuracy in the Digital Humanities Cormac Hampson (cormac.hampson@scss.tdc.ie) Ella Rabinovich (ellak@il.ibm.com), Sara Porat (porat@il.ibm.com) Maya Koleva (maya.koleva@commetric.com), Ivan Uzunov (ivan.uzunov@commetric.com) Owen Conlan (owen.conlan@scss.tcd.ie) © 2014 IBM Corporation
  • 2. IBM Haifa Research Lab What is CULTURA • Digital humanities portal supporting the exploration of cultural heritage collections by a range of different users • • • Professional researchers and historians Students with little or no experience of a particular archive There are three digitised collections in the portal • 1641 Depositions (http://cultura-project.eu/1641/) • Bureau of Military History - 1916 Rising (http://cultura-project.eu/1916) • IPSA Collection (http://cultura-project.eu/ipsa) 2 © 2014 IBM Corporation
  • 3. IBM Haifa Research Lab Smart Content Analysis with Entity-Relationship Extraction • A powerful technique for injecting semantics into unstructured text • • • Employing Natural Language Processing (NLP) Involving training a dataset and/or using prior knowledge (e.g., dictionaries) so that specific entities can be identified within the text Each collection introduces its unique entity-relationship model • Entities, e.g., Person, Location, Event • Entity attributes, e.g., Person.occupation, Deposition.mentioned_date • Relation between entities, e.g., Person at Location 3 © 2014 IBM Corporation
  • 4. IBM Haifa Research Lab Entity Extraction – Example <title> first-name last-name sir Robert Andrew 4 © 2014 IBM Corporation
  • 5. IBM Haifa Research Lab Manual Updates of Extracted Entities - Motivation • The automatic task of entity extraction cannot provide full accuracy • The 1641 Depositions collection introduces additional difficulty due to the noisy text, inconsistent grammar and spelling • Extraction errors can damage a curator’s trust in the automatic processing, as well as an end user’s overall confidence in the system • Approaches to improve the accuracy of entity extraction are of major benefit of the CULTURA environment 5 © 2014 IBM Corporation
  • 6. IBM Haifa Research Lab Entities Visualisation and Modification with PreMapper • PreMapper is a web-based visualization and analysis tool that is integrated into the CULTURA environment • Provides visualisation and editing of entities, maps, flows and relationships between individuals and groups • Entities (people, organizations) are represented by nodes, links present relationships between these nodes 6 © 2014 IBM Corporation
  • 7. IBM Haifa Research Lab Manual Changes of Extracted Entities PreMapper enables curators of the collection to make manual changes to the extracted entities using a GUI • Add/delete/update entity • Merge two entities into a single entity (entities disambiguation) • Add/delete relationship between entities The entity “sir phelim” can be merged with the entity “phelim neil” if an expert deems that these entities refer to the same person 7 © 2014 IBM Corporation
  • 8. IBM Haifa Research Lab General Flow 8 © 2014 IBM Corporation
  • 9. IBM Haifa Research Lab Entity Disambiguation via PreMapper • The task of determining the identity of entities mentioned in the text • • e.g., based on entity’s key attributes Entity disambiguation in historical content is one of the main challenges of CULTURA professional users • • Are “Rob. Meredith” and “Robert Meredith” the same person? • • Are “sir Phelim” and “Phelim o neil” the same person? Entities scope matter (disambiguation of entities found in the same deposition vs. entities found in different depositions) Non-functional challenges • Authorization – who is allowed to make changes? • Personalization – what is the scope of a specific change (specific researcher, group of researchers, the entire professional community)? • Verification – who verifies the changes? 9 © 2014 IBM Corporation
  • 10. IBM Haifa Research Lab Summary and Future Work • Entity-relationship extraction is a powerful technique for extracting structured information from unstructured documents • PreMapper is a visualization tool that allows domain experts to improve the accuracy of the entity-relationship data • Domain experts feedback is important in refining the user interface with the CULTURA environment • • It becomes vital when entity extraction is error-prone, as with the 1641 Depositions collection that contains a lot of noise and misspellings Future work includes further exploration and design of the fully integrated end-to-end solution http://staging1.commetric.com:8080/cultura/?q=1641&ids=836062r034&nodeTypeId= 7&layout=circle#svg-graph-editor-switch 10 © 2014 IBM Corporation
  • 11. IBM Haifa Research Lab 11 © 2014 IBM Corporation

Hinweis der Redaktion

  1. &amp;lt;number&amp;gt; 1641 Depositions – witness testimonies, mainly by Protestants, but also by Catholics concerning their experience during the 1641 Irish rebellion. These testimonies document different incidents like robbery, military actions, imprisonment and murder. 1916 Rising – documents events that happened during a rising that occurred at Enniscorthy (Ireland) in 1916. IPSA collection – a digital archive of illuminated medieval manuscripts, with high quality drawings and metadata (mainly herbs and astrology).
  2. &amp;lt;number&amp;gt; In our work entity extraction involved definition of dictionaries for each entity type we wanted to extract (e.g. FirstName, LastName, Location, Date), and a set of parsing rules that use these dictionaries to annotate piece of text that presumably fits an entity, or entity’s attribute and relationship between entities.
  3. The parsing rule for identification of this Person includes person’s title (e.g., sir, Mrs), then FirstName followed by LastName, or LastName followed by FirstName. We have a few tens of these rules for a comprehensive and accurate (to some extent) annotation of text.
  4. The main part of the paper of therefore this presentation focuses on that task of manual updates of extracted entities. The task of entity extraction can not produce 100% accuracy even under perfect conditions (such as clean and well structured text); it becomes much more difficult when considering 1641 Depositions collection, with noisy text, huge number of misspellings and inconsistent grammar. As an example, the same person’s name can be written differently in the same deposition; yet we strive to disambiguate these names and conclude that they refer to the same person. Another example are entities attributes, e.g., Person’s occupation, religion or origin identification becomes challenging due to inconsistent structure of sentences. In CULTURA we adopted an approach for manual modifications of extracted information using the PreMapper Tool that I present next.
  5. &amp;lt;number&amp;gt;
  6. &amp;lt;number&amp;gt; Note that Entity-Relationship data is used for several (currently decoupled) purposes: (1) entities visualization with the PreMapper tool and (2) Entity-oriented search that is not the subject of this presentation, but in few words is used for retrieval and exploration of extracted entities. Changes in extracted entities should be reflected in the EoS component, that uses the open source Lucene search engine for retrieval. We will focus on the architecture of distributing the modifications made by users in one of the next slides.